Monday, June 11, 2012

trouble with utf-8 chars & apache2 rewrite rules

I see the post and I think that is great, but a more fundamental problem I am having first:

I needed to expand to handle utf-8 chars for query string parameters, names of directories, files, and used in displays to users etc.

I configured my Apache with DefaultCharset utf-8 and also my php if that matters. My original rewrite rule filtered everything except regular A-Za-z and underscore and hyphen. and it worked. Anything else would give you a 404 (which is what I want!) Now, however it seems that everything matches, including stuff I don't want, however, although it seems to match it doesn't go in the query string unless it is a regular A-Za-z_- character string.

I find this confusing, because the rule says put whatever you matched into the query string:

Here is the original rule:

RewriteRule ^/puzzle/([A-Za-z_-]+)$ /puzzle.php?g=$1 [NC]

and here is the revised rule:

RewriteRule ^/puzzle/(\w+)$ /puzzle.php?g=$1 [NC]

I made the change because somewhere I read that \w matches ALL the alpha chars where as A-Zetc. only matches the ones without accents and stuff.

It doesn't seem to matter which of those rules I use: Here is what happens:

In the application I have this:

echo $_GET['g'];

If I feed it a url like it echoes out "USA" and works fine.

If I feed it a url like /México it echoes nothing for that and warns me that index g is not defined and of course doesn't get resources for Mexico.

if I feed it a url like it does the same thing.

This last case should be a 404!

And it does this no matter which of the above rules I use. I configured a rewrite log

RewriteLogLevel 5
RewriteLog /opt/local/apache2/logs/puzzles.httpd.rewrite

but it is empty.

Here is from the regular access log (it gives a status of 200)

[26/May/2010:11:21:42 -0700] "GET /puzzle/M%C3%A9xico HTTP/1.1" 200 342
[26/May/2010:11:21:54 -0700] "GET /puzzle/M/ HTTP/1.1" 200 342

What can I do to get these $%#$@(*#@!!! characters but not slash, dot or other non-alpha into my program, and once there, will it decode them correctly??? Would posix char classes work any better? Is there anything else I need to configure?

Source: Tips4all

1 comment:

  1. On...
    RewriteRule ^/puzzle/(\w+)$ /puzzle.php?g=$1 [NC]

    Someone correct me if I'm wrong, but wouldn't this mean get requests asking for subdirectories simply bypass this rule?

    Also, a lazy way to solve this is to also group in the '%' character. As far as I know, all you're allowed to work with is on any url path is url encoding. Actually, see:

    I'm sure there are more advanced and better ways to do this, but that should solve your immediate problem.