Monday, June 11, 2012

trouble with utf-8 chars & apache2 rewrite rules


I see the post http://stackoverflow.com/questions/2565864/validating-utf-8-in-htaccess-rewrite-rule and I think that is great, but a more fundamental problem I am having first:



I needed to expand to handle utf-8 chars for query string parameters, names of directories, files, and used in displays to users etc.



I configured my Apache with DefaultCharset utf-8 and also my php if that matters. My original rewrite rule filtered everything except regular A-Za-z and underscore and hyphen. and it worked. Anything else would give you a 404 (which is what I want!) Now, however it seems that everything matches, including stuff I don't want, however, although it seems to match it doesn't go in the query string unless it is a regular A-Za-z_- character string.



I find this confusing, because the rule says put whatever you matched into the query string:



Here is the original rule:




RewriteRule ^/puzzle/([A-Za-z_-]+)$ /puzzle.php?g=$1 [NC]



and here is the revised rule:




RewriteRule ^/puzzle/(\w+)$ /puzzle.php?g=$1 [NC]



I made the change because somewhere I read that \w matches ALL the alpha chars where as A-Zetc. only matches the ones without accents and stuff.



It doesn't seem to matter which of those rules I use: Here is what happens:



In the application I have this:




echo $_GET['g'];



If I feed it a url like http://mydomain.com/puzzle/USA it echoes out "USA" and works fine.

If I feed it a url like http://mydomain.com/puzzle /México it echoes nothing for that and warns me that index g is not defined and of course doesn't get resources for Mexico.

if I feed it a url like http://mydomain.com/puzzle/fuzzle/buzzle/j.qle it does the same thing.

This last case should be a 404!



And it does this no matter which of the above rules I use. I configured a rewrite log




RewriteLogLevel 5
RewriteLog /opt/local/apache2/logs/puzzles.httpd.rewrite



but it is empty.



Here is from the regular access log (it gives a status of 200)




[26/May/2010:11:21:42 -0700] "GET /puzzle/M%C3%A9xico HTTP/1.1" 200 342
[26/May/2010:11:21:54 -0700] "GET /puzzle/M/l.foo HTTP/1.1" 200 342



What can I do to get these $%#$@(*#@!!! characters but not slash, dot or other non-alpha into my program, and once there, will it decode them correctly??? Would posix char classes work any better? Is there anything else I need to configure?


Source: Tips4all

1 comment:

  1. On...
    RewriteRule ^/puzzle/(\w+)$ /puzzle.php?g=$1 [NC]


    Someone correct me if I'm wrong, but wouldn't this mean get requests asking for subdirectories simply bypass this rule?

    Also, a lazy way to solve this is to also group in the '%' character. As far as I know, all you're allowed to work with is on any url path is url encoding. Actually, see: http://www.blooberry.com/indexdot/html/topics/urlencoding.htm

    I'm sure there are more advanced and better ways to do this, but that should solve your immediate problem.

    ReplyDelete