Thursday, April 12, 2012

How do I remove accents from characters in a PHP string?


I'm attempting to remove accents from characters in PHP string as the first step to making the string usable in a URL.



I'm using the following code:




$input = "Fóø Bår";

setlocale(LC_ALL, "en_US.utf8");
$output = iconv("utf-8", "ascii//TRANSLIT", $input);

print($output);



The output I would expect would be something like this:




F'oo Bar



However, instead of the accented characters being transliterated they are replaced with question marks:




F?? B?r



Everything I can find online indicates that setting the locale will fix this problem, however I'm already doing this. I've already checked the following details:



  1. The locale I am setting is supported by the server (included in the list produced by locale -a )

  2. The source and target encodings (UTF-8 and ASCII) are supported by the server's version of iconv (included in the list produced by iconv -l )

  3. The input string is UTF-8 encoded (verified using PHP's mb_check_encoding function, as suggested in the answer by mercator )

  4. The call to setlocale is successful (it returns 'en_US.utf8' rather than FALSE )





The cause of the problem:



The server is using the wrong implementation of iconv. It has the glibc version instead of the required libiconv version.




Note that the iconv function on some systems may not work as you expect. In such case, it'd be a good idea to install the GNU libiconv library. It will most likely end up with more consistent results.

PHP manual's introduction to iconv




Details about the iconv implementation that is used by PHP are included in the output of the phpinfo function.



(I'm not able to re-compile PHP with the correct iconv library on the server I'm working with for this project so the answer I've accepted below is the one that was most useful for removing accents without iconv support.)


Source: Tips4all

8 comments:

  1. I think the problem here is that your encodings consider ä and å different symbols to 'a'. In fact, the PHP documentation for strtr offers a sample for removing accents the ugly way :(

    http://ie2.php.net/strtr

    ReplyDelete
  2. You could use urlencode. Does not quite do what you want (remove accents), but will give you a url usable string

    $output = urlencode ($input);


    In Perl I could use a translate regex, but I cannot think of the PHP equivalent

    $input =~ tr/áâàå/aaaa/;


    etc...

    you could do this using preg_replace

    $patterns[0] = '/[á|â|à|å|ä]/';
    $patterns[1] = '/[ð|é|ê|è|ë]/';
    $patterns[2] = '/[í|î|ì|ï]/';
    $patterns[3] = '/[ó|ô|ò|ø|õ|ö]/';
    $patterns[4] = '/[ú|û|ù|ü]/';
    $patterns[5] = '/æ/';
    $patterns[6] = '/ç/';
    $patterns[7] = '/ß/';
    $replacements[0] = 'a';
    $replacements[1] = 'e';
    $replacements[2] = 'i';
    $replacements[3] = 'o';
    $replacements[4] = 'u';
    $replacements[5] = 'ae';
    $replacements[6] = 'c';
    $replacements[7] = 'ss';

    $output = preg_replace($patterns, $replacements, $input);


    (Please note this was typed from a foggy beer ridden Friday after noon memory, so may not be 100% correct)

    or you could make a hash table and do a replacement based off of that.

    ReplyDelete
  3. This is a code i found and use often

    function stripAccents($stripAccents){
    return strtr($stripAccents,'àáâãäçèéêëìíîïñòóôõöùúûüýÿÀÁÂÃÄÇÈÉÊËÌÍÎÏÑÒÓÔÕÖÙÚÛÜÝ','aaaaaceeeeiiiinooooouuuuyyAAAAACEEEEIIIINOOOOOUUUUY');
    }

    ReplyDelete
  4. I agree with georgebrock's comment.

    If you find a way to get //TRANSLIT to work, you can build friendly URLs:


    use iconv with //TRANSLIT ñ => n~
    remove non-alphanumeric non-whitespace chars inside words: $url = preg_replace( '/(\w)[^\w\s](\w)/', '$1$2', $url );
    replace remaining separations: $url = preg_replace( '/[^a-z0-9]+/', '-', $url );
    remove double/leading/traling: $url = preg_replace( '-', e.g. '/(?:(^|\-)\-+|\-$)/', '', $url );


    If you can't get it to work, replace setp 1 with strtr/character-based replacement, like Xetius' solution.

    ReplyDelete
  5. I can't reproduce your problem. I get the expected result.

    How exactly are you using mb_detect_encoding() to verify your string is in fact UTF-8?

    If I simply call mb_detect_encoding($input) on both a UTF-8 and ISO-8859-1 encoded version of your string, both of them return "UTF-8", so that function isn't particularly reliable.

    iconv() gives me a PHP "notice" when it gets the wrongly encoded string and only echoes "F", but that might just be because of different PHP/iconv settings/versions (?).

    I suggest to you try calling mb_check_encoding($input, "utf-8") first to verify that your string really is UTF-8. I think it probably isn't.

    ReplyDelete
  6. When using iconv, locale mus be set:

    function test_enc($text = 'ěščřžýáíé ĚŠČŘŽÝÁÍÉ fóø bår FÓØ BÅR æ')
    {
    echo '<tt>';
    echo iconv('utf8', 'ascii//TRANSLIT', $text);
    echo '</tt><br/>';
    }

    test_enc();
    setlocale(LC_ALL, 'cs_CZ.utf8');
    test_enc();
    setlocale(LC_ALL, 'en_US.utf8');
    test_enc();


    Yields into:

    ????????? ????????? f?? b?r F?? B?R ae
    escrzyaie ESCRZYAIE fo? bar FO? BAR ae
    escrzyaie ESCRZYAIE fo? bar FO? BAR ae


    Another locales then cs_CZ and en_US I haven't installed and I can't test it.

    In C# I see solution using translation to unicode normalized form - accents are splitted out and then filtered via nonspacing unicode category.

    ReplyDelete
  7. One of the tricks I stumbled upon on the web was using htmlentities then stripping the encoded character :

    $stripped = preg_replace('`&[^;]+;`','',htmlentities($string));


    Not perfect but it does work well in some case.

    But, you're writing about creating an URL string, so urlencode and its counterpart urldecode may be better. Or, if you are creating a query string, use this last function : http_build_query.

    ReplyDelete
  8. u can use this class for removing unwanted characters.. But still it does not solves your problem

    ReplyDelete