Tuesday, June 5, 2012

Detect encoding and make everything UTF-8


I'm reading out lots of texts from various RSS feeds and inserting them into my database.



Of course, there are several different character encodings used in the feeds, e.g. UTF-8 and ISO-8859-1.



Unfortunately, there are sometimes problems with the encodings of the texts. Example:



1) The "ß" in "Fußball" should look like this in my database: "Ÿ". If it is a "Ÿ", it is displayed correctly.



2) Sometimes, the "ß" in "Fußball" looks like this in my database: "ß". Then it is displayed wrongly, of course.



3) In other cases, the "ß" is saved as a "ß" - so without any change. Then it is also displayed wrongly.



What can I do to avoid the cases 2 and 3?



How can I make everything the same encoding, preferably UTF-8? When must I use utf8_encode(), when must I use utf8_decode() (it's clear what the effect is but when must I use the functions?) and when must I do nothing with the input?



Can you help me and tell me how to make everything the same encoding? Perhaps with the function mb-detect-encoding()? Can I write a function for this? So my problems are: 1) How to find out what encoding the text uses 2) How to convert it to UTF-8 - whatever the old encoding is



Thanks in advance!



EDIT: Would a function like this work?




function correct_encoding($text) {
$current_encoding = mb_detect_encoding($text, 'auto');
$text = iconv($current_encoding, 'UTF-8', $text);
return $text;
}



I've tested it but it doesn't work. What's wrong with it?


Source: Tips4all

19 comments:

  1. You first have to detect what encoding has been used. As you’re parsing RSS feeds (probably via HTTP), you should read the encoding from the charset parameter of the Content-Type HTTP header field. If it is not present, read the encoding from the encoding attribute of the XML processing instruction. If that’s missing too, use UTF-8 as defined in the specification.



    Edit   Here is what I probably would do:

    I’d use cURL to send and fetch the response. That allows you to set specific header fields and fetch the response header as well. After fetching the response, you have to parse the HTTP response and split it into header and body. The header should then contain the Content-Type header field that contains the MIME type and (hopefully) the charset parameter with the encoding/charset too. If not, we’ll analyse the XML PI for the presence of the encoding attribute and get the encoding from there. If that’s also missing, the XML specs define to use UTF-8 as encoding.

    $url = 'http://www.lr-online.de/storage/rss/rss/sport.xml';

    $accept = array(
    'type' => array('application/rss+xml', 'application/xml', 'application/rdf+xml', 'text/xml'),
    'charset' => array_diff(mb_list_encodings(), array('pass', 'auto', 'wchar', 'byte2be', 'byte2le', 'byte4be', 'byte4le', 'BASE64', 'UUENCODE', 'HTML-ENTITIES', 'Quoted-Printable', '7bit', '8bit'))
    );
    $header = array(
    'Accept: '.implode(', ', $accept['type']),
    'Accept-Charset: '.implode(', ', $accept['charset']),
    );
    $encoding = null;
    $curl = curl_init($url);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($curl, CURLOPT_HEADER, true);
    curl_setopt($curl, CURLOPT_HTTPHEADER, $header);
    $response = curl_exec($curl);
    if (!$response) {
    // error fetching the response
    } else {
    $offset = strpos($response, "\r\n\r\n");
    $header = substr($response, 0, $offset);
    if (!$header || !preg_match('/^Content-Type:\s+([^;]+)(?:;\s*charset=(.*))?/im', $header, $match)) {
    // error parsing the response
    } else {
    if (!in_array(strtolower($match[1]), array_map('strtolower', $accept['type']))) {
    // type not accepted
    }
    $encoding = trim($match[2], '"\'');
    }
    if (!$encoding) {
    $body = substr($response, $offset + 4);
    if (preg_match('/^<\?xml\s+version=(?:"[^"]*"|\'[^\']*\')\s+encoding=("[^"]*"|\'[^\']*\')/s', $body, $match)) {
    $encoding = trim($match[1], '"\'');
    }
    }
    if (!$encoding) {
    $encoding = 'utf-8';
    } else {
    if (!in_array($encoding, array_map('strtolower', $accept['charset']))) {
    // encoding not accepted
    }
    if ($encoding != 'utf-8') {
    $body = mb_convert_encoding($body, 'utf-8', $encoding);
    }
    }
    $simpleXML = simplexml_load_string($body, null, LIBXML_NOERROR);
    if (!$simpleXML) {
    // parse error
    } else {
    echo $simpleXML->asXML();
    }
    }

    ReplyDelete
  2. Detecting the encoding is hard.

    mb_detect_encoding works by guessing, based on a number of candidates that you pass it. In some encodings, certain byte-sequences are invalid, an therefore it can distinguish between various candidates. Unfortunately, there are a lot of encodings, where the same bytes are valid (but different). In these cases, there is no way to determine the encoding; You can implement your own logic to make guesses in these cases. For example, data coming from a Japanese site might be more likely to have a Japanese encoding.

    As long as you only deal with Western European languages, the three major encodings to consider are utf-8, iso-8859-1 and cp-1252. Since these are defaults for many platforms, they are also the most likely to be reported wrongly about. Eg. if people use different encodings, they are likely to be frank about it, since else their software would break very often. Therefore, a good strategy is to trust the provider, unless the encoding is reported as one of those three. You should still doublecheck that it is indeed valid, using mb_check_encoding (note that valid is not the same as being - the same input may be valid for many encodings). If it is one of those, you can then use mb_detect_encoding to distinguish between them. Luckily that is fairly deterministic; You just need to use the proper detect-sequence, which is UTF-8,ISO-8859-1,WINDOWS-1252.

    Once you've detected the encoding you need to convert it to your internal representation (UTF-8 is the only sane choice). The function utf8_encode transforms ISO-8859-1 to UTF-8, so it can only used for that particular input type. For other encodings, use mb_convert_encoding.

    ReplyDelete
  3. If you apply utf8_encode() to an already UTF8 string it will return a garbled UTF8 output.

    I made a function that addresses all this issues. It´s called Encoding::toUTF8().

    You dont need to know what the encoding of your strings is. It can be Latin1 (iso 8859-1) or UTF8, or the string can have a mix of the two. Encoding::toUTF8() will convert everything to UTF8.

    I did it because a service was giving me a feed of data all messed up, mixing UTF8 and Latin1 in the same string.

    Usage:

    $utf8_string = Encoding::toUTF8($utf8_or_latin1_or_mixed_string);

    $latin1_string = Encoding::toLatin1($utf8_or_latin1_or_mixed_string);


    Download:

    http://dl.dropbox.com/u/186012/PHP/forceUTF8.zip

    Update:

    I've included another function, Encoding::fixUFT8(), wich will fix every UTF8 string that looks garbled.

    Usage:

    $utf8_string = Encoding::fixUTF8($garbled_utf8_string);


    Examples:

    echo Encoding::fixUTF8("Fédération Camerounaise de Football");
    echo Encoding::fixUTF8("Fédération Camerounaise de Football");
    echo Encoding::fixUTF8("FÃÂédÃÂération Camerounaise de Football");
    echo Encoding::fixUTF8("Fédération Camerounaise de Football");


    will output:

    Fédération Camerounaise de Football
    Fédération Camerounaise de Football
    Fédération Camerounaise de Football
    Fédération Camerounaise de Football


    Update: I've transformed the function (forceUTF8) into a family of static functions on a class called Encoding. The new function is Encoding::toUTF8().

    ReplyDelete
  4. This cheatsheet lists some common caveats related to UTF-8 handling in PHP:
    http://developer.loftdigital.com/blog/php-utf-8-cheatsheet

    This function detecting multibyte characters in a string might also prove helpful (source):


    function detectUTF8($string)
    {
    return preg_match('%(?:
    [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
    |\xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
    |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
    |\xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
    |\xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
    |[\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
    |\xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
    )+%xs',
    $string);
    }

    ReplyDelete
  5. A little heads up, you said that the "ß" should be displayed as "Ÿ" in your database.

    This is probably because you're using a database with latin1 character encoding or possibly your php-mysql connection is set wrong, this is, php believes your mysql is set to use utf-8, so it sends data as utf8, but your mysql belives php is sending data encoded as iso-8859-1, so it may once again try to encode your sent data as utf-8, causing this kind of trouble.

    Take a look at this, may help you: http://php.net/manual/en/function.mysql-set-charset.php

    ReplyDelete
  6. Working out the character encoding of RSS feeds seems to be complicated. Even normal web pages often omit, or lie about, their encoding.

    So you could try to use the correct way to detect the encoding and then fall back to some form of auto-detection (guessing).

    ReplyDelete
  7. A really nice way to implement an isUTF8-function can be found on php.net:

    function isUTF8($string) {
    return (utf8_encode(utf8_decode($string)) == $string);
    }

    ReplyDelete
  8. It's simple: when you get something that's not UTF8, you must ENCODE that INTO utf8.

    So, when you're fetching a certain feed that's ISO-8859-1 parse it through utf8_encode.

    However, if you're fetching an UTF8 feed, you don't need to do anything.

    ReplyDelete
  9. Your encoding looks like you encoded into UTF-8 twice; that is, from some other encoding, into UTF-8, and again into UTF-8. As if you had iso-8859-1, converted from iso-8859-1 to utf-8, and treated the new string as iso-8859-1 for another conversion into UTF-8.

    Here's some pseudocode of what you did:

    $inputstring = getFromUser();
    $utf8string = iconv($current_encoding, 'utf-8', $inputstring);
    $flawedstring = iconv($current_encoding, 'utf-8', $utf8string);


    You should try:


    detect encoding using mb_detect_encoding() or whatever you like to use
    if it's UTF-8, convert into iso-8859-1, and repeat step 1
    finally, convert back into UTF-8


    That is presuming that in the "middle" conversion you used iso-8859-1. If you used windows-1252, then convert into windows-1252 (latin1). The original source encoding is not important; the one you used in flawed, second conversion is.

    This is my guess at what happened; there's very little else you could have done to get four bytes in place of one extended ASCII byte.

    German language also uses iso-8859-2 and windows-1250 (latin2).

    ReplyDelete
  10. I know this is an older question, but I figure a useful answer never hurts. I was having issues with my encoding between a desktop application, SQLite, and GET/POST variables. Some would be in UTF-8, some would be in ASCII, and basically everything would get screwed up when foreign characters got involved.

    Here is my solution. It scrubs your GET/POST/REQUEST (I omitted cookies, but you could add them if desired) on each page load before processing. It works well in a header. PHP will throw warnings if it can't detect the source encoding automatically, so these warnings are suppressed with @'s.

    //Convert everything in our vars to UTF-8 for playing nice with the database...
    //Use some auto detection here to help us not double-encode...
    //Suppress possible warnings with @'s for when encoding cannot be detected
    try
    {
    $process = array(&$_GET, &$_POST, &$_REQUEST);
    while (list($key, $val) = each($process)) {
    foreach ($val as $k => $v) {
    unset($process[$key][$k]);
    if (is_array($v)) {
    $process[$key][@mb_convert_encoding($k,'UTF-8','auto')] = $v;
    $process[] = &$process[$key][@mb_convert_encoding($k,'UTF-8','auto')];
    } else {
    $process[$key][@mb_convert_encoding($k,'UTF-8','auto')] = @mb_convert_encoding($v,'UTF-8','auto');
    }
    }
    }
    unset($process);
    }
    catch(Exception $ex){}

    ReplyDelete
  11. When you try to handle multi languages like Japanese and Korean you might get in trouble. mb_convert_encoding with 'auto' parameter doesn't work well. Setting mb_detect_order('ASCII,UTF-8,JIS,EUC-JP,SJIS,EUC-KR,UHC') doesn't help since it will detect EUC-* wrongly.

    I concluded that as long as input strings comes from HTML, it should use 'charset' in a meta element. I use Simple HTML DOM Parser because it supports invalid HTML.

    The below snippet extracts title element from a web page. If you would like to convert entire page, then you may want to remove some lines.

    <?php
    require_once 'simple_html_dom.php';

    echo convert_title_to_utf8(file_get_contents($argv[1])), PHP_EOL;

    function convert_title_to_utf8($contents)
    {
    $dom = str_get_html($contents);
    $title = $dom->find('title', 0);
    if (empty($title)) {
    return null;
    }
    $title = $title->plaintext;
    $metas = $dom->find('meta');
    $charset = 'auto';
    foreach ($metas as $meta) {
    if (!empty($meta->charset)) { // html5
    $charset = $meta->charset;
    } else if (preg_match('@charset=(.+)@', $meta->content, $match)) {
    $charset = $match[1];
    }
    }
    if (!in_array(strtolower($charset), array_map('strtolower', mb_list_encodings()))) {
    $charset = 'auto';
    }
    return mb_convert_encoding($title, 'UTF-8', $charset);
    }

    ReplyDelete
  12. php.net/mb_detect_encoding

    echo mb_detect_encoding($str, "auto");


    or

    echo mb_detect_encoding($str, "UTF-8, ASCII, ISO-8859-1");


    i really don't know what the results are, but i'd suggest you just take some of your feeds with different encodings and try if mb_detect_encoding works or not.

    update
    auto is short for "ASCII,JIS,UTF-8,EUC-JP,SJIS". it returns the detected charset, which you can use to convert the string to utf-8 with iconv.

    <?php
    function convertToUTF8($str) {
    $enc = mb_detect_encoding($str);

    if ($enc && $enc != 'UTF-8') {
    return iconv($enc, 'UTF-8', $str);
    } else {
    return $str;
    }
    }
    ?>


    i haven't tested it, so no guarantee. and maybe there's a simpler way.

    ReplyDelete
  13. As already mentioned above: encoding issues can be quite tedious.

    I've used a guide on
    http://www.phpwact.org/php/i18n/charsets (with a link to a dedicated utf-8 guide), and this resolve my issues. The page is still under construction, but is does provide a very precise description of the relevant issues when using utf-8.

    It sounds like case 3 is what you actually want: the characters are correct in the database. Usually it is sufficient to apply utf8_encode once before displaying the string.

    ReplyDelete
  14. @harpax that worked for me. In my case, this is good enough:

    if (isUTF8($str)) {
    echo $str;
    }
    else
    {
    echo iconv("ISO-8859-1", "UTF-8//TRANSLIT", $str);
    }

    ReplyDelete
  15. I was checking for solutions to encoding since AGES, and this page is probably the conclusion of years of search!
    I tested some of the suggestions you mentioned and here's my notes:

    This is my test string:

    this is a "wròng wrìtten" string bùt I nèed to pù 'sòme' special chàrs to see thèm, convertèd by fùnctìon!! & that's it!

    I save to save this string on a DB in a field that is set as "utf8_general_ci"

    charset of my page is "UTF-8"

    if I do an INSERT just like that, in my DB I have some chars probably coming from Mars...
    so I need to convert them into some "sane" UTF-8.
    I tried utf8_encode() but still aliens chars were invading my database...

    so I tried to use the function "forceUTF8" posted on number 8 but on DB the string saved looks like that:

    this is a "wròng wrìtten" string bùt I nèed to pù 'sòme' special chà rs to see thèm, convertèd by fùnctìon!! & that's it!

    so collecting some more infos on this page and merging them with other infos on other pages I solved my problem with this solution:

    $finallyIDidIt = mb_convert_encoding($string,mysql_client_encoding($resourceID),mb_detect_encoding($string));

    now in my database I have my string with correct encoding.

    NOTE:
    Only note to take care of is on function mysql_client_encoding!
    you need to be connected to DB because this function wants a resource ID as parameter.

    but well, I just do that re-encoding before my INSERT so for me is not a problem.

    I hope this will help someone like this page helped me!

    thanks to everybody!

    Mauro

    ReplyDelete
  16. You need test the charset on input since responses can come coded with different encodings.
    I force all content been sent into UTF-8 by doing detection and translation using the following function:

    function fixRequestCharset()
    {
    $ref = array( &$_GET, &$_POST, &$_REQUEST );
    foreach ( $ref as &$var )
    {
    foreach ( $var as $key => $val )
    {
    $encoding = mb_detect_encoding( $var[ $key ], mb_detect_order(), true );
    if ( !$encoding ) continue;
    if ( strcasecmp( $encoding, 'UTF-8' ) != 0 )
    {
    $encoding = iconv( $encoding, 'UTF-8', $var[ $key ] );
    if ( $encoding === false ) continue;
    $var[ $key ] = $encoding;
    }
    }
    }
    }


    That routine will turn all PHP variables that come from the remote host into UTF-8.
    Or ignore the value if the encoding could not be detected or converted.
    You can customize it to your needs.
    Just invoke it before using the variables.

    ReplyDelete
  17. After sorting out your php scripts, don't forget to tell mysql what charset you are passing and would like to recceive.

    Example: set character set utf8

    Passing utf8 data to a latin1 table in a latin1 I/O session gives those nasty birdfeets. I see this every other day in oscommerce shops. Back and fourth it might seem right. But phpmyadmin will show the truth. By telling mysql what charset you are passing it will handle the conversion of mysql data for you.

    How to recover existing scrambled mysql data is another thread to discuss. :)

    ReplyDelete
  18. This version is for German language but you can modifiy the $CHARSETS and the $TESTCHARS


    class CharsetDetector
    {
    private static $CHARSETS = array(
    "ISO_8859-1",
    "ISO_8859-15",
    "CP850"
    );
    private static $TESTCHARS = array(
    "€",
    "ä",
    "Ä",
    "ö",
    "Ö",
    "ü",
    "Ü",
    "ß"
    );
    public static function convert($string)
    {
    return self::__iconv($string, self::getCharset($string));
    }
    public static function getCharset($string)
    {
    $normalized = self::__normalize($string);
    if(!strlen($normalized))return "UTF-8";
    $best = "UTF-8";
    $charcountbest = 0;
    foreach (self::$CHARSETS as $charset) {
    $str = self::__iconv($normalized, $charset);
    $charcount = 0;
    $stop = mb_strlen( $str, "UTF-8");

    for( $idx = 0; $idx < $stop; $idx++)
    {
    $char = mb_substr( $str, $idx, 1, "UTF-8");
    foreach (self::$TESTCHARS as $testchar) {

    if($char == $testchar)
    {

    $charcount++;
    break;
    }
    }
    }
    if($charcount>$charcountbest)
    {
    $charcountbest=$charcount;
    $best=$charset;
    }
    //echo $text."<br />";
    }
    return $best;
    }
    private static function __normalize($str)
    {

    $len = strlen($str);
    $ret = "";
    for($i = 0; $i < $len; $i++){
    $c = ord($str[$i]);
    if ($c > 128) {
    if (($c > 247)) $ret .=$str[$i];
    elseif ($c > 239) $bytes = 4;
    elseif ($c > 223) $bytes = 3;
    elseif ($c > 191) $bytes = 2;
    else $ret .=$str[$i];
    if (($i + $bytes) > $len) $ret .=$str[$i];
    $ret2=$str[$i];
    while ($bytes > 1) {
    $i++;
    $b = ord($str[$i]);
    if ($b < 128 || $b > 191) {$ret .=$ret2; $ret2=""; $i+=$bytes-1;$bytes=1; break;}
    else $ret2.=$str[$i];
    $bytes--;
    }
    }
    }
    return $ret;
    }
    private static function __iconv($string, $charset)
    {
    return iconv ( $charset, "UTF-8" , $string );
    }
    }

    ReplyDelete
  19. The interesting thing about mb_detect_encoding and mb_convert_encoding is that the order of the encodings you suggest does matter:

    // $input is actually UTF-8

    mb_detect_encoding($input, "UTF-8", "ISO-8859-9, UTF-8");
    // ISO-8859-9 (WRONG!)

    mb_detect_encoding($input, "UTF-8", "UTF-8, ISO-8859-9");
    // UTF-8 (OK)


    So you might want to use a specific order when specifying expected encodings. Still, keep in mind that this is not foolproof.

    ReplyDelete