Wednesday, May 30, 2012

What is the best collation to use for MySQL (with PHP)


I'm wondering if there is a "best" choice for collation in MySQL for a general website where you aren't 100% of what will be entered? I understand that all the encodings should be the same, such as MySQL, Apache, the HTML and anything inside PHP.



In the past I have set PHP to output in "UTF-8", but which collation does this match in MySQL? I'm thinking it's one of the UTF-8 ones, but I have used utf8_unicode_ci, utf8_general_ci, and utf8_bin before.


Source: Tips4all

7 comments:

  1. The main difference is sorting accuracy (when comparing characters in the language) and performance. The only special one is utf8_bin which is for comparing characters in binary format.

    utf8_general_ci is somewhat faster than utf8_unicode_ci, but less accurate (for sorting). The specific language utf8 encoding (such as utf8_swedish_ci) contain additional language rules that make them the most accurate to sort for those languages. Most of the time I use utf8_unicode_ci (I prefer accuracy to small performance improvements), unless I have a good reason to prefer a specific language.

    You can read more on specific unicode character sets on the MySQL manual - http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html

    ReplyDelete
  2. Be very, very aware of this problem that can occur when using utf8_general_ci.

    MySQL will not distinguish between some characters in select statements, if the utf8_general_ci collate is used. This can lead to very nasty bugs - especially for example, where usernames are involved. Depending on the implementation that uses the database tables, this problem could allow malicious users to create a username matching an administrator account.

    This problem exposes itself at the very least in early 5.x versions - I'm not sure if this behaviour as changed later.

    I'm no DBA, but to avoid this problem, I always go with 'utf8-bin' instead of a case-insensitive one.

    The script below describes the problem by example.

    -- first, create a sandbox to play in
    CREATE DATABASE `sandbox`;
    use `sandbox`;

    -- next, make sure that your client connection is of the same
    -- character/collate type as the one we're going to test next:
    charset utf8 collate utf8_general_ci

    -- now, create the table and fill it with values
    CREATE TABLE `test` (`key` VARCHAR(16), `value` VARCHAR(16) )
    CHARACTER SET utf8 COLLATE utf8_general_ci;

    INSERT INTO `test` VALUES ('Key ONE', 'value'), ('Key TWO', 'valúe');

    -- (verify)
    SELECT * FROM `test`;

    -- now, expose the problem/bug:
    SELECT * FROM test WHERE `value` = 'value';

    --
    -- Note that we get BOTH keys here! MySQLs UTF8 collates that are
    -- case insensitive (ending with _ci) do not distinguish between
    -- both values!
    --
    -- collate 'utf8_bin' doesn't have this problem, as I'll show next:
    --

    -- first, reset the client connection charset/collate type
    charset utf8 collate utf8_bin

    -- next, convert the values that we've previously inserted in the table
    ALTER TABLE `test` CONVERT TO CHARACTER SET utf8 COLLATE utf8_bin;

    -- now, re-check for the bug
    SELECT * FROM test WHERE `value` = 'value';

    --
    -- Note that we get just one key now, as you'd expect.
    --
    -- This problem appears to be specific to utf8. Next, I'll try to
    -- do the same with the 'latin1' charset:
    --

    -- first, reset the client connection charset/collate type
    charset latin1 collate latin1_general_ci

    -- next, convert the values that we've previously inserted
    -- in the table
    ALTER TABLE `test` CONVERT TO CHARACTER SET latin1 COLLATE latin1_general_ci;

    -- now, re-check for the bug
    SELECT * FROM test WHERE `value` = 'value';

    --
    -- Again, only one key is returned (expected). This shows
    -- that the problem with utf8/utf8_generic_ci isn't present
    -- in latin1/latin1_general_ci
    --
    -- To complete the example, I'll check with the binary collate
    -- of latin1 as well:

    -- first, reset the client connection charset/collate type
    charset latin1 collate latin1_bin

    -- next, convert the values that we've previously inserted in the table
    ALTER TABLE `test` CONVERT TO CHARACTER SET latin1 COLLATE latin1_bin;

    -- now, re-check for the bug
    SELECT * FROM test WHERE `value` = 'value';

    --
    -- Again, only one key is returned (expected).
    --
    -- Finally, I'll re-introduce the problem in the exact same
    -- way (for any sceptics out there):

    -- first, reset the client connection charset/collate type
    charset utf8 collate utf8_generic_ci

    -- next, convert the values that we've previously inserted in the table
    ALTER TABLE `test` CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;

    -- now, re-check for the problem/bug
    SELECT * FROM test WHERE `value` = 'value';

    --
    -- Two keys.
    --

    DROP DATABASE sandbox;

    ReplyDelete
  3. Collations affect how data is sorted and how strings are compared to each other. That means you should use the collation that most of your users expect.

    Example from the documentation:


    utf8_general_ci also is satisfactory
    for both German and French, except
    that ‘ß’ is equal to ‘s’, and not to
    ‘ss’. If this is acceptable for your
    application, then you should use
    utf8_general_ci because it is faster.
    Otherwise, use utf8_unicode_ci because
    it is more accurate.


    So - it depends on your expected user base and on how much you need correct sorting. For an English user base, utf8_general_ci should suffice, for other languages, like Swedish, special collations have been created.

    ReplyDelete
  4. Actually, you probably want to use utf8_unicode_ci or utf8_general_ci.


    utf8_general_ci sorts by stripping away all accents and sorting as if it were ASCII
    utf8_unicode_ci uses the Unicode sort order, so it sorts correctly in more languages


    However, if you are only using this to store english text, these shouldn't differ.

    ReplyDelete
  5. Essentially, it depends on how you think of a string.

    I always use utf8_bin because of the problem highlighted by Guus. In my opinion, as far as the database should be concerned, a string is still just a string. A string is a number of UTF-8 characters. A character has a binary representation so why does it need to know the language you're using? Usually, people will be constructing databases for systems with the scope for multilingual sites. This is the whole point of using UTF-8 as a character set. I'm a bit of a pureist but I think the bug risks heavily outweigh the slight advantage you may get on indexing. Any language related rules should be done at a much higher level than the DBMS.

    In my books "value" should never in a million years be equal to "valúe".

    If I want to store a text field and do a case insensitive search, I will use MYSQL string functions with PHP functions such as LOWER() and the php function strtolower().

    ReplyDelete
  6. For UTF-8 textual information, you should use utf8_general_ci because...


    utf8_bin: compare strings by the
    binary value of each character in
    the string
    utf8_general_ci: compare strings
    using general language rules and
    using case-insensitive comparisons


    a.k.a. it will should making searching and indexing the data faster/more efficient/more useful.

    ReplyDelete
  7. For the case highlighted by Guus, I would strongly suggest using either utf8_unicode_cs (case sensitive, strict matching, ordering correctly for the most part) instead of utf8_bin (strict matching, incorrect ordering).

    If the field is intended to be searched, as opposed to matched for a user, then use utf8_general_ci or utf8_unicode_ci. Both are case-insensitive, one will losely match (‘ß’ is equal to ‘s’, and not to ‘ss’). There are also language specific versions, like utf8_german_ci where the lose matching is more suitable for the language specified.

    ReplyDelete