Whats collate

12/27/2023

There's an argument to be made that if speed is more important to you than accuracy, you may as well not do any sorting at all. Today, that performance cost has all but disappeared, and developers are treating internationalization more seriously. In the past, some people recommended to use utf8mb4_general_ci except when accurate sorting was going to be important enough to justify the performance cost. Your database will almost certainly be limited by other bottlenecks than this. There is almost certainly no reason to use utf8mb4_general_ci anymore, as we have left behind the point where CPU speed is low enough that the performance difference would be important. For some languages, it'll be quite inadequate. The suitability of utf8mb4_general_ci will depend heavily on the language used. In non-latin languages, such as Asian languages or languages with different alphabets, there may be a lot more differences between Unicode sorting and the simplified utf8mb4_general_ci sorting. utf8mb4_unicode_ci handles these properly. Some Unicode characters are defined as ignorable, which means they shouldn't count toward the sort order and the comparison should move on to the next character instead. These rules need to take into account language-specific conventions not everybody sorts their characters in what we would call 'alphabetical order'.Īs far as Latin (ie "European") languages go, there is not much difference between the Unicode sorting and the simplified utf8mb4_general_ci sorting in MySQL, but there are still a few differences:įor examples, the Unicode collation sorts "ß" like "ss", and "Œ" like "OE" as people using those characters would normally want, whereas utf8mb4_general_ci sorts them as single characters (presumably like "s" and "e" respectively). Utf8mb4_unicode_ci, which uses the Unicode rules for sorting and comparison, employs a fairly complex algorithm for correct sorting in a wide range of languages and when using a wide range of special characters. It was devised in a time when servers had a tiny fraction of the CPU performance of today's computers.īenefits of utf8mb4_unicode_ci over utf8mb4_general_ci On modern servers, this performance boost will be all but negligible. It does not follow the Unicode rules and will result in undesirable sorting or comparison in some situations, such as when using particular languages or characters. Utf8mb4_general_ci is a simplified set of sorting rules which aims to do as well as it can while taking many short-cuts designed to improve speed. Utf8mb4_unicode_ci is based on the official Unicode rules for universal sorting and comparison, which sorts accurately in a wide range of languages. The flawed version remains for backward compatibility, though it is being deprecated.

For now, you need to use utf8mb4 instead of utf8 for the character encoding part, to ensure you are getting the fixed version. MySQL is currently transitioning away from an older, flawed UTF-8 implementation. The description of those older collations below is provided for interest only.

People reading this now should probably use one of these newer collations instead of either _unicode_ci or _general_ci. Newer versions of MySQL introduce new sets of rules, too, such as _unicode_520_ci for equivalent rules based on Unicode 5.2, or the MySQL 8.x specific _0900_ai_ci for equivalent rules based on Unicode 9.0 (and with no equivalent _general_ci variant). _unicode_ci and _general_ci are two different sets of rules for sorting and comparing text according to the way we expect. The differences are in how text is sorted and compared. For example, utf8_unicode_520_ci.Īll these collations are for the UTF-8 character encoding. For those people still arriving at this question in 2020 or later, there are newer options that may be better than both of these.

0 Comments

BLOG

Whats collate

Leave a Reply.

Author

Archives

Categories