Collation Order


What is Collation Order?

Different languages around the world use different schemes for sorting or ordering textual values. It is not uncommon for computer programs to order text values based on the ascii values of each successive character. This works pretty well for English, since the 26 letters that make up the English alphabet have consecutive ascii codes, but it results in strange orderings when there are letters with accents. For example, most dictionaries would place the word "naïve" between the words "nag" and "name", but since the i with an umlat has an ascii code of 239, which comes after z which has an ascii code of 122, a straight ascii places "naïve" after "navigate".

Language-specific collation uses the rules of a particular language (e.g., Finnish, Russian, Spanish, etc) to determine the ordering of words. In almost all languages, you'll find that all variations of accented e (e.g., é, è, ë, ȇ etc.) come between d and f. There are also some orderings that change between languages. For example, in a Spanish-language collation order,"cha" would come after "cza", whereas in English the order would be reversed. In Swedish, ö > z whereas in German it is the other way around.

TextLocale

In Analytica 4.5, the new system variable TextLocale holds the locale that determines collation order. This usually holds the name of the language. For example, if you want to use the "Swedish" collation order, you would set this system variable to "Swedish". You can do this by selecting Definition → System Variables → TextLocale from the Analytica menus, when no nodes are selected. The object window for the system variable appears, where you can set the definition. The value should not be surrounded in quotes.

When you set TextLocale, the value is a property of your model, not a property of your computer or your installation of Analytica. Changing the system variable does not invalidate previously computed values that compared or sorted text.

Ansi order

You also have the option of setting TextLocale to "ANSI" (or synonymously "ASCII") if you really want pure ascii-order collation or to "Regional" if you want it to use the end-user's native language (when you are sharing models between people in different countries). The "ANSI" order may be desirable if you have legacy (pre-Analytica 4.4) models containing algorithms that rely on ascii ordering. If you don't set the system variable, it will behave as if set to "Regional"; however, you should be aware that the results of your model (those things that rely on text ordering) might not be identical for users in different countries.

Functions and operators that use TextLocale

TextLocale determines how the operators <, >, <=, and >= compare text. It also impacts how the functions SortIndex, Rank, Sort and RankCorrel order text values relative to each other.

Properties

Collation order lacks several properties you might be inclined to assume (see Unicode Collation Algorithm). I list a few of these here.

Order is not determined by the ascii values of characters -- a smaller ascii value will often come after a larger ascii value, and vise versa.

Collation order is not preserved under concatenation or substring operations in general.

x < y does not imply that xz < yz
x < y does not imply that zx < zy
xz < yz does not imply that x < y
zx < zy does not imply that x < y

This last property implies that you cannot determine the ordering in general by looking at the first N characters, even if the text differs on the first N characters. Also, there are cases where A < B and C < D, but when you concatenate these, AC > BD.

Although these properties do not hold for non-Ansi collation orders, they do hold (only) for the Ansi collation order.

In Analytica (but not in all other programs that use collation order), it does hold that exact one of x < y, x > y or x = y holds for every two text values x and y. Also, x < y or x = y if and only if x <= y, x > y or x = y if and only if x >= y, and x < y or x > y if and only if x <> y.

Case Insensitivity

Analytica's comparision operators, <, <=, > and >=, perform case-sensitive comparison. However, several functions including SortIndex, Sort, Rank, FindInText, SplitText and TextReplace include optional parameters that you can use to specify case-insensitive comparision. But case-sensitivity / insensitivity is more nuanced than it is with straight Ansi orders (Case Sensitive Collation Sort Order).

Consider the following text values:

"ab"
"ac"
"Ab"
"Ac"

Since "a" < "A" in non-Ansi collation orders, the order these are listed would appear to be the correct case-sensitive collation order, with all text values starting with "a" coming before all text values starting with "A". However, that is not the case, the correct case-sensitive collation order is

"ab"
"Ab"
"ac"
"Ac"

This is an example where collation order is not preserved under concatenation or substring operations. In a natural collation order, the fact that "a" < "A" is relevant, but it is trumped by the ordering of the second character.

Conceptually, in a case-sensitive collation ordering, the case-insensitive and accent-insensitive comparison is usually considered first (but the accent treatment varies by language). In the event of a tie, then accent orderings will usually break the tie first, and if still tied after considering accents, then upper/lower case will break the ties. For case-insensitive comparisons that last tie breaker isn't used. The actual algorithm doesn't use three stages, and there are some minor variations on this by language (especially with regard to accents), but the ordering can be best understood by thinking of it in these three stages. See Case Sensitive Collation Sort Order.

Equality

The equality operator (=) and not-equal operator (<>) identify text as equal (x = y) only when both contains precisely the same character sequence. In Unicode there are often multiple ways of representing the same logical character or characters. For example, an accented á can be either Chr(225), or it can be the two-character sequence a & Chr(769). "fi" can either be a two-character sequence 'f' & 'i', or the single character 'fi' (Chr(64257)). The Ω character can be either Chr(937) or Chr(8486). When characters have more than one combining character, the combining characters can appear in any order. In all these cases, even though visually identical text results, they are not considered equal by Analytica's = operator because the precise sequence of characters is not identical. In all these cases, text using these characters will be adjacent when sorted, as if the text is nearly equal. When collation order alone does not distinguish between strings, Analytica breaks ties using ansi-order.

History

Introduced in Analytica 4.5.

See Also

Comments


You are not allowed to post comments.