Collation in ICU Mark Davis Chief SW Globalization
Collation in ICU Mark Davis Chief SW Globalization Architect IBM 21 st International Unicode Conference
What is ICU? Premier Unicode Enablement Library Open-source: non-viral license Full-Featured, Cross-Platform C, C++, Java APIs Collation, Charset Conversion, Resources, Boundaries, Calendars, Transforms (case, norm. , translit. , …), Format/Parse (dates, times, msgs, nums. , curr. , …), Unicode strings/props http: //oss. software. ibm. com/icu/ 21 st International Unicode Conference 2 Dublin, Ireland — 9/9/2020
Collation = Sorting Order How hard can it be? A < B < C < … Complications Languages are complex and varied Unicode is a big set of characters Performance is crucial 21 st International Unicode Conference 3 Dublin, Ireland — 9/9/2020
Varies By: Customizations Language A < a a < A Swedish: z < ö German: ö < z Versioning Usage Fixes New Gov. Stds New Characters Dictionary: öf < of Telephone: of < öf 21 st International Unicode Conference 4 Dublin, Ireland — 9/9/2020
Strength Levels: L 1, L 2, L 3 1. Base characters: a < b 2. Accents: as < às < at ignored if there is a L 1 character difference 3. Case: ao < Ao < aò ignored if there is a L 1 or L 2 difference 4. Punctuation: ab < a-b < a. B ignored* if there is a L 1, L 2, or L 3 difference 5. Tie-breaker: NFD code point order 21 st International Unicode Conference 5 Dublin, Ireland — 9/9/2020
Context Sensitivity Contractions H < Z, but CZ < CH Expansions OE < Œ < OF Both カー < カイ キー > キイ 21 st International Unicode Conference 6 Dublin, Ireland — 9/9/2020
Canonical Equivalence Å≡ Å ≡ A + º x +. + ^ ≡ x + ^ +. ự ≡ u + ’ ≡ ư +. ≡ ụ + ’ ≡ u + ’ +. 21 st International Unicode Conference 7 Dublin, Ireland — 9/9/2020
Oddities Normal accents cote < coté < côte < côté • first accent difference determines order French accents cote < côte < coté < côté • last accent difference determines order Logical Order Exception (Thai, Lao) �� sorts like �� 21 st International Unicode Conference 8 Dublin, Ireland — 9/9/2020
Merging Database Fields F 1 = Last. Name, F 2 = First. Name Sequential F 1, then F 2 Weak 1 st F 1 (L 1), F 2 Merged L 1, L 2, L 3 di. Silva, John di. Silva, Fred di Silva, John di Silva, Fred dísilva, John dísilva, Fred di. Silva, John dísilva, John di Silva, Fred di. Silva, Fred dísilva, Fred di. Silva, John di Silva, John dísilva, John di. Silva, Fred di Silva, Fred dísilva, Fred 21 st International Unicode Conference 9 Dublin, Ireland — 9/9/2020
Customizations Parameters that change collation behavior Choice of language (locale) Runtime choices Examples to follow 21 st International Unicode Conference 10 Dublin, Ireland — 9/9/2020
Parametric Customizations Strength Case: Base+Accent+ Case &c. 21 st International Unicode Conference A < a a < A Punctuation: di Silva < di Silva 11 Dublin, Ireland — 9/9/2020
Punctuation / Spaces (Alternates) Base Character Ignoreable di silva di Silva Di silva Di Silva Dickens disilva di. Silva Disilva Di. Silva Dickens di silva di Silva di. Silva Di silva Di Silva Di. Silva 21 st International Unicode Conference 12 Dublin, Ireland — 9/9/2020
Extended Customizations User-defined Script Order “&” ≡ “ampersand” Merging tailorings Iranian + French 21 st International Unicode Conference b < ב < β < б β < b < б < ב Numbers A-10 < A-2 < A-10 13 Dublin, Ireland — 9/9/2020
Other Uses: String Searching Match according to locale conventions: e. g. w = v for Swedish Use collation options: ignore case, accent other customizations 21 st International Unicode Conference 14 Dublin, Ireland — 9/9/2020
Other Uses: Selection Bounds Return all records where: Zoë ≤ name < Zorma Ignore case / accents Zoe / zoe / Zoë / zoë / … 21 st International Unicode Conference 15 Dublin, Ireland — 9/9/2020
UCA UTS #10: Unicode Collation Algorithm Levels, Expansions, Contractions, Punctuation, Canonical Equivalence, etc. Default ordering: all Unicode points Provides for tailoring to given languages Also see: The Unicode Standard, § 5. 17: Sorting and Searching Aligned with ISO 14651 21 st International Unicode Conference 16 Dublin, Ireland — 9/9/2020
APIs String Compare Sort Keys String Search Selection Boundaries Merged sortkeys 21 st International Unicode Conference 17 Dublin, Ireland — 9/9/2020
Sort Keys Transform string into series of bytes which will binary-compare a: A: á: ab: b: 06 C 3 01 20 01 02 00 06 C 3 01 20 01 08 00 06 C 3 01 20 32 01 02 02 00 06 C 3 06 D 7 01 20 20 01 02 02 00 06 D 7 01 20 01 02 00 Level 1 Level 2 Level 3 21 st International Unicode Conference 18 Dublin, Ireland — 9/9/2020
String Compare vs. Sort Keys Same results in either case SC faster for single comparisons average 5 to 10 times! SK faster for multiple comparisons index once binary compare many times 21 st International Unicode Conference 19 Dublin, Ireland — 9/9/2020
String Search Naïve Approach key matches in target at <x, y> iff target. substring(x, y) ≡ key Boundary Complications Ignorables: “a” matches in “(a)”? • at <0, 2> & <1, 2> & <0, 3> & <1, 3>? Contractions: “c” matches in “churo”? Normalization: “å” matches in “a¸˚”? 21 st International Unicode Conference 20 Dublin, Ireland — 9/9/2020
WARNING 1: Basics Not aligned with character set or repertoire Latin-1: Swedish and German sorting differs Not code point (binary) order Binary: Z < a < v < w English: Z > a Swedish: v ≡ w Not a property of strings: same Database Swedish user: views/select German user: views/selects 21 st International Unicode Conference 21 Dublin, Ireland — 9/9/2020
WARNING 2: Operations Order not preserved under concatenation / substringing x<y xz < yz zx < zy 21 st International Unicode Conference ↛ ↛ xz < yz zx < zy x<y 22 Dublin, Ireland — 9/9/2020
WARNING 3: Dependence Collation is a relation over strings Sort keys embody part of that relation Thus, comparing sort keys from different tailorings (or parameters) gives undefined results. 21 st International Unicode Conference 23 Dublin, Ireland — 9/9/2020
WARNING 4: Stability Stable Sort Records with equal comparison come out in original order Property of algorithm, not comparison Semi-Stable Comparison x ≠ y → x ≢ y Property of comparison, not algorithm Degrades performance Doesn’t do what people think (or really want)! 21 st International Unicode Conference 24 Dublin, Ireland — 9/9/2020
ICU/Java Collation Architecture L 1 -3, contractions, expansions, … Locale tailorings Fully rule-based specification Arbitrary runtime user customizations & ‘? ’ = ‘question mark’ & ‘$’ = ‘dollar sign’ & z < ‘george’ 21 st International Unicode Conference 25 Dublin, Ireland — 9/9/2020
Java Sun licensed and includes an early version of ICU collation in Java ICU version: Dramatically faster Much reduced memory consumption Halved sort-key length Many additional features 21 st International Unicode Conference 26 Dublin, Ireland — 9/9/2020
ICU Collation I Full UCA compliance Full supplementary character support Solid performance Small Sort-Keys Small Memory Footprint 21 st International Unicode Conference 27 Dublin, Ireland — 9/9/2020
ICU Collation II Parametric control Tailorable to any language Simultaneous Multiple Versions Merging Sort Keys Selection Bounds 21 st International Unicode Conference 28 Dublin, Ireland — 9/9/2020
Memory-Mappable, Fast Init Old: separate allocations 21 st International Unicode Conference New: offsets within mem-map 29 Dublin, Ireland — 9/9/2020
Delta Tailoring: Minimize Memory Usage input FR found not UCA: One Copy; ≈80 K found output 21 st International Unicode Conference not code synthesize 30 Dublin, Ireland — 9/9/2020
Simultaneous Multiple Versions Programs can link against different versions of ICU, simultaneously. Preserves exact binary order over time. Application New DB ICU 2. 1 21 st International Unicode Conference Old DB ICU 2. 0 31 Dublin, Ireland — 9/9/2020
Performance Checks for identical prefixes first Invokes normalization only when needed Fast paths for common cases Minimizes comparison time Minimizes sort key length 21 st International Unicode Conference 32 Dublin, Ireland — 9/9/2020
Sort Key Compression Common weights are 1 -byte Primary, secondary, tertiary, quarternary Sequences are compressed UTF-16 Values for “Märk Davis” (22 bytes) 004 D 00 E 4 0072 006 B 0020 0044 0061 0076 0069 0073 0000 Sort Key (L 3, ignorable punctuation - 19 bytes) 2 F 17 39 2 B 1 D 17 41 27 3 B 01 77 96 0 A 01 8 F 80 8 F 07 00 21 st International Unicode Conference 33 Dublin, Ireland — 9/9/2020
ICU vs. Windows, glibc Full UCA! String comparison: comparable speed ≈ -20%. . +400% Sort keys: much shorter ≈ 50% Warning: speed comparisons are approximate! Depends on data, parameters, features, CPU 21 st International Unicode Conference 34 Dublin, Ireland — 9/9/2020
More Information ICU http: //oss. software. ibm. com/icu/ Design Document http: //oss. software. ibm. com/cvs/icuht ml/design/collation/ Latest Version of these slides http: //www. macchiato. com 21 st International Unicode Conference 35 Dublin, Ireland — 9/9/2020
Q & A 21 st International Unicode Conference 36 Dublin, Ireland — 9/9/2020
Fast C or D (FCD) Accepts all NFD, most NFC, without normalization 21 st International Unicode Conference 37 Dublin, Ireland — 9/9/2020
Backup Slides Not used in the presentation, except in response to questions 21 st International Unicode Conference 38 Dublin, Ireland — 9/9/2020
Performance: Coding Avoided unnecessary function calls. Example: strlen too expensive! Avoided use of objects Rewrote core code in C C++ API wraps the C core code. Fast-pathed common cases Used stack memory buffers (with expansion if necessary) Made inner loops as tight as possible 21 st International Unicode Conference 39 Dublin, Ireland — 9/9/2020
WARNING 5: Math. Relation S = {Unicode Strings} Reflexive ∀a ∊ S: a ≤ a Antisymmetric ∀a, b ∊ S: a ≤ b & b ≤ a → a = b Transitive ∀a, b ∊ S: a ≤ b & b ≤ c → a ≤ c Total ∀a, b ∊ S: a ≤ b ∨ b ≤ a 21 st International Unicode Conference 40 Dublin, Ireland — 9/9/2020
Identical Prefixes Sorting / Searching Databases Many comparisons to “close” strings Check initial prefixes with binary compare Drop into collation loop at first difference Complication… 21 st International Unicode Conference 41 Dublin, Ireland — 9/9/2020
Initial Prefix Complication Need to backup if in “bad” position: 21 st International Unicode Conference 42 Dublin, Ireland — 9/9/2020
Fractional UCA Fractional weights for compression Gaps for tailoring, future UCA additions Only stores differences in tailoring file Reduces memory footprint 21 st International Unicode Conference 43 Dublin, Ireland — 9/9/2020
Exceptional Values Normal weight storage Special Weight Storage NOT_FOUND, EXPANSION, CONTRACTION, THAI, … 21 st International Unicode Conference 44 Dublin, Ireland — 9/9/2020
Minimal Memory Flat-file (memory mapped) speeds initialization reduces memory footprint (next slide) Delta Tailoring Single copy of UCA (≈80 K) Small delta files per locale 21 st International Unicode Conference 45 Dublin, Ireland — 9/9/2020
- Slides: 45