Multilingual Collation and CJK Sorts in Oracle 9





























- Slides: 29
Multilingual Collation and CJK Sorts in Oracle 9 i Claire Ho Winson Chu Server Globalization Group Oracle Corp 9/15/2021 Copyright 2001, Oracle Corp 1 ®
Introduction • • 9/15/2021 High demand for multilingual collation support Collation features Generic multilingual sort SQL collation functions Linguistic sorts in 9 i Performance Flexibility and extensibility Demo Copyright 2001, Oracle Corp 2 ®
Collation Features (introduction) • Sorting of Latin based Characters Example: s 1: code s 2: cote s 3: Cote s 4: côte Sort key primary 9/15/2021 secondary Copyright 2001, Oracle Corp tertiary 3 ®
Collation Features (contract/expand) • Sorting of Contracting Characters Example: S 1: Ciencia S 2: Cheremoya S 3: Deportes ès 1 < s 2 < s 3 in traditional Spanish sort. ès 2 < s 1 < s 3 in other sorts. • Sorting of Expanding Characters Example: => ‘A’ + ‘E’ 9/15/2021 Copyright 2001, Oracle Corp 4 ®
Collation Features (context sensitive) • Sorting of Context Sensitive Characters In Japanese: ka a ka i ki a ki i 9/15/2021 Copyright 2001, Oracle Corp 5 ®
Collation Features (surrogates) • Sorting of Surrogate Characters - High-surrogate range: U+D 800 ~ U+DBFF - Low-surrogate range: U+DC 00 ~ U+DFFF - capability to support 1 million surrogate pairs in a single sort • Canonical Equivalence Example: 9/15/2021 Copyright 2001, Oracle Corp 6 ®
Collation Features (special sorts) • Backward Sorting of Accented Characters Example: S 1: élève S 2: élevé ès 1 < s 2 in French sort ès 1 > s 2 in other sorts • Character Rearrangement for Thai/Lao Characters - Thai vowels(U+0 E 40 ~ U+0 E 44) and Lao vowels(U+0 EC 0 ~ U+0 EC 4) need to be swapped if next character is a consonant 9/15/2021 Copyright 2001, Oracle Corp 7 ®
SQL Collation Functions • Normalization functions -- Composition function -- Decomposition function • Collation Key functions -- NLSSORT() 9/15/2021 Copyright 2001, Oracle Corp 8 ®
SQL Collation Functions Normalization Form D (NFD) Normalization Form C (NFC) Normalization Form KD (NKFD) Normalization Form KC (NFKC) 9/15/2021 Description SQL statement Select decompose( string, Canonical Decomposition CANONICAL) from dual; Canonical Decomposition followed by Select compose( string ) from dual; Canonical Composition Compatibility Decomposition followed by Canonical Composition Copyright 2001, Oracle Corp Select decompose( string, COMPATIBILITY ) from dual; Select compose(decompose (string, COMPATIBILITY)) from dual; 9 ®
SQL Collation Functions Select decompose(‘ ’, CANONICAL) from dual; Canonical Decomposition Original string Return string “ “ ” = 0 x 1 ED 9 ij = 0 x 0133 ” o = 0 x 006 F. = 0 x 0323 ˆ = 0 x 0302 ij = 0 x 0133 Select decompose(‘ ’, COMPATIBILITY) from dual; “ “ = 0 x 1 ED 9 ij = 0 x 0133 “ o = 0 x 006 F. = 0 x 0323 ˆ = 0 x 0302 Compatibility Decomposition 9/15/2021 “ i = 0 x 0069 Copyright 2001, Oracle Corp j = 0 x 006 A 10 ®
SQL Collation Function Normalization Form C Canonical Decomposition followed by Canonical Composition Select compose(‘ 9/15/2021 ’) from dual; Copyright 2001, Oracle Corp 11 ®
SQL Collation Function Normalization Form KC Compatibility Decomposition followed by Canonical Composition Select compose(decompose(‘abc’, COMPATIBILITY)) from dual; Conformance The normalization functions in Oracle 9 i comply with the Unicode conformance tests 9/15/2021 Copyright 2001, Oracle Corp 12 ®
SQL Collation Functions Collation Key function: NLSSORT() -- an SQL function which overwrites client linguistic setting. Example: client linguistic setting: Generic_M SQL > select item_name from category_tab order by item_name; would return ITEM_NAME ------------------------------ TV (course in Japanese) (computer in Japanese) 9/15/2021 Copyright 2001, Oracle Corp 13 ®
SQL Collation Functions --Changing sorting method on SQL level SQL> select item_name from category_tab order by nlssort( item_name, ‘nls_sort=Japanese_M’); would return ITEM_NAME ------------------------TV (course in Japanese) (computer in Japanese) 9/15/2021 Copyright 2001, Oracle Corp 14 ®
Generic multilingual sort • A multilingual collation common template • Based on ISO 14651 – International String Ordering • Named Generic_M • Defines accented characters and punctuation characters as ignorable characters • Covers Latin based letters 9/15/2021 Copyright 2001, Oracle Corp 15 ®
CJK Sorts Japanese Sort -- Based on JIS X 4061 Japanese collation -- Include JIS x 0208 and JIS x 0212 planes -- Reserved space for JIS x 0213 -- Latin letters are based on ISO 14651 order 9/15/2021 Copyright 2001, Oracle Corp 16 ®
CJK Sorts Chinese Sorts for Traditional Chinese 1. Sorted by stroke count 2. Sorted by radical 3. Sorted by Big 5 character set order 4. Sorted by HKSCS character set order 9/15/2021 Copyright 2001, Oracle Corp 17 ®
CJK Sorts Chinese Sorts for Simplified Chinese 1. Sorted by stroke count 2. Sorted by radical order 3. Sorted by Pin. Yin order 4. Sorted by GB character set order 9/15/2021 Copyright 2001, Oracle Corp 18 ®
CJK Sorts Korean Sort -- Latin letters are based on ISO 14651 order -- Hangul characters are based on Unicode binary order -- Hanja characters are based on pronunciation order 9/15/2021 Copyright 2001, Oracle Corp 19 ®
Linguistic Sorts in Oracle 9 i ARABIC_MATCH ARABIC_ABJ_SORT ARABIC_ABJ_MATCH ASCII 7 BENGALI BIG 5 BULGARIAN CANADIAN FRENCH CATALAN XCATALAN CROATIAN XCROATIAN CZECH XCZECH DANISH XDANISH DUTCH XDUTCH EEC_EUROPA 3 ESTONIAN FINNISH 9/15/2021 FRENCH XFRENCH GERMAN XGERMAN_DIN XGERMAN_DIN GBK GREEK HEBREW HKSCS HUNGARIAN XHUNGARIAN ICELANDIC INDONESIAN ITALIAN JAPANESE LATIN LATVIAN LITHUANIAN MALAY NORWEGIAN POLISH PUNCTUATION XPUNCTUATION ROMANIAN RUSSIAN SLOVAK XSLOVAK SLOVENIAN XSLOVENIAN SPANISH XSPANISH SWEDISH SWISS XSWISS THAI_DICTIONARY CANADIAN_M DANISH_M FRENCH_M GENERIC_M JAPANESE_M KOREAN_M SPANISH_M THAI_M SCHINESE_STROKE_M SCHINESE_PINYIN_M SCHINESE_RADICAL_M TCHINESE_STROKE_M THAI_TELEPHONE TURKISH XTURKISH UKRAINIAN UNICODE_BINARY VIETNAMESE WEST_EUROPEAN Copyright 2001, Oracle Corp 20 ®
Performance Features • Support 1 million character sorting in fourbyte collation key -- Four-byte collation key Primary key 2 bytes Secondary key 1 byte Tertiary key 1 byte -- The more number of bytes of collation key, the greater the cost for performance and memory -- We can support 1 million characters because our architecture allows us to extend the primary key 9/15/2021 Copyright 2001, Oracle Corp 21 ®
Performance Features • Reduce the number of calls for checking canonical equivalence -- Call normalization function only 1. In Unicode character set 2. Current character is accented character ( combining character) Example: èdit ( 0 normalization call) e’dit( 1 normalization call) e’di’t( 2 normalization calls) 9/15/2021 Copyright 2001, Oracle Corp 22 ®
Performance Features • Special context sensitive sort optimization for Japanese -- Collation key of Japanese prolonged sound is determined by previous character. i. e. -- Traditional design: 9/15/2021 Copyright 2001, Oracle Corp 23 ®
Performance Features • On average prolonged sound makes up of about 10% of normal Japanese text • New Design: -- Direct access collation key for all Katakana and Hiragana characters except -- Improved binary search to access context sensitive character table 9/15/2021 Copyright 2001, Oracle Corp 24 ®
Performance Features • Building linguistic indexes to speed up sorting Example: CREATE INDEX idx_cust ON cust_tab ( NLSSORT( cust_name, NLS_SORT= ‘GENERIC_M’)); 9/15/2021 Copyright 2001, Oracle Corp 25 ®
Flexibility and Extensibility • Support CJK extension B • Reserved key gap for future changes. • Locale Builder A GUI tool for locale customization. 9/15/2021 Copyright 2001, Oracle Corp 26 ®
Locale Builder 9/15/2021 Copyright 2001, Oracle Corp 27 ®
Demo 9/15/2021 Copyright 2001, Oracle Corp 28 ®
Summary • Fully support Unicode collation features • Support over 60 monolingual sorts and 12 multilingual sorts • Extensible and Flexible 9/15/2021 Copyright 2001, Oracle Corp 29 ®