Globalization Center of Competency ICU Unicode Globalization Locale
Globalization Center of Competency – ICU, Unicode, Globalization, Locale Data Unicode Overview Markus Scherer, Mark Davis Unicode Overview 11/25/2020 © 2005 IBM Corporation
Business Unit or Product Name What is Unicode? Coded Character Set: Characters, codes, semantics a U+0061 Latin a ä U+00 E 4 a-umlaut σ U+03 C 3 Greek sigma א U+05 D 0 Hebrew alef ٣ U+0663 Arabic digit 3 カ U+30 AB Katakana ka 退 U+9000 Han Ideograph �� U+21 BC 1 Ideograph used in HKSCS 2 Unicode Overview 11/25/2020 © 2005 IBM Corporation
Business Unit or Product Name Unicode Makes Globalization Possible § Single server English § Single build Español § Single install Ελληνικά § Single instance עברית § …serves all clients in all ﺍﻟﻌﺮﺑﻴﺔ languages 日本語 中文 4 Unicode Overview 11/25/2020 © 2005 IBM Corporation
Business Unit or Product Name Unicode Gives Characters Meaning and Behavior: Data Ideographic 不与 Uppercase a ξ �� AΞ Quotation_Mark Alphabetic "' «» ‘’『』 ٣→ 3 �→ 4 Numeric_Value �→ 5 5 Unicode Overview 11/25/2020 © 2005 IBM Corporation
Business Unit or Product Name Unicode Gives Characters Meaning and Behavior: Algorithms § Case mapping § Case folding & Case-insensitive comparison § Collation § Bidi § Normalization § Line Breaking §… 6 Unicode Overview 11/25/2020 © 2005 IBM Corporation
Business Unit or Product Name Case Mapping dz↔Dz↔DZ Heiß → HEISS → heiss όσος ↔ ΌΣΟΣ topkapı istanbul ↔tr TOPKAPI İSTANBUL 7 Unicode Overview 11/25/2020 © 2005 IBM Corporation
Business Unit or Product Name Forms of Text ä U+00 E 4 = a+¨ U+0061 + U+0308 § Equivalent text – equivalent behavior § Same display (for supported repertoire) § Normalization generates unique forms 8 Unicode Overview 11/25/2020 © 2005 IBM Corporation
Business Unit or Product Name Right-To-Left and Bi-Directional Text ﺃﺒـﻞ ،(IBM). ﺇﻡ. ﺑﻲ. آﻲ ﻳﻭـﺖ ،(APPLE) Hewlett-) ﺑـﺎﻛـﺮﺩ ،(Packard ﻣﺎﻳﻜﺮﻭﺳﻮﻓﺖ ﺃﻮﺭﺍـﻞ ،(Microsoft) ﺻﻦ ،(Oracle) (Sun) … ISO ) ١٠٦٤٦ ﺇﻳﺰﻭ (10646 9 Unicode Overview § Text stored in logical order: No special consideration for processing, only for UI and for legacy encoding conversion § RTL text (mostly Arabic and Hebrew) flows from right to left § Embedded numbers and LTR text flow right to left § Line break preserves reading order § Selection: Contiguous text ≠ contiguous display 11/25/2020 © 2005 IBM Corporation
Business Unit or Product Name Sorting, Searching, Matching § Binary order A < C < Z < a < c < z < Ç – Code Point Order (same as UTF-8 binary comparison) – UTF-16 Order (Java String binary comparison) – Refinements, usually only for matching, not sorting • Case-insensitive • Matching equivalent forms of text § Language-sensitive collation a<A<c<C<Ç<z<Z 10 Unicode Overview 11/25/2020 © 2005 IBM Corporation
Business Unit or Product Name Collation: UCA + Language Tailorings § Context-sensitive, language-sensitive – china < China < chinas – æ ≅ a+e – c < d <. . . k < ch < l – Adding/removing trailing character can change sorting considerably § String → Sequence of weights; not reversible § Attributes: Lowercase first, ignore case or punctuation, … 11 Unicode Overview 11/25/2020 © 2005 IBM Corporation
Business Unit or Product Name Security: Spoofing with Look-Alikes Olive – 01 ive ICU – 1 CU Ham – Harn Paypal – Paypаl § Not new with Unicode, but more opportunities due to more characters § UTR #36: Unicode Security Considerations 12 Unicode Overview 11/25/2020 © 2005 IBM Corporation
Business Unit or Product Name Unicode Text Encodings UTF-16 § In-memory strings, best for processing § Java, . Net, Windows, UTF-8 § Storage & Protocols §. txt, . html, . xml, … Mac. OS X, Java. Script, inside browsers, … String aa=“au 00 E 4”; 13 Unicode Overview <? xml version="1. 0" encoding="UTF-8"? > 11/25/2020 © 2005 IBM Corporation
Business Unit or Product Name Unicode Text Encoding Examples 14 Character Code Point UTF-16 UTF-8 a U+0061 61 ä U+00 E 4 C 3 A 0 σ U+03 C 3 CF 83 א U+05 D 0 D 7 90 ٣ U+0663 D 9 A 3 カ U+30 AB E 3 82 AB 退 U+9000 E 9 80 80 �� U+21 BC 1 D 846 DFC 1 F 0 A 1 AF 81 Unicode Overview 11/25/2020 © 2005 IBM Corporation
Business Unit or Product Name Common Locale Data Repository § CLDR industry standard for locale data § Adoption brings consistent across industry § Display names for languages, countries, currencies, etc. § Date/time/number formats and data for parsing § Language tailorings for collation and text segmentation 15 Unicode Overview 11/25/2020 © 2005 IBM Corporation
Business Unit or Product Name References Unicode: http: //www. unicode. org/ IBM software globalization: http: //ibm. com/software/globalization ICU docs & papers: http: //icu. sourceforge. net/docs/ ICU: http: //ibm. com/software/globalization/icu ICU (IBM intranet): http: //icu. sanjose. ibm. com/ 16 Unicode Overview 11/25/2020 © 2005 IBM Corporation
- Slides: 16