ICU Overview The OpenSource Unicode Library v 3

ICU Overview The Open-Source Unicode Library, v 3. 2 Markus Scherer ICU Manager IBM Globalization Center of Competency 27 th Internationalization and Unicode Conference Berlin, Germany, April 2005

ICU Overview: The Open-Source Unicode Library, v 3. 2 Agenda § Background § What is ICU? § Architecture Overview § ICU Features and recent additions § References § Q and A 27 th Internationalization and Unicode Conference 2 Berlin, Germany, April 2005

ICU Overview: The Open-Source Unicode Library, v 3. 2 Why Globalization? 27 th Internationalization and Unicode Conference 3 Berlin, Germany, April 2005

ICU Overview: The Open-Source Unicode Library, v 3. 2 Unicode § All world languages § Efficient and effective processing § Lossless data exchange § Enables single-binary global software § But… all languages ⇒ large, complex standard – 1, 400 pages + Annexes + additional standards – 90, 000+ characters – Major update every 3 years – 70 character properties, many multi-valued – Affects many processes: display, line-break, regex, … 27 th Internationalization and Unicode Conference 4 Berlin, Germany, April 2005

ICU Overview: The Open-Source Unicode Library, v 3. 2 Locales § Features vary widely across languages & countries – Sorting, line breaks, date/time/number/currency formatting, codepage conversion, … – Performance is key: easy to do the right thing; hard to do it fast 27 th Internationalization and Unicode Conference 5 Berlin, Germany, April 2005

ICU Overview: The Open-Source Unicode Library, v 3. 2 What is ICU? § Globalization / Unicode / Locales § Mature, widely used set of C/C++ and Java libraries – Basis for Java 1. 1 internationalization – but goes far beyond – “ICU 4 C”: C/C++ libraries; “ICU 4 J”: Java library § Very portable – identical results on all platforms / programming languages – C/C++: 30+ platforms/compilers – Java: IBM & Sun JDK § Full threading model; customizable; modular § Open source – but not viral § ICU 3. 2: 78 languages; 118 countries; 870 codepages 27 th Internationalization and Unicode Conference 6 Berlin, Germany, April 2005

ICU Overview: The Open-Source Unicode Library, v 3. 2 Who uses ICU? (Examples) § Products Within IBM – DB 2, COBOL, Info. Print Manager, Lotus Notes, Lotus Workplace, Tivoli Presentation Services, Web. Sphere, XML Parser, … § Other Companies and Organizations – Adobe, Apple (Mac OS X), BEA, CERN, Cognos, Debian, HP, Inktomi, JD Edwards, Macromedia, Mathworks, Mozilla, NCR, Open. Office, Pay. Pal, SAP, Siebel, SIL, Software AG, Sun Microsystems (Solaris, Java), Su. SE, Sybase, web. Methods, … 27 th Internationalization and Unicode Conference 7 Berlin, Germany, April 2005

ICU Overview: The Open-Source Unicode Library, v 3. 2 ICU Features § § § § Unicode Regular Unicode text handling Expressions Charset conversions (870+) § Breaks: word, line, … § Formatting Collation & Searching Locales (170+) – Date & time Resource Bundles – Messages Calendar & Time zones – Numbers & currencies Complex-text layout engine § Transforms – Normalization – Casing – Transliterations 27 th Internationalization and Unicode Conference 8 Berlin, Germany, April 2005

ICU Overview: The Open-Source Unicode Library, v 3. 2 Architecture Overview 1 § Locale Based Services – Locale is an identifier, not a container – Keywords for variants: de@collation=phonebook – Recent addition: accept-language support § Resource inheritance: shared resources root Language en de zh Hant Script Country US IE 27 th Internationalization and Unicode Conference DE CH TW 9 Hans CN CN TW Berlin, Germany, April 2005

ICU Overview: The Open-Source Unicode Library, v 3. 2 Architecture Overview 2 § Open and Close Service Model – Open a service object, use it many times, close it when done – Better performance by avoiding setup costs per operation – Warning: use properly for maximum performace § ICU Threading Model – Multiple service objects in use simultaneously, with same or different attributes – Large resources shared in read-only cache 27 th Internationalization and Unicode Conference 10 Berlin, Germany, April 2005

ICU Overview: The Open-Source Unicode Library, v 3. 2 Architecture Overview 3 § Data Driven Services – Customize at build-time or run-time – Interchange with other platforms; • same results on each – Rule-based • Collation, Word-breaks, Transforms – Pattern-based • Formats, Unicode. Set – Table-based • Character Conversion 27 th Internationalization and Unicode Conference 11 Berlin, Germany, April 2005

ICU Overview: The Open-Source Unicode Library, v 3. 2 Architecture Overview – ICU 4 C § Simple Error Handling – C++ subset for portability – Support for multi-threaded environment § Version Management – Multiple versions at the same time – Data and library versioning § String Buffer Management – Preflighting and overflow protection § Misc: Load/Unload ICU § Recent Additions: – Runtime-settable memory allocation and mutex functions 27 th Internationalization and Unicode Conference 12 Berlin, Germany, April 2005

ICU Overview: The Open-Source Unicode Library, v 3. 2 Architecture Overview – ICU 4 J § Supplement for Java § Core globalization (no character conversion or regular expressions, no GUI components) – We do supply complex text support for Sun § Modularized: products may add just needed functionality 27 th Internationalization and Unicode Conference 13 Berlin, Germany, April 2005

ICU Overview: The Open-Source Unicode Library, v 3. 2 ICU 4 J vs. JDK § CLDR 1. 2 (Common Locale Data Repository) § Up-to-date globalization: standards-compliant; latest Unicode – Supplementary character (GB 18030, JIS X 213, HKSCS) • Java 5 adds handling of supplementary characters – Full properties – JDK has only a fraction – Unicode Collation Algorithm – Local calendars (Thailand, Japan, …); ISO dates – Currencies, String Search, Int’l Domain Names – Transforms: Case, Scripts, Normalization § Much faster turn-around on bug fixes, enhancements 27 th Internationalization and Unicode Conference 14 Berlin, Germany, April 2005

ICU Overview: The Open-Source Unicode Library, v 3. 2 Unicode Text Handling §C – UChar*: null-terminated or with length § C++ – Unicode. String: full featured string class § Java – Uses normal JDK String, adds utilities § All handle supplementary characters – Required for GB 18030/JIS X 0213/HKSCS repertoires 27 th Internationalization and Unicode Conference 15 Berlin, Germany, April 2005

ICU Overview: The Open-Source Unicode Library, v 3. 2 Unicode Text Handling 2 § All Unicode 4. 0. 1 properties – Direct API • Values, names, enumerations – Unicode. Set • Fast, compact set operations • Pattern-based (both Perl & POSIX syntax for properties) – p{greek} vs. [: greek: ] • All properties: – [p{lowercase}-[a-z]] – [p{greek} & p{uppercase}] 27 th Internationalization and Unicode Conference 16 Berlin, Germany, April 2005

ICU Overview: The Open-Source Unicode Library, v 3. 2 Data: Recent Additions § Conforms to CLDR 1. 2 – 50% more data than CLDR 1. 0: adding many translated terms for languages, scripts, countries, currencies, and time zones. – Added data for new languages: Malayalam, Oriya, Welsh § Reduced multiplatform install image size § Improved XLIFF-ICU conversion tools § Locale canonicalization spec defined and implemented (C+J) – Provides interoperability with POSIX and. NET locale IDs, more RFC 3066 support 27 th Internationalization and Unicode Conference 17 Berlin, Germany, April 2005

ICU Overview: The Open-Source Unicode Library, v 3. 2 Character Set Conversion § Precise alias information: – When you ask for “SJIS”, you can request the precise definition by platform: • Windows, IBM, Solaris, … § Buffer management – automatically handles characters that cross buffers § Customizations allowed for: – illegal sequences – undefined characters § Unicode Text Compression – SCSU, BOCU-1 27 th Internationalization and Unicode Conference 18 Berlin, Germany, April 2005

ICU Overview: The Open-Source Unicode Library, v 3. 2 Collation and Searching § Fast international comparison and string search; fully UCA compliant – Compressed sort keys, optimized string comparison, sublinear string search – incremental sortkeys for radix-sort § Precise binary sortkey stability over time § Fully data driven § API / rule customizations – strength, normalization, upper vs. lowercase first, ignore punctuation, sort digits as numbers, … 27 th Internationalization and Unicode Conference 19 Berlin, Germany, April 2005

ICU Overview: The Open-Source Unicode Library, v 3. 2 Collation and Searching: Recent Additions § Numeric sorting: sequences of digits can be sorted numerically instead of alphabetically – e. g. , filenames would sort "ab-2" < "ab-10" – without material performance cost – with reduced sortkey length. § Significantly improved sorting orders for many other languages § Data in separate tree, for easier modularization and maintenance § get. Functional. Equivalent API allows for better caching and UI support. 27 th Internationalization and Unicode Conference 20 Berlin, Germany, April 2005

ICU Overview: The Open-Source Unicode Library, v 3. 2 Calendar & Time Zones § International Calendars – Arabic, Buddhist, Hebrew, Japanese – Required for correct presentation of dates in some countries § Olson timezone support, with localizations § Recent Additions: – RFC 822 time zone format support in Date. Format (C+J) for compatibility. – “Universal Time” conversions for high-precision date/time computations 27 th Internationalization and Unicode Conference 21 Berlin, Germany, April 2005

ICU Overview: The Open-Source Unicode Library, v 3. 2 Formatting § Date & time: 8 formats per locale § Messages – Completely localizable, Plural support § Numbers & currencies – Scientific Notation, Spelled-out (checks, etc. ) – Full Orthogonal Currency support • INR In Hindi: In English: In German: Rs. 1, 234. 57 Rs. 1. 234, 57 § Recent Additions – POSIX migration library – Allows parsing multiple currencies with one formatter – Short and stand-alone month/day names 27 th Internationalization and Unicode Conference 22 Berlin, Germany, April 2005

ICU Overview: The Open-Source Unicode Library, v 3. 2 Transforms § Unicode Normalization – Highly optimized for performance – performance utilities: concatenation, detection, comparison § Casing (upper, lower, title, folding) § General Transforms – Script transliterations – Half-width/Full-width, Hex, etc. – Chain transforms together, filter source characters – Rule-based, customizable at runtime. § IDNA: International Domain Names 27 th Internationalization and Unicode Conference 23 Berlin, Germany, April 2005

ICU Overview: The Open-Source Unicode Library, v 3. 2 Segmentation: word, line & sentence § Fast state-table implementation § Customizable – Rule-based – customizable at runtime – Special customizations, e. g. Thai § Recent Additions: – Greatly improved performance when going backwards (common case when doing line break) – Java • The rules syntax has been extended. Rules can now return information about the types of characters they encountered. • Common compiled (binary) rule format with ICU 4 C 27 th Internationalization and Unicode Conference 24 Berlin, Germany, April 2005

ICU Overview: The Open-Source Unicode Library, v 3. 2 Unicode Regular Expressions § Full Regex Implementation – C only: Java 1. 4 has own package (though not as powerful) § All Unicode 4. 0. 1 Properties – supported through Unicode. Set § Good performance – competitive with non-Unicode regex § Recent Additions – Now features a C API, instead of just C++. 27 th Internationalization and Unicode Conference 25 Berlin, Germany, April 2005

ICU Overview: The Open-Source Unicode Library, v 3. 2 Complex-text layout engine § Glyph processing, positioning & adjustment – ligature substitution, contextual forms, kerning, accent placement, Bidi scripts, etc. § Support for: – – – – Drawing Caret Display Hit Testing Selection Highlighting Caret Movement Layout Metrics Line Break § Recent addition: Canonical Equivalence: a + ´ or á 27 th Internationalization and Unicode Conference 26 Berlin, Germany, April 2005

ICU Overview: The Open-Source Unicode Library, v 3. 2 References § ICU main site: – http: //ibm. com/software/globalization/icu • New URL – Links to • Download ICU • User Guide, Technical FAQ, Support, Bug Reports § Unicode Consortium – http: //www. unicode. org • Unicode glossary, Unicode character database 27 th Internationalization and Unicode Conference 27 Berlin, Germany, April 2005

ICU Overview: The Open-Source Unicode Library, v 3. 2 Questions and Answers 27 th Internationalization and Unicode Conference 28 Berlin, Germany, April 2005
- Slides: 28