ICU Overview The Open Source Unicode Library George

ICU Overview: The Open Source Unicode Library George Rhoten IBM Globalization Center of Competency 28 th Internationalization and Unicode Conference © 2005 IBM Corporation

ICU Overview: The Open Source Unicode Library Agenda § Background Information § What is ICU? § Architecture Overview – Significant New ICU Features § References § Q and A 28 th Internationalization and Unicode Conference 2 Orlando, Florida, September, 2005

ICU Overview: The Open Source Unicode Library Why Globalization? 28 th Internationalization and Unicode Conference 3 Orlando, Florida, September, 2005

ICU Overview: The Open Source Unicode Library Unicode § Handles all modern world languages § Efficient and effective processing § Lossless data exchange § Enables single-binary global software § But… all languages ⇒ large, complex standard – 1, 400 pages + Annexes + additional standards – 96, 000+ characters – Major update every 3 years – Minor update about once a year – 70 character properties, many multi-valued – Affects many processes: display, line-break, regular expressions, … 28 th Internationalization and Unicode Conference 4 Orlando, Florida, September, 2005

ICU Overview: The Open Source Unicode Library Internationalization, Localization & Locales § Requirements vary widely across languages & countries – Sorting – Text searching – Line breaks – Date/time/number/currency formatting – Codepage conversion – …and so on § Performance is key – It is easy to do the right thing – It is hard to do it fast 28 th Internationalization and Unicode Conference 5 Orlando, Florida, September, 2005

ICU Overview: The Open Source Unicode Library What is ICU? § International Components for Unicode § Globalization / Unicode / Locales § Mature, widely used set of C/C++ and Java libraries – Basis for Java 1. 1 internationalization, but goes far beyond Java 1. 1 § Very portable – identical results on all platforms / programming languages – C/C++: 30+ platforms/compilers – Java: IBM & Sun JDK – You can use: C/C++ (ICU 4 C), Java (ICU 4 J), C/C++ with Java (ICU 4 JNI) § § Full threading model Customizable Modular Open source – but non-restrictive 28 th Internationalization and Unicode Conference 6 Orlando, Florida, September, 2005

ICU Overview: The Open Source Unicode Library Who uses ICU? § Products Within IBM – All 5 major software brands – Many other related software applications – Used on all IBM operating systems § Other Companies and Organizations – Adobe, Apple (Mac OS X), Avaya, BEA, Broad. Jump, Business Objects, Caris, CERN, Cognos, Debian Linux, Gentoo Linux, HP, Home Depot, Inktomi, JD Edwards, Macromedia, Mathworks, MKS, Mozilla, NCR, Open. Office, Parrot, Pay. Pal, Python, QNX, Rogue Wave, SAP, Siebel, SIL, Software AG, Sun Microsystems (Solaris, Java), Su. SE Linux, Sybase, Virage, web. Methods, Wine, Leica Geosystems GIS & Mapping LLC. , Xerox, Yahoo!. . . and many more 28 th Internationalization and Unicode Conference 7 Orlando, Florida, September, 2005

ICU Overview: The Open Source Unicode Library ICU Features § Unicode text handling § Charset conversions (700+) § Breaks: word, line, … § Formatting – Date & time § Collation & Searching – Messages § Locales from CLDR (250+) – Numbers & currencies § Resource Bundles § Transforms § Calendar & Time zones – Normalization § Complex-text layout engine – Casing § Unicode Regular Expressions – Transliterations 28 th Internationalization and Unicode Conference 8 Orlando, Florida, September, 2005

ICU Overview: The Open Source Unicode Library Architecture Overview 1 § Locale Based Services – Locale is an identifier, not a container – Keywords for variants: de@collation=phonebook § Resource inheritance: shared resources root Language en de zh Script Region US IE 28 th Internationalization and Unicode Conference DE CH 9 Hant Hans TW CN CN TW Orlando, Florida, September, 2005

ICU Overview: The Open Source Unicode Library Architecture Overview 2 § Open and Close Service Model – Open a service object, use it many times, close it when done – Better performance by avoiding setup costs per operation § ICU Threading Model – Multiple service objects in use simultaneously with same or different attributes – Large resources shared in read-only cache – Compatible with Java threading model 28 th Internationalization and Unicode Conference 10 Orlando, Florida, September, 2005

ICU Overview: The Open Source Unicode Library Architecture Overview 3 § Data Driven Services – Customize at build-time or run-time – Interchange with other platforms; • same results on each – Rule-based • Collation, Word-breaks, Transforms – Pattern-based • Date/Time/Number/Message formatting – Table-based • Character Conversion 28 th Internationalization and Unicode Conference 11 Orlando, Florida, September, 2005

ICU Overview: The Open Source Unicode Library Architecture Overview – ICU 4 C § Simple Error Handling – Thread safe – Works in C and C++ § C/C++ subset for portability § Version Management – Multiple versions of ICU 4 C in the same process memory space – Data and library versioning § String Buffer Management – Preflighting and overflow protection § Flexible – Allows Loading and Unloading ICU 4 C libraries – Runtime settable memory allocation and mutex functions 28 th Internationalization and Unicode Conference 12 Orlando, Florida, September, 2005

ICU Overview: The Open Source Unicode Library Architecture Overview – ICU 4 J § Supplement for Java § Core globalization (no character conversion or regular expressions) – We do supply complex text support for Sun § Modularized: products may add just needed functionality § Usually drop-in replacement for JDK functionality – Changing the import statements is usually all that is needed 28 th Internationalization and Unicode Conference 13 Orlando, Florida, September, 2005

ICU Overview: The Open Source Unicode Library ICU 4 J: Supplement for Java § CLDR (Common Locale Data Repository) – More fully supported locales than Java § Up-to-date globalization: standards-compliant; latest Unicode – Supplementary character (GB 18030, JIS X 213, HKSCS) • Java 5 adds handling of supplementary characters – Full properties – JDK has only a fraction – Unicode Collation Algorithm – Local calendars (Islamic, Japan, …); more time zone localizations – Currencies, String Search, Internationalized Domain Names – Transforms: Case, Scripts, Normalization § Much shorter release cycle and quicker support for Unicode standard 28 th Internationalization and Unicode Conference 14 Orlando, Florida, September, 2005

ICU Overview: The Open Source Unicode Library Unicode Text Handling § C (UTF-16) – UChar*: null-terminated or with length § C++ (UTF-16) – Unicode. String: full featured string class § Java (UTF-16 BE) – Uses java. lang. String and adds utilities § All handle supplementary characters – Required for GB 18030 and JIS 213 repertoire 28 th Internationalization and Unicode Conference 15 Orlando, Florida, September, 2005

ICU Overview: The Open Source Unicode Library Unicode Text Handling 2 § All Unicode 4. 1 properties – direct API • values, names, enumerations – Unicode. Set • Fast, compact set operations (union, intersection, …) • Pattern-based (both Perl & POSIX syntax for properties) – p{greek} vs. [: greek: ] • All properties: – [p{lowercase}-[a-z]] – [p{greek} & p{uppercase}] 28 th Internationalization and Unicode Conference 16 Orlando, Florida, September, 2005

ICU Overview: The Open Source Unicode Library Recent Additions § Conforms to CLDR 1. 3 – Adds many translated terms for languages, scripts, regions, currencies, and time zones. – Access to more CLDR items § Support for Unicode interpretation of POSIX properties § Charset detection API (ICU 4 J only) § Better modularization for memory constrained environments (ICU 4 C only) 28 th Internationalization and Unicode Conference 17 Orlando, Florida, September, 2005

ICU Overview: The Open Source Unicode Library Character Set Conversion § Precise alias information: – When you ask for “Shift-JIS”, you can request the precise definition by platform (e. g. Windows, IBM, Java, … ) § Buffer management – API automatically handles characters that cross buffers – Can provide offset mappings between byte buffer and UChar buffer § Runtime customizations allowed for: – illegal sequences – undefined characters § Unicode Text Compression – SCSU, BOCU-1 § Consistent conversion results across platforms § You can use more character sets at runtime or build time 28 th Internationalization and Unicode Conference 18 Orlando, Florida, September, 2005

ICU Overview: The Open Source Unicode Library Collation: Sorting, Searching and Matching § Fast international comparison for string search; fully UCA compliant – Compressed sort keys, optimized string comparison, sublinear string search – Incremental sortkeys used for radix sorting § Precise binary sortkey stability over time (library versioning) § Fully data driven – Many common rules provided § Runtime and build time rule customizations – strength, normalization, upper vs. lowercase first, ignore punctuation, numeric, … – Only delta from UCA is needed for rule customization 28 th Internationalization and Unicode Conference 19 Orlando, Florida, September, 2005

ICU Overview: The Open Source Unicode Library Calendar & Time Zones § International Calendars – Islamic, Buddhist, Hebrew, Japanese – Required for correct presentation of dates in some countries § Olson timezone support with localizations § Recent Additions: – Many more time zone localizations 28 th Internationalization and Unicode Conference 20 Orlando, Florida, September, 2005

ICU Overview: The Open Source Unicode Library Formatting § Date & time: 8 formats per locale by default § Messages – Completely localizable, plural support § Numbers & currencies – Scientific Notation, Spelled-out (checks, etc. ) – Full Orthogonal Currency support • INR In Hindi: In English: In German: Rs. 1, 234. 57 Rs. 1. 234, 57 § Recent Additions – List available currencies API – Short and stand-alone month/day names 28 th Internationalization and Unicode Conference 21 Orlando, Florida, September, 2005

ICU Overview: The Open Source Unicode Library Transforms § Unicode Normalization – Highly optimized for performance – performance utilities: concatenation, detection, comparison § Casing (upper, lower, title, folding) § General Transforms – Script transliterations – Half-width/Full-width, Hex, etc. – Chain transforms together, filter source characters – Rule-based, customizable at runtime. § String Prep: NFS, Internationalized Domain Names (IDN) 28 th Internationalization and Unicode Conference 22 Orlando, Florida, September, 2005

ICU Overview: The Open Source Unicode Library Segmentation: word, line & sentence § Fast state-table implementation § Customizable – Rule-based – customizable at runtime – Special customizations, e. g. Thai § Recent Additions: – Uses new UText API • Discontinuous text • Buffering • Usable with UTF-8, UTF-16 or UTF-32 28 th Internationalization and Unicode Conference 23 Orlando, Florida, September, 2005

ICU Overview: The Open Source Unicode Library Unicode Regular Expressions § Full Regex Implementation – C/C++ only: Java 1. 4 has own package (though not as powerful) § All Unicode 4. 1 Properties – Supported through Unicode. Set § Good performance – Competitive with non-Unicode regex 28 th Internationalization and Unicode Conference 24 Orlando, Florida, September, 2005

ICU Overview: The Open Source Unicode Library Complex-text layout engine § Glyph processing, positioning & adjustment – Ligature substitution, contextual forms, kerning, accent placement, bidi scripts, etc. § Support for: – – – – Information for drawing Caret Display Hit Testing Selection Highlighting Caret Movement Layout Metrics Line Break Canonical Equivalence: a + ´ or á § Recent Additions: – Support for more complex scripts 28 th Internationalization and Unicode Conference 25 Orlando, Florida, September, 2005

ICU Overview: The Open Source Unicode Library References § ICU main site: – http: //www. ibm. com/software/globalization/icu/ – Links to • Download ICU • User Guide, Technical FAQ, Support, Bug Reports, Demonstrations § ICU support site: – http: //icu. sourceforge. net/ § Unicode Consortium – http: //www. unicode. org/ • Unicode glossary, Unicode character database 28 th Internationalization and Unicode Conference 26 Orlando, Florida, September, 2005

ICU Overview: The Open Source Unicode Library Questions and Answers 28 th Internationalization and Unicode Conference 27 Orlando, Florida, September, 2005
- Slides: 27