Getting Started with ICU Vladimir Weinstein Eric Mader

  • Slides: 41
Download presentation
Getting Started with ICU Vladimir Weinstein Eric Mader 26 th Internationalization and Unicode Conference

Getting Started with ICU Vladimir Weinstein Eric Mader 26 th Internationalization and Unicode Conference San Jose, September 2004

Getting Started with ICU Agenda § Getting & setting up ICU 4 C §

Getting Started with ICU Agenda § Getting & setting up ICU 4 C § Using conversion engine § Using break iterator engine § Getting & setting up ICU 4 J § Using collation engine § Using message formats 26 th Internationalization and Unicode Conference 2 San Jose, September 2004

Getting Started with ICU Getting ICU 4 C § http: //oss. software. ibm. com/icu/

Getting Started with ICU Getting ICU 4 C § http: //oss. software. ibm. com/icu/ § Get the latest release § Get the binary package § Source download for modifying build options § CVS for bleeding edge: § : pserver: anoncvs@oss. software. ibm. com: /usr/cvs/icu 26 th Internationalization and Unicode Conference 3 San Jose, September 2004

Getting Started with ICU Setting up ICU 4 C § Unpack binaries § If

Getting Started with ICU Setting up ICU 4 C § Unpack binaries § If you need to build from source – Windows: • MSVC. Net 2003 Project, • Cyg. Win + MSVC 6, • just Cyg. Win – Unix: run. Configure. ICU • make install • make check 26 th Internationalization and Unicode Conference 4 San Jose, September 2004

Getting Started with ICU Testing ICU 4 C § Windows - run: cintltst, intltest,

Getting Started with ICU Testing ICU 4 C § Windows - run: cintltst, intltest, iotest § Unix - make check (again) § See it for yourself: #include <stdio. h> #include "unicode/utypes. h" #include "unicode/ures. h" main() { UError. Code status = U_ZERO_ERROR; UResource. Bundle *res = ures_open(NULL, "", &status); if(U_SUCCESS(status)) { printf("everything is OKn"); } else { printf("error %s opening resourcen", u_error. Name(status)); } ures_close(res); } 26 th Internationalization and Unicode Conference 5 San Jose, September 2004

Getting Started with ICU Conversion Engine - Opening § ICU 4 C uses open/use/close

Getting Started with ICU Conversion Engine - Opening § ICU 4 C uses open/use/close paradigm § Open a converter: UError. Code status = U_ZERO_ERROR; UConverter *cnv = ucnv_open(encoding, &status); if(U_FAILURE(status)) { /* process the error situation, die gracefully */ } § Almost all APIs use UError. Code for status § Check the error code! 26 th Internationalization and Unicode Conference 6 San Jose, September 2004

Getting Started with ICU What Converters are Available § ucnv_count. Available() – get the

Getting Started with ICU What Converters are Available § ucnv_count. Available() – get the number of available converters § ucnv_get. Available – get the name of a particular converter § Lot of frameworks allow this examination 26 th Internationalization and Unicode Conference 7 San Jose, September 2004

Getting Started with ICU Converting Text Chunk by Chunk char buffer[DEFAULT_BUFFER_SIZE]; char *buf. P

Getting Started with ICU Converting Text Chunk by Chunk char buffer[DEFAULT_BUFFER_SIZE]; char *buf. P = buffer; len = ucnv_from. UChars(cnv, buf. P, DEFAULT_BUFFER_SIZE, source. Len, &status); if(U_FAILURE(status)) { if(status == U_BUFFER_OVERFLOW_ERROR) { status = U_ZERO_ERROR; buf. P = (UChar *)malloc((len + 1) * sizeof(char)); len = ucnv_from. UChars(cnv, buf. P, DEFAULT_BUFFER_SIZE, source. Len, &status); } else { /* other error, die gracefully */ } } /* do interesting stuff with the converted text */ 26 th Internationalization and Unicode Conference 8 San Jose, September 2004

Getting Started with ICU Converting Text Character by Character UChar 32 result; char *source

Getting Started with ICU Converting Text Character by Character UChar 32 result; char *source = start; char *source. Limit = start + len; while(source < source. Limit) { result = ucnv_get. Next. UChar(cnv, &source, source. Limit, &status); if(U_FAILURE(status)) { /* die gracefully */ } /* do interesting stuff with the converted text */ } § Works only from code page to Unicode 26 th Internationalization and Unicode Conference 9 San Jose, September 2004

Getting Started with ICU Converting Text Piece by Piece while((!feof(f)) && ((count=fread(in. Buf, 1,

Getting Started with ICU Converting Text Piece by Piece while((!feof(f)) && ((count=fread(in. Buf, 1, BUFFER_SIZE , f)) > 0) ) { source = in. Buf; source. Limit = in. Buf + count; do { target = u. Buf; target. Limit = u. Buf + u. Buf. Size; ucnv_to. Unicode(conv, &target, target. Limit, &source, source. Limit, NULL, feof(f)? TRUE: FALSE, /* pass 'flush' when eof */ /* is true (when no more data will come) */ &status); if(status == U_BUFFER_OVERFLOW_ERROR) { // simply ran out of space – we'll reset the // target ptr the next time through the loop. status = U_ZERO_ERROR; } else { // Check other errors here and act appropriately } text. append(u. Buf, target-u. Buf); count += target-u. Buf; } while (source < source. Limit); // while simply out of space } 26 th Internationalization and Unicode Conference 10 San Jose, September 2004

Getting Started with ICU Clean up! § Whatever is opened, needs to be closed

Getting Started with ICU Clean up! § Whatever is opened, needs to be closed § Converters use ucnv_close § Sample uses conversion to convert code page data from a file 26 th Internationalization and Unicode Conference 11 San Jose, September 2004

Getting Started with ICU Break Iteration - Introduction § Four types of boundaries: –

Getting Started with ICU Break Iteration - Introduction § Four types of boundaries: – Character, word, line, sentence § Points to a boundary between two characters § Index of character following the boundary § Use current() to get the boundary § Use first() to set iterator to start of text § Use last() to set iterator to end of text 26 th Internationalization and Unicode Conference 12 San Jose, September 2004

Getting Started with ICU Break Iteration - Navigation § Use next() to move to

Getting Started with ICU Break Iteration - Navigation § Use next() to move to next boundary § Use previous() to move to previous boundary § Returns DONE if can’t move boundary 26 th Internationalization and Unicode Conference 13 San Jose, September 2004

Getting Started with ICU Break Itaration – Checking a position § Use is. Boundary()

Getting Started with ICU Break Itaration – Checking a position § Use is. Boundary() to see if position is boundary § Use preceeding() to find boundary at or before § Use following() to find boundary at or after 26 th Internationalization and Unicode Conference 14 San Jose, September 2004

Getting Started with ICU Break Iteration - Opening § Use the factory methods: Locale

Getting Started with ICU Break Iteration - Opening § Use the factory methods: Locale locale = …; // locale to use for break iterators UError. Code status = U_ZERO_ERROR; Break. Iterator *character. Iterator = Break. Iterator: : create. Character. Instance(locale, status); Break. Iterator *word. Iterator = Break. Iterator: : create. Word. Instance(locale, status); Break. Iterator *line. Iterator = Break. Iterator: : create. Line. Instance(locale, status); Break. Iterator *sentence. Iterator = Break. Iterator: : create. Sentence. Instance(locale, status); § Don’t forget to check the status! 26 th Internationalization and Unicode Conference 15 San Jose, September 2004

Getting Started with ICU Set the text § We need to tell the iterator

Getting Started with ICU Set the text § We need to tell the iterator what text to use: Unicode. String text; read. File(file, text); word. Iterator->set. Text(text); § Reuse iterators by calling set. Text() again. 26 th Internationalization and Unicode Conference 16 San Jose, September 2004

Getting Started with ICU Break Iteration - Counting words in a file: int 32_t

Getting Started with ICU Break Iteration - Counting words in a file: int 32_t count. Words(Break. Iterator *word. Iterator, Unicode. String &text) { U_ERROR_CODE status = U_ZERO_ERROR; Unicode. String word; Unicode. Set letters(Unicode. String("[: letter: ]"), status); int 32_t word. Count = 0; int 32_t start = word. Iterator->first(); for(int 32_t end = word. Iterator->next(); end != Break. Iterator: : DONE; start = end, end = word. Iterator->next()) { text->extract. Between(start, end, word); if(letters. contains. Some(word)) { word. Count += 1; } } return word. Count; } 26 th Internationalization and Unicode Conference 17 San Jose, September 2004

Getting Started with ICU Break Iteration – Breaking lines int 32_t previous. Break(Break. Iterator

Getting Started with ICU Break Iteration – Breaking lines int 32_t previous. Break(Break. Iterator *break. Iterator, Unicode. String &text, int 32_t location) { int 32_t len = text. length(); while(location < len) { UChar c = text[location]; if(!u_is. Whitespace(c) && !u_iscntrl(c)) { break; } location += 1; } return break. Iterator->previous(location + 1); } 26 th Internationalization and Unicode Conference 18 San Jose, September 2004

Getting Started with ICU Break Iteration – Cleaning up § Use delete to delete

Getting Started with ICU Break Iteration – Cleaning up § Use delete to delete the iterators delete character. Iterator; word. Iterator; line. Iterator; sentence. Iterator; 26 th Internationalization and Unicode Conference 19 San Jose, September 2004

Getting Started with ICU Useful Links § Homepage: http: //oss. software. ibm. com/icu/ §

Getting Started with ICU Useful Links § Homepage: http: //oss. software. ibm. com/icu/ § API documents: http: //oss. software. ibm. com/icu/apiref/index. html § User guide: http: //oss. software. ibm. com/icu/userguide/ 26 th Internationalization and Unicode Conference 20 San Jose, September 2004

Getting Started with ICU Getting ICU 4 J § Easiest – pick a. jar

Getting Started with ICU Getting ICU 4 J § Easiest – pick a. jar file off download section on http: //oss. software. ibm. com/icu 4 j § Use the latest version if possible § For sources, download the source. jar § For bleeding edge, use the latest CVS § : pserver: anoncvs@oss. software. ibm. com: /u sr/cvs/icu 4 j 26 th Internationalization and Unicode Conference 21 San Jose, September 2004

Getting Started with ICU Setting up ICU 4 J § Check that you have

Getting Started with ICU Setting up ICU 4 J § Check that you have the appropriate JDK version § Try the test code (ICU 4 J 3. 0 or later): import com. ibm. icu. util. ULocale; import com. ibm. icu. util. UResource. Bundle; public class Test. ICU { public static void main(String[] args) { UResource. Bundle resource. Bundle = UResource. Bundle. get. Bundle. Instance(null, ULocale. get. Default()); } } § Add ICU’s jar to classpath on command line § Run the test suite 26 th Internationalization and Unicode Conference 22 San Jose, September 2004

Getting Started with ICU Building ICU 4 J § Need ant in addition to

Getting Started with ICU Building ICU 4 J § Need ant in addition to JDK § Use ant to build § We also like Eclipse 26 th Internationalization and Unicode Conference 23 San Jose, September 2004

Getting Started with ICU Collation Engine § More on collation in a couple of

Getting Started with ICU Collation Engine § More on collation in a couple of hours! § Used for comparing strings § Instantiation: ULocale locale = new ULocale("fr"); Collator coll = Collator. get. Instance(locale); // do useful things with the collator § Lives in com. ibm. icu. text. Collator 26 th Internationalization and Unicode Conference 24 San Jose, September 2004

Getting Started with ICU String Comparison § Works fast § You get the result

Getting Started with ICU String Comparison § Works fast § You get the result as soon as it is ready § Use when you don’t need to compare same strings many times int compare(String source, String target); 26 th Internationalization and Unicode Conference 25 San Jose, September 2004

Getting Started with ICU Sort Keys § Used when multiple comparisons are required §

Getting Started with ICU Sort Keys § Used when multiple comparisons are required § Indexes in data bases § ICU 4 J has two classes § Compare only sort keys generated by the same type of a collator 26 th Internationalization and Unicode Conference 26 San Jose, September 2004

Getting Started with ICU Collation. Key class § JDK compatible § Saves the original

Getting Started with ICU Collation. Key class § JDK compatible § Saves the original string § Compare keys with compare. To method § Get the bytes with to. Byte. Array method § We used Collation. Key as a key for a Tree. Map structure 26 th Internationalization and Unicode Conference 27 San Jose, September 2004

Getting Started with ICU Raw. Collation. Key class § Does not store the original

Getting Started with ICU Raw. Collation. Key class § Does not store the original string § Get it by using get. Raw. Collation. Key method § Mutable class, can be reused § Simple and lightweight 26 th Internationalization and Unicode Conference 28 San Jose, September 2004

Getting Started with ICU Message Format - Introduction § Assembles a user message from

Getting Started with ICU Message Format - Introduction § Assembles a user message from parts § Some parts fixed, some supplied at runtime § Order different for different languages: – English: My Aunt’s pen is on the table. – French: The pen of my Aunt is on the table. § Pattern string defines how to assemble parts: – English: {0}''s {2} is {1}. – French: {2} of {0} is {1}. § Get pattern string from resource bundle 26 th Internationalization and Unicode Conference 29 San Jose, September 2004

Getting Started with ICU Message Format - Example String person = …; String place

Getting Started with ICU Message Format - Example String person = …; String place = …; String thing = …; // e. g. “My Aunt” // e. g. “on the table” // e. g. “pen” String pattern = resource. Bundle. get. String(“person. Place. Thing”); Message. Format msg. Fmt = new Message. Format(pattern); Object arguments[] = {person, place, thing); String message = msg. Fmt. format(arguments); System. out. println(message); 26 th Internationalization and Unicode Conference 30 San Jose, September 2004

Getting Started with ICU Message Format – Different data types § We can also

Getting Started with ICU Message Format – Different data types § We can also format other data types, like dates § We do this by adding a format type: String pattern = “On {0, date} at {0, time} there was {1}. ”; Message. Format fmt = new Message. Format(pattern); Object args[] = {new Date(System. current. Time. Millis()), // 0 “a power failure” // 1 }; System. out. println(fmt. format(args)); § This will output: On Jul 17, 2004 at 2: 15: 08 PM there was a power failure. 26 th Internationalization and Unicode Conference 31 San Jose, September 2004

Getting Started with ICU Message Format – Format styles § Add a format style:

Getting Started with ICU Message Format – Format styles § Add a format style: String pattern = “On {0, date, full} at {0, time, full} there was {1}. ”; Message. Format fmt = new Message. Format(pattern); Object args[] = {new Date(System. current. Time. Millis()), // 0 “a power failure” // 1 }; System. out. println(fmt. format(args)); § This will output: On Saturday, July 17, 2004 at 2: 15: 08 PM PDT there was a power failure. 26 th Internationalization and Unicode Conference 32 San Jose, September 2004

Getting Started with ICU Message Format – Format style details Format Type number date

Getting Started with ICU Message Format – Format style details Format Type number date time Format Style Sample Output (none) 123, 456. 789 integer 123, 457 currency $123, 456. 79 percent 12% (none) Jul 17, 2004 short 7/17/04 medium Jul 17, 2004 long July 17, 2004 full Saturday, July 17, 2004 (none) 2: 15: 08 PM short 2: 15 PM medium 2: 14: 08 PM long 2: 15: 08 PM PDT full 2: 15: 08 PM PDT 26 th Internationalization and Unicode Conference 33 San Jose, September 2004

Getting Started with ICU Message Format – No format type § If no format

Getting Started with ICU Message Format – No format type § If no format type, data formatted like this: Data Type Sample Output Number 123, 456. 789 Date 7/17/04 2: 15 PM String on the table others output of to. String() method 26 th Internationalization and Unicode Conference 34 San Jose, September 2004

Getting Started with ICU Message Format – Counting files § Pattern to display number

Getting Started with ICU Message Format – Counting files § Pattern to display number of files: There are {1, number, integer} files in {0}. § Code to use the pattern: String pattern = resource. Bundle. get. String(“file. Count”); Message. Format fmt = new Message. Format(file. Count. Pattern); String directory. Name = … ; Int file. Count = … ; Object args[] = {directory. Name, new Integer(file. Count)}; System. out. println(fmt. format(args)); § This will output messages like: There are 1, 234 files in my. Directory. 26 th Internationalization and Unicode Conference 35 San Jose, September 2004

Getting Started with ICU Message Format – Problems counting files § If there’s only

Getting Started with ICU Message Format – Problems counting files § If there’s only one file, we get: There are 1 files in my. Directory. § Could fix by testing for special case of one file § But, some languages need other special cases: – Dual forms – Different form for no files – Etc. 26 th Internationalization and Unicode Conference 36 San Jose, September 2004

Getting Started with ICU Message Format – Choice format § Choice format handles all

Getting Started with ICU Message Format – Choice format § Choice format handles all of this § Use special format element: There {1, choice, 0#are no files| 1#is one file| 1<are {1, number, integer} files} in {0}. § Using this pattern with the same code we get: There are no files in this. Directory. There is one file in that. Directory. There are 1, 234 files in my. Directory. 26 th Internationalization and Unicode Conference 37 San Jose, September 2004

Getting Started with ICU Message Format – Choice format patterns § Selects a string

Getting Started with ICU Message Format – Choice format patterns § Selects a string based on number § If string is a format element, process it § Splits real line into two or more ranges § Range specifiers separated by vertical bar (“|”) § Lower limit, separator, string § Separator indicates type of lower limit: Separator Lower Limit # inclusive ≤ inclusive < exclusive 26 th Internationalization and Unicode Conference 38 San Jose, September 2004

Getting Started with ICU Message Format – Choice pattern details § Here’s our pattern

Getting Started with ICU Message Format – Choice pattern details § Here’s our pattern again: There {1, choice, 0#are no files| 1#is one file| 1<are {1, number, integer} files} in {0}. § First range is [0. . 1) – Really [-∞. . 1) § Second range is [1. . 1] § Third range is (1. . ∞] 26 th Internationalization and Unicode Conference 39 San Jose, September 2004

Getting Started with ICU Message Format – Other details § Format style can be

Getting Started with ICU Message Format – Other details § Format style can be a pattern string – Format type number: use Decimal. Format pattern – Format type date, time: use Simple. Date. Format pattern § Quoting in patterns – Enclose special characters in single quotes – Use two consecutive single quotes to represent one The '{' character, the '#' character and the '' character. 26 th Internationalization and Unicode Conference 40 San Jose, September 2004

Getting Started with ICU Useful Links § Homepage: http: //oss. software. ibm. com/icu 4

Getting Started with ICU Useful Links § Homepage: http: //oss. software. ibm. com/icu 4 j/ § API documents: http: //oss. software. ibm. com/icu 4 j/doc/ § User guide: http: //oss. software. ibm. com/icu/userguide/ 26 th Internationalization and Unicode Conference 41 San Jose, September 2004