International Components for Unicode String Prep Unicode in

  • Slides: 36
Download presentation
International Components for Unicode String. Prep: Unicode in Network Protocols Ram Viswanadha Globalization Center

International Components for Unicode String. Prep: Unicode in Network Protocols Ram Viswanadha Globalization Center of Competency, San José IBM 29 th Unicode Conference March, 2006 © 2006 IBM Corporation

International Components for Unicode Agenda § Problem § String. Prep § Profiles of String.

International Components for Unicode Agenda § Problem § String. Prep § Profiles of String. Prep § IDNA § String. Prep in ICU § Demo 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode Terminology § Domain Name § DNS: Domain Naming Service §

International Components for Unicode Terminology § Domain Name § DNS: Domain Naming Service § URL: Universal Resource Locator § NFKC: Normalization Form KC, compatibility composition, e. g. : ffi → ffi : The ffi_ligature (U+FB 03) is decomposed in NFKC (whereas it is not in NFC). § Bi. Di: Bi-Directional code points 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode Why Internationalize? § Users like to use their language/script in

International Components for Unicode Why Internationalize? § Users like to use their language/script in –domain names –URLs –e-mail § Not everyone can read/write English § How to internationalize? –Use Unicode 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode Domain Name: Examples www. 日本平. jp www. ハンドボールサムズ. com www.

International Components for Unicode Domain Name: Examples www. 日本平. jp www. ハンドボールサムズ. com www. färgbolaget. nu www. bücher. de www. brændendekærlighed. com 理容ナカムラ. com あーるいん. com 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode Domain Name: Parts WWW . Domain Label 29 th Unicode

International Components for Unicode Domain Name: Parts WWW . Domain Label 29 th Unicode Conference, San Francisco, CA IBM . COM Label Separator March, 2006 © 2005 IBM Corporation

International Components for Unicode DNS Protocol Requirements § Minimum impact on DNS protocol's interoperability.

International Components for Unicode DNS Protocol Requirements § Minimum impact on DNS protocol's interoperability. § Minimum number of changes § Maximum backwards compatibility § Deterministic resolution of domain names § Single global namespace 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode Problems § Unicode contains large number of –Visually identical ,

International Components for Unicode Problems § Unicode contains large number of –Visually identical , e. g. : і → i –Confusable characters, e. g. : O → 0 –Control codes, e. g. : U+0080 - U+009 F –Non-Spacing, e. g. : U+00 A 0 –Invisible characters, e. g. : U+200 B –Private Use Characters, e. g. : U+E 000 -U+FF 8 F –Punctuation, e. g. : U+002 E –Symbols, e. g. : U+2097 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode Example www. arnaudléhors. com www. arnaudle hors. com 29 th

International Components for Unicode Example www. arnaudléhors. com www. arnaudle hors. com 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode Example: Contd. www. arnaudlu 00 e 9 hors. com www.

International Components for Unicode Example: Contd. www. arnaudlu 00 e 9 hors. com www. arnaudleu 0301 hors. com 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode String. Prep § Defined by RFC 3454 § Framework for

International Components for Unicode String. Prep § Defined by RFC 3454 § Framework for preparing Unicode strings § Based on Unicode Version 3. 2 § Specifies rules for handling –un-assigned code points –visually similar sequences –Prohibited code points –Bi. Di code points 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode String. Prep Tables § Unassigned Table § Mapping Tables –Case

International Components for Unicode String. Prep Tables § Unassigned Table § Mapping Tables –Case mapping, e. g. : u 0041, u 0061 –Deletion, e. g. : u 00 AD, u 200 C, u 200 D § Prohibited Tables e. g. : LRM, RLM, etc. 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode String. Prep Algorithm 1. Map 2. Normalize 3. Prohibit 4.

International Components for Unicode String. Prep Algorithm 1. Map 2. Normalize 3. Prohibit 4. Check Bi. Di 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode Name. Prep § Defined by RFC 3491 § Profile 1.

International Components for Unicode Name. Prep § Defined by RFC 3491 § Profile 1. Map : Include all code point mappings specified in the String. Prep. 2. Normalize: Normalize the output of step 1 according to NFKC. 3. Prohibit: Prohibit all code points specified as prohibited in String. Prep except for the space ( U+0020) code point from the output of step 2. 4. Check Bi. Di: Check for bidirectional code points and process according to the rules specified in String. Prep. 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode Punycode § Defined by RFC 3492 § Algorithm to convert

International Components for Unicode Punycode § Defined by RFC 3492 § Algorithm to convert prepared Unicode Strings to ASCII Compatible Encoding (ACE) § Complete § Unique § Reversible § Preserves case information 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode Internationalized Domain Names in Applications § Defined by RFC 3490

International Components for Unicode Internationalized Domain Names in Applications § Defined by RFC 3490 § Prescribes algorithm for using Unicode in DNS § Name. Prep : Profile of String. Prep for use in DNS § Punycode : Algorithm for converting prepared Unicode strings to ACE 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode IDNA: To. ASCII www. ������������� ��. com To. ASCII www.

International Components for Unicode IDNA: To. ASCII www. ������������� ��. com To. ASCII www. xn—i 1 baa 7 eci 9 glrd 9 b 2 ae 1 bj 0 hfcgg 6 iyaf 8 o 0 a 1 dig 0 cd. com 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode IDNA: To. Unicode www. xn—i 1 baa 7 eci 9

International Components for Unicode IDNA: To. Unicode www. xn—i 1 baa 7 eci 9 glrd 9 b 2 ae 1 bj 0 hfcgg 6 iyaf 8 o 0 a 1 dig 0 cd. com To. Unicode www. �������������� ��. com 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode IDNA: Details § ASCII Full Stop(U+002 E) –Ideographic Full Stop

International Components for Unicode IDNA: Details § ASCII Full Stop(U+002 E) –Ideographic Full Stop (U+3002) –Full Width Full Stop (U+FF 0 E) –Half Width Ideographic Full Stop (U+FF 61) § Unassigned code points § Letter-Digit-Hyphen (LDH) code points § STD 3 ASCII Rules 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode News article 29 th Unicode Conference, San Francisco, CA March,

International Components for Unicode News article 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode Network File System (NFS) Version 4 Profiles § Defined by

International Components for Unicode Network File System (NFS) Version 4 Profiles § Defined by RFC 3530 § nfs 4_cs_prep Profile –Profile for file and path name strings § nfs 4_cis_prep Profile –Profile for NFS server names § nfs 4_mixed_prep Profile –profile for strings in the Access Control Entries 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode XMPP Profiles § Defined by RFC 3920 § Resource. Prep

International Components for Unicode XMPP Profiles § Defined by RFC 3920 § Resource. Prep –Profile for resource identifiers within XMPP e. g. : node@domain/resource § Node. Prep –Profile for node identifiers within XMPP e. g. : node@domain/resource 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode Other Profiles § SASLPrep –RFC 4013 –Profile for Usernames and

International Components for Unicode Other Profiles § SASLPrep –RFC 4013 –Profile for Usernames and passwords § MIB Profile –RFC 4011 –Profile for Mannagement Information Base § i. SCSI Names –RFC 3722 –Profile for internationalized i. SCSI names 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode String. Prep Service in ICU § Data driven § Customizable

International Components for Unicode String. Prep Service in ICU § Data driven § Customizable § Portable § C & Java § Procedure for producing a String. Prep profile data file 1. Run filter. RFC 3454. pl 2. Run gensprep 3. Open the profile 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode Design Considerations § String. Prep profile characteristics: –Prescribe a fixed

International Components for Unicode Design Considerations § String. Prep profile characteristics: –Prescribe a fixed set of tables –Normalization On/Off –Check Bi. Di On/Off –String. Prep algorithm fixed. –Profiles once defined are fixed. § Performance critical 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode Implementation § Accurate Unicode 3. 2 Normalization algorithm § Access

International Components for Unicode Implementation § Accurate Unicode 3. 2 Normalization algorithm § Access to Unicode 3. 2 Character Properties § String. Prep algorithm 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode Sizes Name Size IDNA 21 K NFSCIS 21 K NFSCSI

International Components for Unicode Sizes Name Size IDNA 21 K NFSCIS 21 K NFSCSI 20 K NFSMXP 14 K NFSMXS 21 K NFSCSS 13 K 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode C UError. Code status = U_ZERO_ERROR; UParse. Error parse. Error;

International Components for Unicode C UError. Code status = U_ZERO_ERROR; UParse. Error parse. Error; /*open the String. Prep profile */ UString. Prep. Profile nameprep = usprep_open(“/usr/joe/mydata”, “nfscsi”, &status); if(U_FAILURE(status)){ /handle the error */ } /prepare the string for use according to the rules specified in the profile */ int 32_t ret. Len = usprep_prepare(src, src. Length, dest. Capacity, USPREP_ALLOW_UNASSIGNED, nameprep, &parse. Error, &status); /close the profile*/ usprep_close(nameprep); 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode Java private static final String. Prep nfscsi = null; public

International Components for Unicode Java private static final String. Prep nfscsi = null; public NFSCSIString. Prep (){ try{ Input. Stream nfscsi. File = this. class. get. Resource. As. Stream("nfscsi. spp"); nfscsi = new String. Prep(nfscsi. File); nfscsi. File. close(); }catch(IOException e){ //handle the exception } } private static byte[] prepare(byte[] src, String. Prep prep) throws String. Prep. Parse. Exception, Unsupported. Encoding. Exception{ String s = new String(src, "UTF-8"); UCharacter. Iterator iter = UCharacter. Iterator. get. Instance(s); String. Buffer out = prepare(iter, String. Prep. DEFAULT); return out. to. String(). get. Bytes("UTF-8"); } 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode UTrie – BMP Access Diagram UPPER_WID TH BMP code point

International Components for Unicode UTrie – BMP Access Diagram UPPER_WID TH BMP code point Upper 15 Inde x LOWER_WID TH LOWER_MA SK Lower 0 Data Array Data 0 Block 0 29 th Unicode Conference, San Francisco, CA March, 2006 Block © 2005 IBM Corporation

International Components for Unicode UTrie – Supplementary Access Diagram 1 Lead 15 Surrogate 110110.

International Components for Unicode UTrie – Supplementary Access Diagram 1 Lead 15 Surrogate 110110. . 0 Has data for surrogate block? 2 Folded Trie Same for the surrogate block 3 No Data Yes Lead Surrogate Data Trail 15 9 Surrogate 110111. . 0 4 4 5 Pseudo Code Point 6 Index + Data Final Data § BMP code points access same as with single-index 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode String. Prep Data Structure Indexes The value in these bits

International Components for Unicode String. Prep Data Structure Indexes The value in these bits is an index into the mapping table or the delta value from the code point Array that contains size info of trie & mapping table, options, version numbers etc. 16 Bit Trie word UTrie 0 1 ON: The code point is prohibited Mapping Table Contains the code point(s) that a single code point maps to 29 th Unicode Conference, San Francisco, CA 2. . 15 ON : The value in the next 14 bits is an index into the mapping table Values greater than 0 x. FFF 0 have specify the state of the code point. OFF: The value in the next 14 bits is an delta value from the code point March, 2006 © 2005 IBM Corporation

International Components for Unicode Demo http: //www. ibm. com/software/glob alization/icu/demo/domain 29 th Unicode Conference,

International Components for Unicode Demo http: //www. ibm. com/software/glob alization/icu/demo/domain 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode Conclusion § Unicode can be used in Network protocols §

International Components for Unicode Conclusion § Unicode can be used in Network protocols § ASCII compatibility can be achieved § String. Prep applicable for all network protocols § ICU provides String. Prep services 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode References § Moving Towards Internationalized Domain Names –Paul E. Hoffman

International Components for Unicode References § Moving Towards Internationalized Domain Names –Paul E. Hoffman § A Tangled Web: Issues of I 18 N, Domain Names, and the Other Internet protocols –RFC 2825 § Multilingual Domain Name Race –Suzzanne Topping 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode Q & A 29 th Unicode Conference, San Francisco, CA

International Components for Unicode Q & A 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation