Unicode Mark Davis Unicode Consortium President IBM Chief

Unicode Mark Davis Unicode Consortium President IBM Chief SW Globalization Architect 2003 -09 -24

Universal Character Encoding n Unique number for every character …

Unifies all Languages n 96 thousand characters, so far n All characters accessible at the same time, in the same document: A, Ž, Ш, Δ, ﺵ , �, �, �, … か, 上, 각, …. .

Lingua Franca for Computers n Developed & supported by industry leaders: n n Required by modern standards: n n Apple, HP, IBM, Just. System, Microsoft, Oracle, SAP, Sun, Sybase, Unisys, … XML, HTML, Java, ECMAScript (Java. Script), LDAP, CORBA 3. 0, WML, Perl, etc. Implemented in: n All modern operating systems, browsers, and other products

International Domain Names n Approved - Unicode-Based n Examples: nhttp: //Юникод. com nhttp: //Βαλκανίω ν. com nhttp: //����. com

Standard Resources nwww. unicode. org Online Standard n Technical Reports n FAQs n General Information n Discussion Forums, Conferences n

Programming Resources n System n APIs: Windows, Java, Unix, Oracle, DB 2, Sybase, Mac, Linux, … n Languages n Java, Java. Script, C#, Perl 5. 6. 0, C, C++, SQL, … n Cross-platform n libraries: ICU, Rosette, …

Stability Developers / other standards need absolute stability n Characters are never moved or deleted n n Characters never change names n n Ordering of characters is by collation, not binary order. See UTS #10: Unicode Collation Algorithm Characters may be deprecated (discouraged). Annotations are used to clarify usage See Unicode Policies

Indic Support in Unicode n ISCII the basis for characters and allocation n Consortium actively engaged with Indian Government, which is a member n Welcomes addition of missing characters (e. g. Vedic), clarifications or corrections of usage

Structural Similarities with ISCII n Within script, layout and contents nearly identical n Independent + dependent vowels n Halant model for representing conjuncts / half-forms not directly encoded n represented by sequences instead n n Phonetic sequence – order in syllables

Structural Differences with ISCII n Unicode is stateless: No shifting to get different scripts n Each character has a unique number n n Unicode is uniform: No extension bytes necessary n All characters coded in the same space n

Additional Characters n Indian Government is developing proposals for: n Additions of missing characters: Vedic n Individual characters for certain scripts n n Annotations and Descriptions

Global Applications now support languages of India n Companies supporting Indic with Unicode n Open. Type fonts n Font support for Indic n Microsoft Windows n Java (IBM contributed ICU Indic Layout) n Linux n…

Benefits for India n All documents, anywhere in the world, can have Indic text n Allows seamless multilingual documents in India n including scriptures and minority languages n Opens up software export market, beyond English n Connects India to the world

How India Can Contribute Effective Communication with the Unicode Consortium n Provide Resources for Development n n n n Descriptions of Usage Descriptions of Character Shaping Transliteration Tables from Script to Script Collation Information Open. Type fonts …

What Developers Can Do n Interwork with existing ISCII systems n Move to Unicode for future developments n Java, Windows, Linux, …

The Future n The world is moving rapidly to Unicode n Unicode makes India open to the world The world comes to you, and n You go to the world n n You can help

Q&A

Backup Slides

Multiple Forms n UTF-8: maximal compatibility with 8 -bit systems n UTF-16: good storage, interoperability with Windows/Java n UTF-32: simplest processing n Fast, lossless conversion n See Forms of Unicode
- Slides: 20