Basics of Unicode base upon a presentation by

Basics of Unicode (base upon a presentation by NRSI, SIL International)

By the Numbers How is text stored in a computer? As a sequence of numbers that represent characters

Encoding The Heart of Data Processing:

Unicode Universally agreed Every character has its own number 100, 000+ already defined Space for 1, 100, 000+ Everyone is using it Windows, Mac, Linux, etc. All fonts All current applications

Unicode Character Set Unicodespace U+0000. . U+10 FFFF 17 planes of 64 K (0 x 10000) characters Plane 0: BMP (Basic Multilingual Plane) Includes latin, ethiopic, arabic, hebrew & greek Other planes not used in Africa U+0000. . . U+FFFF

Unicode: BMP CJKV Ideographs General Scripts Symbols CJK Misc. Yi Hangul PUA Compatibility res’d for surrogate code values

Unicode Design Principles Encode Characters – not glyphs e. g. Arabic contextual forms: U+0628 ARABIC LETTER BEH: U+006 A LATIN SMALL LETTER J: ﺏ ﺏ j j Encode Characters – not graphemes multi-graphs encoded as sequences: “ch” ↔ < “c”, “h” > “ny” ↔ < “n”, “y” >

Unicode Design Principles Character Semantics General Category Letter, punctuation, number, symbol Case CAPITAL, small, Title A - U+0041 a - U+0061

Unicode Design Principles Unification Unify across languages within the same script French é = U+00 E 9 e + high tone: é = U+00 E 9 Different scripts don't unify characters Latin capital b: B = U+0062 Cyrillic capital ve: B = U+0432 ● DO NOT MIX AND MATCH ACROSS SCRIPTS

Unicode Design Principles Logical Order Store in reading order, not visual order P e a c e 0050 0065 0061 0063 0065 ש ל ם 05 E 9 05 DC 05 DD Peace שלם

Unicode Design Principles Dynamic Composition Base characters + combining marks < “a”, “◌ ” > ⇒ “ä” < “n”, “◌ ” > ⇒ “n ” < “c”, “◌ ” > ⇒ “c ”

Unicode Design Principles Multiple combining marks any number of combining marks < “u”, “◌ ” > ⇒ “u ” each combining mark has its own relative order < “u”, “◌ ” > ⇒ “u ” Usually stack vertically outward from base character < “u”, “◌ ” > ⇒ “ü ”

Unicode Design Principles Unicode has a dedicate area for Private Use called PUA Unicode is Extensible But it takes forever to add anything Use what is there

Unicode Normalization Canonical Equivalence Different ways of storing the same thing: é = < “é” > = U+00 E 9 é = < “e”, “◌ ” > = U+0065 U+0301 U+00 E 9 ≡ U+0065 U+0301 To be treated as identical in all respects Not all programs do this

Unicode Normalization Normal Forms NFC (Normal Form Composed) é = < “é” > = U+00 E 9 Sequences reduced to minimal length NFD (Normal Form Decomposed) é = < “e”, “◌ ” > = U+0065 U+0301 Sequences expanded to maximal length

Unicode Normalization Canonical Combining Order Which order should non-interacting marks be stored? < “u”, “◌ ”, “◌ ” > ⇒ “u ” Important for comparison and sorting

Unicode Storage ● Ways of storing Unicodepoints ● UTF 32 – ● UTF 16 – ● Single 32 bit number Single 16 bit number for BMP, pairs for higher values UTF 8 – Use 'upper ASCII' (values 0 x 80 and above) sequences

Unicode Storage BOM (Byte Order Mark) U+FEFF: Zero Width Non-Breaking Space U+FFFE: Undefined Identifies Encoding Scheme: UTF 8: 0 x. EF 0 x. BB 0 x. BF ("ï» ¿" ) UTF 16 BE: 0 x. FE 0 x. FF UTF 16 LE: 0 x. FF 0 x. FE UTF 32 BE: 0 x 00 0 x. FE 0 x. FF UTF 32 LE 0 x. FF 0 x. FE 0 x 00

Unicode Storage UTF 8 Used for file storage Plays well with old 8 -bit applications Bit structures 0000 -007 F: 0 xxx xxxx 0080 -07 FF: 110 x xxxx 10 xx xxxx 0800 -FFFF: 1110 xxxx 10 xx xxxx 10000 -10 FFFF: 1111 0 xxx 10 xx xxxx ● BOM is redundant, but some files still get it.

Unicode Storage UTF 16 Used for pure unicode storage Good average storage performance and speed Bit Structures 0000 -FFFF: xxxxxxxx 10000 -10 FFFF: 110110 xx xxxx 110111 xx xxxx Where xxxxx is USV – 0 x 10000

Unicode Storage UTF 16 BE Store most significant byte first UTF 16 LE Store least significant byte first Use BOM to resolve

Questions?
- Slides: 22