Globalisation Computer systems Week 56 n Character representation

Globalisation & Computer systems Week 5/6 n Character representation n ACII and code pages n UNICODE

Representation n n n bits and bytes characters code points glyphs fonts standardization

Representation n What is a bit? n n What is a byte? n n ‘a binary digit’, i. e either 0 or 1 ‘the fixed no. of bits that can be treated as a unit by the computer hardware’ A byte can be used to express a character such as “A”

Representation n ASCII: n n n American standard code for information interchange A standard character encoding system The bytes were originally 7 -bits Given this, how many bit patterns? Each pattern maps onto a decimal code point, and that maps onto a character

Representation n ASCII: n The problem with 7 -bit bytes… n n n What about French la tête What about Greek κεφαλη Extend ASCII to 8 -bit bytes n n ISO (International organization for standardization) Now 256 bit-patterns

Representation n Extended ASCII: n n n With 8 -bit bytes you get 256 bit-patterns For consistency, the first 128 code-points remain the same from ISO-7 The next 128 used for a range of languages For each language, you need an interpretation of these 128 code points The encoding is handled by a code page

Representation n What about Chinese? n n Thousands of characters – 256 bit-patterns clearly not enough Make the bytes bigger… Bytes have 16 -bits, which gives 65336 bitpatterns UNICODE

Representation n Glyphs n the pictures used to represent a given character; many to one: n The character “A” -> AAAAA

Representation n Glyphs n the pictures used to represent a given character; many to one: n n The character “A” -> AAAAA Fonts n the collection, or ‘picture gallery’ of glyphs

Representation n Extended ASCII: n n n With 8 -bit bytes you get 256 bit-patterns For consistency, the first 128 code-points remain the same from ISO-7 The next 128 used for a range of languages For each language, you need an interpretation of these 128 code points The encoding is handled by a code page

Representation n Extended ASCII: n For code point 154: n n CP_EASTEUROPE (code page 1250): š CP_RUSSIAN (code page 1251): љ What about code point 65 for these two code pages? Now represent your names with your own orthographies in mind, using the code pages

Representation n Code pages in VB Public Enum Valid. Charsets ANSI_CHARSET = 0 GREEK_CHARSET = 161 THAI_CHARSET = 222 End Enum Private Sub Form_Load() Dim X As New Std. Font X. Charset = 161 X. Bold = True X. Size = 8 X. Name = "Times New Roman" Set frm. Test. Font = X Set frm. Test. Label 1. Font = X Set frm. Test. Text 1. Font = X frm. Test. Label 1. Caption = Chr(181) + Chr(225) + Chr(226) frm. Test. Text 1. Text = Chr(181) + Chr(225) + Chr(226) End Sub

Representation and UNICODE n What about Chinese? n Thousands of characters – 256 bit-patterns clearly not enough

Representation and UNICODE n What about Chinese? n n Thousands of characters – 256 bit-patterns clearly not enough Make the bytes bigger… Bytes have 16 -bits, which gives 65536 bitpatterns UNICODE

UNICODE – design principles n Reference: The Unicode Standard, Version 3. 2000. Online: http: //www. unicode. org/unicode/uni 2 book/

UNICODE – design principles n n Principle 1: 16 -bit bytes For code pages, characters share 8 -bit byte code points – determined by interpretation For UNICODE each character assigned a unique code point 65536 code values available n n Byte 1: 256 values X Byte 2: 256 values 63485 for character representation; remaining 2048 reserved for extended 32 -bit codes n This gives 1, 048, 544 code values to cover all languages

UNICODE – design principles n Principle 2: allocation of code space n n n General scripts area: alphabetic CJK Ideographs – 27484 ideographs Hangul syllables – 11172 Korean Hangul syllables 1 st 128 code points for Latin Punctuation symbols grouped together

UNICODE – design principles n Principle 3: efficiency n n n All characters have equal status, i. e. no escape characters Characters of a common script grouped together as far as is possible Common punctuation shared

Design principles Principle 4: logical and display order n n n Logical order: how the code is ordered in memory: follows time sequence of input …and ‘logically’ that is L-R Dynamically composed characters: base character ordered ‘before’, i. e. left wrt to the modifying character

Design principles Principle 5: plain text and rich text n n n Unicode encodes unformatted plain text, where rendering aim is legibility only Formatting: extra data, give rich text To preserve plain text requirements? n n Have layers of plain text representing characters and how they are formatted Use mark-up languages: content + tags

Design principles Principle 6: unification n Share characters where you can: n n n Mixed writing systems Ideographs common to CJK Punctuation

Character semantics n n n Character name Representative glyph Properties

Property 1: Case n A letter in the alphabet has several variants n n n UPPERCASE variant lowercase variant Five scripts which have case: n Latin, Greek, Cyrillic, Armenian, archaic Georgian

Property 2: Decomposition n A character which is equivalent to one or more other characters n n Š=S+ˇ 0160 (Latin Ext. -A)= 0053 + 030 C (Basic Latin)

Property 3: Combining class n Base character n n Combining character n n Some characters have shape-change or position behaviour when combing with other characters Non-spacing combining character n n i. e. no special graphical combining behaviour when following another character Does not take up space, e. g. diacritics Spacing combining character n Takes up space as though a base character

Property 3: Combining class n Sequence is a convention: n n n Base character + combining character Symbol: dotted circle, representing the space of the base character, and combining character positioned relative to the circle Stacking of diacritics follows the convention: n Move from the base character outwards

Property 4: Directionality Two directionality types: n n Left to Right to Left (Arabic, Hebrew, Syriac, Thaana) Logical sequence: Left to Right

Property 5: General Category The full character space is partitioned into several major categories: n n n Letters Punctuation Symbols Numbers Examples of general category codes: n n n Lu: letter, uppercase; Ll: letter, lowercase Nd: number, decimal digit; No: number, other

Property 6: Numeric value For characters that represent numbers n n n Decimal digits Fractions Subscripts and superscripts Currency numerators Portion of the CJK ideographs: e. g. U+4 E 94

Property 7: Mirrored property n n For characters that have equivalent mirror image characters, e. g. ‘(‘ Important for directionality

Character properties Summary n n n n 1. 2. 3. 4. 5. 6. 7. Case Decomposition Combining class Directionality General category Numeric value Mirrored property