UNICODE Character Sets and Coding Standards Han Unification

  • Slides: 23
Download presentation
UNICODE • • Character Sets and Coding Standards Han Unification and ISO 10646 Encoding

UNICODE • • Character Sets and Coding Standards Han Unification and ISO 10646 Encoding Evolution and Unicode Programming Unicode

Character Sets • Character Sets - a complete group of characters for one or

Character Sets • Character Sets - a complete group of characters for one or more writing system. • Coded character set – a mapping from a set of abstract characters to a set of integers. • Character encoding scheme- a mapping from a coded character set to a set of octets.

Standards • International standards – - RFC 2130, - ISO • RFC standards come

Standards • International standards – - RFC 2130, - ISO • RFC standards come out of the Internet Engineer Task Force(IETF), it stands for Request For Comments. • ISO is a worldwide federation of national standards bodies such as ANSI, JISC, KISI, CSA(GB} and ECMA.

Coded character sets (1) • ASCII ( American Standard Code for Information Interchange) or

Coded character sets (1) • ASCII ( American Standard Code for Information Interchange) or ISO 646 defined by 7 -bit - Latin alphabet; - Arabic numerals; - Punctuation marks; - Some computer control codes;

Coded character sets (2) • ISO-8859 Standards- ISO-8859 -1 to 15 • ISO-8859 -1(Latin

Coded character sets (2) • ISO-8859 Standards- ISO-8859 -1 to 15 • ISO-8859 -1(Latin 1) - the most widely used standard, it contains the characters necessary for writing in Western European and Scandinavian languages. It is also an ANSI Standard and known as the ANSI character set. • The first 128 positions are identical to ASCII • All use 8 -bit byte • Windows use ISO-8859 -1 as its character set.

Coded character sets (3) • Asian Language character sets Traditional Chinese(Taiwan) - BIG 5:

Coded character sets (3) • Asian Language character sets Traditional Chinese(Taiwan) - BIG 5: 8 -16 bit, 94 x 157 matrix : 13002 characters; - CNS 11643 X 5012: 16 bit, 48000 characters. Simplified Chinese(China) - GB 2312 -80: 8 -16 bit, 94 x 94 matrix: 6763 characters. - GBK (Extended GB 2312 -80): 8 -16 bit, 94 x 94 matrix, 21003 characters.

Coded character sets (3) Japanese(Japan) - JIS x 0208: 16 bit, 94 x 94

Coded character sets (3) Japanese(Japan) - JIS x 0208: 16 bit, 94 x 94 matrix, 6879 characters. - JIS x 0212: 16 bit, 5801 supplemental kanji. - JIS x 0201: 7 bit, JIS-Roman plus half-width katakana. Korean(Korea) - KSX-1001(KSC 5601): 8 -16 bit, 94 x 94 matrix, 4888 hanja. - KSC-5636: 7 bit Korean version of ASCII.

Coded character sets (4) • EUC (Extended UNIX Code); • ISO 2022 Escape sequence

Coded character sets (4) • EUC (Extended UNIX Code); • ISO 2022 Escape sequence • Windows Code pages: 8 -16 bit, - cp 1252 English is ISO 8859 -1 or Latin 1 or Windows ANSI; - cp 1200 Unicode page; - cp 932 Japanese is shift-JIS; - cp 936 Simplified Chinese is close to GB 231280; - cp 949 Korean is in KSC 5601 -1992 order; - cp 950 Traditional Chinese is the same as Big 5.

Han Unification • The Han standards consist of : JIS X 0208(6349 Kanji), GB

Han Unification • The Han standards consist of : JIS X 0208(6349 Kanji), GB 2312 -80 (6763 Hanzi), CNS 11543 (13951 Hanzi) and KSC 5601 (4888 Hanji). All characters in these standards must be included in 10646 • Without unification, more than 100, 000 characters are separately encoded • Characters from these standards should be unified in ISO 10646 that is identical characters from two or more of these standards may have the same code point in 10646. The unification of Han characters allows for approximately 40, 000 characters.

Coded character sets (5) • Unicode: 16 bit, most languages are covered, developed by

Coded character sets (5) • Unicode: 16 bit, most languages are covered, developed by the Unicode Consortium. • Unicode 1. 0 Unicode 2. 0 contains 38, 885 characters. It will be Unicode 3. 0 in this year. • Unicode is similar to ISO 10646, the UCS(Universal Character Set) encoding, - UCS-2: i. e. Unicode, 16 bit, a fixed with 2 byte format scheme; - UCS-4: also called ISO 10646, 32 bit, 4 byte format, 32, 000 planes each with 65, 000 characters capacities, for tall 2, 080 million characters. The 1 st plane is in use(that’s Unicode) - UTF-7: 7 -40 bit, UCS Transformation Format, a Unicode character encoding scheme using a 7 bit. - UTF-8: 7 -48 bit, UCS Transformation Format, a Unicode character encoding scheme using a 8 bit. - UTF-16(UCS-2 E): 16 -32 bit. • Windows NT, use Unicode, has a single character set.

UTF-8 • UCS Transformation Format; • A variable-width or multi-byte encoding format; • In

UTF-8 • UCS Transformation Format; • A variable-width or multi-byte encoding format; • In UTF-8, the standard ASCII characters occupy only one byte, other Unicode characters occupy two or three bytes. Table The UTF-8 Encoding Start Character u 0000 End Character u 007 F Required Data Bits 7 Binary Byte Sequence (x = data bits) 0 xxxxxxx u 0080 u 07 FF 11 110 xxxxxx u 0800 u. FFFF 16 1110 xxxxxx

Encoding Evolution 16 bits Unicode EUC ISO 10646 (UCS 2) Shift-JIS 8 bits ISO

Encoding Evolution 16 bits Unicode EUC ISO 10646 (UCS 2) Shift-JIS 8 bits ISO 2022 7 bits GB PC CPs ASCII JIS ISO 8859 KSC EBCDIC Apple CPs

Programming Unicode • How does Win 32 support it • How to write apps

Programming Unicode • How does Win 32 support it • How to write apps for Unicode • How to be backward compatible

Unicode in Win 32 App (Unicode) Win 32 App Non-Unicode Win 32 Client-Server Boundary

Unicode in Win 32 App (Unicode) Win 32 App Non-Unicode Win 32 Client-Server Boundary ANSI to Unicode Conversion Win 32 Server Windows NT Base System fully Unicode internally

Unicode in Win 32 • Separate Unicode Datatype - Wide character: Unicode - 8

Unicode in Win 32 • Separate Unicode Datatype - Wide character: Unicode - 8 -bit char: ANSI, DBCS • Parallel Unicode and ANSI APIs - Unicode and ANSI windows classes - Implicit code conversion • Resources always in Unicode

Programming Unicode • Basic techniques - How to migrate existing code base • Special

Programming Unicode • Basic techniques - How to migrate existing code base • Special topics - Interaction with non-Unicode apps - Untyped file system • Foundation for script/language specific functionality • Migration example: Win 32

Windows Unicode Programming Common Source Data Exchange Win 32 EXE Non-Unicode Conversion Win 32

Windows Unicode Programming Common Source Data Exchange Win 32 EXE Non-Unicode Conversion Win 32 EXE Unicode

Generic data types in C Generic data types LPTSTR TCHAR wchar_t * char *

Generic data types in C Generic data types LPTSTR TCHAR wchar_t * char * Explicit data types WCHAR wchar_t char LPWSTR wchar_t * LPSTR char *

Macros and literals • String literals TEXT(“hello”); “hello”; L”hello”; • numeric equivalence ‘A’ =

Macros and literals • String literals TEXT(“hello”); “hello”; L”hello”; • numeric equivalence ‘A’ = 0 x 41 0 x 0041 = L’A’

Dual function prototypes Set. Window. Text (HWND, LPTSTR) ; Set. Window. Text. A (HWND,

Dual function prototypes Set. Window. Text (HWND, LPTSTR) ; Set. Window. Text. A (HWND, LPSTR) ; Set. Window. Text. W (HWND, LPWSTR) ; • Generic API prototypes • Resolved to explicit prototypes #ifdef UNICODE

Basic conversion steps • Use generic data types: TCHAR, LPTSTR for Text • Use

Basic conversion steps • Use generic data types: TCHAR, LPTSTR for Text • Use explicit types for BYTE pointers (data buffers) • Use TEXT() macro for literal constants • Adjust pointer arithmetic • Use generic function prototypes

Conversion metrics • About 10% of source lines need to be modified by global

Conversion metrics • About 10% of source lines need to be modified by global replace LPSTR LPTSTR strstr() wcswcs() • About 2 -5% need small modification lstrlen(s)/sizeof(TCHAR) • Less than 1% need revised algorithm Open. File() Search. Path(); Create. File()

Summary: Migrating Windowsbased Program to Unicode 1. 2. 3. 4. 5. 6. 7. 8.

Summary: Migrating Windowsbased Program to Unicode 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Modify your code to use generic data types Modify your code to use generic function prototypes Surround any character or string literal with the TEXT macro. Create generic version of your data structures Change your make process. Adjust pointer arithmetic. Check for any code that assumes a character is always 1 byte long. Add Unicode-specific code if necessary. Add code to support special Unicode characters. Determine how using Unicode will affect file I/O. Double check the way in which you retrieve command line arguments. 12. Debug your port by enabling your compiler’s type-checking