ASCII Unicode UTF8 ASCII American Standard Code for

  • Slides: 16
Download presentation
ASCII, Unicode, UTF-8

ASCII, Unicode, UTF-8

ASCII American Standard Code for Information Interchange • Traced back to 1960’s • First

ASCII American Standard Code for Information Interchange • Traced back to 1960’s • First character set used between computers on the internet. • Total 128 (27) characters • ‘A’ represented by hex 41 (0010 0001) • ‘a’ represented by hex 61 (0110 0001) • ‘ ‘ (space) represented by hex 20 (0010 0000) • A total of 32 device control characters and 96 printable characters.

ASCII Table in Dec and HEX

ASCII Table in Dec and HEX

Demonstration Using Hex Editor Neo • Use hex editor Neo to enter this text

Demonstration Using Hex Editor Neo • Use hex editor Neo to enter this text with TAB (hex 09) key after ‘C’ and after ‘F’ and RETURN (0 d 0 a) after I. • 41 42 43 09 44 45 46 09 47 48 49 0 d 0 a 4 a 4 b • Save as example. txt and look at it with Notepad or another text editor.

Unicode • ASCII was English-centric, not good enough for international use. • Becker published

Unicode • ASCII was English-centric, not good enough for international use. • Becker published a draft proposed a "international/multilingual text character encoding system” He named it Unicode to signify a “unique, unified, universal encoding“. • Originally, Unicode is 16 -bits based. It was limited to existing languages and script in living use. • 1991 c-Unicode Consortium formed • 1992 – Chinese Characters encoded. • 1996 Unicode 2. 0 – no longer limited to 16 bits pe character.

Unicode • The Unicode character set started with 7, 129 named characters in 1991

Unicode • The Unicode character set started with 7, 129 named characters in 1991 and has grown to 137, 929 named characters in 2019. The majority of those characters describe logograms from Chinese, Japanese, and Korean, such as "U+6728" which refers to "木". It also includes over 1, 200 emoji symbols ("U+1 F 389" = "�� "). • Note that Unicode is a character set but NOT an encoding scheme. • "U+6728" which refers to "木“ is a way to name the character but is not a proposal to encode this character as hex 67 28. • https: //www. khanacademy. org/computing/ap-computer-science-principles/x 2 d 2 f 703 b 37 b 450 a 3: digitalinformation/x 2 d 2 f 703 b 37 b 450 a 3: storing-text-in-binary/a/storing-text-in-binary? modal=1

UTF-8 • Unicode Transformation Format. from 1 block to 4 blocks, with each block

UTF-8 • Unicode Transformation Format. from 1 block to 4 blocks, with each block consisting of 8 bits. • The Unicode Consortium which develops the Unicode Standard, seeks to encode the existing character sets with its standard Unicode Transformation Format (UTF). • UTF-8 became is now used for encoding more than 90% of all web pages. • Use Firefox. Go to www. fidelity. com click on menu on top (3 bars) -> web developer -> page source. • Also try www. Gordon. edu

UTF-8 At Work - A Demo • http: //www. computewithgrace. com/demo/emoji. html The HTML

UTF-8 At Work - A Demo • http: //www. computewithgrace. com/demo/emoji. html The HTML Code – <!DOCTYPE html> <head> <meta charset="UTF-8"> </head> <body> <center> <p style="font-size: 200 px">😀 </p> </center> </body> </html>

UTF-8 Can Encode Characters Using 1 – 4 Bytes • ASCII characters have 7

UTF-8 Can Encode Characters Using 1 – 4 Bytes • ASCII characters have 7 -bit encoding. • If encoding is limited to 2 bytes (16 bits), then at most 216 (=64 k) characters can be coded. • Chinese has more than 50 K characters. • Using 1 -4 bytes for encoding is therefore a necessity.

How UTF-8 (1993) Handles Varying-Length Encoding https: //www. khanacademy. org/computing/ap-computer-science-principles/x 2 d 2 f

How UTF-8 (1993) Handles Varying-Length Encoding https: //www. khanacademy. org/computing/ap-computer-science-principles/x 2 d 2 f 703 b 37 b 450 a 3: digital -information/x 2 d 2 f 703 b 37 b 450 a 3: storing-text-in-binary/a/storing-text-in-binary? modal=1

Broad Groupings Of UTF-8 Encoding • 1 byte characters beginning with 0 – ASCII

Broad Groupings Of UTF-8 Encoding • 1 byte characters beginning with 0 – ASCII • 2 bytes characters beginning with 110 – Spanish, German, Hebrew, Greek, Arabic • 3 bytes characters beginning with 1110 – Asian Languages • 4 bytes characters beginning with 1111 – emojis, rarely used historical scripts

𓀀 Examples Of UTF-8 Encoding Alias Character Name UTF-8 Encoding Latin Capital Letter A

𓀀 Examples Of UTF-8 Encoding Alias Character Name UTF-8 Encoding Latin Capital Letter A A U+0041 41 (0100 0001) Greek Small Letter Alpha α U+03 B 1 CE B 1 (1100 1110 1011 0001) Chinese U+7684 https: //www. compart. com/en/unicode/U+7684 E 7 9 A 84 (1110 0111 1001 1010 1000 0100) Face With Tears Of Joy +1 F 602 https: //www. compart. com/en/unicode/U+1 F 602 F 0 9 F 98 82 (1111 0000 1001 1111 1000 0010) Note that a character name does not indicate UTF-8 encoding (except in the case of ASCII characters). That is, U+0041, is a name like “R 2 D 2”.

More UTF-8 Examples • Go to https: //www. fileformat. info/unicode/utf 8 test. htm •

More UTF-8 Examples • Go to https: //www. fileformat. info/unicode/utf 8 test. htm • https: //www. unicode. org/emoji/

How many characters can be coded in UTF-8? • What is the maximum possible

How many characters can be coded in UTF-8? • What is the maximum possible number of 1 -byte UTF-8 characters ? Give your answer as a power of 2, and hex. • What is the maximum possible number of 2 -bytes UTF-8 characters? Give your answer as a power of 2, and hex. • What is the maximum possible number of 3 -bytes UTF-8 characters? Give your answer as a power of 2, and hex. • What is the maximum possible number of 4 -bytes UTF-8 characters? Give your answer as a power of 2, and hex.

In-Class Exercise What English Word is Represented • 0100 1000 0110 0101 0110 1100

In-Class Exercise What English Word is Represented • 0100 1000 0110 0101 0110 1100 0110 1111 • 0100 1000|0110 0101|0110 1100|0110 1111 • 48 65 6 c 6 c 6 f

In-Class Exercises • How many UTF-8 characters are represented by each of the following?

In-Class Exercises • How many UTF-8 characters are represented by each of the following? If there is an error in the encoding, which bit(s) is (are) in error? • 01000001 11110000 01011111 010101001010 11010011 01011111 • 11101010 01111000 0101 01110011 01010000 • 11101010 01111100 01010111 10110011 01010000 • 01000000 1001