Chapter 3 Data Representation Text Characters Representing Text

  • Slides: 13
Download presentation
Chapter 3 Data Representation Text Characters

Chapter 3 Data Representation Text Characters

Representing Text n To represent a text document in digital form, we need to

Representing Text n To represent a text document in digital form, we need to be able to represent every possible character that may appear. n There is a finite number of characters to represent, so the general approach is to list them all and assign each a binary string. 2

Representing Text n n A character set is a list of characters and the

Representing Text n n A character set is a list of characters and the codes used to represent each one. In 1960, a survey revealed 60 different characters sets in use. q n At IBM alone there were 9 different sets. By agreeing to use ONE particular character set, computer manufacturers have made the processing of text data easier. 3

The ASCII Character Set n ASCII stands for American Standard Code for Information Interchange.

The ASCII Character Set n ASCII stands for American Standard Code for Information Interchange. The ASCII character set originally used seven bits to represent each character, allowing for 128 unique characters. n Wikipedia has an excellent entry on ASCII. n 4

The ASCII Character Set (7 bit) 5

The ASCII Character Set (7 bit) 5

The ASCII Character Set n n Notice the organisation of the ASCII table. The

The ASCII Character Set n n Notice the organisation of the ASCII table. The table divides in half according to the MSB. q Letters are all in the second half so all codes for alphabetic characters start with 1. n This second half of the table divides in half again according to the next bit: q q n UPPERCASE letters start 10. lowercase letters start 11. The first half of the table also divides in half according to the next bit: q q Control characters start 00. Numerals and punctuation start 01. 6

The ASCII Character Set n n n Note that control characters (the first 32

The ASCII Character Set n n n Note that control characters (the first 32 in the ASCII character set) do not have simple character representations that you could print to the screen. Many, however, perform actions with which you are familiar. Some have there own keys, others need to be constructed. 7

The ASCII Character Set Control Characters Control sequences are created by holding the Ctrl

The ASCII Character Set Control Characters Control sequences are created by holding the Ctrl key (control) and pressing a letter. This has the effect of subtracting 64 from the ASCII value of the letter pressed. For example: n ‘M’ has ASCII value 77 (1001101 in binary), n Ctrl-M has ASCII value 13 (0001101 in binary). Alternately, we can see this as “masking bit 6. ” 8

The ASCII Character Set Common Control Characters Hex Binary Decimal Name Function 00 0000000

The ASCII Character Set Common Control Characters Hex Binary Decimal Name Function 00 0000000 0 NUL Null 07 0000111 7 BEL Bell 08 0001000 8 BS Backspace 09 0001001 9 HT Horizontal Tab 0 A 0001010 10 LF Line Feed 0 B 0001011 11 VT Vertical Tab 0 C 0001100 12 FF Form Feed 0 D 0001101 13 CR Carriage Return 0 E 0001110 14 SO Shift Out 0 F 0001111 15 SI Shift In 1 B 0011011 29 ESC Escape 9

The ASCII Character Set Coding letters in ASCII is easy. Let’s look at ‘j’

The ASCII Character Set Coding letters in ASCII is easy. Let’s look at ‘j’ as an example: Since ‘j’ is a letter, its code starts with a 1. Since it’s lowercase, the next bit is also a 1. Since it’s the tenth letter of the alphabet the rest of the code is 01010. The complete ASCII code for ‘j’ is 1101010. 10

The ASCII Character Set n n ASCII evolved so that eight bits are used.

The ASCII Character Set n n ASCII evolved so that eight bits are used. The 7 -bit codes were simply prefixed with another bit, giving another natural doubling. q The original 7 -bit codes were padded with 0. n q So the code for ‘j’ became 01101010. 128 new characters were added. The codes for this alternate character set start with 1. 11

The Unicode Character Set n n n Even the extended version of the ASCII

The Unicode Character Set n n n Even the extended version of the ASCII character set is not enough for international use. The Unicode character set uses 16 bits per character. The Unicode character set can represent 216, or over 65 thousand characters. Unicode was designed to be a superset of ASCII. That is, the first 256 characters in the Unicode character set correspond exactly to the extended ASCII character set. 12

Examples of Unicode Characters Figure 3. 6 A few characters in the Unicode character

Examples of Unicode Characters Figure 3. 6 A few characters in the Unicode character set 13