Is ASCII the only way For computers to

For computers to do anything (besides sit on a desk and collect dust) they

How can the computer store data? It has no eyes and hasn't been to

Lets focus on fixed length bit strings. “Bit”is more computerese, formed from parts of

For many years, the computer world agreed to use the American Standard Code for

ASCII (partial table) Char A B C D E F G H 128 0

Question: What is the maximum number of codes that can be created using 8

Note: today, another coding system, entitled Unicode is used to store and transmit data.

A fixed length code, such as ASCII, wastes time and space. For example, consider

THE QUICK GREY FOX JUMPED OVER THE LAZY BROWN COWS 0101010001 01001011 01000101 01001111

The code for our sentence takes 50 bytes, or 400 bits. We know where

Data compression is important in many situations such as sending data over the Internet,

Here is Huffman's algorithm: 1. Make a table showing the frequency of each letter

2. Set the table by frequency in descending order S-6 space-4 I-3 A-2 E-2

3. Give the smallest code to the character with the highest frequency. Starting with

Encode the Susie sentence using ASCII and encode the message again using the above

Solution ASCII - (8 bits)*(21 characters)=168 bits Huffman 10011111011011110010010111010001 10011010001111010101110 = 60 bits Difference

Question: How do we know where one character ends and another character begins? To

Here is the Huffman Tree needed to decode the Susie message 22 9 4(sp)

The characters of the message appear in the terminal (or leaf) nodes of the

Since a Huffman code is different for each message, the tree for that code

Here is the algorithm to create a Huffman Tree: 1. Make a node for

3. Re-insert the new tree in the list of nodes in the postion that

2 3 2 (T)1 (lf)1 U(1) 22 (Y)2 (E)2 (A)2 3 (T)1 (lf)1 (I)3

Once a Huffman tree is complete, it can be searched for each character: When

A Huffman tree is made for each message, but the tree is not necessarily

Suppose ‘ 01’ was used for ‘S’ and ‘ 11’is used for ‘space’ instead

Assignment: Write a 4 or 5 word sentence on a sheet of paper. Code

Slides: 28

Download presentation

Is ASCII the only way?

For computers to do anything (besides sit on a desk and collect dust) they need two things: 1. PROGRAMS 2. DATA A program is a set of instructions that specify each step the computer should take to solve a problem or perform a task. Data is information which is stored and processed. Data may be numeric or alphabetical ("string" in computerese) or a combination of those (alphanumeric - more computerese).

How can the computer store data? It has no eyes and hasn't been to school to learn the meaning of "4", or to understand that the symbols 4, four, and IV, represent the same idea. Can computers think, anyway? (Well, that's a discussion for another day. Possibly a research topic for someone's final paper? ) (First come, first served. )) Since computers can only work with 2 states, on/off or 1/0, all data must be transmitted and stored as some combination of the 2 states. Electronically, we are talking about on/off. (That we will leave for the computer engineer majors. ) In computer science, we abstract a bit, we use the 1/0 representation.

Lets focus on fixed length bit strings. “Bit”is more computerese, formed from parts of the phrase "binary digit" referring to 1 and 0. In a binary number system, only 2 digits are needed, 1 and 0, unlike our decimal system which needs 10 (0, 1, 2. . 9).

For many years, the computer world agreed to use the American Standard Code for Information Interchange referred to as ASCII (pronounced "as ski"). Part of the code is below:

ASCII (partial table) Char A B C D E F G H 128 0 0 0 0 64 1 1 1 1 32 0 0 0 0 16 0 0 0 0 8 0 0 0 0 1 4 0 0 0 1 1 0 2 0 1 1 0 Each string of 8 bits is called a byte. (Remember, computers don't bite!) 1 1 0 1 0 Dec 65 66 67 68 69 70 71 72

Question: What is the maximum number of codes that can be created using 8 bits, starting from 0000 to 1111 ? That is, how many characters can be represented by 1 byte? (That sounds like mathematics, Oh no! Oh yes, there is a considerable amount of discrete mathematics that's the fun type - in computer science. )

Note: today, another coding system, entitled Unicode is used to store and transmit data. ASCII is a subset of Unicode and therefore, is still used. Unicode might be another research topic, but it is a narrow topic, which might take some creativity to make interesting and grade-worthy.

A fixed length code, such as ASCII, wastes time and space. For example, consider the following sentence: THE QUICK GREY FOX JUMPED OVER THE LAZY COWS We will use all capital letters and no punctuation for simplicity. Why is this a great sentence for studying codes? How many characters are in this sentence?

THE QUICK GREY FOX JUMPED OVER THE LAZY BROWN COWS 0101010001 01001011 01000101 01001111 0101 01000101 01001000001 01000010 01001110 01010111 01001000 0101 00100000 01011001 01011000 01001101 00100000 01010010 010001011010 010100100000 01010011 01000101 01001001 01000111 00100000 01010000 01001111 00100000 01011001 01001111 01000011 00100000 01000011 01010010 01000110 01001010 010001010110 010101001100 00100000 01010111 01001111

The code for our sentence takes 50 bytes, or 400 bits. We know where each new character starts by counting every 8 bits, since each character takes the same amount of space. This method is wasteful, and can be significantly compressed. One approach, commonly used for text, is to reduce the number of bits for the most frequent characters. In english, E is the most common letter, so it is reasonable to use the fewest number of bits for E, and to use a large number of bits for a letter seldom used, such as Z. This produces a variable length code that takes less space.

Data compression is important in many situations such as sending data over the Internet, where transmission can take a long time, especially in a dial-up connection. In a sentence such as: SUSIE SAYS IT IS EASY; E is not the most common character. In the Java language, the semi-colon (; ) appears most often. Therefore, to achieve optimal compression for each message we need to create a new code. This was the method used by David Huffman in 1952. Depending on the message, 20% to 90% savings is achieved.

Here is Huffman's algorithm: 1. Make a table showing the frequency of each letter in the message Ex (for the Susie sentence) A-2 E-2 I-3 S-6 T-1 U-1 Y-2 space-4 linefeed-1

2. Set the table by frequency in descending order S-6 space-4 I-3 A-2 E-2 Y-2 T-1 U-1 linefeed-1

3. Give the smallest code to the character with the highest frequency. Starting with 2 bits, increase the number of bits as needed. One method to encode the Susie sentence is : S-10 space-00 I-110 A-010 E-1111 Y-1110 T-0110 U-01111 linefeed-01110

Encode the Susie sentence using ASCII and encode the message again using the above Huffman code. What is the % savings in the number of bits?

Solution ASCII - (8 bits)*(21 characters)=168 bits Huffman 10011111011011110010010111010001 10011010001111010101110 = 60 bits Difference in bits = 168 -60=108 % savings = (108/168)*100 = 64%

Question: How do we know where one character ends and another character begins? To decode a message, we use a device called a Huffman Trees are data structures that are used in many ways in computer science. Trees are usually drawn upside down

Here is the Huffman Tree needed to decode the Susie message 22 9 4(sp) 13 5 2(A) 6(S) 3 1(T) 7 3(I) 2 1(lf) 1(U) 4 2(Y) 2(E)

The characters of the message appear in the terminal (or leaf) nodes of the tree. The number outside the terminal node is the frequency of each character. (The number outside the non-terminal node will be explained when we learn to create a Huffman tree) In the coded message, a '0' means go left, and a '1' means go right. Use the tree to decode the Susie message. Each time a leaf is reached, record the character. Return to the root (top!) of the tree and begin following the next path.

Since a Huffman code is different for each message, the tree for that code must be transmitted with the message. (Programmers create a tree with the programming language they use, but first they must create it with paper and pencil. That is how we will do it. ) 1. Make a node for each character, and list them in increasing order lf(1) U(1) T(1) Y(2) E(2) A(2) I(3) sp(4) S(6)

Here is the algorithm to create a Huffman Tree: 1. Make a node for each character, and list them in increasing order Ex: lf(1) U(1) T(1) Y(2) E(2) A(2) I(3) sp(4) S(6) 2. Remove the 2 lowest frequency nodes and create a tree with them, labeling the root node with the sum of those frequencies 2 (lf)1 U(1)

3. Re-insert the new tree in the list of nodes in the postion that keeps the list ordered. (T)1 (lf)1 2 (Y)2 (E)2 (A)2 (I)3 (sp)4 (S)6 U(1) 4. Repeat these steps until there is one large tree.

2 3 2 (T)1 (lf)1 U(1) 22 (Y)2 (E)2 (A)2 3 (T)1 (lf)1 (I)3 (sp)4 (S)6 2 U(1) Continue until you have the tree we saw earlier.

Once a Huffman tree is complete, it can be searched for each character: When “ 00” is read, go left then left again - ‘space’ is found. When “ 10” is read, go right then left - ‘S’ is found. When “ 010” is read, go left, then right, then left - ‘A’ is found. etc.

A Huffman tree is made for each message, but the tree is not necessarily unique. Some variations in the tree are possible for a given coding scheme. However, there is an important restriction on the coding scheme, because a Huffman code is a prefix code.

Suppose ‘ 01’ was used for ‘S’ and ‘ 11’is used for ‘space’ instead of what we used. What problem would arise in attempting to decode the ‘Susie’ message? Since ‘ 01’ is the beginning of the code for ‘A’ as well as the code for ’T’, the system would not work.

Assignment: Write a 4 or 5 word sentence on a sheet of paper. Code the sentence with ASCII on another sheet of paper. Code the sentence with a Hoffman code on a third paper. Include the Hoffman tree on the third paper.