Plan for the Week l Understand Huffman Coding

  • Slides: 28
Download presentation
Plan for the Week l Understand Huffman Coding Ø Ø l Data compression Priority

Plan for the Week l Understand Huffman Coding Ø Ø l Data compression Priority Queue Bits and Bytes Greedy Algorithms + Data Structures = Programs Ø Ø Ø What does this mean and who said it? From greedy to Dijkstra’s algorithm From divide-and-conquer to sorting to … CPS 100, Spring 2010 13. 1

Compression and Coding l What gets compressed? Ø Save on storage, why is this

Compression and Coding l What gets compressed? Ø Save on storage, why is this a good idea? Ø Save on data transmission, how and why? l What is information, how is it compressible? Ø Exploit redundancy, without that, hard to compress Represent information: code (Morse cf. Huffman) Ø Dots and dashes or 0’s and 1’s Ø How to construct code? l CPS 100, Spring 2010 13. 2

Typoglycemia: Wikipedia (orphan) l In a puiltacibon of New Scnieitst you could ramdinose all

Typoglycemia: Wikipedia (orphan) l In a puiltacibon of New Scnieitst you could ramdinose all the letetrs, keipeng the first two and last two the same, and reibadailty would hadrly be aftcfeed. My ansaylis did not come to much beucase thoery at the time was for shape and senqeuce retigcionon. Saberi's work sugsegts we may have some pofrweul palrlael prsooscers at work. The resaon for this is suerly that idnetiyfing coentnt by paarllel prseocsing speeds up regnicoiton. We only need the first and last two letetrs to spot chganes in meniang. Graham Rawlinson, Ph. D thesis 1976 l Scrmabling, an anagram of scrambling, is the verb for rearranging the letters in a word, but leaving the first and last letters as is. CPS 100, Spring 2010 13. 3

Huffman Coding D. A Huffman in early 1950’s: story of invention Ø Analyze and

Huffman Coding D. A Huffman in early 1950’s: story of invention Ø Analyze and process data before compression Ø Not developed to compress data “on-the-fly” l Represent data using variable length codes Ø Each letter/chunk assigned a codeword/bitstring Ø Codeword for letter/chunk is produced by traversing the Huffman tree Ø Property: No codeword produced is the prefix of another Ø Frequent letters/chunk have short encoding, while those that appear rarely have longer ones l Huffman coding is optimal per-character coding method 13. 4 CPS 100, Spring 2010 l

Coding/Compression/Concepts l For ASCII we use 8 bits, for Unicode 16 bits Ø Minimum

Coding/Compression/Concepts l For ASCII we use 8 bits, for Unicode 16 bits Ø Minimum number of bits to represent N values? Ø Representation of genomic data (a, c , g, t)? Ø What about noisy genomic data? l We can use a variable-length encoding, e. g. , Huffman Ø How do we decide on lengths? How do we decode? Ø Values for Morse code encodings, why? Ø … ---… CPS 100, Spring 2010 13. 5

Mary Shaw l Software engineering and software architecture Ø Tools for constructing large software

Mary Shaw l Software engineering and software architecture Ø Tools for constructing large software systems Ø Development is a small piece of total cost, maintenance is larger, depends on welldesigned and developed techniques l Interested in computer science, programming, curricula, and canoeing, health-care costs ACM Fellow, Alan Perlis Professor of Compsci at CMU l CPS 100, Spring 2010 13. 6

Huffman coding: go go gophers ASCII g 103 o 111 p 112 h 104

Huffman coding: go go gophers ASCII g 103 o 111 p 112 h 104 e 101 r 114 s 115 sp. 32 l l 1100111 1101111 1110000 1101000 1100101 1110010 1110011 1000000 3 bits 000 001 010 011 100 101 110 111 ? ? choose two smallest weights Ø combine nodes + weights Ø Repeat Ø Priority queue? Encoding uses tree: Ø 0 left/1 right Ø How many bits? CPS 100, Spring 2010 g o p h e r s * 3 3 1 1 1 2 2 2 3 p h e r s * 1 1 1 2 6 4 2 2 p h e r 1 1 g o 3 3 13. 7

Huffman coding: go go gophers ASCII g 103 o 111 p 112 h 104

Huffman coding: go go gophers ASCII g 103 o 111 p 112 h 104 e 101 r 114 s 115 sp. 32 l 1100111 1101111 1110000 1101000 1100101 1110010 1110011 1000000 3 bits 000 001 010 011 100 101 110 111 00 01 1100 1101 1110 1111 100 101 13 7 6 g o 3 3 Encoding uses tree/trie: Ø 0 left/1 right Ø How many bits? 37!! Ø Savings? Worth it? 4 3 s * 1 2 2 2 p h e r 1 1 6 4 2 2 g o 3 3 3 CPS 100, Spring 2010 p h e r 1 1 s * 1 2 13. 8

Building a Huffman tree l Begin with a forest of single-node trees/tries (leaves) Ø

Building a Huffman tree l Begin with a forest of single-node trees/tries (leaves) Ø Each node/tree/leaf is weighted with character count Ø Node stores two values: character and count l Repeat until there is only one node left: root of tree Ø Remove two minimally weighted trees from forest Ø Create new tree/internal node with minimal trees as children, • Weight is sum of children’s weight (no char) How does process terminate? Finding minimum? Ø Remove minimal trees, hummm…… l CPS 100, Spring 2010 13. 9

How do we create Huffman Tree/Trie? l Insert weighted values into priority queue Ø

How do we create Huffman Tree/Trie? l Insert weighted values into priority queue Ø What are initial weights? Why? Ø l Remove minimal nodes, weight by sums, re-insert Ø Total number of nodes? Priority. Queue<Tree. Node> pq = new Priority. Queue<Tree. Node>(); for(int k=0; k < freq. length; k++){ pq. add(new Tree. Node(k, freq[k], null)); } while (pq. size() > 1){ Tree. Node left = pq. remove(); Tree. Node right = pq. remove(); pq. add(new Tree. Node(0, left. weight+right. weight, left, right)); } Tree. Node root = pq. remove(); CPS 100, Spring 2010 13. 10

Creating compressed file l Once we have new encodings, read every character Ø Write

Creating compressed file l Once we have new encodings, read every character Ø Write encoding, not the character, to compressed file Ø Why does this save bits? Ø What other information needed in compressed file? l How do we uncompress? Ø How do we know foo. hf represents compressed file? Ø Is suffix sufficient? Alternatives? Why is Huffman coding a two-pass method? CPS 100, 2010 ØSpring Alternatives? l 13. 11

Uncompression with Huffman l We need the trie to uncompress Ø 00010011001101111 l As

Uncompression with Huffman l We need the trie to uncompress Ø 00010011001101111 l As we read a bit, what do we do? Ø Go left on 0, go right on 1 Ø When do we stop? What to do? l How do we get the trie? Ø How did we get it originally? Store 256 int/counts • How do we read counts? Ø How do we store a trie? 20 Questions relevance? • Reading a trie? Leaf indicator? Node values? CPS 100, Spring 2010 13. 12

Other Huffman Issues l What do we need to decode? Ø How did we

Other Huffman Issues l What do we need to decode? Ø How did we encode? How will we decode? Ø What information needed for decoding? l Reading and writing bits: chunks and stopping Ø Can you write 3 bits? Why not? Why? Ø PSEUDO_EOF Ø Bit. Input. Stream and Bit. Output. Stream: API l What should happen when the file won’t compress? Ø Silently compress bigger? Warn user? Alternatives? CPS 100, Spring 2010 13. 13

Huffman Complexities l How do we measure? Size of input file, size of alphabet

Huffman Complexities l How do we measure? Size of input file, size of alphabet Ø Which is typically bigger? l Accumulating character counts: ______ Ø How can we do this in O(1) time, though not really Building the heap/priority queue from counts ____ Ø Initializing heap guaranteed Building Huffman tree ____ Ø Why? Create table of encodings from tree ____ Ø Why? Write tree and compressed file _____ l l CPS 100, Spring 2010 13. 14

Good Compsci 100 Assignment? l l l Array of character/chunk counts, or is this

Good Compsci 100 Assignment? l l l Array of character/chunk counts, or is this a map? Ø Map character/chunk to count, why array? Priority Queue for generating tree/trie Ø Do we need a heap implementation? Why? Tree traversals for code generation, uncompression Ø One recursive, one not, why and which? Deal with bits and chunks rather than ints and chars Ø The good, the bad, the ugly Create a working compression program Ø How would we deploy it? Make it better? Benchmark for analysis Ø What’s a corpus? CPS 100, Spring 2010 13. 15

Other methods l l l Adaptive Huffman coding Lempel-Ziv algorithms Ø Build the coding

Other methods l l l Adaptive Huffman coding Lempel-Ziv algorithms Ø Build the coding table on the fly while reading document Ø Coding table changes dynamically Ø Protocol between encoder and decoder so that everyone is always using the right coding scheme Ø Works well in practice (compress, gzip, etc. ) More complicated methods Ø Burrows-Wheeler (bunzip 2) Ø PPM statistical methods CPS 100, Spring 2010 13. 16

Robert Tarjan l Graph algorithms Ø LCA, Connected Components, Planarity, persistence Ø Turing Award,

Robert Tarjan l Graph algorithms Ø LCA, Connected Components, Planarity, persistence Ø Turing Award, 1986, with Hopcroft; Nevanlinna Prize, 1983; Kanellakis Theory/Practice, 1999 l Interaction is very important and having cooperative spirit. One of the greatest joys of being a professor is having graduate students… it/s very motivating because you’ve got people whose minds are fresh and open and eager to learn: Shasha, Out of Their Minds CPS 100, Spring 2010 13. 17

Views of programming l Writing code from the method/function view is pretty similar across

Views of programming l Writing code from the method/function view is pretty similar across languages Ø Ø l Organizing methods is different, organizing code is different, not all languages have classes, Loops, arrays, arithmetic, … Program using abstractions and high level concepts Ø Ø Do we need to understand 32 -bit twos-complement storage to understand x =x+1? Do we need to understand how arrays map to contiguous memory to use Array. Lists? CPS 100, Spring 2010 13. 18

Bottom up meets Top down l We teach programming by teaching abstractions Ø Ø

Bottom up meets Top down l We teach programming by teaching abstractions Ø Ø Ø l How are programs stored and executed in the JVM? Ø Ø l What’s an array? What’s an Array. List? What’s an array in C? What’s a map, Hashmap? Treemap? Differences on different architectures? Similarities between Java and C++ and Python and PHP? What about how the CPU works? How memory works? Shouldn’t we study this too? CPS 100, Spring 2010 13. 19

From bit to byte to char to int to long l Ultimately everything is

From bit to byte to char to int to long l Ultimately everything is stored as either a 0 or 1 Ø Ø Ø l Bit is binary digit a byte is a binary term (8 bits) We should be grateful we can deal with Strings rather than sequences of 0's and 1's. We should be grateful we can deal with an int rather than the 32 bits that comprise an int If we have 255 values for R, G, B, how can we pack this into an int? Ø Ø Why should we care, can’t we use one int per color? How do we do the packing and unpacking? CPS 100, Spring 2010 13. 20

More information on bit, int, long l int values are stored as two's complement

More information on bit, int, long l int values are stored as two's complement numbers with 32 bits, for 64 bits use the type long, a char is 16 bits Ø Ø Ø Standard in Java, different in C/C++ Facilitates addition/subtraction for int values We don't need to worry about this, except to note: • Infinity + 1 = - Infinity (see Integer. MAX_VALUE) • Math. abs(-Infinity) > Infinity l Java byte, int, long are signed values, char unsigned Ø Ø What are values for 16 -bit char? 8 -bit byte? Why will this matter in Burrows Wheeler? CPS 100, Spring 2010 13. 21

Signed, unsigned, and why we care l Some applications require attention to memoryuse Ø

Signed, unsigned, and why we care l Some applications require attention to memoryuse Ø Differences: one-million bytes, chars, and int • First requires a megabyte, last requires four megabytes • When do we care about these differences? Ø l Java signed byte: -128. . 127, # bits? Ø Ø l Memory is cheaper, faster, …But applications expand to use it What if we only want 0 -255? (Huff, pixels, …) Convert negative values or use char, trade-offs? Java char unsigned: 0. . 65, 536 # bits? Ø Why is char unsigned? Why not as in C++/C? CPS 100, Spring 2010 13. 22

More details about bits l How is 13 represented? Ø … _0_ _1_ 24

More details about bits l How is 13 represented? Ø … _0_ _1_ 24 Ø l 23 22 21 20 Total is 8+4+1 = 13 What is bit representation of 32? Of 15? Of 1023? Ø What is bit-representation of 2 n - 1? Ø What is bit-representation of 0? Of -1? • Study later, but -1 is all 1’s, left-most bit determines < 0 l Determining what bits are on? How many on? Ø Understanding, problem-solving CPS 100, Spring 2010 13. 23

How are data stored? l To facilitate Huffman coding we need to read/write one

How are data stored? l To facilitate Huffman coding we need to read/write one bit Ø Ø Ø l Why do we need to read one bit? Why do we need to write one bit? When do we read 8 bits at a time? 32 bits? We can't actually write one bit-at-a-time. We can't really write one char at a time either. Ø Ø Output and input are buffered, minimize memory accesses and disk accesses Why do we care about this when we talk about data structures and algorithms? • Where does data come from? CPS 100, Spring 2010 13. 24

How do we buffer char output? l Done for us as part of Input.

How do we buffer char output? l Done for us as part of Input. Stream and Reader classes Input. Streams are for reading bytes Ø Readers are for reading char values Ø Why do we have both and how do they interact? Reader r = new Input. Stream. Reader(System. in); Ø Ø l Do we need to flush our buffers? In the past Java IO has been notoriously slow Ø Ø Do we care about I? About O? This is changing, and the java. nio classes help • Map a file to a region in memory in one operation CPS 100, Spring 2010 13. 25

Buffer bit output l To buffer bits we store bits in a buffer (duh)

Buffer bit output l To buffer bits we store bits in a buffer (duh) Ø Ø Ø l When the buffer is full, we write it. The buffer might overflow, e. g. , in process of writing 10 bits to 32 -bit capacity buffer that has 29 bits in it How do we access bits, add to buffer, etc. ? We need to use bit operations Ø Ø Ø Mask bits -- access individual bits Shift bits – to the left or to the right Bitwise and/or/negate bits CPS 100, Spring 2010 13. 26

Representing pixels l Pixel typically stores RGB and alpha/transparency values Each RGB is a

Representing pixels l Pixel typically stores RGB and alpha/transparency values Each RGB is a value in the range 0 to 255 Ø The alpha value is also in range 0 to 255 Pixel red = new Pixel(255, 0, 0, 0); Pixel white = new Pixel(255, 0); Ø l A picture is simply an array of int values void process(int pixel){ int blue = pixel & 0 xff; int green = (pixel >> 8) & 0 xff; int red = (pixel >> 16) & 0 xff; } CPS 100, Spring 2010 13. 27

Bit masks and shifts void process(int pixel){ int blue = pixel & 0 xff;

Bit masks and shifts void process(int pixel){ int blue = pixel & 0 xff; int green = (pixel >> 8) & 0 xff; int red = (pixel >> 16) & 0 xff; } l Hexadecimal number: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b, c, d, e, f Ø Ø l f is 15, in binary this is 1111, one less than 10000 The hex number 0 xff is an 8 bit number, all ones Bitwise & operator creates an 8 bit value, 0— 255 Ø Ø Ø Must use an int/char, what happens with byte? 1&1 == 1, otherwise we get 0 like logical and Similarly we have |, bitwise or CPS 100, Spring 2010 13. 28