Data Compression l Compression is a highprofile application








































































- Slides: 72
Data Compression l Compression is a high-profile application Ø. zip, . mp 3, . jpg, . gif, . gz, … Ø What property of MP 3 was a significant factor in what made Napster work (why did Napster ultimately fail? ) l Why do we care? Ø Secondary storage capacity doubles every year Ø Disk space fills up quickly on every computer system Ø More data to compress than ever before CPS 100 13. 1
More on Compression l l What’s the difference between compression techniques? Ø. mp 3 files and. zip files? Ø. gif and. jpg? Ø Lossless and lossy Is it possible to compress (lossless) every file? Why? Lossy methods Ø Good for pictures, video, and audio (JPEG, MPEG, etc. ) Lossless methods Ø Run-length encoding, Huffman, LZW, … 11 3 5 3 2 6 5 3 5 3 10 CPS 100 13. 2
Priority Queue l Compression motivates the study of the ADT priority queue Ø Supports two basic operations • insert -– an element into the priority queue • delete – the minimal element from the priority queue Ø Implementations may allow getmin separate from delete • Analogous to top/pop, front/dequeue in stacks, queues l See pqdemo. cpp and usepq. cpp, Ø code below sorts, complexity? string s; priority_queue pq; while (cin >> s) pq. insert(s); while (pq. size() > 0) { pq. deletemin(s); cout << s << endl; } CPS 100 13. 3
Priority Queue implementations l Implementing priority queues: average and worst case Unsorted vector Search tree Balanced tree Heap l Insert average Getmin (delete) Insert worst Getmin (delete) O(1) O(n) O(1) log n O(n) log n O(1) log n O(n) Heap has O(1) find-min (no delete) and O(n) build heap CPS 100 13. 4
Class tpqueue<…>, see tpq. h l l Templated class like tstack, tqueue, tvector, tmap, … Ø If deletemin is supported, what properties must inserted objects have, e. g. , can we insert string? double? struct? Ø Change what minimal means? Ø Implementation in tpq. h, tpq. cpp uses heap If we use a compare function object for comparing entries we can make a min-heap act like a max-heap, see pqdemo. cpp Ø Notice that Rev. Comp inherits from Comparer<Kind> Ø l Where is class Comparer declaration? How used? STL standard C++ class priority_queue Ø CPS 100 See stlpq. cpp, changing comparison requires template 13. 5
Sorting with tapestrypq. cpp, stlpq. cpp void sort(tvector<string>& v) // pre: v contains v. size() entries // post: v is sorted { tpqueue<string> pq; for(int k=0; k < v. size(); k++) pq. insert(v[k]); for(int k=0; k < v. size(); k++) pq. deletemin(v[k]); } l l How does this work, regardless of tpqueue implementation? What is the complexity of this method? Ø insert O(1), deletemin O(log n)? If insert O(log n)? Ø heapsort uses vector as the priority queue rather than separate pq. Ø From a big-Oh perspective no difference: O(n log n) • Is there a difference? What’s hidden with O notation? CPS 100 13. 6
Priority Queue implementation l The class tpqueue uses heaps, fast and reasonably simple Ø Ø Why not use inheritance hierarchy as was used with tmap? Trade-offs when using HMap and BSTMap: • Time, space • Ordering properties, e. g. , what does BSTMap support? l Changing method of comparison when calculating priority? Ø Create a function that replaces operator < • We want to pass the function, most general approach creates an object to hold the function • Also possible to pass function pointers, we avoid that Ø The function object replacing operator < must: • Compare two objects, so has two parameters • Returns – 1, 0, +1 depending on <, ==, > CPS 100 13. 7
Creating Heaps l Heap is an array-based implementation of a binary tree used for implementing priority queues, supports: Ø insert, findmin, deletemin: complexities? l Using array minimizes storage (no explicit pointers), faster too -- children are located by index/position in array l Heap is a binary tree with shape property, heap/value property Ø shape: tree filled at all levels (except perhaps last) and filled left-to-right (complete binary tree) Ø each node has value smaller than both children CPS 100 13. 8
Array-based heap l l l l store “node values” in array beginning at index 1 for node with index k Ø left child: index 2*k Ø right child: index 2*k+1 why is this conducive for maintaining heap shape? what about heap property? is the heap a search tree? where is minimal node? where are nodes added? deleted? CPS 100 6 10 7 17 13 9 21 19 25 0 1 2 3 4 5 6 7 8 9 10 6 7 10 13 17 19 9 21 25 13. 9
Thinking about heaps l l l Where is minimal element? Ø Root, why? Where is maximal element? Ø Leaves, why? How many leaves are there in an N-node heap (big-Oh)? Ø O(n), but exact? What is complexity of find max in a minheap? Why? Ø O(n), but ½ N? Where is second smallest element? Why? Ø Near root? CPS 100 6 7 10 13 17 9 21 25 19 6 10 7 17 13 9 21 19 25 0 1 2 3 4 5 6 7 8 9 10 13. 10
Adding values to heap l l l to maintain heap shape, must add new value in left-to-right order of last level Ø could violate heap property Ø move value “up” if too small change places with parent if heap property violated Ø stop when parent is smaller Ø stop when root is reached 6 7 10 13 17 19 21 25 insert 8 6 7 10 13 17 17 19 8 21 bubble 8 up 7 10 9 25 8 19 6 9 21 25 13 6 7 8 pull parent down, swapping isn’t necessary (optimization) 17 19 CPS 100 9 10 9 21 25 13 13. 11
Adding values, details void pqueue: : insert(int elt) { // add elt to heap in my. List. push_back(elt); int loc = my. List. size(); 6 7 10 13 17 19 9 21 25 6 13 17 19 while (1 < loc && elt < my. List[loc/2]) { my. List[loc] = my. List[loc/2]; loc /= 2; // go to parent } // what’s true here? 7 10 9 21 25 8 6 10 7 17 13 9 21 19 25 0 1 2 3 4 5 6 7 8 9 10} my. List[loc] = elt; tvector my. List CPS 100 13. 12
Removing minimal element l l Where is minimal element? Ø If we remove it, what changes, shape/property? How can we maintain shape? Ø “last” element moves to root Ø What property is violated? After moving last element, subtrees of root are heaps, why? Ø Move root down (pull child up) does it matter where? When can we stop “re-heaping”? Ø Less than both children Ø Reach a leaf 6 7 10 13 17 19 9 21 25 25 7 10 13 17 9 21 19 7 25 10 17 13 9 21 19 7 9 10 17 13 25 21 19 CPS 100 13. 13
Text Compression l l Input: String S Output: String S Ø Shorter Ø S can be reconstructed from S CPS 100 13. 14
Text Compression: Examples “abcde” in the different formats Encodings ASCII: 8 bits/character Unicode: 16 bits/character ASCII: 011000010110001001101100100… Fixed: 000001010011100 1 0 Symbol a ASCII 01100001 Fixed length 000 Var. length 0 1 1 0 0 1 c d 0 0 000 b 01100010 001 11 c 01100011 010 01 d 01100100 011 001 e 01100101 100 10 CPS 100 0 a Var: 000110100110 b e 1 0 0 0 a 1 1 d c 0 e 1 b 13. 15
Huffman Coding l l l D. A Huffman in early 1950’s Before compressing data, analyze the input stream Represent data using variable length codes Variable length codes though Prefix codes Ø Each letter is assigned a codeword Ø Codeword is for a given letter is produced by traversing the Huffman tree Ø Property: No codeword produced is the prefix of another Ø Letters appearing frequently have short codewords, while those that appear rarely have longer ones Huffman coding is optimal per-character coding method CPS 100 13. 16
Building a tree - Initial case: Every character is a leaf/tree with the respective character counts “the forest” of n trees n is the size of your alphabet - Base case: there is only tree in the forest - Reduction: Take the two trees with the smallest counts and combine them into a tree with count is equal to the sum of the two subtrees’ counts n-1 trees in our forest CPS 100 13. 17
Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 11 6 5 5 4 4 3 3 2 2 2 1 1 I E N S M A B O T G D L R U P F CPS 100 1 13. 18 C
Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 2 11 1 1 F C 6 5 5 4 4 3 3 I E N S M A B O T CPS 100 2 2 2 1 G D L R U P 13. 19
Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 2 11 3 1 1 1 2 C F P U 6 5 5 4 4 I E N S M CPS 100 3 3 3 A B O T 2 2 2 G D L R 13. 20
Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 4 2 11 3 1 1 1 2 C F P U 6 5 5 I E N CPS 100 4 4 4 S M 3 2 2 L R 2 3 3 A B O T 2 2 G D 13. 21
Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 4 2 11 3 1 1 1 2 C F P U 6 5 5 I E N CPS 100 4 4 S M 3 4 2 2 G D L R 3 3 A B O T 2 13. 22
Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 5 3 4 2 3 O 11 6 1 1 1 2 C F P U CPS 100 I 5 5 5 E N 4 4 S M 4 2 2 G D L R 3 3 3 A B T 3 13. 23
Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 5 3 6 2 3 O 11 6 4 3 T 1 1 1 2 C F P U CPS 100 6 I 5 5 5 E N 4 4 4 2 2 G D L R 4 4 3 3 S M A B 13. 24
Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 6 5 3 6 2 3 O 11 6 4 3 T 1 1 1 2 C F P U CPS 100 6 6 I 5 5 5 E N 4 4 4 2 2 G D L R 4 4 S M 3 3 A B 13. 25
Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 6 8 5 3 6 2 3 O 11 8 4 4 S M 4 3 T 1 1 1 2 C F P U CPS 100 6 6 6 I 5 5 5 E N 4 4 2 2 G D L R 4 3 3 A B 13. 26
Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 8 5 3 6 2 3 O 11 8 4 4 S M 4 3 T 1 1 1 2 C F P U CPS 100 8 6 6 6 I 5 6 8 5 5 E N 4 2 2 G D L R 3 3 A B 13. 27
Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 10 5 8 5 6 E 3 2 3 O 11 10 4 4 S M 4 3 T 1 1 1 2 C F P U CPS 100 8 8 6 6 6 I 5 6 8 5 N 4 2 2 G D L R 3 3 A B 13. 28
Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 10 5 11 5 5 E 6 N 3 2 3 O 11 8 11 3 T 1 1 1 2 C F P U 10 CPS 100 8 8 6 6 I 6 8 4 4 S M 4 4 2 2 G D L R 3 3 A B 13. 29
Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 12 10 11 8 6 I 5 5 5 E N 3 2 3 O 12 6 11 3 T 1 1 1 2 C F P U 11 CPS 100 10 8 8 4 4 S M 4 4 2 2 G D L R 3 3 A B 13. 30
Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 16 10 11 12 8 6 I 5 5 5 E N 3 2 3 O 16 6 12 3 T 1 1 1 2 C F P U 11 CPS 100 11 10 4 4 S M 4 4 2 2 G D L R 3 3 A B 13. 31
Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 21 16 10 11 12 8 6 I 5 5 5 E N 3 2 3 O 21 6 16 3 T 1 1 1 2 C F P U 12 CPS 100 11 4 4 S M 4 4 2 2 G D L R 3 3 A B 13. 32
Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E N 3 2 3 O 23 6 21 3 T 1 1 1 2 C F P U 16 CPS 100 4 4 S M 4 4 2 2 G D L R 3 3 A B 13. 33
Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E N 3 2 3 O 37 6 23 3 T 1 1 1 2 C F P U CPS 100 4 4 S M 4 4 2 2 G D L R 3 3 A B 13. 34
Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 60 37 23 21 16 10 11 11 12 8 6 I 5 5 5 E N 2 3 3 3 O 60 6 T 1 1 1 2 C F P U CPS 100 4 4 S M 4 4 2 2 G D L R 3 3 A B 13. 35
Encoding 1. Count occurrence of various characters in string O( 2. Build priority queue ) 3. Build Huffman tree O( ) 4. Write Huffman tree and coded data to file O( ) CPS 100 O( ) 13. 36
Properties of Huffman coding l Want to minimize weighted path length L(T)of tree T l wi is the weight or count of each codeword i Ø di is the leaf corresponding to codeword i How do we calculate character (codeword) frequencies? Huffman coding creates pretty full bushy trees? Ø When would it produce a “bad” tree? How do we produce coded compressed data from input efficiently? Ø l l l CPS 100 13. 37
Writing code out to file l l How do we go from characters to codewords? Ø Build a table as we build our tree Ø Keep links to leaf nodes and trace up the tree Need way of writing bits out to file Ø Platform dependent? Ø UNIX read and write See bitops. h Ø obstream and ibstream Ø Write bits from ints How can differentiate between compressed files and random data from some file? Ø Store a magic number CPS 100 13. 38
Decoding a message 011000001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E N 3 2 3 O CPS 100 6 3 T 1 1 1 2 C F P U 4 4 S M 4 4 2 2 G D L R 3 3 A B 13. 39
Decoding a message 11000001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E N 3 2 3 O CPS 100 6 3 T 1 1 1 2 C F P U 4 4 S M 4 4 2 2 G D L R 3 3 A B 13. 40
Decoding a message 1000001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E N 3 2 3 O CPS 100 6 3 T 1 1 1 2 C F P U 4 4 S M 4 4 2 2 G D L R 3 3 A B 13. 41
Decoding a message 000001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E N 3 2 3 O CPS 100 6 3 T 1 1 1 2 C F P U 4 4 S M 4 4 2 2 G D L R 3 3 A B 13. 42
Decoding a message 00001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E N 3 2 3 O CPS 100 6 4 4 S M 3 T 1 1 1 2 C F P U G 4 4 2 2 G D L R 3 3 A B 13. 43
Decoding a message 0001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E N 3 2 3 O CPS 100 6 4 4 S M 3 T 1 1 1 2 C F P U G 4 4 2 2 G D L R 3 3 A B 13. 44
Decoding a message 001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E N 3 2 3 O CPS 100 6 4 4 S M 3 T 1 1 1 2 C F P U G 4 4 2 2 G D L R 3 3 A B 13. 45
Decoding a message 0100001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E N 3 2 3 O CPS 100 6 4 4 S M 3 T 1 1 1 2 C F P U G 4 4 2 2 G D L R 3 3 A B 13. 46
Decoding a message 100001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E N 3 2 3 O CPS 100 6 4 4 S M 3 T 1 1 1 2 C F P U G 4 4 2 2 G D L R 3 3 A B 13. 47
Decoding a message 00001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E N 3 2 3 O CPS 100 6 4 4 S M 3 T 1 1 1 2 C F P U GO 4 4 2 2 G D L R 3 3 A B 13. 48
Decoding a message 0001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E N 3 2 3 O CPS 100 6 4 4 S M 3 T 1 1 1 2 C F P U GO 4 4 2 2 G D L R 3 3 A B 13. 49
Decoding a message 001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E N 3 2 3 O CPS 100 6 4 4 S M 3 T 1 1 1 2 C F P U GO 4 4 2 2 G D L R 3 3 A B 13. 50
Decoding a message 01001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E N 3 2 3 O CPS 100 6 4 4 S M 3 T 1 1 1 2 C F P U GO 4 4 2 2 G D L R 3 3 A B 13. 51
Decoding a message 1001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E N 3 2 3 O CPS 100 6 4 4 S M 3 T 1 1 1 2 C F P U GO 4 4 2 2 G D L R 3 3 A B 13. 52
Decoding a message 001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E N 3 2 3 O CPS 100 6 4 4 S M 3 T 1 1 1 2 C F P U GOO 4 4 2 2 G D L R 3 3 A B 13. 53
Decoding a message 01101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E N 3 2 3 O CPS 100 6 4 4 S M 3 T 1 1 1 2 C F P U GOO 4 4 2 2 G D L R 3 3 A B 13. 54
Decoding a message 1101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E N 3 2 3 O CPS 100 6 4 4 S M 3 T 1 1 1 2 C F P U GOO 4 4 2 2 G D L R 3 3 A B 13. 55
Decoding a message 101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E N 3 2 3 O CPS 100 6 4 4 S M 3 T 1 1 1 2 C F P U GOO 4 4 2 2 G D L R 3 3 A B 13. 56
Decoding a message 01 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E N 3 2 3 O CPS 100 6 4 4 S M 3 T 1 1 1 2 C F P U GOO 4 4 2 2 G D L R 3 3 A B 13. 57
Decoding a message 1 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E N 3 2 3 O CPS 100 6 3 T 1 1 1 2 C F P U 4 4 S M 4 4 2 2 G D L R GOOD 3 3 A B 13. 58
Decoding a message 011000001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E N 3 2 3 O CPS 100 6 3 T 1 1 1 2 C F P U 4 4 S M 4 4 2 2 G D L R GOOD 3 3 A B 13. 59
Decoding 1. Read in tree data 2. Decode bit string with tree CPS 100 O( ) 13. 60
Huffman coding: go go gophers ASCII g 103 o 111 p 112 h 104 e 101 r 114 s 115 sp. 32 l l 3 bits Huffman 1100111 1101111 1110000 1101000 1100101 1110010 1110011 1000000 001 010 011 100 101 110 111 ? ? choose two smallest weights Ø combine nodes + weights Ø Repeat Ø Priority queue? Encoding uses tree: Ø 0 left/1 right Ø How many bits? CPS 100 g o p h e r s * 3 3 1 1 1 2 2 2 3 p h e r s * 1 1 1 2 6 4 2 2 p h e r 1 1 g o 3 3 13. 61
Huffman coding: go go gophers ASCII g 103 o 111 p 112 h 104 e 101 r 114 s 115 sp. 32 l 3 bits Huffman 1100111 1101111 1110000 1101000 1100101 1110010 1110011 1000000 001 010 011 100 101 110 111 00 01 1100 1101 1110 1111 101 13 7 6 g o 3 3 Encoding uses tree: Ø 0 left/1 right Ø How many bits? 37!! Ø Savings? Worth it? 4 3 s * 1 2 2 2 p h e r 1 1 6 4 2 2 g o 3 3 3 CPS 100 p h e r 1 1 s * 1 2 13. 62
Huffman Tree 2 l “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” Ø E. g. “ A SIMPLE” “ 10101101001010011100000” CPS 100 13. 63
Huffman Tree 2 l “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” Ø E. g. “ A SIMPLE” “ 10101101001010011100000” CPS 100 13. 64
Huffman Tree 2 l “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” Ø E. g. “ A SIMPLE” “ 10101101001010011100000” CPS 100 13. 65
Huffman Tree 2 l “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” Ø E. g. “ A SIMPLE” “ 10101101001010011100000” CPS 100 13. 66
Huffman Tree 2 l “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” Ø E. g. “ A SIMPLE” “ 10101101001010011100000” CPS 100 13. 67
Huffman Tree 2 l “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” Ø E. g. “ A SIMPLE” “ 10101101001010011100000” CPS 100 13. 68
Huffman Tree 2 l “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” Ø E. g. “ A SIMPLE” “ 10101101001010011100000” CPS 100 13. 69
Huffman Tree 2 l “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” Ø E. g. “ A SIMPLE” “ 10101101001010011100000” CPS 100 13. 70
Huffman Tree 2 l “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” Ø E. g. “ A SIMPLE” “ 10101101001010011100000” CPS 100 13. 71
Other methods l l l Adaptive Huffman coding Lempel-Ziv algorithms Ø Build the coding table on the fly while reading document Ø Coding table changes dynamically Ø Cool protocol between encoder and decoder so that everyone is always using the right coding scheme Ø Works darn well (compress, gzip, etc. ) More complicated methods Ø Burrows-Wheeler (bunzip 2) Ø CPS 100 PPM statistical methods 13. 72