Fast Parallel Algorithms for Universal Lossless Source Coding

  • Slides: 36
Download presentation
Fast Parallel Algorithms for Universal Lossless Source Coding Dror Baron CSL & ECE Department,

Fast Parallel Algorithms for Universal Lossless Source Coding Dror Baron CSL & ECE Department, UIUC dbaron@uiuc. edu Ph. D. Defense – February 18, 2003

Overview • Motivation, applications, and goals • Background: – Source models – Lossless source

Overview • Motivation, applications, and goals • Background: – Source models – Lossless source coding + universality – Semi-predictive methods • An O(N) semi-predictive universal encoder • Two-part codes – Rigorous analysis of their compression quality – Application to parallel compression of Bernoulli sequences • Parallel semi-predictive (PSP) coding – Achieving a work-efficient algorithm – Theoretical results • Summary 2

Motivation • Lossless compression: text files, facsimiles, software executables, medical, financial, etc. • What

Motivation • Lossless compression: text files, facsimiles, software executables, medical, financial, etc. • What do we want in a compression algorithm? – – – Universality: adaptive to a large class of sources Good compression quality Speed: low computational complexity Simplementation Low memory use Sequential vs. offline 3

Why Parallel Compression ? • Some applications require high data rates: – Compressed pages

Why Parallel Compression ? • Some applications require high data rates: – Compressed pages in virtual memory – Remote archiving + fast communication links – Real-time compression in storage systems – Power reduction for interconnects on a circuit board • Serial compression is limited by the clock rate 4

Room for Improvement and Goals • Previous Art: – Serial universal source coding methods

Room for Improvement and Goals • Previous Art: – Serial universal source coding methods have reached the bounds on compression quality [Willems 1998, Rissanen 1999] – Parallel source coding algorithms have high complexity and/or poor compression quality • Naïve parallelization compresses poorly • Parallel dictionary compression [Franszek et. al. 1996] • Parallel context tree weighting [Stassen&Tjalkens 2001, Willems 2000] • Research Goals: “good” parallel compression algorithm – Work-efficient: O(N/B) time with B computational units – Compression quality: as good as best serial methods (almost!) 5

Main Contributions • BWT-MDL (O(N) universal encoder): – An O(N) algorithm that achieves Rissanen’s

Main Contributions • BWT-MDL (O(N) universal encoder): – An O(N) algorithm that achieves Rissanen’s redundancy bounds on best achievable compression – Combines efficient prefix tree construction with semi-predictive approach to universal coding • Fast Suffix Sorting (not in this talk): – Core algorithm is very simple (can be implemented in VLSI) – Worst-case complexity O(N log 0. 5(N)) – Competitive with other suffix sorting methods in practice • Two-Part Codes: – Rigorous analysis of their compression quality – Application to distributed/parallel compression – Optimal two-part codes • Parallel Compression Algorithm (not in this talk): – Work-efficient O(N/B) algorithm – Compression loss is roughly B log(N/B) bits 6

Source Models • Binary alphabet X={0, 1}, sequence x XN • Bernoulli Model: •

Source Models • Binary alphabet X={0, 1}, sequence x XN • Bernoulli Model: • i. i. d. model • p(xi=1)= • Order-K Markov Model: • • Previous K symbols called context Context-dependent conditional probability for next symbol More flexible than Bernoulli Exponentially many states 7

Context Tree Sources P(xn+1=1|0)=0. 8 ü More flexible than Bernoulli ü More compact than

Context Tree Sources P(xn+1=1|0)=0. 8 ü More flexible than Bernoulli ü More compact than Markov ü Particularly good for text ü Works for M-ary alphabet 0 0 root leaf P(xn+1=1|01)=0. 4 01 11 1 0 1 internal node P(xn+1=1|11)=0. 9 • State = context + conditional probabilities • Example: N=11, x=0101111 8

Review of Lossless Source Coding • Stationary ergodic sources • Entropy rate H=lim. N

Review of Lossless Source Coding • Stationary ergodic sources • Entropy rate H=lim. N H(x)/N • Asymptotically, H is the lowest attainable per-symbol rate • Arithmetic coding: – Probability assignment p(x) – Coding length l(x)=-log(p(x))+O(1) – Can achieve entropy + O(1) bits 9

Universal Source Coding • Source statistics are unknown ÞNeed probability assignment p(x) ÞNeed to

Universal Source Coding • Source statistics are unknown ÞNeed probability assignment p(x) ÞNeed to estimate source model ÞNeed to describe estimated source (explicitly or implicitly) • Redundancy: excess coding length above entropy (x)=l(x)-NH 10

Redundancy Bounds • Rissanen’s bound (K unknown parameters): E[ (x)] > (K/2) [log(N)+O(1)] •

Redundancy Bounds • Rissanen’s bound (K unknown parameters): E[ (x)] > (K/2) [log(N)+O(1)] • Worst-case redundancy for Bernoulli sequences (K=1): (x*)=maxx XN { (x)} 0. 5 log( N/2) • Asymptotically, (x)/N 0 • In practice, e. g. , text, the number of parameters scales almost linearly with N ÞLow redundancy is still essential 11

Semi-Predictive Approach • Semi-predictive methods describe x in two phases: – Phase I: find

Semi-Predictive Approach • Semi-predictive methods describe x in two phases: – Phase I: find a “good” tree source structure S and describe it using codelength l. S – Phase II: encode x using S with probability assignment p. S(x) Phase II Phase I x MDL Estimator S* y Encoder • Phase I: estimate minimum description length (MDL) tree source model S*=arg min {l. S –log(p. S(x))} 12

Semi-Predictive Approach - Phase II • Sequential encoding of x given S* – Determine

Semi-Predictive Approach - Phase II • Sequential encoding of x given S* – Determine which state s of S* generated symbol xi – Assign xi a conditional probability p(xi|s) – Arithmetic encoding S* xi Determine s s Assign p(xi|s) Arithmetic Encoder y • p(xi|s) can be based on previously processed portion of x, quantized probability estimates, etc. 13

Context Trees • We will provide an O(N) semi-predictive algorithm by estimating S* using

Context Trees • We will provide an O(N) semi-predictive algorithm by estimating S* using context trees root node 0 • Context trees arrange x in a tree ü Each node corresponds to sequence of appended arc labels on path to root node 01 ü Internal nodes correspond to repeating contexts in x unique ü Leaves correspond to unique contexts ü Sentinel symbol x 0=$ makes sure sentinel symbols have different contexts node 11 14

Context Tree Pruning (“To prune or not to prune…”) s MDL structure for state

Context Tree Pruning (“To prune or not to prune…”) s MDL structure for state s is • The MDL structure for state s yields the shortest description for symbols generated by s • When processing state s: – Estimate MDL structures for states 0 s and 1 s – Decide whether to keep 0 s and 1 s or prune them into state s – Base decision on coding lengths 0 s 00 s s or 1 s 10 s additional structure MDL structure for 1 s 15

Phase I with Atomic Context Trees • Atomic context tree: ü Arc labels are

Phase I with Atomic Context Trees • Atomic context tree: ü Arc labels are atomic (single symbol) ü Internal nodes are not necessarily branching ü Has up to O(N 2) nodes 1 0 1 0 0 1 • The coding length minimization of Phase I processes each node of the context tree [Nohre 94] • With atomic context trees, the worst-case complexity is at least O(N 2) ☹ 16

Compact Context Trees • Compact context tree: 111 0 üArc labels not necessarily atomic

Compact Context Trees • Compact context tree: 111 0 üArc labels not necessarily atomic 1 0 üInternal node are branching üO(N) nodes üCompact representation of the same tree • Depth-first traversal of compact context tree provides O(N) complexity • Theorem: Phase I of BWT-MDL requires O(N) operations performed with O(log(N)) bits of precision 17

Phase II of BWT-MDL S* xi Determine s s Assign p(xi|s) Arithmetic Encoder y

Phase II of BWT-MDL S* xi Determine s s Assign p(xi|s) Arithmetic Encoder y • We determine the generator state using a novel algorithm that is based on properties of the Burrows Wheeler transform (BWT) • Theorem: The BWT-MDL encoder requires O(N) operations performed with O(log(N)) bits of precision • Theorem: [Willems et. al. 2000]: redundancy w. r. t. any tree source S is at most |S|[0. 5 log(N)+O(1)] bits 18

Distributed/Parallel Compression of Bernoulli Sequences • Splitter partitions x into B blocks x(1), …,

Distributed/Parallel Compression of Bernoulli Sequences • Splitter partitions x into B blocks x(1), …, x(B) • Encoder j {1, …, B} compresses x(j); it assigns probabilities p(xi(j)=1)= and p(xi(j)=0)=1 - Þ The total probability assigned to x is identical to that in a serial compression system x(1) xi(1) Assign p(x (1)) i Assign p(xi(B)) Arithmetic Encoder B y(1) … x(B) xi(B) … … x Splitter p(xi(1)) Arithmetic Encoder 1 y(B) • This structure assumes that is known; our goal is to provide a universal parallel compression algorithm for Bernoulli sequences 19

Two-Part Codes • Two-part codes use a semi-predictive approach to describe Bernoulli sequences: –

Two-Part Codes • Two-part codes use a semi-predictive approach to describe Bernoulli sequences: – First part of code: • Determine the maximum likelihood (ML) parameter estimate ML(x)=n 1/(n 0+n 1) • Quantize ML(x) to rk, one of K representation levels • Describe the bin index k with log(K) bits – Second part of code encodes x using rk x Determine ML(x) Quantizer k {1, …, K} rk Encoder y • In distributed systems: – Sequential compressors require O(N) internal communications – Two-part codes need only communicate {n 0(j), n 1(j)}j {1, …, B} 20 ÞRequires O(B log(K)) internal communications

Jeffreys Two-Part Code • Quantize ML(x) Bin edges Representation levels bk=sin 2( k/2 K)

Jeffreys Two-Part Code • Quantize ML(x) Bin edges Representation levels bk=sin 2( k/2 K) rk=sin 2( (2 k-1)/4 K) r 1 r 2 b 0 b 1 b 2 rk bk-1 r. K bk b. K • Use K 1. 772 N 0. 5 bins • Source description: – log(K) bits for describing the bin index k – Need –n 1 log( ML(x))-n 0 log(1 - ML(x)) for encoding x 21

Redundancy of Jeffreys Code for Bernoulli Sequences • Redundancy: ü log(K) bits for describing

Redundancy of Jeffreys Code for Bernoulli Sequences • Redundancy: ü log(K) bits for describing k ü N D( ML(x)||rk) bits for encoding x using imprecise model • D(a||b) is Kullback Leibler divergence • In bin k, l(x)=- ML(x)log(rk )-[1 - ML(x)] log(1 -rk ) Þ l( ML (x)) is poly-line • Redundancy = log(K)+ l( ML(x))– N H( ML(x)) log(K) + L Þ Use quantizers that have small L distance between the entropy 22 function and the induced poly-line fit

Redundancy Properties • For x s. t. ML(x) is quantized to rk, the worst-case

Redundancy Properties • For x s. t. ML(x) is quantized to rk, the worst-case redundancy is log(K)+N max{D(bk||rk), D(bk-1||rk)} • D(bk||rk) and D(bk-1||rk) • Largest in initial or end bins • Similar in the middle bins • Difference reduced over wider range of k for larger N (larger K) Þ Can construct a near-optimal quantizer by modifying the initial and end bins of the Jeffreys quantizer 23

Redundancy Results • Theorem: The worst-case redundancy of the Jeffreys code is 1. 221+O(1/N)

Redundancy Results • Theorem: The worst-case redundancy of the Jeffreys code is 1. 221+O(1/N) bits above Rissanen’s bound • Theorem: The worst-case redundancy of the optimal twopart code is 1. 047+O(1) bits above Rissanen’s bound 24

Parallel Universal Compression for Bernoulli Sequences x(1) Determine n 0(1) and n 1(1) Determine

Parallel Universal Compression for Bernoulli Sequences x(1) Determine n 0(1) and n 1(1) Determine n 0(B) and n 1(B) Encoder 1 n 0(1), n 1(1) ML(x) [ j n 0(j)] / [ j n 0(j)+ j n 1(j)] ML(x) Quantizer rk n 0(B), n 1(B) Encoder B x(B) y(1) … … x(B) x(1) • Phase I: • Parallel units (PUs) compute symbol counts for the B blocks • Coordinating unit (CU) computes and quantizes the MDL parameter estimate ML(x) and describes k • Phase II: B PUs encode the B blocks based on rk 25 y(B) k

Why do we need Parallel Semi-Predictive Coding? x(1) y(1) … … Splitter … x

Why do we need Parallel Semi-Predictive Coding? x(1) y(1) … … Splitter … x Compressor 1 • Naïve parallelization: x(B) y(B) Compressor B • Partition x into B blocks • Compress blocks independently • The redundancy for a length-N/B block is O(log(N/B)) ÞTotal redundancy is O(B log(N/B)) • Rissanen’s bound is O(log(N)) ÞThe redundancy with naïve parallelization is excessive! 26

Parallel Semi-Predictive (PSP) Concept Phase I x(1) Statistics Accumulator 1 Statistics Accumulator B x(1)

Parallel Semi-Predictive (PSP) Concept Phase I x(1) Statistics Accumulator 1 Statistics Accumulator B x(1) Compressor 1 y(1) symbol counts 1 Coordinating Unit symbol counts B x(B) S* … … x(B) Phase II Compressor B y(B) S* • Phase I: • B parallel units (PUs) accumulate statistics (symbol counts) on the B blocks • Coordinating unit (CU) computes the MDL tree source estimate S* • Phase II: -- B PUs compress the B blocks based on S* 27

Source Description in PSP Coordinating Unit Describe bin indices {ks}s S* Describe S* structure

Source Description in PSP Coordinating Unit Describe bin indices {ks}s S* Describe S* structure S* xi(b) Determine s {ks}s S* s Determine p(xi(b)|s) Arithmetic p(xi(b)|s) Encoder y(b) Parallel unit b p(xi(b)|s) 1{xi(b)=1}rks+1{xi(b)=0}(1 -rks ) • Phase I: the CU describes the structure of S* and the quantized ML parameter estimates {ks}s S* • Phase II: each of B PUs compresses block x(b) just like Phase II of the (serial) semi-predictive approach 28

Complexity of Phase I • Phase I processes each node of the context tree

Complexity of Phase I • Phase I processes each node of the context tree [Nohre 94] • The CU processes the states of a full atomic context tree of depth-Dmax, where Dmax log(N/B) • Processing a node: • Internal node: requires D O(1) time • Leaf: CU adds up block symbol counts to compute 2 =O(N/B) each symbol count, i. e. , ns = b ns (b), where {0, 1} ÞThe CU processes a leaf node in O(B) time ÞWith O(N/B) leaves, the aggregate complexity is O(N), which is excessive 29 max Dmax

Phase I in O(N/B) Time • We want to compute ns = b ns

Phase I in O(N/B) Time • We want to compute ns = b ns (b) faster • An adder tree incurs O(log(B)) delay for adding up B block symbol counts • Pipelining enables us to generate a result every O(1) time ÞO(N/B) nodes, each requiring O(1) time 30

Phase II in O(N/B) Time Parallel unit b xi(b) Determine s S* s Determine

Phase II in O(N/B) Time Parallel unit b xi(b) Determine s S* s Determine p(xi(b)|s) Arithmetic p(xi(b)|s) Encoder y(b) {ks}s S* • The challenging part in Phase II is determining s: • Define the context index for a length-Dmax context s preceding xi(b) as the binary number that represents s • The length-2 Dmax generator table g satisfies gj=s S* if s is a suffix of the context whose context index is j • We can construct g in O(N/B) time (far from trivial!) • Compute context indices for all symbols of x(b) and determine the generating states via the generator table g 31

Decoder x(1) … … … bus y D E M U X * Reconstruct

Decoder x(1) … … … bus y D E M U X * Reconstruct S , {ks}s S* S*, {ks}s S* Decoding Unit 1 y(1) y(B) Decoding Unit B x(B) • An input bus is demultiplexed to multiple units • The MDL source and quantized ML parameters are reconstructed • The B compressed blocks y(B) are decompressed on B decoding units 32

Theoretical Results • Theorem: With computations performed with 2 log(N) bits of precision defined

Theoretical Results • Theorem: With computations performed with 2 log(N) bits of precision defined as O(1) time: – Phase I of PSP approximates the MDL coding length within O(1) of the true optimum – The PSP algorithm requires O(N/B) time • Theorem: The PSP algorithm uses a total of O(N) words of memory = a total of O(N log(N)) bits • Theorem: The pointwise redundancy of PSP w. r. t. S* is (x) < B[log(N/B)+O(1)]+|S|*[log(N)/2+O(1)] parallelization overhead 33

Main Contributions • BWT-MDL (O(N) universal encoder): ü An O(N) algorithm that achieves Rissanen’s

Main Contributions • BWT-MDL (O(N) universal encoder): ü An O(N) algorithm that achieves Rissanen’s redundancy bounds on best achievable compression ü Combines efficient prefix tree construction with semi-predictive approach to universal coding • Fast Suffix Sorting (not in this talk): – Core algorithm is very simple (can be implemented in VLSI) – Worst-case complexity O(N log 0. 5(N)) – Competitive with other suffix sorting methods in practice • Two-Part Codes: ü Rigorous analysis of their compression quality ü Application to distributed/parallel compression – Optimal two-part codes • Parallel Compression Algorithm (not in this talk): ü Work-efficient O(N/B) algorithm ü Compression loss is roughly B log(N/B) bits 34

More… • Results have been extended to |X|-ary alphabet • Future research can concentrate

More… • Results have been extended to |X|-ary alphabet • Future research can concentrate on: – Processing broader classes of tree sources – Problems in statistical inference • Universal classification • Channel decoding • Prediction – Characterize the design space for parallel compression algorithms 35

Generic Phase I if (s is a leaf) { Count symbol appearances ns 0

Generic Phase I if (s is a leaf) { Count symbol appearances ns 0 and ns 1 MDLs length(ns 0, ns 1) } else { /* s is an internal node */ Recursively compute MDL length and counts for 0 s and 1 s ns 0 n 0 s 0+n 1 s 0, ns 1 n 0 s 1+n 1 s 1 MDLs length(ns 0, ns 1) if (MDLs >MDL 0 s +MDL 1 s ) Keep 0 s and 1 s } else { Prune 0 s and 1 s, keep s } } 36