Programming by Sketching for BitStreaming Programs Armando SolarLezama
Programming by Sketching for Bit-Streaming Programs Armando Solar-Lezama, Ras Bodik UC Berkeley Rodric Rabbah MIT Kemal Ebcioglu IBM
Verification, synthesis, sketching • Verification: does your program implement the spec? – user responsible for low-level implementation details – redundancy: implementation restates aspects of spec • Synthesis: produce a program that implements the spec – say what not how – say it only once • Sketching: synthesis + partially described implementation – spec is executable → easy to debug – programmer sketches the implementation
The sketching experience sketch program = completed sketch
Case study domain: bit-stream programs • Manipulate a stream at the bit level – crypto: DES, Serpent, AES, …, code breaking – coding: error correction • Implementation Gap – easy to describe, difficult to implement • Operate under strict constraints – performance is very important • up to 95% of server cycles spent in security-related processing – correctness is crucial • subtle bug in Blowfish implementation allowed cracking over half the keys in less than 10 minutes http: //www. schneier. com/blowfish-bug. txt.
Running example • Drop. Third: “Drop every third bit in the bit stream. ” • exhibits many features of complicated permutations – implementation Gap – number of implementation: : exponential in word size • In Stream. Bit, fast implementation can be sketched SLOW O(w) sketch FAST O(log w) ? ? ? ? functionality ? ? ? ? ? FAST ? ? implementation ? ? ? ? ? + ? ? ? ?
Two sketches needed for Drop. Third The log-shifter: Decompose( [shift(1: 16 by 0 || 1)], [shift(1: 16 by 0 || 2)], [shift(1: 16 by 0 || 4)] ) Smart packing of input stream to machine words: Decompose( [shift(1: 2 by 0), shift(17: 18 by 0), shift(33: 34 by 0)], [shift(1: 16 by ? ), shift(17: 32 by ? ), shift(33: 48 by ? )] ) Two sketches synthesize a high-quality implementation: – 32 bit on a Pentium IV: 1. 83 -fold speedup – 64 bit on an Itanium II: 3. 33 -fold speedup
… compare with Fortran WSIZE=16; 100+ lines subsequence = Unroll[WSIZE](subsequence); DIMENSION MASKB 1(INC), MASKB 2(INC), MASKB 3(INC), MASKB 4(INC) subsequence = Permut. Factor[ [shift(1: 2 by 0), shift(17: 18 by 0), shift(33: 34 by 0)], DATA MASKB 1 /Z'F 81 F 03 E 07 C 0 F 81 F 0', Z'3 E 07 C 0 F 81 F 03 E 07 C', [shift(1: 16 by ? ), shift(17: 32 by ? ), shift(33: 48 by ? )] $ Z'0 F 81 F 03 E 07 C 0 F 81 F', Z'03 E 07 C 0 F 81 F 03 E 07', Z'C 0 F 81 F 03 E 07 C 0 F 81', Z'F 03 E 07 C 0 F 81 F 03 E 0', ] ( subsequence ); $ Z'7 C 0 F 81 F 03 E 07 C 0 F 8', Z'1 F 03 E 07 C 0 F 81 F 03 E', Z'07 C 0 F 81 F 03 E 07 C 0 F', Z'81 F 03 E 07 C 0 F 81 F 03', $ Z'E 07 C 0 F 81 F 03 E 07 C 0'/ DATA MASKB 2 /Z'FFC 003 FF 000 FFC 00', Z'3 FF 000 FFC 003 FF 00', subsequence_1=Diag. Split[WSIZE](subsequence); $ Z'0 FFC 003 FF 000 FFC 0', Z'03 FF 000 FFC 003 FC 0', Z'FE 001 FF 8007 FE 001', Z'FF 8007 FE 001 FF 800', $ Z'7 FE 001 FF 8007 FE 00', Z'1 FF 8007 FE 001 FF 80', Z'07 FE 001 FF 8007 FC 0', Z'FC 003 FF 000 FFC 003', for(i=0; i<3; ++i) { Z'FF 000 FFC 003 FF 000'/ $ … subsequence_1. filter(i) = … Permut. Factor[ [shift(1: 16 by 0 || 1)], c Move word 1 into position [shift(1: 16 by 0 || 2)], TC = ISHFT(CBUF(1, I), SLC(1)) c Move first part of word 2 into position [shift(1: 16 by 0 || 4)] TC](= subsequence_1. filter(i) TC + ISHFT(CBUF(2, I), SRC(2)) ); c Move word 3 into position and output 1 st word of output } C(1, I+K) = TC + ISHFT(CBUF(3, I), SRC(3)) c Move last part of word 3 into position Size: 13 lines TC = ISHFT(CBUF(3, I), SLC(3)) c Move word 4 into position TC = TC + ISHFT(CBUF(4, I), SRC(4)) … …
Exploring different implementations A permutation from DES cipher (64 bits 64 bits) 32 bits shift(1: 64 by 0 || 33 || -33), shift(1: 2: 31 by -33), shift(34: 2: 64 by 33), [] // unspecified, synthesized Sketch • trail and error with different implementations
How sketching works
The specification • An executable specification – a Stream. It program (synchronous dataflow) • Learn more about Stream. It, see their LCTES and PPOPP talks – filters represented internally as matrices bit->bit filter Drop. Third { work push 2 pop 3{ push(pop()); pop(); } } 3 2 100 010 x x x y = y z consumes a 3 -bit chunk of inp produces a 2 -bit of output.
The sketched synthesis problem • spec: • sketch of implementation: ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? • problem: – synthesize target-machine code such that it follows the sketched implementation steps
Easier problem 1: Base Compilation • spec: • sketch of implementation: none • problem: – compile the spec for the target machine
Implementation expressible in Stream. It • relevant hardware instructions are dataflow filters: in wxyz duplicate wxyz filter(in) { t 2 = in SHIFTL 1 t 1 = in AND 1100 t 3 = t 2 AND 0010 0 1 x y 0 z 0 w 0010 0001 0 0 x 0 y 0 z 0 wxyz 10 w x y 0 z 0 0100 0000 w x 00 00 00 xyz 0 0 x y 0 z 0000 0010 0 00 z 000 0 wx 00 out = t 1 OR t 3 return out } 00 z 0 or wxz 0 out
Base compiler space of all programs spec implementations more decomposed (into low-level steps)
How It Works • Example: Drop Third Bit (word size W = 4 bits) – unroll filter – decompose into filters operating on W=4 bits of input. t 1 = in AND 1100 – decompose into filters producing W=4 bits of output in duplicate t 2 = in 1 0 1 SHIFTL 00 0010 0001 0000 1000 0100 0000 t 3 = t 2 AND 0 0 0010 0000 or opqrstuvwxyz out = t 1 OR t 3 opqrstuvwxyz opqr xyz opr 00000 stuv 000 suv 00 wxyz 000000 xy xy op rs uv xy
Easier problem 2: adopting an implementation • spec: • fragment of implementation (not a sketch): • problem: – synthesize target-machine code such that it uses the provided implementation steps
Adopting provided implementation spec implementations
Adopting provided implementation (Drop. Third)
Finally, the sketched synthesis problem • spec: • sketch of implementation: ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
Adopting provided implementation spec implementations
Nudging the base compiler ? ? ? ? + ? ? ? ? ? ? ? ? ? ? ? ?
Sketch resolved using Boolean matrix algebra function to implement completed sketch ? ? ? ? + ? ? ? ? ? ? ? ? ? ? ? ? M [shift(1: 16 by 0 || 1)] x [shift(1: 16 by 0 || 2)] x [shift(1: 16 by 0 || 4)] M 3 x M 2 x M 1 – filter-to-pipeline decomposition = matrix factorization: M = M 3 x M 2 x M 1 (guarantees correctness) – sketch gives constraints on factors – factorization approach: • constrain solving followed by search • Use constraints to narrow search space
Evaluation of Approach
Evaluation Goals 1. Time to First Solution – how quickly can we develop a first reference solution? 2. Performance of Base Compilation – can base compiled code compete with handwritten C? 3. Benefits of Sketching – how good is the quality of code produced through sketching? 4. Sketching Vs. Expert Tuning – can sketching compete with professionally tuned code?
1) Time to develop the spec How quickly can the specification (first solution) be developed? Avg
2) Performance of Base Compilation • Can base compiled code compete with handwritten C? Stream. Bit submissions C submissions (each line = one programmer) Time in hours
3) Benefits of Sketching • How much performance can sketching get beyond baseline compilation? sketch-based implementation expert-tuned C implementation C programmers time (hours)
4) Sketching vs. expert tuning • Can sketching compete with professionally tuned code? • Can we match the best DES implementation (lib. DES)? – only 17% slower on 32 -bit machines (we can fix this) – faster on 64 -bit machines processor sketched vs. lib. DES Pentium 4 Pentium III Sparc IA 64 IBM SP 0. 91 0. 83 0. 91 1. 06 1. 08
Related Approaches and Conclusion
Can we do without the sketch? sketch functionality FAST implementation ? ? ? ? + ? ? ? ? ? ? ? ? ? ? ? ? • Search-based optimization (a’ la Atlas) – synthesize and evaluate all implementations – unconstrained by the sketch, search space becomes huge • Classical optimization – hard-code log-shifter as a typical optimization phase – more work than writing the sketch, needs compiler
Sketches allow global control over compilation Log-shifting Sketch Pack within words Sketch spec
Conclusion • Spec + sketch – separation of aspects: correctness vs. performance – separation of roles: domain experts vs. system experts (crypto) (perf. programmer) • Sketching worked in this domain because : – large gap between specification and implementation – algebra of program transformations • Other Domains? – Graphics, Scientific Kernels, Media Codecs
- Slides: 32