Post Placement CSlow Retiming for Xilinx Virtex FPGAs

Outline UC Berkeley BRASS Group • “Automatically Double Your Throughput” – “You paid for

Retiming and Repipelining UC Berkeley BRASS Group • Retiming – Automatically moving registers to

C-Slow Retiming UC Berkeley BRASS Group • Replace every register with a sequence of

Design Semantics After C-Slowing UC Berkeley BRASS Group • Design operates on C independent

C-slowing, Retiming, and the Virtex FPGA UC Berkeley BRASS Group • Every 4 -LUT

Sketch of Tool’s Operation UC Berkeley BRASS Group 8. 9. Router. xdl 2 1

Experiment 1: How Good is the Tool? UC Berkeley BRASS Group • Tool is

Experiment 1: AES, Automatically Placed UC Berkeley BRASS Group Version Clock Rate (Throughput) Stream

Experiment 1: Smith/Waterman, Automatically Placed UC Berkeley BRASS Group Version Clock Rate (Throughput) Stream

Experiment 1: Comments UC Berkeley BRASS Group • Just retiming is of no benefit

Experiment 2: Retiming LEON UC Berkeley BRASS Group • Can we automatically C-slow a

Experiment 2: Results UC Berkeley BRASS Group Version Clock Rate Thread Clock (Throughput) Rate

Experiment 2: Comments UC Berkeley BRASS Group • Retiming alone worked surprisingly well –

Conclusions: UC Berkeley BRASS Group • C-slow retiming is very effective – "Automatically double

Backup Slide: Why Not Use (Current) Synthesis Tools? UC Berkeley BRASS Group • Many

Backup Slide: Why the limitations on total speedup? UC Berkeley BRASS Group • Absolute

(Backup Slide) : Design Restrictions to Enable C-slowing UC Berkeley BRASS Group • Resets

Scrap Image UC Berkeley BRASS Group Automatic C-Slow Retiming for Virtex FPGAs 19

Scrap Image 2 UC Berkeley BRASS Group Automatic C-Slow Retiming for Virtex FPGAs 20

Scrap Image 3 UC Berkeley BRASS Group Addr Dout Din WE Thread Counter Addr

Scrap Image 4 UC Berkeley BRASS Group Automatic C-Slow Retiming for Virtex FPGAs 22

Scrap 5 1. 1 2 1. 6 1 1. 1 UC Berkeley BRASS Group

Slides: 24

Download presentation

Post Placement C-Slow Retiming for Xilinx Virtex FPGAs UC Berkeley BRASS Group Nicholas Weaver Yury Markovskiy Yatish Patel John Wawrzynek UC Berkeley Reconfigurable Architectures, Systems, and Software (BRASS) Group ACM Symposium on Field Programmable Gate Arrays (FPGA) February 2 x, 2003 http: //www. cs. berkeley. edu/~nweaver/cslow. html Automatic C-Slow Retiming for Virtex FPGAs

Outline UC Berkeley BRASS Group • “Automatically Double Your Throughput” – “You paid for those registers, here’s how to use them” • Retiming and C-slow Retiming – The transformation • C-slow Retiming and the Virtex FPGA – The target • Retiming 3 Benchmarks – The tests Automatic C-Slow Retiming for Virtex FPGAs 2

Retiming and Repipelining UC Berkeley BRASS Group • Retiming – Automatically moving registers to minimize the clock period – Benefits limited by the number of registers – Algorithm developed by Leiserson et al • Repipelining – Adding registers to the front or back – Let retiming then move them around • But What About Feedback Loops? – Retiming and repipelining are of limited benefit when you have feedback loops Automatic C-Slow Retiming for Virtex FPGAs 3

C-Slow Retiming UC Berkeley BRASS Group • Replace every register with a sequence of C registers. – With more registers retiming can break the design into finer pieces – Again proposed by Leiserson et al, to meet systolic slowdown • Semantic altering transformation – But resulting semantics are predictable and useful • Ideal: C-slow in synthesis, retime after placement • Our prototype: C-slow and retime after placement Automatic C-Slow Retiming for Virtex FPGAs 4

Design Semantics After C-Slowing UC Berkeley BRASS Group • Design operates on C independent data streams – Data streams are externally interleaved on round robin basis • Semantics apply to designs with Task Level Parallelism – Encryption • Counter (CTR) mode works on independent blocks – Sequence matching • Compare sequence vs database • C-slowing improves throughput but adds latency and registers Automatic C-Slow Retiming for Virtex FPGAs 5

C-slowing, Retiming, and the Virtex FPGA UC Berkeley BRASS Group • Every 4 -LUT has associated register • LUTs can act as clocked shift registers (SRL 16 s) – Used in our AES hand-benchmark – Not used in our tool F 4 F 3 F 2 F 1 4 -LUT – Register can, almost always, be used independently of the LUT X XB BX • Many designs have low register utilization – Excess of registers available in unoptimized designs • Retiming best performed with/after placement – Xilinx placement operates on mapped slices – Need net delay information for better results Automatic C-Slow Retiming for Virtex FPGAs 6

Sketch of Tool’s Operation UC Berkeley BRASS Group 8. 9. Router. xdl 2 1 1. 6 . xdl 1 11 2 1. 3 1 2 1 0. 9 5. 6. 7. Load design into graph representation Replace registers with edge annotations to represent registers Replace every single register with C registers Compute costs based on delay model Retime Convert edge annotations back to instance registers Write out. xdl, convert to. ncd Placer 1. 4 4. Convert. ncd to. xdl after placement 2. 2 1. 2. 3. Route Automatic C-Slow Retiming for Virtex FPGAs 7

Experiment 1: How Good is the Tool? UC Berkeley BRASS Group • Tool is a simple prototype – Manhattan distance delay estimate – No attempt to minimize flip-flops – Basic flip-flop allocation • Two benchmarks: AES and Smith/Waterman – Hand mapped – (optionally) hand placed – (optionally) hand C-slowed and retimed • Our Best hand AES implementation – 1. 3 Gb/s – <800 Slices, 10 Block. RAMs – $10 part, Spartan II-100 Automatic C-Slow Retiming for Virtex FPGAs 8

Experiment 1: AES, Automatically Placed UC Berkeley BRASS Group Version Clock Rate (Throughput) Stream Clock Rate (1 / Latency) Initial Design 48 MHz 5 -Slow by hand 105 MHz 21 MHz Retimed Automatically 47 MHz 2 -Slow Automatically 64 MHz 32 MHz 3 -Slow Automatically 75 MHz 25 MHz 4 -Slow Automatically 87 MHz 21 MHz 5 -Slow Automatically 88 MHz 18 MHz • Just retiming is of no benefit • Automatic C-slowing very effective – But could do even better Automatic C-Slow Retiming for Virtex FPGAs 9

Experiment 1: Smith/Waterman, Automatically Placed UC Berkeley BRASS Group Version Clock Rate (Throughput) Stream Clock Rate (1 / Latency) Initial Design 43 MHz 4 -Slow by hand 90 MHz 22 MHz Retimed Automatically 40 MHz 2 -Slow Automatically 69 MHz 34 MHz 3 -Slow Automatically 84 MHz 28 MHz 4 -Slow Automatically 76 MHz 25 MHz • Again, just retiming is of no benefit • C-slowing highly effective – Within 7% of hand-built implementation Automatic C-Slow Retiming for Virtex FPGAs 10

Experiment 1: Comments UC Berkeley BRASS Group • Just retiming is of no benefit – Both designs limited by single cycle feedback loops • C-Slowing very effective – Able to automatically nearly double throughput • Hand implementations more than doubled throughput – Reasonable numbers of additional registers • Limitations of prototype tool: – Flip-flop allocation routines could be better – Some AES hand benchmarks used SRL 16 delay chains • Simple is pretty good – Relatively simplistic implementation gets reasonably close to hand -mapped performance Automatic C-Slow Retiming for Virtex FPGAs 11

Experiment 2: Retiming LEON UC Berkeley BRASS Group • Can we automatically C-slow a large, synthesized design? • Leon 1: A synthesized , GPLed SPARC compatible microprocessor core [1] – 5 stage pipeline, integer only – Modify register file to use Block. RAMs • Block. RAMs are used as negative edge devices – Remove caches, I/O, etc – Synthesize, using Symplify with CEs disabled – Edit EDIF to replace Sets/Resets • Retime and C-slow with prototype tool – Prototype tool converts Block. RAMs to positive edge • C-slow a microprocessor core. . . – Get an interleaved multithreaded architecture [1] Leon 1, by Jiri Gaisler, http: //www. gaisler. com/leonmain. html Automatic C-Slow Retiming for Virtex FPGAs 12

Experiment 2: Results UC Berkeley BRASS Group Version Clock Rate Thread Clock (Throughput) Rate (Latency) Lut Associated Flip Flops Lut Independent Flip Flops Initial Design 23 MHz 1611 NA Retimed Automatically 25 MHz 2398 194 2 -Slow Automatically 46 MHz 23 MHz 2150 388 3 -Slow Automatically 47 MHz 16 MHz 2438 3713 • Retiming alone worked surprisingly well • 2 -slowing very effective • 3 -slowing hit diminishing returns 6132 Luts for all designs Automatic C-Slow Retiming for Virtex FPGAs 13

Experiment 2: Comments UC Berkeley BRASS Group • Retiming alone worked surprisingly well – Tool automatically converted Block. RAMs to positive -edge clocking and rebalanced the pipeline • 2 -slowing very effective – Effectively doubled the initial throughput • NO slowdown in latency over initial design because retiming was effective without C-slowing – Used more many registers, but fewer registers than LUTs • 3 -slowing hit diminishing returns – Too many registers required combined with poor register allocation poor performance Automatic C-Slow Retiming for Virtex FPGAs 14

Conclusions: UC Berkeley BRASS Group • C-slow retiming is very effective – "Automatically double your throughput" • Benefits: More throughput • Costs: More Flip Flops, worse latency • Post-placement retiming appropriate – Independent Flip Flop usage critical – Have delay model for interconnect as well as logic • Some room for improvement – Faster/Better implementation • Minimize Flip Flop usage as well as delay • Use SRL 16 s • Better placement of Flip Flops – Experience suggests more Flip Flops/LUT would be useful Automatic C-Slow Retiming for Virtex FPGAs 15

Backup Slide: Why Not Use (Current) Synthesis Tools? UC Berkeley BRASS Group • Many synthesis tools support retiming, but with caveats: – ONLY works for synthesized items • AES and Smith/Waterman didn't use synthesis – Can't automatically C-slow – Can't retime through memory blocks – Can't accurately guesstimate interconnect delay before placement • >½ of the delay is the interconnect – Can't effectively scavenge unused flip-flops before placement • Xilinx placement operates on slices, not luts Automatic C-Slow Retiming for Virtex FPGAs 16

Backup Slide: Why the limitations on total speedup? UC Berkeley BRASS Group • Absolute maximum – Interconnect + LUT + Flip-Flop • Practical maximums – Too many flip-flops to allocate • “Only” one flip-flop per LUT available – Flip-flop allocation poor • Quick and dirty greedy heuristic – Works well for mild C-slowing – Fails with highly aggressive C-slowing – Tool doesn’t minimize flip-flops – Critical path is defined by the single worst path – Tool uses “Cheap and dirty” interconnect delay model Automatic C-Slow Retiming for Virtex FPGAs 17

(Backup Slide) : Design Restrictions to Enable C-slowing UC Berkeley BRASS Group • Resets and Clock Enables – Convert to explicit logic • Memories – Increase by a factor of C • Add high bits of addr to provide round-robin access • Every stream sees an independent memory • Global Set/Reset – Convert to individual resets – Still highly restrictive Thread Counter • Interleave/deinterleave IO Addr Dout Din WE – Requires external logic • No asynchronous sets/resets Automatic C-Slow Retiming for Virtex FPGAs 18

Scrap Image UC Berkeley BRASS Group Automatic C-Slow Retiming for Virtex FPGAs 19

Scrap Image 2 UC Berkeley BRASS Group Automatic C-Slow Retiming for Virtex FPGAs 20

Scrap Image 3 UC Berkeley BRASS Group Addr Dout Din WE Thread Counter Addr Din WE Dout Automatic C-Slow Retiming for Virtex FPGAs 21

Scrap Image 4 UC Berkeley BRASS Group Automatic C-Slow Retiming for Virtex FPGAs 22

Scrap 5 1. 1 2 1. 6 1 1. 1 UC Berkeley BRASS Group 1 2. 2 Automatic C-Slow Retiming for Virtex FPGAs 1. 3 1. 4 1 2 2 1 0. 9 2 0. 9 1 2. 2 1 1. 3 1. 4 1 23

Scrap 6 1. 1 1. 3 0. 9 2. 2 1. 4 1. 6 UC Berkeley BRASS Group Automatic C-Slow Retiming for Virtex FPGAs 24