Power Optimal DualVdd Buffered Tree Considering Buffer Stations

Power Optimal Dual-Vdd Buffered Tree Considering Buffer Stations and Blockages King Ho Tam and Lei He Electrical Engineering Department University of California, Los Angeles Sponsors: NSF CAREER, UC MICRO (Fujitsu, Intel and Mindspeed), and IBM Faculty Partner Award.

Motivation l Increasing interconnect power ¤ 35% cells are buffers at 65 nm technology [Saxena, TCAD 04] l Previous work ¤ Power-optimal single Vdd buffer insertion [Lillis, JSSC 96] ¤ Delay-optimal buffered tree generation [Cong, DAC 00; Alpert, TCAD 02] l No existing algorithms consider dual-Vdd for buffer insertion or buffered tree generation

Major Contributions l First in-depth study of dual Vdd buffer insertion and buffered tree generation ¤ Large power saving over single Vdd buffering l Efficient algorithms for power optimality ¤ 17 x faster than [Lillis, JSSC 96] when single Vdd is considered

Outline l Dual Vdd buffer insertion and sizing (DVB) ¤ Problem formulation ¤ Sampling for speedup ¤ Experimental results l Dual Vdd buffered tree generation (D-Tree) ¤ Problem formulation ¤ Improved augmented orthogonal search tree ¤ Experimental results

Delay, Slew and Power Modeling l Elmore delay ¤ Wire: ¤ Bakoglu’s , buffer: slew metric (ln 9 ∙Elmore) l Power = energy per switch ¤ Wire: ¤ Lumped buffer dynamic/short-circuit power ¤ Can be easily extended to leakage power ¢ Low Vdd (VL) reduces leakage ¢ Need to assume of clock rate and switching activity

Introducing Dual Vdd Buffering l Achieves power saving since power α Vdd 2 l Suffer no loss of delay optimality l VL => VH requires level converter (LC) ¤ Restore V voltage level and reduce leakage Reduced VL V VH I ¤ Ext-CVS ¢ LC noise margin Leakage for logic [Srivastava, ISLPED 04] delay and power overhead amortized

Key Observation in Dual Vdd Buffering l Disallowing VL => VH will not affect optimality ¤ Optimality empirically illustrated (@ 65 nm): ¢ (a) has LC and VH drives Cl, power (a) > (b) ¢ Delay (b) > (a) only if Cl > 0. 5 p. F (~ 9 mm wire) VH VL

DVB Formulation l Dual Vdd Buffer Insertion (DVB) ¤ Given interconnect tree ¤ Find buffer placement, Vdd assignment for buffers, sizes of buffers ¢ VH buffers driving VL buffers within the tree ¢ Level converters at VH sinks driven by VL buffers ¤ Minimize ¢ Arrival power subject to time requirement at the source (RAT) ¢ Slew rate constraint at buffer inputs and sinks

DVB Algorithm l Based on [Lillis, JSSC 96] ¤ Dynamic programming with partial solution (option) pruning ¤ Options must now record downstream Vdd levels for buffering ¢ To prevent VL => VH, which removes unnecessary search on solution space ¤ Still quite slow for large nets l Challenge ¤ Considering power causes super-linear growth in the number of options (w. r. t. tree size) ¤ Dual Vdd buffers => 2 x options at each node

Speed-up Technique l Approximate by power-delay sampling l Sampling under each distinct cap value ¤ Uniformly pick options from the entire RAT— power trade-off curve

Experimental Settings for DVB l Testcase: randomly generated Steiner trees 20 to 800 terminals in 1 cm x 1 cm routing area ¤ Buffer sizes: 16 x, 32 x, 64 x ¤ l Sampling grid set to 20 x 20 l Comparison Exact power-optimal algorithm (PB) [Lillis, JSSC 96] ¤ Our algorithm with single (SVB) and dual (DVB) Vdd buffers ¤

Sampling Preserves Optimality l Sampling has little impact on optimality SVB follows PB closely ¤ Still optimal delay, 1. 7% larger power over PB ¤

Dual Vdd Reduces Power l Dual Vdd shifts power-delay curve to the left

Experimental Results for DVB Testcase Power (at optimal RAT) (f. J) Net # nodes # sinks SVB DVB S 5 375 199 18699 13808 [-26%] S 6 515 299 23443 17239 [-26%] S 7 784 499 33552 23804 [-29%] S 8 1054 699 38351 25799 [-33%] S 9 1188 799 40228 26646 [-34%] avg [-23%] l DVB saves 23% power over SVB More power saving in larger nets ¤ Power saving becomes larger w/delay slack ¤ ¢ e. g. relax delay 5%, saving becomes 26%

Runtime Testcases Runtime (s) net # nodes # sinks PB SVB DVB S 5 375 199 719 86 212 S 6 515 299 2121 139 371 S 7 784 499 33419 393 635 S 8 1054 699 > 1 day 598 1072 S 9 1188 799 > 1 day 853 1859 1 x 1/17 x 1/7 x avg l SVB scales a lot better for larger testcases Achieved 17 x speedup over PB [Lillis, JSSC 96] ¤ DVB takes ~2. 5 x more runtime than SVB ¤

Outline l Dual-Vdd Buffer insertion and sizing (DVB) Problem formulation ¤ “Sampling” speed-up technique ¤ Experimental results ¤ l Dual-Vdd buffered tree generation (D-Tree) Problem formulation ¤ Improved augmented orthogonal search tree ¤ Experimental results ¤

D-Tree Formulation l Dual Vdd Buffered Tree (D-Tree) Given locations of terminals, buffer stations and blockages ¤ Find a rectilinear Steiner tree (RST), buffer placement/size/Vdd assignment ¤ ¢ VH buffers driving VL buffers only ¢ Level converters at VH sinks driven by VL buffers ¤ Minimize power ¢ Arrival time requirement at the source (RAT) ¢ Slew rate constraint at buffer inputs and sinks l D-Tree is NP-Hard ¤ Finding minimum RST alone is NP-Complete

Buffered Tree Construction l Delay optimization only [Cong, DAC 00] by 1. 2. Build Hanan Graph w/buffer insertion nodes according to locations of buffer stations Path search on the grid by option propagation

D-Tree Algorithm Overview l Challenges ¤ Growth of option is exponential ¢ ¤ An artifact of D-Tree’s NP-hardness Considering power worsens option growth l Solution: sampling + efficient prune tree

Prune Tree in [Lillis, JSSC 96] l Option inserted in sorted capacitance ¤ Never need to clear options out from the tree ¢ If new option is checked against the tree ¢ Automatically avoid redundant option in tree ¢ e. g. Фnew = (c = 20, p = 100, q = 600) c=10, q=500 P=100 c=8, q=400 c=7, q=380 c=15, q=550 c=12, q=520 c=20, q=600 l Not applicable to D-Tree problem ¤ Order of new options is not known a priori

Our Improvement on Prune Tree l Indexing w/capacitance results in fewer trees ¤ # capacitance value < # power value l Efficient “tree cleaning” Enables out-of-order option insertion ¤ Guarantee no redundancy in tree ¤

Tree Cleaning l To add an option Фnew in O(|c|·log(|T|)) time 1. 2. Check whether Фnew is dominated by any option in the data-structure If not, remove options in the tree dominated by Фnew in two downward tree traversals • e. g. Фnew = (c = 10, p = 70, q = 410, …)

Experimental Settings for D-Tree l Random testcases All based on a random floorplan of 1 cm x 1 cm ¤ Blockages ~ 30%, buffer stations ~1 mm apart ¤ l Comparison Delay-optimal tree (RMP) [Cong, DAC 00] ¤ Ours with single (S-Tree) and dual (D-Tree) Vdd Buffer ¤

Experimental Results for D-Tree Net T 3 T 4 T 5 avg Testcases # nodes # sinks 137 4 261 5 235 6 Power @ optimal RAT (p. J) RMP S-Tree D-Tree 3. 9 3. 5 [-10%] 2. 9 [-23%] 4. 9 4. 4 [-13%] 3. 1 [-37%] 4. 2 3. 8 [-10%] 3. 4 [-18%] -7% -18% l Significant power saving over RMP ¤ ¤ S-Tree: 7%, D-Tree: 18% Larger saving for large testcases (e. g. T 4) l Handles up to 6 -sink nets (T 5 takes 23 mins) ¤ Similar capability compared with delay-optimal approaches [Cong, DAC 00; Chen, ASP-DAC 02]

Conclusion l Formulated dual Vdd buffer insertion/tree generation without level converters l Proposed 2 speedup techniques ¤ ¤ “Sampling” w/negligible loss of optimality “Improved prune tree” for solution pruning l Applied to single-Vdd buffer insertion, 17 x faster than existing work l Large power saving over single Vdd buffering ¤ ¤ 23% in buffer insertion: dual Vdd vs single Vdd 18% in buffered tree: dual Vdd vs delay optimal

Future Work l Speed up tree construction l Slack allocation for more power reduction ¤ Path-based buffer insertion [Sze, DAC 05] ¢ Allocate slack along one interconnect path ¢ Consider single Vdd buffers only ¤ Chip level FPGA dual Vdd assignment [Lin, DAC 05] ¢ Fixed buffer location, assign Vdd ¢ Consider Multiple critical path ¢ Solved as a linear programming levels problem