MinRegister Retiming Under Simultaneous Timing and Initial State

  • Slides: 39
Download presentation
Min-Register Retiming Under Simultaneous Timing and Initial State Constraints Aaron Hurst Dec. 2007

Min-Register Retiming Under Simultaneous Timing and Initial State Constraints Aaron Hurst Dec. 2007

Introduction l Retiming is the structural relocation of registers such that output functionality is

Introduction l Retiming is the structural relocation of registers such that output functionality is preserved l Transformation with many means and many ends l l l Minimizing worst-case delay Minimizing number of registers Either of the above under constraints Optimally or heuristically Other… ? In an industrial setting? l 1. 2. Diminishing returns in combinational optimization Coming of age of sequential equivalence checking

Motivation Register minimization is uniquely valuable l l Area and power reduction l Clock

Motivation Register minimization is uniquely valuable l l Area and power reduction l Clock network: dynamic power, design effort Testability: scan chain depth Verification: state representation l l Must satisfy several constraints l l Timing, initializability, congestion, electrical. . . Constrained minimum-register retiming is hard l l Current solutions not scalable

Outline Core problem 1. l unconstrained register minimization 2. Constraints: Timing 3. Constraints: Initializability

Outline Core problem 1. l unconstrained register minimization 2. Constraints: Timing 3. Constraints: Initializability 4. Other constraints

Flow-Based Register Minimization A New Approach to Unconstrained Register Minimization

Flow-Based Register Minimization A New Approach to Unconstrained Register Minimization

Background l Register minimization an “original” problem in retiming r(v): retiming lag wi(u, v):

Background l Register minimization an “original” problem in retiming r(v): retiming lag wi(u, v): initial reg count l An instance of minimum-cost network flow [Goldberg 97] But we can do better. . .

Orientation l Consider one combinational frame of the circuit l A single directed acyclic

Orientation l Consider one combinational frame of the circuit l A single directed acyclic graph of combinational logic l l Nodes: Edges: Inputs: Outputs: logic gates pair-wise net connections register outputs, primary inputs registers inputs, primary outputs primary inputs register outputs

Cuts in a Frame l Consider circuit w/o primary IOs and their transitive fan-in/out

Cuts in a Frame l Consider circuit w/o primary IOs and their transitive fan-in/out l l Retiming = a complete cut of the DAG Number of registers = l Problem consists of finding minimum cut

Max-Flow Formulation l Min-cut/Max-flow Duality l l l Edges in graph are assigned a

Max-Flow Formulation l Min-cut/Max-flow Duality l l l Edges in graph are assigned a capacity Min-cut width = Max-flow through graph Min-cut derived from residual flow l Partition graph into {S, R} by source reachability l l l sink S = augmenting path from source s R = augmenting path from source s Min-cut is not unique l Selects one with min movement of registers source

Constraint Type #1 l What are the effect of unconstrained edges? v u l

Constraint Type #1 l What are the effect of unconstrained edges? v u l Never saturated; always present in residual graph l l Destination node v always reachable from source u Minimum cut will never lie between u and v A useful tool for constraining solution. .

A Necessary Modification l l Min-cut guarantees every path will be cut at least

A Necessary Modification l l Min-cut guarantees every path will be cut at least once Retiming requires that every path is cut exactly once R R 2’ R 1 R 2 R 3’ R 3 S l Observation: a path must cross cut from R → S l Solution: Use unconstrained flow to prevent reverse edges

Fanout Sharing l l Nets were decomposed into flow arcs False model of register

Fanout Sharing l l Nets were decomposed into flow arcs False model of register count l l l 1 11 Reality: one register per net / hyper-edge “Fanout sharing” 1 1 Introduce a structure to simulate fanout-sharing 1 1 1 1

Single Iteration l 1. 2. What is the final flow graph? Reverse Edges Fanout-sharing

Single Iteration l 1. 2. What is the final flow graph? Reverse Edges Fanout-sharing 1 1 1 l Unitary Flow Simplification l l Binary marking scheme Flow computed on original netlist 1 1 1 1

Multiple Frames l Globally minimum solution requires moving registers beyond one frame l l

Multiple Frames l Globally minimum solution requires moving registers beyond one frame l l Corresponding min-cut may stretch across multiple combinational frames Solution: Repeat over single frame l Terminate when no further change l Then, consider backward direction l Final result is provably optimal unrolled circuit

Overall Algorithm Start Forward retiming Backward retiming Block Fan-out Cone of PIs Block Fan-in

Overall Algorithm Start Forward retiming Backward retiming Block Fan-out Cone of PIs Block Fan-in Cone of POs Compute Max-Flow y Implement Min-Cut n Improv. ? y Implement Min-Cut Improv. ? n l Forward retiming is preferred due to initial state computation Done

Asymptotic Analysis l Single iteration runtime limited by maximum flow solver [Goldberg 95] l

Asymptotic Analysis l Single iteration runtime limited by maximum flow solver [Goldberg 95] l Or, using unitary flow simplification… l Total number of iterations is bounded by |R|

Experimental Results l Applied to {ISCAS, ITC, Open. Cores, Altera} benchmarks. . . Register

Experimental Results l Applied to {ISCAS, ITC, Open. Cores, Altera} benchmarks. . . Register Savings per Iteration l The number of iterations is quite small l Register count is monotonically decreasing l l Runtime can be bounded Runtime is 5 x faster than minimum-cost formulation l <0. 01 s for 70% of benchmarks

Summary Key points: 1. Optimal 2. Minimum register movement 3. Fast. . . both

Summary Key points: 1. Optimal 2. Minimum register movement 3. Fast. . . both absolutely and relatively 4. Scalable: early termination with improvement

Timing Constraints

Timing Constraints

Background l Timing constraints make problem much harder D(u, v): path delay W(u, v):

Background l Timing constraints make problem much harder D(u, v): path delay W(u, v): path reg count l Complexity: pair-wise path delay constraints l Enumeration alone is O(n 3) l Simplification: prune unnecessary constraints l Minaret: use skew-equivalence to find ASAP and ALAP register positions [Sapatnekar 99]

Conservative Constraints l Consider retiming a register l Two timing constraints made potentially critical

Conservative Constraints l Consider retiming a register l Two timing constraints made potentially critical in each direction max minarrival l If other end of timing constraints is fixed. . . l l max minarrival Bound on absolute positions of register All such constraints can be computed with two-pass STA l Linear time

Exact Constraints l Fixing other end of timing path is conservative l l May

Exact Constraints l Fixing other end of timing path is conservative l l May also move in the same direction, relaxing constraint If other end of timing constraints is not fixed. . . l Conditional constraints “Can retime R 2 past v 2 only if R 1 is retimed past v 1” R 1 v 1 R 2 v 2 unit delay max delay 3 l Computed from edge of bounded transitive fan-in/out cone - delay - register count

Constraint Implementation l Conservative Constraints: l l Indicate nodes to be removed from problem

Constraint Implementation l Conservative Constraints: l l Indicate nodes to be removed from problem Exact Constraints: l Implemented as unconstrained edges v 1 l v 2 Cut can only move beyond v 2 if it moves beyond v 1

Refinement l All timing constraints met by initial circuit l l l Guarantees flow

Refinement l All timing constraints met by initial circuit l l l Guarantees flow from source to sink remains finite Iteratively tighten conservative constraints into exact ones Simplification: Only constraints limiting area improvement Ccons = all Cexact = minexact+cons compute cut with Ccons compute cut w/o Ccons y convert ccons between two cuts into cexact Conservative Constraint any? n

Building Intuition 4 5 6 unit delay max delay 3 4 No Constraint Conservative

Building Intuition 4 5 6 unit delay max delay 3 4 No Constraint Conservative Constraint Exact Constraint l minexact+cons Constraints impose relations across multiple clock cycles

Building Intuition unit delay max delay 2 No Constraint Conservative Constraint Exact Constraint l

Building Intuition unit delay max delay 2 No Constraint Conservative Constraint Exact Constraint l minexact+cons Cycles in constraints lock retiming moves to be ‘in-step’

Experimental Results Max path delay initial period, min path delay ≥ 0 l l

Experimental Results Max path delay initial period, min path delay ≥ 0 l l Average number of exact constraints = 1. 1% of design size

Summary Key points: 1. Inherits benefits of flow-based retiming 1. 2. 3. Optimal Fast

Summary Key points: 1. Inherits benefits of flow-based retiming 1. 2. 3. Optimal Fast Monotonic improvement 2. Problem reduced using both timing and area criticality 3. Advanced timing model and constraints 4. Scalable: intermediate solutions are timing-feasible

Initializability Constraints

Initializability Constraints

Problem l Retimed circuit must preserve initialization behavior l Accomplished by: 1. 2. Additional

Problem l Retimed circuit must preserve initialization behavior l Accomplished by: 1. 2. Additional combinational logic Identifying an equivalent initial state forward retiming: simulation 0 ? backward retiming: l SAT Backward retiming jeopardizes initializability 0

Background l How to transform an uninitializable circuit into an initializable one? l ‘Prayer’

Background l How to transform an uninitializable circuit into an initializable one? l ‘Prayer’ minimizing register movement maximizes initializability [Pan 99] l ‘Slash and Burn’ incrementally tighten bound on negative retiming lag [Stok 95] l ‘Brute Force’ mixed-integer linear program [Sapatnekar 97]

Feasibility Constraints SAT problem with variables Feasibility Constraint : l Parital cut of unrolled

Feasibility Constraints SAT problem with variables Feasibility Constraint : l Parital cut of unrolled circuit Sufficient to imply infeasibility l ? l At least one element must be removed UNSAT with only variables in TFO( ) l ? l 0 Built incrementally as retiming progresses 0 l Variables switched off in SAT with additional per-clause variables z

Feasibility Constraints l Topologically order circuit y l l Fast: Faster: 0 y v

Feasibility Constraints l Topologically order circuit y l l Fast: Faster: 0 y v ? SAT ? n n ? binary search on v 0 0 = Incremental SAT UNSAT core can be used to localize conflict

Constraint Type #2 Penalty Structure 0 l source Adds exactly one unit flow path

Constraint Type #2 Penalty Structure 0 l source Adds exactly one unit flow path Delayed insertion until comes into frame l 0 l Biases against cuts below feasibility constraint l l Closest cut of width +1 returned Cut is squeezed forward l New cut of width +1 is closest and therefore most initializable l If it isn’t. . . additional penalty needed sink

Algorithm Cfeas = compute min-cut constrained by Cfeas initializable? n Cfeas = y binary

Algorithm Cfeas = compute min-cut constrained by Cfeas initializable? n Cfeas = y binary search on v n l SAT y n SAT y v Complexity: already NP-complete from test for initializability

Experimental Results l Equivalent init. state after min-reg retiming for most designs Only one

Experimental Results l Equivalent init. state after min-reg retiming for most designs Only one design was not initializable: s 400 Can be easily lost if backward retiming invoked multiple times l l l A harder problem: randomized initial states Original Name Gates Regs Infeasible Feasible Min-Register Regs Avg. | | Runtime s 400 0. 3 k 21 18 +1 8. 0 0. 08 s oc_aes_core 16. 6 k 402 395 +3 2. 0 2. 55 s oc_vga_lcd 17. 1 k 1108 1087 +1 1. 09 s nut_003 6. 6 k 484 450 +3 1. 0 1. 41 s radar 12 71. 1 k 3875 3771 +27 2. 3 108. 3 s oc_wb_dma 29. 2 k 1775 1757 +2 3. 5 5. 70 s oc_minirisc 3. 9 k 289 271 +2 1. 0 0. 49 s

Summary Key points: 1. Optimal 2. Compatible with timing constraints 3. Worst-case bound non-polynomial,

Summary Key points: 1. Optimal 2. Compatible with timing constraints 3. Worst-case bound non-polynomial, but fast in practice

Additional Applications l Physical constraints l l Electrical constraints l l Placement congestion: penalty

Additional Applications l Physical constraints l l Electrical constraints l l Placement congestion: penalty structures Capacitive load on clock network drivers: penalty structures Others?

Contribution l New formulation of register minimization problem l Constraints of different forms can

Contribution l New formulation of register minimization problem l Constraints of different forms can be added to problem 1. 2. 3. Timing Initializability Other l Improves upon best practices within each sub-problem Unified solution to synthesis-ready retiming l Scalable l