Technology Mapping with Choices Priority Cuts and PlacementAware

Overview (1) (2) (3) (4) (5) (6) Introduction Technology mapping Priority cuts Structural choices

(1) Introduction Terminology l And-Inverter Graphs l Technology mapping in a nutshell l 3

Terminology l Logic network l l l Primary inputs/outputs (PIs/POs) Logic nodes Fanins/fanouts Transitive

AIG Definition and Examples AIG is a Boolean network composed of two-input ANDs and

Mapping in a Nutshell l AIGs reprsent logic functions l l Primitives with delay,

(2) Technology Mapping l Traditional LUT mapping l l l Drawbacks of the traditional

Traditional LUT Mapping Algorithm Input: And-Inverter Graph 1. Compute K-feasible cuts for each node

Delay-Optimal Mapping Cut size K = 3 l Input: l l AIG and K-cuts

Area Recovery During Mapping l Delay-optimal mapping is performed first l l l Arrival

How to Measure Area? Suppose we use the naïve definition: Area (cut) = 1

Area-flow area-flow (cut) = 1 + [ Σ ( area-flow ( fanin ) /

Exact Local Area Exact-local-area (cut) = 1 + [ Σ exact-local-area (fanin with no

Area Recovery Summary l Area recovery heuristics l Area-flow (global view) l l Exact

Drawbacks of Traditional Mapping l Excessive memory and runtime requirements l l Exhaustive cut

Excessive Memory and Runtime l For large designs, there may be too many K-feasible

Structural Bias l Consider mapping 4: 1 MUX into 4 -LUTs l l The

Ways to Mitigate the Drawbacks l Excessive memory and runtime requirements l Compute only

(3) Priority Cuts Structural cuts l Exhaustive cut enumeration l Prioritizing cuts l Implementation

Structural Cuts in AIG n A cut of a node n is a set

Exhaustive Cut Enumeration { n, pq, pbc, abq, abc } n { p, ab

Cut Filtering Bottom-up cut computation in the presence of re-convergence might produce dominated cuts

Signature-Based Cut Filtering Problem: Given two cuts, how to quickly determine whether one can

Example l Let the node IDs be a = 1, b = 2, c

Experiment with K-Cut Computation C/N is the number of cuts per node; T is

Computing Priority Cuts l Consider nodes in a topological order l l At each

Priority Cuts: A Bag of Tricks · · · · Compute and use priority

Priority-Cut-Based Mapping Input: And-Inverter Graph 1. Compute K-feasible cuts for each node 2. Compute

Complexity Analysis l The worst-case complexity of traditional mapping l l Flow. Map O(Kmn)

(4) Structural Choices l l Structural bias Ways to overcome structural bias l Need

Structural Bias The mapped netlist very closely resembles the subject graph f f p

Example of Structural Bias A better match may not be found f f p

Example of Structural Bias The better match can be found with a different subject

Synthesis for Structural Choices l l Traditional synthesis produces one “optimized” network Synthesis with

Mapping with Structural Choices l Two questions have to be answered l l l

Detecting Choices Given two Boolean networks, create a network with choices Network 1 x

Detecting Choices Step 2: Use combinational equivalence to detect functionally equivalent nodes up to

Detecting Choices Step 3: Merge equivalent nodes with choice edges x a b y

Cut Computation with Choices Cuts are now computed for equivalence classes of nodes {

Mapping Algorithm with Choices Only Step 1 has to be changed Input: And-Inverter Graph

(5) Tuning Mapping for Placement l Placement-aware cost function for priority-cut computation l l

Modified Cut Prioritization Heuristics in Wire. Map l Consider nodes in a topological order

Wire. Map Algorithm Input: And-Inverter Graph 1. Compute K-feasible cuts for each node 2.

Experimental Results l Experimental comparison l l Wire. Map leads to the average edge

(6) Other Applications of Priority. Cut-Based Mapping l l Sequential mapping (mapping + retiming)

Sequential Mapping · That is, combinational mapping and retiming combined · · Minimizes clock

Sequential Mapping (continued) l Advantages l Uses priority cuts (L=1) for computing sequential arrival

Speeding Up SAT Solving l Perform technology mapping into K-LUTs for area l l

Cut Sweeping l Reduce the circuit by detecting and merging shallow equivalences (proposed by

Sequential Resynthesis for Delay l Restructure logic along the tightest sequential loops to reduce

Summary l l Reviewed traditional and novel LUT mapping Presented the current mapping solution

Backup Slides on Wire. Map l l Virtex-5 dual-output LUT Comparison of LUT distribution

Virtex-5 Dual-Output LUT 6 A 1 A 2 A 3 A 4 A 5

Comparison of Area Flow and Edge Flow Mapping (K = 6) 55

Wirelength, Channel Width, and Critical Path Delay Comparison twl = total wire length, mcw

Slides: 56

Download presentation

Technology Mapping with Choices, Priority Cuts, and Placement-Aware Heuristics Alan Mishchenko UC Berkeley

Overview (1) (2) (3) (4) (5) (6) Introduction Technology mapping Priority cuts Structural choices Tuning mapping for placement Other applications 2

(1) Introduction Terminology l And-Inverter Graphs l Technology mapping in a nutshell l 3

Terminology l Logic network l l l Primary inputs/outputs (PIs/POs) Logic nodes Fanins/fanouts Transitive fanin/fanout cone (TFI/TFO) Structural cut of a node l l l Cut is a boundary in the network separating the node from the PIs Boundary nodes are the leaves The node is the root K-feasible cut has K or less leaves Function of the cut is function of the root in terms of the leaves Primary outputs TFO Fanouts Fanins TFI Primary inputs Root Cut Leaves Primary inputs 4

AIG Definition and Examples AIG is a Boolean network composed of two-input ANDs and inverters. a cd b 00 01 11 10 00 0 0 1 1 11 0 10 0 0 1 0 F(a, b, c, d) = ab + d(ac’+bc) a 6 nodes d b 4 levels a a cd b 00 01 11 10 00 0 0 1 1 11 0 10 0 0 1 0 c b c F(a, b, c, d) = ac’(b’d’)’ + c(a’d’)’ = ac’(b+d) + bc(a+d) 7 nodes 3 levels a c b d b c a d 5

Mapping in a Nutshell l AIGs reprsent logic functions l l Primitives with delay, area, etc LUT Computes a cover of AIG using primitives of the technology Computes cuts for each AIG node Associates each cut with a primitive Selects a cover with a minimum cost LUT a b c d e Primary outputs Structural bias l l Uses a description of a technology Cut-based structural mapping l l l LUT Structural mapping l l f f Technology l l A good subject graph for mapping Technology mapping expresses logic functions to be implemented l Mapped network AIG Good mapping cannot be found because of the poor AIG structure Choice node Overcoming structural bias l Need to map over a number of AIG structures (leads to choice nodes) 6 Primary inputs

(2) Technology Mapping l Traditional LUT mapping l l l Drawbacks of the traditional mapping l l l Delay-optimal mapping Area recovery Excessive memory and runtime Structural bias Ways to mitigate the drawbacks l l Priority cuts Structural choices 7

Traditional LUT Mapping Algorithm Input: And-Inverter Graph 1. Compute K-feasible cuts for each node 2. Compute best arrival time at each node • • 3. Perform area recovery • • 4. In topological order (from PI to PO) Compute the depth of all cuts and choose the best one Using area flow Using exact local area Chose the best cover • In reverse topological order (from PO to PI) Output: Mapped Netlist 8

Delay-Optimal Mapping Cut size K = 3 l Input: l l AIG and K-cuts computed for all nodes For all nodes in a topological order l l l Compute arrival time of each cut using fanin arrival times Select one cut with min arrival time Set the arrival time of the node to be the arrival time of this cut 3 s Algorithm: l l f 1 p r q 1 2 1 a c b e d f f Output: l Cut {pqr} of node f has arrival time 3 Delay-optimal mapping for all nodes 2 s Cut {stu} of node f has arrival time 2 1 t u 1 1 9 a b c d e f

Area Recovery During Mapping l Delay-optimal mapping is performed first l l l Arrival and required times are computed for all AIG nodes l l A number of area recovery heuristics can be used Heuristic area recovery is iterative l l This process is called area recovery Exact area recovery is exponential in the circuit size l l Required time for all used nodes is determined If a node is not used, its required time is set to +infinity Slack is a difference between required time and arrival time If a node has positive slack, its current best match can be updated to reduce the total area of mapping l l Best match is assigned at each node Some nodes are used in the mapping; others are not used Typically involved 3 -5 iterations Next, we discuss cost functions used during area recovery l They are used to decide what is the best match at each node 10

How to Measure Area? Suppose we use the naïve definition: Area (cut) = 1 + [ Σ area (fanin) ] (assuming that each LUT has one unit of area) x p q a b c d r e Area of cut {pcd} = 1 + [1 + 0] =2 y x y f p q a b c d r e f Area of cut {abq} = 1 + [ 0 + 1] =2 Naïve definition says both cuts are equally good in area Naïve definition ignores sharing due to multiple fanouts 11

Area-flow area-flow (cut) = 1 + [ Σ ( area-flow ( fanin ) / fanout_num( fanin ) ) ] x p q a b c d r e Area-flow of cut {pcd} = 1 + [1 + 0] =2 y x y f p q a b c d r e f Area-flow of cut {abq} = 1 + [ 0/1 + ½] = 1. 5 Area-flow recognizes that cut {abq} is better Area-flow “correctly” accounts for sharing (Cong ’ 99, Manohara-rajah ’ 04) 12

Exact Local Area Exact-local-area (cut) = 1 + [ Σ exact-local-area (fanin with no other fanout) ] f f p p 6 6 s q t a b c d 6 e f 6 s q t a b c d e f Cut {pef} Cut {stq} Area flow = 1+ [(. 25+3)/2] = 2. 75 Area flow = 1+ [. 25+. 25 +1] = 2. 5 Exact area = 1 + 0 (p is used elsewhere) Exact area = 1 + 1 = 2 (due to q) Exact area will choose this cut. Area flow will choose this cut. 13

Area Recovery Summary l Area recovery heuristics l Area-flow (global view) l l Exact local area (local view) l l Minimizes the number of LUTs by looking one node at a time The results of area recovery depends on l l l Chooses cuts with better logic sharing The order of processing nodes The order of applying two passes The number of iterations Implementation details This scheme works for the constant-delay model l Any change off the critical path does not affect critical path 14

Drawbacks of Traditional Mapping l Excessive memory and runtime requirements l l Exhaustive cut enumeration leads to many cuts (especially when K 6) Structural bias l The structure of the object graph does not allow good mapping to be found 15

Excessive Memory and Runtime l For large designs, there may be too many K-feasible cuts l l l 1 M node AIG has ~50 M 6 -cuts Requires ~2 GB of storage memory and takes ~30 sec to compute Past ways of tackling the problem l Detect and remove dominated cuts l l Perform cut pruning (store N cuts/node) l l Does not help much Throws away useful cuts even if N = 1000 k Average number of cuts per node 4 6 5 25 6 50 7 120 8 250 Store only cuts on the frontier l Reduces memory but increases runtime 16

Structural Bias l Consider mapping 4: 1 MUX into 4 -LUTs l l The naïve approach results in 3 LUTs After logic structuring, mapping with 2 LUTs can be found 17

Ways to Mitigate the Drawbacks l Excessive memory and runtime requirements l Compute only a small number of “useful” cuts l l Leads to mapping with priority cuts Structural bias l Perform mapping over multiple circuit structures l Leads to mapping with structural choices 18

(3) Priority Cuts Structural cuts l Exhaustive cut enumeration l Prioritizing cuts l Implementation tricks l 19

Structural Cuts in AIG n A cut of a node n is a set of nodes in transitive fanin such that every path from the node to PIs is blocked by nodes in the cut. A k-feasible cut has no more than k leaves. p a q b c The set {pbc} is a 3 -feasible cut of node n. (It is also a 4 -feasible cut. ) k-feasible cuts are important in LUT mapping because the logic between root n and the cut leaves {pbc} can be replaced by a k-LUT. 20

Exhaustive Cut Enumeration { n, pq, pbc, abq, abc } n { p, ab } { q, bc } p Computation is done bottom-up q {a} {b} {c} a b c The set of cuts of a node is a ‘cross product’ of the sets of cuts of its children. Any cut that is of size greater than k is discarded. (P. Pan et al, FPGA ’ 98; J. Cong et al, FPGA ’ 99) 21

Cut Filtering Bottom-up cut computation in the presence of re-convergence might produce dominated cuts x {. . {adbc}. . {abc}. . } f d a {. . {dbc}. . {abc}. . } e b Cut {a, b, c} dominates cut {a, d, b, c} c • The “good” cut {abc} is present (so not a quality issue) • But the “bad” cut {adbc} may be propagated further (so a run-time issue) • It is important to discard dominated cuts quickly 22

Signature-Based Cut Filtering Problem: Given two cuts, how to quickly determine whether one can be a subset of another. Solution: Signature of a cut is a 32 -bit integer defined as: sig (c) = Σ 2 ID(n) mod 32 nÎc (Σ means bit-wise OR) where ID(n) is the integer id of node n Observation: If cut c 1 dominates cut c 2, then sig(c 1) OR sig(c 2) = sig(c 2) Signature checking is a quick test for the most common case when a cut does not dominate another. Only if this check fails, 23 an actual comparison is performed.

Example l Let the node IDs be a = 1, b = 2, c = 3, d= 4 l Let c 1 = {a, b, c} and c 2 = {a, d, b, c} l sig (c 1) = 21 OR = 0001 22 OR OR 23 0010 OR 0100 = 0111 l sig (c 2) = 21 OR = 0001 24 OR OR 22 1000 OR OR 23 0010 OR 0100 = 1111 l As sig (c 1) OR sig (c 2) ¹ sig (c 1), c 2 does not dominate c 1 l But sig (c 1) OR sig (c 2) = sig (c 2), so c 1 may dominate c 2 24

Experiment with K-Cut Computation C/N is the number of cuts per node; T is time in seconds; L/N is the ratio of nodes with the number of cuts exceeding the limit (N=1000); for K < 8, the number of cuts did not exceed 1000 25

Computing Priority Cuts l Consider nodes in a topological order l l At each node, merge two sets of fanin cuts (each containing C cuts) resulting in (C+1) * (C+1) + 1 cuts Sort these cuts using a given cost function, select C best cuts, and use them for computing priority cuts of the fanouts Select one best cut, and use it to map the node Sorting criteria The tie-breaking criterion denoted “fanin refs” means “prefer cuts with larger average fanin reference counters”. 26

Priority Cuts: A Bag of Tricks · · · · Compute and use priority cuts (a subset of all cuts) Dynamically update the cuts in each mapping pass Use different sorting criteria in each mapping pass Include the best cut from the previous pass into the set of candidate cuts of the current pass Consider several depth-oriented mappings to get a good starting point for area recovery Use complementary heuristics for area recovery Perform cut expansion as part of area recovery Use efficient memory management 27

Priority-Cut-Based Mapping Input: And-Inverter Graph 1. Compute K-feasible cuts for each node 2. Compute arrival time at each node • • • 3. Perform area recovery • • • 4. In topological order (from PI to PO) Compute the depth of all cuts and choose the best one Compute at most C good cuts and choose the best one Using area flow Using exact local area In each iteration, re-compute at most C good cuts and choose the best one Chose the best cover • In reverse topological order (from PO to PI) Output: Mapped Netlist 28

Complexity Analysis l The worst-case complexity of traditional mapping l l Flow. Map O(Kmn) (J. Cong et al, TCAD ’ 94) Cut. Map O(2 Kmn K ) (J. Cong et al, FPGA ’ 95) DAOmap O(Kn K ) (J. Cong et al, ICCAD’ 04) Mapping with priority cuts l O(KC 2 n) K is max cut size C is max number of cuts n is number of nodes m is number of edges 29

(4) Structural Choices l l Structural bias Ways to overcome structural bias l Need some form of (re)synthesis to get multiple circuit structures l l l Computing and using several synthesis snapshots Running several scripts and combining the resulting networks Performing Boolean decomposition during mapping Multiple circuit structures = structural choices Questions: l l How to efficiently detect and store structural choices? How to perform technology mapping with structural choices? 30

Structural Bias The mapped netlist very closely resembles the subject graph f f p LUT p Technology Mapping LUT m m LUT a b c d e Every input of every LUT in the mapped netlist must be present in the subject graph - otherwise technology mapping will not find the match 31

Example of Structural Bias A better match may not be found f f p LUT This match is not found p f LUT q m m LUT a b c d e a b LUT c d e a b c d e Since the point q is not present in the subject graph, the match on the right is not found 32

Example of Structural Bias The better match can be found with a different subject graph f f p p f synthesis LUT q q LUT m a b c d e 33

Synthesis for Structural Choices l l Traditional synthesis produces one “optimized” network Synthesis with choices produces several networks l l These can be different snapshot of the same synthesis flow These can be results of synthesizing the design with different options l For example, area-oriented and delay-oriented scripts Synthesis D 1 D 2 D 3 D 4 Synthesis with structural choices D 1 HAIG D 2 D 3 D 4 34

Mapping with Structural Choices l Two questions have to be answered l l l How to store multiple circuit structures? How to perform mapping with multiple circuit structures? Both questions can be solved due to the following: l The subject graph is an AIG l l There are powerful equivalence checking methods l l Structural hashing quickly merges isomorphic circuit structures They can be used to prove equivalence Cut computation can be extended to work with structural choices l The modification is straight-forward 35

Detecting Choices Given two Boolean networks, create a network with choices Network 1 x = (a + b)c y = bcd Network 2 x = ac + bc y = bcd Step 1: Make And-Inverter decomposition of networks y x a b c y x d a b c d 36

Detecting Choices Step 2: Use combinational equivalence to detect functionally equivalent nodes up to complementation (A. Kuehlmann, TCAD’ 02) l Random simulation to detect possibly equivalent nodes l SAT-based decision procedure to prove equivalence Network 1 x = (a + b)c y = bcd x a b Network 2 x = ac + bc y = bcd y c y x d a b c d 37

Detecting Choices Step 3: Merge equivalent nodes with choice edges x a b y c y x d a x x now represents a class of nodes that are functionally equivalent up to complementation b c d y 38 a b c d

Cut Computation with Choices Cuts are now computed for equivalence classes of nodes { x 1, pr, pbc, acr, abc } { x 2, qc, abc } x x 1 p y x 2 r q a b c d Cuts ( x ) = Cuts ( x 1 ) Cuts( x 2 ) = { x 1, pr, pbc, acr, abc, x 2, qc } 39

Mapping Algorithm with Choices Only Step 1 has to be changed Input: And-Inverter Graph with choices 1. Compute K-feasible cuts with choices 2. Compute best arrival time at each node • • 3. Perform area recovery • • 4. In topological order (from PI to PO) Compute the depth of all cuts and choose the best one Using area flow Using exact local area Chose the best cover • In reverse topological order (from PO to PI) Output: Mapped Netlist 40

(5) Tuning Mapping for Placement l Placement-aware cost function for priority-cut computation l l Advantages l l l Correlates with the total wire-length after placement Easy to take into account during area recovery Treat “edges” as “area” resulting in l l l The total number of edges in a mapped network Edge flow (similar to area flow) Exact local edges (similar to exact local area) Wire. Map l New placement-aware mapping algorithm 41

Modified Cut Prioritization Heuristics in Wire. Map l Consider nodes in a topological order l l At each node, merge two sets of fanin cuts (each containing C cuts) getting (C+1) * (C+1) + 1 cuts Sort these cuts using a given cost function, select C best cuts, and use them for computing priority cuts of the fanouts Select one best cut, and use it to map the node Sorting criteria 42

Wire. Map Algorithm Input: And-Inverter Graph 1. Compute K-feasible cuts for each node 2. Compute best arrival time at each node • • 3. Perform area recovery • • 4. In topological order (from PI to PO) Compute the depth of all cuts and choose the best one Using area flow and edge flow Using exact local area and exact local edge Chose the best cover • In reverse topological order (from PO to PI) Output: Mapped Netlist 43

Experimental Results l Experimental comparison l l Wire. Map leads to the average edge reduction l l 8. 5% reduction in the total wire length 6. 0% reduction in minimum channel width 2. 3% reduction in critical path delay Changes in the LUT size distribution l l l 9. 3% (while maintaining depth and LUT count) Place-and-route after Wire. Map leads to l l Wire. Map vs. the same mapper w/o edge heuristics The ratio of 5 - and 6 -LUTs in a typical design is reduced The ratio of 2 -, 3 -, and 4 -LUTs is increased Changes after LUT merging l 9. 4% reduction in dual-output LUTs 44

(6) Other Applications of Priority. Cut-Based Mapping l l Sequential mapping (mapping + retiming) Speeding up SAT solving Cut sweeping Delay-oriented resynthesis for sequential circuits 45

Sequential Mapping · That is, combinational mapping and retiming combined · · Minimizes clock period in the combined solution space Previous work: · · · Pan et al, FPGA’ 98 Cong et al, TCAD’ 98 Our contribution: dividing sequential mapping into steps · · · Finding the best clock period via sequential arrival time computation (Pan et al, FPGA’ 98) Running combinational mapping with the resulting arrival/required times of the register outputs/inputs Performing final retiming to bring the circuit to the best clock period computed in Step 1 46

Sequential Mapping (continued) l Advantages l Uses priority cuts (L=1) for computing sequential arrival times l l Reuses efficient area recovery available in combinational mapping l l almost no degradation in LUT count and register count Greatly simplifies implementation l l very fast due to not computing sequential cuts (cuts crossing register boundary) Quality of results l Leads to ~15% better quality compared to comb. mapping + retiming l l due to searching the combined search space Achieves almost the same (-1%) clock period as the general sequential mapping with sequential cuts l due to using transparent register boundary without sequential cuts 47

Speeding Up SAT Solving l Perform technology mapping into K-LUTs for area l l l Reduces the number of CNF clauses by ~50% l l Define area as the number of CNF clauses needed to represent the Boolean function of the cut Run several iterations of area recovery Compared to a good circuit-to-CNF translation (M. Velev) Improves SAT solver runtime by 3 -10 x l Experimental results are in the SAT’ 07 paper 48

Cut Sweeping l Reduce the circuit by detecting and merging shallow equivalences (proposed by Niklas Een) l l l By “shallow” equivalences, we mean equivalent points, A and B, for which there exists a K-cut C (K < 16) such that FA(C) = FB(C) A subset of “good” K-cuts can be computed The cost function is the average fanout count of cut leaves l l Cut sweeping quickly reduces the circuit l l Typically ~50% gain of SAT sweeping (fraiging) Cut sweeping is much faster than SAT sweeping l l The more fanouts, the more likely the cut is common for two nodes Typically 10 -100 x, for large designs Can be used as a fast preprocessing to (or a low-cost substitute for) SAT sweeping 49

Sequential Resynthesis for Delay l Restructure logic along the tightest sequential loops to reduce delay after retiming (Soviani/Edwards, TCAD’ 07) l l l Similar to sequential mapping Computes seq. arrival times for the circuit Uses the current logic structure, as well as logic structure, transformed using Shannon expansion w. r. t. the latest variables Accepts transforms leading to delay reduction In the end, retimes to the best clock period l The improvement is 7 -60% in delay with 1 -12% area degradation (ISCAS circuits) l This algorithm could benefit from the use of priority cuts 50

Summary l l Reviewed traditional and novel LUT mapping Presented the current mapping solution l l l Starts with an optimized AIG (with choices) Performs exhaustive (or priority) cut computation Performs heuristic area recovery Uses placement-aware heuristics Experimental results are promising Future work l l Area- and delay-oriented resynthesis for mapped networks Using delay information from preliminary placement 51

Backup Slides on Wire. Map l l Virtex-5 dual-output LUT Comparison of LUT distribution Comparison of area flow and edge flow mapping (K = 6) Wirelength, channel width, and critical path delay comparison 52

Virtex-5 Dual-Output LUT 6 A 1 A 2 A 3 A 4 A 5 A 6 LUT 5 D A 5 A 6 D 6 A 2 A 3 A 4 LUT 5 D D 5 A 6 53

Comparison of LUT Distribution 54

Comparison of Area Flow and Edge Flow Mapping (K = 6) 55

Wirelength, Channel Width, and Critical Path Delay Comparison twl = total wire length, mcw = minimum channel width required to route in VPR, cpd = critical path delay with min channel width across the three implementations 56