Improvements in FPGA Technology Mapping Satrajit Chatterjee Alan

Outline 1. 2. 3. 4. Review of Technology Mapping More Efficient Cut Computation Lossless

Technology Mapping Input: A Boolean network f Output: A netlist of k-LUTs implementing the

Basic Mapping Algorithm Cut-based mapping using dynamic programming on a DAG for delay optimality

k-feasible Cuts r (Rough definitions) A cut of a node n is a set

k-feasible Cut Computation The set of cuts of a node is a ‘cross product’

Outline 1. Review of Technology Mapping 2. More Efficient Cut Computation 1. Cut Dropping

Cut Dropping During bottom up computation of cuts, the set of cuts of a

Cuts Behaving Badly Bottom-up cut computation in the presence of re-convergence might produce dominated

Signature-based Dominance Problem: Given two cuts how to quickly determine whether one is a

Example • Let the node id’s be a = 1, b = 2, c

Other Uses of Signatures • Signatures can be used as quick negative tests for

Run-time of k-feasible cut computation 13

Structural Bias The mapped netlist very closely resembles the subject graph f f p

The Problem of Structural Bias A better match may not be found f f

The Problem of Structural Bias The match would be found with a different subject

Traditional Synthesis Only the network at the end of technology independent synthesis is used

Lossless Synthesis Idea: Merge intermediate networks into a single network with choices which is

Lossless Synthesis Can combine the results of different technology independent optimization scripts Script optimizes

Mapping with Choices sweep eliminate resub simplify fx resub sweep eliminate sweep full simplify

Detecting Choices Task: Given two Boolean networks, we need to create a network with

Detecting Choices Step 2: Use combinational equivalence to detect functionally equivalent nodes up to

Detecting Choices Step 3: Merge equivalent nodes with choice edges x a b y

Mapping with Choices Only Step 1 requires modification Input: And-Inverter Graph with Choices 1.

Cut Computation with Choices Cuts are now computed for equivalence classes of nodes {

Mapping with Choices After Step 1 everything else remains same Input: And-Inverter Graph with

Overview of Area Recovery • Initial mapping is delay oriented – Gets best delay

How to Measure Area? Naïve definition: Area (cut) = 1 + [ Σ area

Area-flow (cut) = 1 + [ Σ ( area-flow (fan-in) / fan-out (fan-in) )

Area Recovery with Area-flow 1. Do delay-optimal mapping 2. Compute slack at each node

Exact Area Exact-area (cut) = 1 + [ Σ exact-area (fan-in with no other

Area Recovery with Exact-area 1. 2. 3. 4. Do delay-optimal mapping Compute slack at

Area Recovery Summary • Two step area recovery • Area-flow has global view •

Experimental Comparison • Compare area-recovery with state-of-the-art academic mapper DAOmap – DAOmap uses many

Summary • Improvements to cut computation – Cut dropping – Signature-based dominance check •

Slides: 41

Download presentation

Improvements in FPGA Technology Mapping Satrajit Chatterjee, Alan Mishchenko and Robert Brayton U. C. Berkeley

Outline 1. 2. 3. 4. Review of Technology Mapping More Efficient Cut Computation Lossless Synthesis Area Recovery 2

Technology Mapping Input: A Boolean network f Output: A netlist of k-LUTs implementing the Boolean network optimizing some cost function f Technology Mapping a b c d The subject graph e a b c d e The mapped netlist 3

Basic Mapping Algorithm Cut-based mapping using dynamic programming on a DAG for delay optimality Input: And-Inverter Graph 1. Compute k-feasible cuts for each node 2. Compute best arrival time at each node • In topological order (from PI to PO) • Assuming that each cut maps to a k-LUT • Assuming that each k-LUT has unit delay 3. Chose the best cover • In reverse topological order (from PO to PI) Output: Mapped Netlist 4

k-feasible Cuts r (Rough definitions) A cut of a node n is a set of nodes in transitive fan-in such that assigning values to those nodes fixes the value of n. A k-feasible cut means the size of the cut must be k or less. p a q b c The set {p, b, c} is a 3 -feasible cut of node r. (It is also a 5 -feasible cut. ) k-feasible cuts are important in FPGA mapping since the logic between a node and the nodes in its cut can be replaced by a k-LUT. 5

k-feasible Cut Computation The set of cuts of a node is a ‘cross product’ of the sets of cuts of its children { {r}, {p, q}, {p, b, c}, {a, b, q}, {a, b, c} } r { {p}, {a, b} } { {q}, {b, c} } p Computation is done bottom-up { {a} } a q { {b} } b { {c} } c Any cut that is of size greater than k is discarded (Pan ’ 98, Cong ’ 99) 6

Outline 1. Review of Technology Mapping 2. More Efficient Cut Computation 1. Cut Dropping 2. Cut Dominance 3. Lossless Synthesis 4. Area Recovery 7

Cut Dropping During bottom up computation of cuts, the set of cuts of a node can be freed once all its fan-outs have been processed { {r}, {p, q}, {p, b, c}, {a, b, q}, {a, b, c} } r { {p}, {a, b} } p Bottom-up computation a q b { {q}, {b, c} } Can delete these cuts once node r is done c • Once the cuts of node r are computed, the cuts of q are no longer needed • But cannot discard the cuts of node p since not all fan-outs of p have been processed • Dramatically reduces peak memory consumption on large designs 8

Cuts Behaving Badly Bottom-up cut computation in the presence of re-convergence might produce dominated cuts x = ~a + a. b + ~b. c x {. . {a, d, b, c}. . {a, b, c}. . } f d a {. . {d, b, c}. . {a, b, c}. . } e b Cut {a, b, c} dominates cut {a, d, b, c} c • The “good” cut {a, b, c} is there: so not a quality issue • But the “bad” cut {a, d, b, c} may be propagated further: so a run-time issue • Want to discard dominated cuts quickly 9

Signature-based Dominance Problem: Given two cuts how to quickly determine whether one is a subset of another Define signature of a cut: sig (c) = Σ 2 ID(n) mod 32 n Îc (Σ means bit-wise OR) where ID(n) is the integer id of node n Observation: If cut c 1 dominates cut c 2 then sig(c 1) OR sig(c 2) = sig(c 2) Cheap test for the common case that a cut does not dominate another. Only if this fails is an actual comparison made. 10

Example • Let the node id’s be a = 1, b = 2, c = 3, d = 4 • Let c 1 = {a, b, c} and c 2 = {a, d, b, c} • sig (c 1) = 21 OR 22 OR 23 = 0001 OR 0010 OR 0100 = 0111 • sig (c 2) = 21 OR 24 OR 22 OR 23 = 0001 OR 1000 OR 0010 OR 0100 = 1111 • As sig (c 1) OR sig (c 2) ¹ sig (c 1), c 2 does not dominate c 1 • But sig (c 1) OR sig (c 2) = sig (c 2), so c 1 may dominate c 2 11

Other Uses of Signatures • Signatures can be used as quick negative tests for equality of cuts and for k-feasibility 12

Run-time of k-feasible cut computation 13

Peak Memory in Mb with Cut Dropping 14

Outline 1. 2. 3. 4. Review of Technology Mapping More Efficient Cut Computation Lossless Synthesis Area Recovery 15

Structural Bias The mapped netlist very closely resembles the subject graph f f p p Technology Mapping m m a b c d e Every input of every LUT in the mapped netlist must be present in the subject graph. . otherwise technology mapping will not find the match 16

The Problem of Structural Bias A better match may not be found f f p This match is not found p q m m a f b c d e a b c d e Since the point q is not present in the subject graph, the match on the extreme right will not be found 17

The Problem of Structural Bias The match would be found with a different subject graph f f p f = q q m a b c d e 18

Traditional Synthesis Only the network at the end of technology independent synthesis is used for mapping Technologyindependent synthesis sweep eliminate resub simplify fx resub sweep eliminate sweep full simplify Boolean Network No guarantee of optimality since each synthesis step is heuristic. But structural bias means the mapped netlist depends heavily on the final network. Technology Mapping Mapped Netlist 19

Lossless Synthesis Idea: Merge intermediate networks into a single network with choices which is used for mapping Technologyindependent synthesis sweep eliminate resub simplify fx resub sweep eliminate sweep full simplify Boolean Network Choice operator Technology mapping is not any harder with choices (Lehman-Watanabe ’ 95, Chen and Cong `01) Technology Mapping Mapped Netlist 20

Lossless Synthesis Can combine the results of different technology independent optimization scripts Script optimizes area sweep eliminate resub simplify fx resub sweep eliminate sweep full simplify Boolean Network speed up Script optimizes delay reduce depth Technology Mapping Mapped Netlist 21

Mapping with Choices sweep eliminate resub simplify fx resub sweep eliminate sweep full simplify Boolean Network Question 1: How to implement an efficient choice operator? Question 2: How to map quickly with choices? Technology Mapping Mapped Netlist 22

Detecting Choices Task: Given two Boolean networks, we need to create a network with choices Network 1 x = (a + b). c y = b. c. d Network 2 x = a. c + b. c y = b. c. d Step 1: Make And-Inverter decomposition of networks y x a b c y x d a b c d 24

Detecting Choices Step 2: Use combinational equivalence to detect functionally equivalent nodes up to complementation (Kuehlmann ’ 04, …) – Random simulation to detect possibly equivalent nodes – SAT-based decision procedure to prove equivalence Network 1 x = (a + b). c y = b. c. d x a b Network 2 x = a. c + b. c y = b. c. d y c y x d a b c d 25

Detecting Choices Step 3: Merge equivalent nodes with choice edges x a b y c d a x x now represents a class of nodes that are functionally equivalent up to complementation a y x b c d y b c d 26

Mapping with Choices Only Step 1 requires modification Input: And-Inverter Graph with Choices 1. Compute k-feasible cuts with choices 2. Compute best arrival time at each node • In topological order (from PI to PO) • Assuming that each cut maps to a k-LUT • Assuming that each k-LUT has unit delay 3. Chose the best cover • In reverse topological order (from PO to PI) Output: Mapped Netlist 28

Cut Computation with Choices Cuts are now computed for equivalence classes of nodes { {x 2}, {q, c}, {a, b, c} } { {x 1}, {p, r}, {p, b, c}, {a, c, r}, {a, b, c} } x x 1 p y x 2 r q a b c d Cuts ( x ) = Cuts ( x 1 ) Cuts( x 2 ) = { {x 1}, {p, r}, {p, b, c}, {a, c, r}, {a, b, c}, {x 2}, {q, c} } 29

Mapping with Choices After Step 1 everything else remains same Input: And-Inverter Graph with Choices 1. Compute k-feasible cuts with choices 2. Compute best arrival time at each node • In topological order (from PI to PO) • Assuming that each cut maps to a k-LUT • Assuming that each k-LUT has unit delay 3. Chose the best cover • In reverse topological order (from PO to PI) Output: Mapped Netlist 30

Outline 1. 2. 3. 4. Review of Technology Mapping More Efficient Cut Computation Lossless Synthesis Area Recovery 1. Area-flow 2. Exact Area 31

Overview of Area Recovery • Initial mapping is delay oriented – Gets best delay for all paths – Area-based tie-breaking • Not all paths critical – Area recovery tries to slow down non critical paths to reduce area – Each node with positive slack: choose a different cut that reduces area – Done as subsequent passes after delay-oriented mapping • Question: how to measure area? 32

How to Measure Area? Naïve definition: Area (cut) = 1 + [ Σ area (fan-in) ] x p q a b c d r e Area of cut {p, c, d} = 1 + [1 + 0] =2 y x y f p q a b c d r e f Area of cut {a, b, q} = 1 + [ 0 + 1] =2 Naïve definition says both cuts are equally good in area Naïve definition ignores sharing due to multiple fan-outs 33

Area-flow (cut) = 1 + [ Σ ( area-flow (fan-in) / fan-out (fan-in) ) ] x p q a b c d r e y x y f Area-flow of cut {p, c, d} = 1 + [1 + 0] =2 p q a b c d r e f Area-flow of cut {a, b, q} = 1 + [ 0/1 + ½] = 1. 5 Area-flow recognizes that cut {a, b, q} is better Area-flow “correctly” accounts for sharing (Cong ’ 99, Manohara-rajah ’ 04) 34

Area Recovery with Area-flow 1. Do delay-optimal mapping 2. Compute slack at each node 3. Do area recovery with area-flow – – Done in topological order from PI to PO Among all the cuts which do not exceed slack budget choose cut with smallest area-flow Fan-out of a node is estimated from delay optimal mapping We only do one pass • Saw only marginal improvement on subsequent passes 35

Exact Area Exact-area (cut) = 1 + [ Σ exact-area (fan-in with no other fan-out) ] f f p p 6 6 s q t a b c d 6 e f 6 s q t a b c d e f Cut {p, e, f} Cut {s, t, q} Area flow = 1+ [(. 25+3)/2] = 2. 75 Area flow = 1+ [. 25+. 25 +1] = 2. 5 Exact area = 1 + 0 (p is used elsewhere) Exact area = 1 + 1 = 2 (due to q) Exact area will choose this cut. Area flow will choose this cut. 36

Area Recovery with Exact-area 1. 2. 3. 4. Do delay-optimal mapping Compute slack at each node Do area recovery with area-flow Do area recovery with exact-flow – – Done in topological order from PI to PO Among all the cuts which do not exceed slack budget choose cut with smallest exact-area Note: Unlike area-flow, no estimation involved We only do one pass • Saw only marginal improvement on subsequent passes 37

Area Recovery Summary • Two step area recovery • Area-flow has global view • Exact area has local view – Ensures local minimum is reached • Order in which nodes are processed for both steps is important • Order of the two passes is important 38

Experimental Comparison • Compare area-recovery with state-of-the-art academic mapper DAOmap – DAOmap uses many (~10) different area recovery heuristics – Some more effective than others • Just the two heuristics of area-recovery and exact-area give better results on their benchmarks • Also separate comparison with choices obtained from lossless synthesis flow – Six snapshots of MVSIS script. rugged – Not the best FPGA optimization script – Improves both area and delay 39

Comparison with DAOmap 40

Summary • Improvements to cut computation – Cut dropping – Signature-based dominance check • Lossless Synthesis – Map over multiple synthesis snapshots • Simpler, faster and better area recovery – Global area-flow – Local exact area – Order of application is important • Implemented in the abc system – Google: “abc berkeley logic synthesis” 41