Synthesizing Datapath Circuits for FPGAs With Emphasis on

Motivation: Datapath Regularity • Larger FPGAs – Larger applications on FPGAs – More datapath

Utilizing Datapath Regularity • A new datapath-oriented FPGA • New CAD tools supporting the

Background: Datapath-oriented FPGA • Architected to utilize datapath regularity • Architectural features – Capture

Background: FPGA Overview L Routing Channels L L S L L Logic cluster S

Background: Logic Cluster BLE BLE BLE BLE Subcluster 4 Subcluster 3 Subcluster 2 Subcluster

Background: Coarse Grain Routing Tracks Subcluster Sub. Cluster M M Fine Grain Routing M

Datapath Synthesis • Synthesis – The first step in a fully automated CAD flow

Datapath Representation • Datapath circuits are represent by netlists of datapath components (VHDL or

Hard Boundary Hierarchical Synthesis • Optimize within the boundaries of bit-slices • Keep identical

Causes of Area Inflation • Examined circuits to determine the causes • Constraint of

Enhanced Module Compaction Netlist of Datapath Components Manual Operation Word-level Optimization Module Compaction Bit-slice

Word-level Optimization • Done manually and will be automated • Optimizes across bit-slice boundaries

Multiplexer Tree Collapsing • Datapath circuits contain multiplexers in a tree topology • Collapses

Operation Reordering • Transforms result selection into operand selection • Accepts the transformation if

Module Compaction • Merges bit-slices into larger bit-slices • Based on connectivity between datapath

An Example FA 0 FA 1 mux 0 FA 2 mux 1 FA 3

Bit-slice I/O Optimization • Granularity of bit-slice I/O optimization, m • Breaks datapath components

Bit-slice I/O Optimization • Converts bit-slice I/O signals into internal signals if all m

Experimental Results • Fifteen benchmark circuits – From the Pico-java processor – Synthesized into

Area • m (granularity of bit-slice I/O optimization) =4 • Compare datapath synthesis with

Post-synthesis Area (LUT Count) icu_dpath ex_dpath multmod_dp ucode_dat imdr_dpath dcu_dpath mantissa_dp incmod_dp smu_dpath exponent_dp

Regularity • m (granularity of bit-slice I/O optimization) =4 • Two terminal connections captured

Regularity A 4 -bit wide bus S 4 S 3 S 2 S 1

Regularity Results dcu_dpath ex_dpath icu_dpath imdr_dpath pipe_dpath smu_dpath ucode_data ucode_reg code_seq_dp exponent_dp incmod_dp mantissa_dp

Granularity (m) Vs. Area • Higher m (the granularity of bit-slice I/O optimization) –

Conclusion • Presented a datapath-oriented FPGA architecture • Presented an enhanced module compaction algorithm

Slides: 31

Download presentation

Synthesizing Datapath Circuits for FPGAs With Emphasis on Area Minimization Andy Ye, David Lewis, Jonathan Rose Department of Electrical and Computer Engineering, University of Toronto {yeandy, lewis, jayar}@eecg. utoronto. ca 1

Motivation: Datapath Regularity • Larger FPGAs – Larger applications on FPGAs – More datapath logic in larger applications – Datapath logic is highly regular • Utilize regularity to improve logic density 2

Utilizing Datapath Regularity • A new datapath-oriented FPGA • New CAD tools supporting the new FPGA – Synthesis – Packing – Placement – Routing • This talk focuses on synthesis 3

Background: Datapath-oriented FPGA • Architected to utilize datapath regularity • Architectural features – Capture regularity using special logic blocks – Increase logic density by coarse grain routing 4

Background: FPGA Overview L Routing Channels L L S L L Logic cluster S Switch box Coarse grain routing tracks Fine grain routing tracks 5

Background: Logic Cluster BLE BLE BLE BLE Subcluster 4 Subcluster 3 Subcluster 2 Subcluster 1 DFF A Subcluster MUX BLE Local BLE Routing BLE Network BLE LUT M A Basic Logic Element (BLE) 6

Background: FPGA Overview L Routing Channels L L S L L Logic cluster S Switch box Coarse grain routing tracks Fine grain routing tracks 7

Background: Coarse Grain Routing Tracks Subcluster Sub. Cluster M M Fine Grain Routing M Coarse Grain Routing Switch Box Logic Cluster M M 8

Datapath Synthesis • Synthesis – The first step in a fully automated CAD flow – Transforms high level descriptions into logic • Conventional synthesis (flat synthesis) – Minimizes area and delay metrics – Destroys datapath regularity • Datapath synthesis – Preserves datapath regularity – Supports downstream CAD tools 9

Datapath Representation • Datapath circuits are represent by netlists of datapath components (VHDL or Verilog) • Datapath component library – – – Multiplexers Adders/subtracters Shifters Comparators Registers • Each component consists of identical bit-slices 10

Hard Boundary Hierarchical Synthesis • Optimize within the boundaries of bit-slices • Keep identical bit-slices identical • Optimized 15 datapath circuits from Picojava processor using Synopsys [sun] – Good regularity – Bad area - 38% area inflation • FPGA architecture – increase logic density – Need a better synthesis tool 11

Causes of Area Inflation • Examined circuits to determine the causes • Constraint of preserving bit-slice boundaries – Common sub-expressions exist across bit-slices – Harder to discover in datapath synthesis • Constraint of preserving datapath regularity – Identical bit-slices have different external connections – Some bit-slices have more optimization opportunities – Missing optimization opportunities if one has to keeping all bit-slices identical 12

Enhanced Module Compaction Netlist of Datapath Components Manual Operation Word-level Optimization Module Compaction Bit-slice Netlist I/O Optimization Flat Synthesis & Optimization Within Bit-slice Boundaries Netlist of Synthesized Bit-slices 13

Word-level Optimization • Done manually and will be automated • Optimizes across bit-slice boundaries • Uses the functionality of each datapath component to create optimization opportunities • Two are performed – Multiplexer tree collapsing – Operation reordering • More in the future 14

Multiplexer Tree Collapsing • Datapath circuits contain multiplexers in a tree topology • Collapses several multiplexers in a multiplexer tree into a single multiplexer • Collapsing operation creates common subexpressions • Extracts common expressions out of multiple bit-slices to save area 15

An Example A S 1 S 2 R mux 1 mux 2 FF A S 1 S 2 rl FF rl – random logic 16

Operation Reordering • Transforms result selection into operand selection • Accepts the transformation if resulting in smaller area 17

An Example a b c + s a 0 b 0 d a s + + d 0 cin 0 b sum carry cout 0 a cout 0 b s 0 e 0 b d mux mux e cin 0 a c 0 c a 0 e c 0 b 0 d 0 s 0 cin 0 sum carry cout 0 e 0 18

Module Compaction • Merges bit-slices into larger bit-slices • Based on connectivity between datapath components • Larger bit-slices have more optimization opportunities for flat synthesis • Avoids merging based on carry chains • Similar to the algorithm proposed by Koch 19

An Example FA 0 FA 1 mux 0 FA 2 mux 1 FA 3 mux 2 FA 4 mux 3 20

Bit-slice I/O Optimization • Granularity of bit-slice I/O optimization, m • Breaks datapath components into m-bit wide chunks • m bit-slices are kept identical to each other • Allows some bit-slices in a datapath component to be optimized more than others 21

Bit-slice I/O Optimization • Converts bit-slice I/O signals into internal signals if all m bit-slices meet an optimization criteria • More optimization opportunities for flat synthesis • Four types of I/O optimizations – – Constant absorption Feedback absorption Duplicated input absorption Unused output absorption 22

Experimental Results • Fifteen benchmark circuits – From the Pico-java processor – Synthesized into 4 -LUTs and DFFs • Experiments – Area – Regularity – Area against m (the granularity of bit-slice I/O optimization) 23

Area • m (granularity of bit-slice I/O optimization) =4 • Compare datapath synthesis with flat synthesis 24

Post-synthesis Area (LUT Count) icu_dpath ex_dpath multmod_dp ucode_dat imdr_dpath dcu_dpath mantissa_dp incmod_dp smu_dpath exponent_dp pipe_dpath prils_dp rsadd_dp code_seq_dp ucode_reg Total Area Flat Synthesis Area 3120 2530 1558 1243 1182 960 846 779 490 477 443 377 346 218 78 14647 Datapath Synthesis Area Inflation 3235 3. 7% 2553 0. 91% 1634 4. 9% 1304 4. 9% 1219 3. 1% 966 0. 63% 878 3. 8% 865 11% 493 0. 61% 501 5. 0% 471 6. 3% 388 2. 9% 305 -12% 223 2. 3% 82 5. 1% 15117 3. 2% 25

Regularity • m (granularity of bit-slice I/O optimization) =4 • Two terminal connections captured by – 4 -bit wide buses – 4 -bit wide control groups 26

Regularity A 4 -bit wide bus S 4 S 3 S 2 S 1 A 4 -bit wide control group S 4 S 3 S 2 S 1 27

Regularity Results dcu_dpath ex_dpath icu_dpath imdr_dpath pipe_dpath smu_dpath ucode_data ucode_reg code_seq_dp exponent_dp incmod_dp mantissa_dp multmod_dp prils_dp rsadd_dp Total Two Terminal Connections 2232 6547 8047 3100 1049 1167 3143 194 799 1362 2013 2533 3380 864 722 37152 4 -bit Wide Buses 49% 52% 47% 50% 48% 52% 72% 58% 32% 47% 39% 41% 52% 48% 4 -bit Wide Control groups 43% 39% 36% 42% 25% 41% 21% 18% 23% 36% 25% 32% 27% 35% • 94% of LUTs remain in regular datapath components 28

Granularity (m) Vs. Area • Higher m (the granularity of bit-slice I/O optimization) – Keeps more bit-slices identical – Preserves more regularity – Higher area cost 29

Granularity Vs. Area Inflation 30

Conclusion • Presented a datapath-oriented FPGA architecture • Presented an enhanced module compaction algorithm • Empirically demonstrated the area efficiency of the algorithm – 3%-8% area inflation • Good regularity – 48% two terminal connections are in 4 -bit wide buses – 35% two terminal connections are in 4 -bit wide control groups 31