Synthesizing Datapath Circuits for FPGAs With Emphasis on

  • Slides: 31
Download presentation
Synthesizing Datapath Circuits for FPGAs With Emphasis on Area Minimization Andy Ye, David Lewis,

Synthesizing Datapath Circuits for FPGAs With Emphasis on Area Minimization Andy Ye, David Lewis, Jonathan Rose Department of Electrical and Computer Engineering, University of Toronto {yeandy, lewis, jayar}@eecg. utoronto. ca 1

Motivation: Datapath Regularity • Larger FPGAs – Larger applications on FPGAs – More datapath

Motivation: Datapath Regularity • Larger FPGAs – Larger applications on FPGAs – More datapath logic in larger applications – Datapath logic is highly regular • Utilize regularity to improve logic density 2

Utilizing Datapath Regularity • A new datapath-oriented FPGA • New CAD tools supporting the

Utilizing Datapath Regularity • A new datapath-oriented FPGA • New CAD tools supporting the new FPGA – Synthesis – Packing – Placement – Routing • This talk focuses on synthesis 3

Background: Datapath-oriented FPGA • Architected to utilize datapath regularity • Architectural features – Capture

Background: Datapath-oriented FPGA • Architected to utilize datapath regularity • Architectural features – Capture regularity using special logic blocks – Increase logic density by coarse grain routing 4

Background: FPGA Overview L Routing Channels L L S L L Logic cluster S

Background: FPGA Overview L Routing Channels L L S L L Logic cluster S Switch box Coarse grain routing tracks Fine grain routing tracks 5

Background: Logic Cluster BLE BLE BLE BLE Subcluster 4 Subcluster 3 Subcluster 2 Subcluster

Background: Logic Cluster BLE BLE BLE BLE Subcluster 4 Subcluster 3 Subcluster 2 Subcluster 1 DFF A Subcluster MUX BLE Local BLE Routing BLE Network BLE LUT M A Basic Logic Element (BLE) 6

Background: FPGA Overview L Routing Channels L L S L L Logic cluster S

Background: FPGA Overview L Routing Channels L L S L L Logic cluster S Switch box Coarse grain routing tracks Fine grain routing tracks 7

Background: Coarse Grain Routing Tracks Subcluster Sub. Cluster M M Fine Grain Routing M

Background: Coarse Grain Routing Tracks Subcluster Sub. Cluster M M Fine Grain Routing M Coarse Grain Routing Switch Box Logic Cluster M M 8

Datapath Synthesis • Synthesis – The first step in a fully automated CAD flow

Datapath Synthesis • Synthesis – The first step in a fully automated CAD flow – Transforms high level descriptions into logic • Conventional synthesis (flat synthesis) – Minimizes area and delay metrics – Destroys datapath regularity • Datapath synthesis – Preserves datapath regularity – Supports downstream CAD tools 9

Datapath Representation • Datapath circuits are represent by netlists of datapath components (VHDL or

Datapath Representation • Datapath circuits are represent by netlists of datapath components (VHDL or Verilog) • Datapath component library – – – Multiplexers Adders/subtracters Shifters Comparators Registers • Each component consists of identical bit-slices 10

Hard Boundary Hierarchical Synthesis • Optimize within the boundaries of bit-slices • Keep identical

Hard Boundary Hierarchical Synthesis • Optimize within the boundaries of bit-slices • Keep identical bit-slices identical • Optimized 15 datapath circuits from Picojava processor using Synopsys [sun] – Good regularity – Bad area - 38% area inflation • FPGA architecture – increase logic density – Need a better synthesis tool 11

Causes of Area Inflation • Examined circuits to determine the causes • Constraint of

Causes of Area Inflation • Examined circuits to determine the causes • Constraint of preserving bit-slice boundaries – Common sub-expressions exist across bit-slices – Harder to discover in datapath synthesis • Constraint of preserving datapath regularity – Identical bit-slices have different external connections – Some bit-slices have more optimization opportunities – Missing optimization opportunities if one has to keeping all bit-slices identical 12

Enhanced Module Compaction Netlist of Datapath Components Manual Operation Word-level Optimization Module Compaction Bit-slice

Enhanced Module Compaction Netlist of Datapath Components Manual Operation Word-level Optimization Module Compaction Bit-slice Netlist I/O Optimization Flat Synthesis & Optimization Within Bit-slice Boundaries Netlist of Synthesized Bit-slices 13

Word-level Optimization • Done manually and will be automated • Optimizes across bit-slice boundaries

Word-level Optimization • Done manually and will be automated • Optimizes across bit-slice boundaries • Uses the functionality of each datapath component to create optimization opportunities • Two are performed – Multiplexer tree collapsing – Operation reordering • More in the future 14

Multiplexer Tree Collapsing • Datapath circuits contain multiplexers in a tree topology • Collapses

Multiplexer Tree Collapsing • Datapath circuits contain multiplexers in a tree topology • Collapses several multiplexers in a multiplexer tree into a single multiplexer • Collapsing operation creates common subexpressions • Extracts common expressions out of multiple bit-slices to save area 15

An Example A S 1 S 2 R mux 1 mux 2 FF A

An Example A S 1 S 2 R mux 1 mux 2 FF A S 1 S 2 rl FF rl – random logic 16

Operation Reordering • Transforms result selection into operand selection • Accepts the transformation if

Operation Reordering • Transforms result selection into operand selection • Accepts the transformation if resulting in smaller area 17

An Example a b c + s a 0 b 0 d a s

An Example a b c + s a 0 b 0 d a s + + d 0 cin 0 b sum carry cout 0 a cout 0 b s 0 e 0 b d mux mux e cin 0 a c 0 c a 0 e c 0 b 0 d 0 s 0 cin 0 sum carry cout 0 e 0 18

Module Compaction • Merges bit-slices into larger bit-slices • Based on connectivity between datapath

Module Compaction • Merges bit-slices into larger bit-slices • Based on connectivity between datapath components • Larger bit-slices have more optimization opportunities for flat synthesis • Avoids merging based on carry chains • Similar to the algorithm proposed by Koch 19

An Example FA 0 FA 1 mux 0 FA 2 mux 1 FA 3

An Example FA 0 FA 1 mux 0 FA 2 mux 1 FA 3 mux 2 FA 4 mux 3 20

Bit-slice I/O Optimization • Granularity of bit-slice I/O optimization, m • Breaks datapath components

Bit-slice I/O Optimization • Granularity of bit-slice I/O optimization, m • Breaks datapath components into m-bit wide chunks • m bit-slices are kept identical to each other • Allows some bit-slices in a datapath component to be optimized more than others 21

Bit-slice I/O Optimization • Converts bit-slice I/O signals into internal signals if all m

Bit-slice I/O Optimization • Converts bit-slice I/O signals into internal signals if all m bit-slices meet an optimization criteria • More optimization opportunities for flat synthesis • Four types of I/O optimizations – – Constant absorption Feedback absorption Duplicated input absorption Unused output absorption 22

Experimental Results • Fifteen benchmark circuits – From the Pico-java processor – Synthesized into

Experimental Results • Fifteen benchmark circuits – From the Pico-java processor – Synthesized into 4 -LUTs and DFFs • Experiments – Area – Regularity – Area against m (the granularity of bit-slice I/O optimization) 23

Area • m (granularity of bit-slice I/O optimization) =4 • Compare datapath synthesis with

Area • m (granularity of bit-slice I/O optimization) =4 • Compare datapath synthesis with flat synthesis 24

Post-synthesis Area (LUT Count) icu_dpath ex_dpath multmod_dp ucode_dat imdr_dpath dcu_dpath mantissa_dp incmod_dp smu_dpath exponent_dp

Post-synthesis Area (LUT Count) icu_dpath ex_dpath multmod_dp ucode_dat imdr_dpath dcu_dpath mantissa_dp incmod_dp smu_dpath exponent_dp pipe_dpath prils_dp rsadd_dp code_seq_dp ucode_reg Total Area Flat Synthesis Area 3120 2530 1558 1243 1182 960 846 779 490 477 443 377 346 218 78 14647 Datapath Synthesis Area Inflation 3235 3. 7% 2553 0. 91% 1634 4. 9% 1304 4. 9% 1219 3. 1% 966 0. 63% 878 3. 8% 865 11% 493 0. 61% 501 5. 0% 471 6. 3% 388 2. 9% 305 -12% 223 2. 3% 82 5. 1% 15117 3. 2% 25

Regularity • m (granularity of bit-slice I/O optimization) =4 • Two terminal connections captured

Regularity • m (granularity of bit-slice I/O optimization) =4 • Two terminal connections captured by – 4 -bit wide buses – 4 -bit wide control groups 26

Regularity A 4 -bit wide bus S 4 S 3 S 2 S 1

Regularity A 4 -bit wide bus S 4 S 3 S 2 S 1 A 4 -bit wide control group S 4 S 3 S 2 S 1 27

Regularity Results dcu_dpath ex_dpath icu_dpath imdr_dpath pipe_dpath smu_dpath ucode_data ucode_reg code_seq_dp exponent_dp incmod_dp mantissa_dp

Regularity Results dcu_dpath ex_dpath icu_dpath imdr_dpath pipe_dpath smu_dpath ucode_data ucode_reg code_seq_dp exponent_dp incmod_dp mantissa_dp multmod_dp prils_dp rsadd_dp Total Two Terminal Connections 2232 6547 8047 3100 1049 1167 3143 194 799 1362 2013 2533 3380 864 722 37152 4 -bit Wide Buses 49% 52% 47% 50% 48% 52% 72% 58% 32% 47% 39% 41% 52% 48% 4 -bit Wide Control groups 43% 39% 36% 42% 25% 41% 21% 18% 23% 36% 25% 32% 27% 35% • 94% of LUTs remain in regular datapath components 28

Granularity (m) Vs. Area • Higher m (the granularity of bit-slice I/O optimization) –

Granularity (m) Vs. Area • Higher m (the granularity of bit-slice I/O optimization) – Keeps more bit-slices identical – Preserves more regularity – Higher area cost 29

Granularity Vs. Area Inflation 30

Granularity Vs. Area Inflation 30

Conclusion • Presented a datapath-oriented FPGA architecture • Presented an enhanced module compaction algorithm

Conclusion • Presented a datapath-oriented FPGA architecture • Presented an enhanced module compaction algorithm • Empirically demonstrated the area efficiency of the algorithm – 3%-8% area inflation • Good regularity – 48% two terminal connections are in 4 -bit wide buses – 35% two terminal connections are in 4 -bit wide control groups 31