Physical Design Challenges of Reconfigurable Computing Systems Majid
Physical Design Challenges of Reconfigurable Computing Systems Majid Sarrafzadeh Nu. CAD Department of ECE Northwestern University Ryan Kastner, Todd Haverkos, Kia Bazargan, Seda Ogrenci, Eli Bozorgzadeh, Candice Mc. Grew Sponsored: DARPA, Motorola, AT&T, NSF Berkeley: Sept 15, 1999 1
Faculty Position • In VLSI Design & CAD (1 -2 openings) • VLSI Design & CAD: One of the six focused research areas in the department • Assistant/Associate/Full Professor – (Northwestern rank: top 10; – ECE: top 20 (top 10 in 5 years) • Contact: majid@ece. nwu. edu Berkeley: Sept 15, 1999 2
Field Programmable Gate Array: FPGA Berkeley: Sept 15, 1999 3
FPGA(Xilinx) Berkeley: Sept 15, 1999 4
Degraded Image Berkeley: Sept 15, 1999 Restored Image 5
Degraded Image Berkeley: Sept 15, 1999 Restored Image 6
Host processor ( image is stored here) Image stored in on-chip memory Circuit to process the image residing on the rest of the chip System A Berkeley: Sept 15, 1999 FPGA chip On-board memory, FPGA chip where the image is stored System B System C 7
The Architecture of a Reconfigurable System Data Memory Data Control RFU CPU Data RFUOPs CPU instructions Instruction Memory (Program) Berkeley: Sept 15, 1999 8
Field Programmable Gate Array: FPGA • SRAM cells used in configuration – Reconfigurable (runtime) – Static vs. dynamic configuration • Hardware functions implemented as rectangular areas on the FPGA Programmable logic RFU SRAM cells Berkeley: Sept 15, 1999 Programmable connections 9
System Components Data CPU instructions Configuration Memory Config. Bits Data Memory Data RFU Program Manager Instruction Mem. (Prog. ) RFUOPs Cache Control Prefetch/Branch Manager Prediction Unit Placement Engine RFU Manager Berkeley: Sept 15, 1999 10
System Behavior • Two kind of instructions – CPU instructions => always run on CPU • Assume known runtime – RFUOPs, might be performed on CPU if not enough room on RFU • Assume known runtime and reconfiguration time • Runtime profiles and RFU status are used to decide between CPU and RFU Berkeley: Sept 15, 1999 11
PD Challenges • Problem: Given RFUOPs to be performed on RFU and DFG constraints, schedule them in time assign them physical location. • Must be very fast: (mtools achieve 1000 cells per minute). Existing tools/techniques are very slow. Quality is less important. • New PD algorithm/paradigms are needed. • In this presentation: – placement, – routing, – an application on reconfigurable systems. Berkeley: Sept 15, 1999 12
Firm Macros • Not hard (too rigid), not soft (takes too much time to utilize the flexibility) • Each unit is 80%-100% pre-designed: Can “break” the macros in limited ways • We have defined a network algebra for combining circuits (based on parameterization using VHDL generics): combine a fast and a slow adder in multiple ways Berkeley: Sept 15, 1999 13
Faculty Position • In VLSI Design & CAD (1 -2 openings) • VLSI Design & CAD: One of the six focused research areas in the department • Assistant/Associate/Full Professor – (Northwestern rank: top 10; – ECE: top 20 (top 10 in 5 years) – Contact: majid@ece. nwu. edu Berkeley: Sept 15, 1999 14
Execution of a Sample Program Code DFG … => x = 3*a - b; t => C = RFUOP 1(x, 5); => y = 4*x - c; (on CPU) (on RFU) for (i=0; i<3; i++){ y => x ++y; RFU Berkeley: Sept 15, 1999 x += RFUOP 2(y); No room on RFU to run all in parallel ==> run in sequence } => z = RFUOP 1(x, 3); => a = z - y; => b = RFUOP 3(a, b); => c = a - b; => … (in parallel) 15
Placement • On-line placement – RFU calls needs to be executed as the program proceeds • off-line placement – Have a complete or partial profile of the operation Berkeley: Sept 15, 1999 16
Online Placement • When a new RFUOP arrives – Is there enough space to place the RFUOP? – If yes, Which location is best to place it? • Decision 1: Managing the empty space – Fast but sub-optimal • Keep only O(n) empty rectangles – Shorter Seg. (SSEG), Square Empty Rects. (SQR), . . . – Efficient use of RFU real estate • KAMER: Keep all O(n 2) maximal empty rectangles • Decision 2: Packing rule – Best Fit, Bottom Left, First Fit Berkeley: Sept 15, 1999 17
Keeping All Empty Rectangles Keeping O(n) Empty Rectangles - SSEG Cannot fit this Berkeley: Sept 15, 1999 18
Heuristics for Choosing an Empty Rectangle Current Placement A + B New module to be inserted = FF (First Fit) BF (Best Fit) ? BL (Bottom Left) P 1 P 2 Places the new module in the empty rectangle which causes less wasted space. Area( ) < Area( ) Berkeley: Sept 15, 1999 Choose A Any of A or B could be chosen for placing the new module. Places the new module in rect w/ lower bottom-left corner, breaking the tie by picking leftmost one. y(P 2) < y(P 1) Choose B 19
Heuristics for Choosing a Segment A S 1 SSEG (Shorter Seg) Chooses the shorter of the two segments. D S 2 LSEG (Longer Seg) Chooses the longer of the two segments. S 1 < S 2 Berkeley: Sept 15, 1999 BER (Balanced Empty Rects) Chooses the segment which creates less area difference. A S 1 C B Area(B) - Area(A) > Area(D) - Area(C) S 1 < S 2 A B S 2 C C B D LSQR (Larger Rect Square) Chooses the segment which creates the larger rectangle closer to square. Aspect. Ratio(B) > Aspect. Ratio(D) A C B D LER (Large Empty Rects) Chooses the segment which creates the larger empty rectangle. Area(B) > Area(D) D SQR (Square Rects) Chooses the segment which creates empty rectangles closer to squares. Max{AR(A), AR(B)} < Max{AR(C), AR(D)} AR = Aspect. Ratio 20
Online Placement Results Table 1. Percentage of accepted modules using different bin-packing and empty space partitioning rules Berkeley: Sept 15, 1999 21
Online Placement Results Volume that does not fit BEST Berkeley: Sept 15, 1999 22
Online Placement Results (cont. ) Berkeley: Sept 15, 1999 23
Off-line placement: 3 -D Floorplanning DFG Schedule RFU CPU RFU area time t y x RFU Berkeley: Sept 15, 1999 24
3 -D Floorplanning DFG Schedule RFU t y x RFU Berkeley: Sept 15, 1999 CPU By deleting this RFUOP (CPU performs the operation). . . 25
3 -D Floorplanning DFG Schedule RFU t CPU y x RFU Berkeley: Sept 15, 1999 26
Our 3 -D Floorplanner: No change in the schedule • Pure annealing – Move set • Move operation from CPU set to RFU set • Move operation from RFU set to CPU set • Displace an already placed RFUOP on the RFU – Cost function: Volume – Very poor results • Start with an ASAP schedule, use on-line to get an initial solution, then low-temperature annealing Berkeley: Sept 15, 1999 27
Offline Placement Results Algorithm Data set T 50 T 100 LTSA S 100 X=100% S 200 A 1024 LTSA X=20% T 50 T 100 S 200 A 1024 Offline Online Penalty 147287 213153 253566 307879 464049 508923 539435 612623 427761 456627 148975 225603 287153 359980 213036 213153 307879 508923 612623 456627 Ratio 69. 10% 82. 36% 91. 18% 88. 05% 93. 68% 69. 89% 73. 28% 56. 42% 58. 76% 46. 65% Place X% of the largest-volume modules using on-line placement Berkeley: Sept 15, 1999 28
Flexibility of the Modules • Library of modules have different implementations for each RFUOP – Experimental results with our online algorithms show about 60% reduction in penalty. • 3 -4 Implementations are enough Berkeley: Sept 15, 1999 29
Technology-Mapped netlist Architecture Description File VPR Place Circuit or Read in Existing Placement Perform either Global or Combined Global/Detailed Routing VPRCAD flow Faster Routing: mostly offline Placement and Routing Output Files Berkeley: Sept 15, 1999 30
Routing Algorithm (VPR) Call the VPR’s Router by an arbitrary channel width • Based on Path. Finder negotiated congestion algorithm Step 1: Each net routed by the shortest path which can be found. (Regardless of any overuse of wiring segments) Step 2: Sequentially ripping-up and re-routing every net in the circuit (by the lowest cost path found) Berkeley: Sept 15, 1999 31
Fast Pattern Routing • Maze-based routing algorithm has a good performance but it’s very slow. So, • Speed-up the router by partially using pattern routing if an arbitrary net picked and routed differently, it would not change the result effectively. Berkeley: Sept 15, 1999 32
Independent subset of nets - Class 1 - Class 2 Two geometrical independent sets of nets Berkeley: Sept 15, 1999 33
Cost = L + const / Flexibility Routing Patterns 2 terminal net patterns Berkeley: Sept 15, 1999 Multi terminal net patterns (MST & RSTs) 34
Implementation of Algorithm • First choose the 2 terminal nets to route - More than 50% of the nets are 2 terminal nets. - In order to get the maximum independent sets, sort the two terminal nets in terms of their bounding boxes. - Classify the 2 terminal nets in geometrical independent classes - Route the classes, sequentially by pattern routing. • Next choose the multi terminal nets ( low fan-out) - Route them in their corresponding RST patterns • Finally, let the rest of the nets be routed by traditional router Berkeley: Sept 15, 1999 35
Experimental Results Berkeley: Sept 15, 1999 36
Faculty Position • In VLSI Design & CAD (1 -2 openings) • VLSI Design & CAD: One of the six focused research areas in the department • Assistant/Associate/Full Professor – (Northwestern rank: top 10; – ECE: top 20 (top 10 in 5 years) – Contact: majid@ece. nwu. edu Berkeley: Sept 15, 1999 37
Image Restoration r 1 r 1 r 0 r 1 The value of the center pixel in the next iteration: r 1 xk+1 = *y + xk - * (d**xk) y: the pixel value from the original degraded image r 1 xk: the pixel value from the previous iteration d**xk denotes the weighted sum r 1* (eight neighbor pixels) + r 0 * center pixel Berkeley: Sept 15, 1999 38
Incentive : Processing of large sized images using FPGA’s with limited resources m 1. Segmentation of the image into smaller sized images suitable for the FPGA n Segments of size m x n are surrounded by an overlap of o. o Berkeley: Sept 15, 1999 39
. Pixels of individual segments are restored in parallel by hardware. Restored segments are written back after the overlap is discarded m n RFU o MEMORY Berkeley: Sept 15, 1999 40
How bad is the segmentation? • Theorem: The error introduces is about (w)**O example: (1/16) ** 2 = (1/264) • Proof: By induction m n o Berkeley: Sept 15, 1999 41
Berkeley: Sept 15, 1999 42
Degraded Image Berkeley: Sept 15, 1999 Restored Image 43
Degraded Image Berkeley: Sept 15, 1999 Restored Image 44
Host processor ( image is stored here) Image stored in on-chip memory Circuit to process the image residing on the rest of the chip System A Berkeley: Sept 15, 1999 FPGA chip On-board memory, FPGA chip where the image is stored System B System C 45
Image cameraman Software Running Time (sec) for System A (msec) 4. 772 9. 157 Running Time for System C (msec) 91. 960 moon 2. 812 5. 725 54. 494 circle 2. 987 4. 254 42. 722 animals 6. 761 8. 826 88. 628 fish 7. 029 14. 026 140. 850 barbara 21. 741 36. 630 367. 840 yacht 12. 367 34. 079 342. 227 soccer 12. 360 34. 079 342. 227 announcer 13. 462 34. 079 342. 227 bluegirl 10. 158 34. 079 342. 227 cablecar 12. 354 34. 079 342. 227 cornfield 13. 458 34. 079 342. 227 Running Times of the Application on Software and on Different Systems (ignoring reconfiguration) Berkeley: Sept 15, 1999 46
Conclusions • Need radical departure (new algorithm, etc) from traditional PD algorithms. • Fast (and lower quality) place & route tools • Do as much as possible (building complex libraries, hierarchical routing, …) before compilation • All of the above (and more) needed to make reconfigurable computing a reality. Berkeley: Sept 15, 1999 47
Faculty Position • In VLSI Design & CAD (1 -2 openings) • VLSI Design & CAD: One of the six focused research areas in the department • Assistant/Associate/Full Professor – (Northwestern rank: top 10; – ECE: top 20 (top 10 in 5 years) • Contact: majid@ece. nwu. edu Berkeley: Sept 15, 1999 48
- Slides: 48