SelfImproving Computer Chips Warp Processing Frank Vahid Dept

  • Slides: 58
Download presentation
Self-Improving Computer Chips – Warp Processing Frank Vahid Dept. of CS&E University of California,

Self-Improving Computer Chips – Warp Processing Frank Vahid Dept. of CS&E University of California, Riverside Associate Director, Center for Embedded Computer Systems, UC Irvine Contributing Ph. D. Students Roman Lysecky (Ph. D. 2005, now Asst. Prof. at Univ. of Arizona Greg Stitt (Ph. D. 2007, now Asst. Prof. at Univ. of Florida, Gainesville Scotty Sirowy (current) David Sheldon (current) This research was supported in part by the National Science Foundation, the Semiconductor Research Corporation, Intel, Freescale, IBM, and Xilinx

FPGA Coprocessing Entering Maintstream n n n Xilinx, Altera, … Cray, SGI Mitrionics AMD

FPGA Coprocessing Entering Maintstream n n n Xilinx, Altera, … Cray, SGI Mitrionics AMD Opteron Intel Quick. Assist IBM Cell (research) Xilinx Virtex II Pro. Source: Xilinx Virtex V. Source: Xilinx AMD Opteron socket plug-ins SGI Altix supercomputer (UCR: 64 Itaniums plus 2 FPGA RASCs) Frank Vahid, UC Riverside 5

Circuits on FPGAs Can Execute Fast n Large speedups on many important applications n

Circuits on FPGAs Can Execute Fast n Large speedups on many important applications n Int. Symp. on FPGAs, FCCM, FPL, CODES/ISSS, ICS, MICRO, CASES, DAC, DATE, ICCAD, … Frank Vahid, UC Riverside 9

Background System Synthesis, Hardware/Software Partitioning n Spec. Syn – 1989 -1994 n n (Gajski

Background System Synthesis, Hardware/Software Partitioning n Spec. Syn – 1989 -1994 n n (Gajski et al, UC Irvine) Synthesize executable specifications like VHDL or Spec. Charts (now Spec. C) to microprocessors and custom ASIC circuits FPGAs were just invented and had very little capacity ~2000: Dynamic Software Optimization/Translation e. g. , HP’s Dynamo; Java JIT compilers; Transmeta Crusoe “code morphing” Performance VLIW µP x 86 Binary VLIW Binary Translation Frank Vahid, UC Riverside 10

Circuits on FPGAs are Software Microprocessor Binaries (Instructions) 01110100. . . 001010010 … …

Circuits on FPGAs are Software Microprocessor Binaries (Instructions) 01110100. . . 001010010 … … FPGA "Binaries“ (Circuits) "Software" … … More commonly known as "bitstream" Bits loaded into LUTs and SMs Bits loaded into program memory "Hardware" 0010 … Processor 0111 … FPGA Processor Frank Vahid, UC Riverside 11

Circuits on FPGAs are Software 1958 article – “Today the “software” comprising the carefully

Circuits on FPGAs are Software 1958 article – “Today the “software” comprising the carefully planned interpretive routines, compilers, and other aspects of automative programming are at least as important to the modern electronic calculator as its “hardware” of tubes, transistors, wires, tapes, and the like. ” n “Circuits” often called “hardware” n n n Previously same “Software” does not equal “instructions” Software is simply the “bits” n Bits may represents instructions, circuits, … Frank Vahid, UC Riverside 12

Circuits on FPGAs are Software Sep 2007 IEEE Computer Frank Vahid, UC Riverside 13

Circuits on FPGAs are Software Sep 2007 IEEE Computer Frank Vahid, UC Riverside 13

The New Software – Circuits on FPGAs – May Be Worth Paying Attention To

The New Software – Circuits on FPGAs – May Be Worth Paying Attention To History repeats itself? … 1876; there was a lot of love in the air, but it was for the telephone, not for Bell or his patent. There were many more applications for telephonelike devices, and most claimed Bell’s original application was for an object that wouldn’t work as described. Bell and his partners weathered these, but at such a great cost that they tried to sell the patent rights to Western Union, the giant telegraph company, in late 1876 for $100, 000. n Multi-billion dollar growing industry n n Increasingly found in embedded system products – medical devices, base stations, set-top boxes, etc. Recent announcements (e. g, Intel) FPGAs about to “take off”? ? But Western Union refused, because at the time they thought the telephone would never amount to anything. After all, why would anyone want a telephone? They could already communicate longdistance through the telegraph, and early phones had poor transmission quality and were limited in range. … http: //www. telcomhistory. org/ Frank Vahid, UC Riverside 15

JIT Compilers / Dynamic Translation n e. g. , Java JIT compilers; Transmeta Crusoe

JIT Compilers / Dynamic Translation n e. g. , Java JIT compilers; Transmeta Crusoe “code morphing” Extensive binary translation in modern microprocessors Performance VLIW µP x 86 Binary µP Binary n VLIW Binary Translation FPGA JIT Compiler / Binary “Translation” Inspired by binary translators of early 2000 s, began “Warp processing” project in 2002 – dynamically translate binary to circuits on FPGAs Frank Vahid, UC Riverside 16

Warp Processing 1 Initially, software binary loaded into instruction memory Profiler I Mem µP

Warp Processing 1 Initially, software binary loaded into instruction memory Profiler I Mem µP D$ FPGA Software Binary Mov reg 3, 0 Mov reg 4, 0 loop: Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 On-chip CAD Frank Vahid, UC Riverside 17

Warp Processing 2 Microprocessor executes instructions in software binary Profiler I Mem µP D$

Warp Processing 2 Microprocessor executes instructions in software binary Profiler I Mem µP D$ FPGA Software Binary Mov reg 3, 0 Mov reg 4, 0 loop: Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 On-chip CAD Frank Vahid, UC Riverside 18

Warp Processing 3 Profiler monitors instructions and detects critical regions in binary Profiler beq

Warp Processing 3 Profiler monitors instructions and detects critical regions in binary Profiler beq beq beq add add add µP I Mem D$ FPGA On-chip CAD Software Binary Mov reg 3, 0 Mov reg 4, 0 loop: Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 Critical Loop Detected Frank Vahid, UC Riverside 19

Warp Processing 4 On-chip CAD reads in critical region Profiler I Mem µP D$

Warp Processing 4 On-chip CAD reads in critical region Profiler I Mem µP D$ FPGA Software Binary Mov reg 3, 0 Mov reg 4, 0 loop: Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 On-chip CAD Frank Vahid, UC Riverside 20

Warp Processing 5 On-chip CAD decompiles critical region into control data flow graph (CDFG)

Warp Processing 5 On-chip CAD decompiles critical region into control data flow graph (CDFG) Profiler I Mem µP D$ FPGA Dynamic Part. On-chip CAD Module (DPM) Decompilation surprisingly effective at recovering high-level program structures Stitt et al ICCAD’ 02, DAC’ 03, CODES/ISSS’ 05, ICCAD’ 05, FPGA’ 05, TODAES’ 06, TODAES’ 07 Software Binary Mov reg 3, 0 Mov reg 4, 0 loop: Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 Recover loops, reg 3 : = 0 arrays, reg 4 : = 0 subroutines, loop: etc. – reg 4 : = reg 4 + mem[ needed to reg 2 + (reg 3 << 1)] synthesize reg 3 : = reg 3 + 1 good circuits if (reg 3 < 10) goto loop ret reg 4 Frank Vahid, UC Riverside 21

Warp Processing 6 On-chip CAD synthesizes decompiled CDFG to a custom (parallel) circuit Profiler

Warp Processing 6 On-chip CAD synthesizes decompiled CDFG to a custom (parallel) circuit Profiler I Mem µP D$ FPGA Dynamic Part. On-chip CAD Module (DPM) Software Binary Mov reg 3, 0 Mov reg 4, 0 loop: Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 + + + reg 3 : = 0 + : = 0+ reg 4 + . . . loop: +reg 4 : = reg 4++ mem[ reg 2 + (reg 3 << 1)] reg 3 : = reg 3 + 1 if (reg 3. . loop +< 10). goto ret reg 4 + . . . Frank Vahid, UC Riverside 22

Warp Processing 7 Software Binary On-chip CAD maps circuit onto FPGA Mov reg 3,

Warp Processing 7 Software Binary On-chip CAD maps circuit onto FPGA Mov reg 3, 0 Mov reg 4, 0 loop: Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 Profiler I Mem µP D$ reg 3 : = 0 + : = 0+ + FPGA reg 4 SM SM SM. . . loop: reg 4+ + mem[ + CLB + +reg 4 : =CLB + reg 2 + (reg 3 << 1)] reg 3 : = reg 3 + 1 Lean place&route/FPGA 10 x faster CAD SM goto. . loop (Lysecky et al DAC’ 03, ISSS/CODES’ 03, DATE’ 04, DAC’ 04, + if. SM(reg 3 +< 10). SM DATE’ 05, FCCM’ 05, TODAES’ 06) ret reg 4 Dynamic Part. On-chip CAD Module (DPM) + Multi-core chips – use 1 powerful core for CAD + + + . . . Frank Vahid, UC Riverside 23

Warp Processing 8 On-chip CAD replaces instructions in binary to use hardware, causing performance

Warp Processing 8 On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more Profiler I Mem µP D$ FPGA Dynamic Part. On-chip CAD Module (DPM) >10 x speedups for some apps Software Binary Mov reg 3, 0 Mov reg 4, 0 loop: // instructions Shl reg 1, reg 3, that 1 interact FPGA Add reg 5, with reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 + Software-only “Warped” reg 3 : = 0 + : = 0+ + reg 4 SM SM SM. . . loop: reg 4+ + mem[ + CLB + +reg 4 : =CLB + reg 2 + (reg 3 << 1)] reg 3 : = reg 3 + 1 SM + if. SM (reg 3 goto. . loop +< 10). SM + + ret reg 4 Warp speed, Scotty + . . . Frank Vahid, UC Riverside 24

Warp Processing Challenges n Binary Two key challenges n n Can we decompile binaries

Warp Processing Challenges n Binary Two key challenges n n Can we decompile binaries to recover enough high-level constructs to create fast circuits on FPGAs? Can we just-in-time (JIT) compile to FPGAs using limited on-chip compute resources? Profiling & partitioning Decompilation Profiler µP FPGA Synthesis I$ D$ Binary Updater On-chip CAD Std. Ckt. Binary JIT FPGA compilation Micropr Binary FPGA Binary binary Frank Vahid, UC Riverside 25

Decompilation Binary n Profiling & partitioning Solution – Recover high-level information from binary (branches,

Decompilation Binary n Profiling & partitioning Solution – Recover high-level information from binary (branches, loops, arrays, subroutines, …): Decompilation n Adapted extensive previous work (for different purposes) Developed new methods (e. g. , “reroll” loops) Ph. D. work of Greg Stitt (Ph. D. UCR 2007, now Asst. Prof. at UF Gainesville) n Numerous publications: http: //www. cs. ucr. edu/~vahid/pubs n n Original C Code long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; } Corresponding Assembly Mov reg 3, 0 Mov reg 4, 0 loop: Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 Almost Identical Representations Decompilation Synthesis Binary Updater Std. HW Binary JIT FPGA compilation Micropr. Binary FPGA Binary binary Data Flow Analysis Control/Data Flow. Recovery Graph Creation Control Structure Function Recovery Array Recovery long f( long reg 2 ) { : = array[10] reg 3 long f( short long reg 2 ) {0 long reg 4= =0; 0; reg 4 : = 0 int reg 3 for (long=reg 3 int reg 4 0; = 0; reg 3 < 10; reg 3++) { reg 4 += mem[reg 2 array[reg 3]; loop: + (reg 3 << 1)]; }reg 4 = reg 4 + mem[reg 2 + << reg 4 : = reg 3 reg 4 + mem[ reg 1 1 reg 3 << 1)]; return reg 4; reg 2 + (reg 3 << 1)] reg 5 : = reg 2 + reg 1 } reg 3 = reg 3 + 1; reg 6 reg 3 : = mem[reg 5 reg 3 + 1 + 0] if (reg 3 < 10) goto loop; if (reg 3 < 10)+goto reg 4 : = reg 4 reg 6 loop return reg 4; reg 3 : = reg 3 + 1 } if (reg 3 < 10) goto loop ret reg 4 Frank Vahid, UC Riverside 27

Decompilation Results vs. C n Competivive with synthesis from C Frank Vahid, UC Riverside

Decompilation Results vs. C n Competivive with synthesis from C Frank Vahid, UC Riverside 28

Decompilation Results on Optimized H. 264 In-depth Study with Freescale n Again, competitive with

Decompilation Results on Optimized H. 264 In-depth Study with Freescale n Again, competitive with synthesis from C Frank Vahid, UC Riverside 29

Decompilation is Effective Even with High Compiler-Optimization Levels n Do compiler optimizations generate binaries

Decompilation is Effective Even with High Compiler-Optimization Levels n Do compiler optimizations generate binaries harder to effectively decompile? n Average Speedup of 10 Examples (Surprisingly) found opposite – optimized code even better Frank Vahid, UC Riverside 30

Warp Processing Challenges n Binary Two key challenges n n Can we decompile binaries

Warp Processing Challenges n Binary Two key challenges n n Can we decompile binaries to recover enough high-level constructs to create fast circuits on FPGAs? Can we just-in-time (JIT) compile to FPGAs using limited on-chip compute resources? Profiling & partitioning Decompilation Profiler µP FPGA Synthesis I$ D$ Binary Updater On-chip CAD Std. HW Binary JIT FPGA compilation Micropr Binary FPGA Binary binary Frank Vahid, UC Riverside 31

Challenge: JIT Compile to FPGA n Developed ultra-lean CAD heuristics for synthesis, placement, routing,

Challenge: JIT Compile to FPGA n Developed ultra-lean CAD heuristics for synthesis, placement, routing, and technology mapping; simultaneously developed CAD-oriented FPGA n e. g. , Our router (ROCR) 10 x faster and 20 x less memory, at cost of 30% longer critical path. Similar results for synth & placement Ph. D. work of Roman Lysecky (Ph. D. UCR 2005, now Asst. Prof. at Univ. of Arizona) n Numerous publications: http: //www. cs. ucr. edu/~vahid/pubs -- EDAA Outstanding Dissertation Award n Binary Xilinx ISE Profiling & partitioning Decompilation 9. 1 s Synthesis 60 MB Binary Updater JIT FPGA compilation Riverside JIT FPGA tools 0. 2 s Micropr. Binary 3. 6 MB Riverside JIT FPGA tools on a 75 MHz ARM 7 1. 4 s Std. HW Binary FPGA Binary binary DAC’ 04 3. 6 MB Frank Vahid, UC Riverside 32

Warp Processing Results Performance Speedup (Most Frequent Kernel Only) Average kernel speedup of 41

Warp Processing Results Performance Speedup (Most Frequent Kernel Only) Average kernel speedup of 41 Vs. 200 MHz ARM-Only Execution Overall application speedup average is 7. 4 Frank Vahid, UC Riverside 33

Recent Work: Thread Warping (CODES/ISSS Oct 07 Austria, Best Paper Nom. ) Multi-core platforms

Recent Work: Thread Warping (CODES/ISSS Oct 07 Austria, Best Paper Nom. ) Multi-core platforms multithreaded apps for (i = 0; i < 10; i++) { } OS schedules threads onto accelerators (possibly dozens), in addition to µPs Compiler Binary µP OS schedules threads onto available µPs Performance thread_create( f, i ); f() µP f() Remaining threads added to queue f() On-chip CAD OS µP FPGA Very large speedups possible – parallelism at bit, arithmetic, and now thread level too µP f() Acc. Lib OS invokes on-chip CAD tools to create accelerators for f() Thread warping: use one core to create accelerator for waiting threads Frank Vahid, UC Riverside 34

Thread Warping Tools n FPGA µ P n On-chip CAD Developed framework Uses pthread

Thread Warping Tools n FPGA µ P n On-chip CAD Developed framework Uses pthread library (POSIX) n Mutex/semaphore for synchronization Thread Functions Thread Queue false Thread Counts Queue Analysis Thread Functions Accelerators Synthesized? Accelerator Library Not In Library? true false Done true Accelerator Instantiation Bitfile Place&Route Schedulable Resource List Decompilation FPGA Netlist Accelerator Synthesis Thread Group Table Updated Binary Hw/Sw Partitioning Hw Memory Access Synchronization Sw Binary Updater High-level Synthesis Netlist Thread Group Table Frank Vahid, UC Riverside Updated Binary 35

Memory Access Synchronization (MAS) n Must deal with widely known memory bottleneck problem n

Memory Access Synchronization (MAS) n Must deal with widely known memory bottleneck problem n FPGAs great, but often can’t get data to them fast enough for (i = 0; i < 10; i++) { RAM DMA a() n n FPGA …. b() thread_create( thread_function, a, i ); Data for dozens of threads can create bottleneck } void f( int a[], int val ) { int result; for (i = 0; i < 10; i++) { result += a[i] * val; } Same. . } array Threaded programs exhibit unique feature: Multiple threads often access same data Solution: Fetch data once, broadcast to multiple threads (MAS) Frank Vahid, UC Riverside 36

Memory Access Synchronization (MAS) 1) Identify thread groups – loops that create threads 2)

Memory Access Synchronization (MAS) 1) Identify thread groups – loops that create threads 2) Identify constant memory addresses in thread function n Def-use analysis of parameters to thread function 3) Synthesis creates a “combined” memory access n Execution synchronized by OS Thread Group Def-Use: a is constant for all threads RAM DMA for (i = 0; i < 100; i++) { thread_create( f, a, i ); } Data fetched once, delivered to entire group A[0 -9] enable (from OS) void f( int a[], int val ) { int result; for (i = 0; i < 10; i++) { result += a[i] * val; }. . Addresses of a[0 -9] are constant for } thread group f() A[0 -9] ……………… f() Before MAS: 1000 memory accesses After MAS: 100 memory accesses Frank Vahid, UC Riverside 37

Memory Access Synchronization (MAS) n n Also detects overlapping memory regions – “windows” Synthesis

Memory Access Synchronization (MAS) n n Also detects overlapping memory regions – “windows” Synthesis creates extended “smart buffer” n [Guo/Najjar FPGA 04] Caches reused data, delivers windows to threads a[0] a[1] a[2] a[3] a[4] a[5] for (i = 0; i < 100; i++) { thread_create( thread_function, a, i ); } void f( int a[], int i ) { int result; result += a[i]+a[i+1]+a[i+2]+a[i+3]; . . } Each thread accesses different addresses – but addresses may overlap RAM DMA A[0 -103] ……… Data streamed to “smart buffer” Smart Buffer A[0 -3] enable f() A[1 -4] f() ……………… A[6 -9] f() Buffer delivers window to each thread W/O smart buffer: 400 memory accesses With smart buffer: 104 memory accesses Frank Vahid, UC Riverside 38

Speedups from Thread Warping n n n Chose benchmarks with extensive parallelism But, FPGA

Speedups from Thread Warping n n n Chose benchmarks with extensive parallelism But, FPGA uses additional area Compared to 4 -ARM device Average 130 x speedup So we also compare to systems with 8 to 64 ARM 11 u. Ps – FPGA size = ~36 ARM 11 s n n 11 x faster than 64 -core system Simulation pessimistic, actual results likely better Frank Vahid, UC Riverside 39

Warp Scenarios Warping takes time – when useful? n Long-running applications n n Scientific

Warp Scenarios Warping takes time – when useful? n Long-running applications n n Scientific computing, etc. Recurring applications (save FPGA configurations) n n Common in embedded systems Might view as (long) boot phase Long Running Applications µP FPGA Recurring Applications µP (1 st execution) On-chip CAD µP FPGA On-chip CAD Time Single-execution speedup Time Speedup Frank Vahid, UC Riverside 40

Why Dynamic? n Static good, but hiding FPGA opens technique to all sw platforms

Why Dynamic? n Static good, but hiding FPGA opens technique to all sw platforms n Standard languages/tools/binaries Static Compiling to FPGAs Dynamic Compiling to FPGAs Specialized Language Any Language Specialized Compiler Any Compiler Binary Netlist Binary FPGA µ P n n On-chip CAD Can adapt to changing workloads n n µ P Smaller & more accelerators, fewer & large accelerators, … Can add FPGA without changing binaries – like expanding memory, or adding processors to multiprocessor Custom interconnections, tuned processors, … Frank Vahid, UC Riverside 41

Dynamic Enables Expandable Logic Concept RAM Expandable. RAM Logic– –System Warp tools detects duringinvisibly

Dynamic Enables Expandable Logic Concept RAM Expandable. RAM Logic– –System Warp tools detects duringinvisibly start, adapt amount. RAM of FPGA, improves performance invisiblyhardware. application to use less/more DMA Cache FPGA Profiler µP µP Warp Tools Expandable Logic Expandable RAM u. P Performance Frank Vahid, UC Riverside 42

Dynamic Enables Expandable Logic n n Large speedups – 14 x to 400 x

Dynamic Enables Expandable Logic n n Large speedups – 14 x to 400 x (on scientific apps) Different apps require different amounts of FPGA n Expandable logic allows customization of single platform n n User selects required amount of FPGA No need to recompile/synthesize n Recent (Spring 2008) results vs. 3. 2 GHz Intel Xeon – 2 x-8 x speedups n Nallatech H 101 -PCIXM FPGA accelerator board w/ Virtex IV LX 100 FPGA I/O mems are 8 MB SRAMs. Board connects to host processor over PCI-X bus Frank Vahid, UC Riverside 43

Ongoing Work: Dynamic Coprocessor Management (CASES’ 08) c 2 c 3 • Each with

Ongoing Work: Dynamic Coprocessor Management (CASES’ 08) c 2 c 3 • Each with pre-designed FPGA coprocessor c 1, c 2, . . . , optional use provides speedup a 1 a 2 a 3 • The size of FPGA is limited. How to manage the coprocessors? Reconfig time Runtime with cp c 1 Runtime on CPU alone App runtime • Multiple possible applications a 1, a 2, . . . Memory CPU a 2 a 1 c 2 c 3 • Loading c 2 would require removing c 1 or c 3. Is it worthwhile? App instance • Depends on pattern of future instances of a 1, a 2, a 3 FPGA c 1 c 3 • Must make “online” decision Frank Vahid, UC Riverside 44/19 44

The Ski-Rental Problem n Greedy: Always load n n Doesn’t consider past apps, which

The Ski-Rental Problem n Greedy: Always load n n Doesn’t consider past apps, which may predict future Solution idea for “ski rental problem” (popular online technique) n Ski-Rental Problem n n n You decide to take up skiing Should you rent skis each trip, or buy? Popular online algorithm solution – Rent until cumulative rental cost equals cost of buying, then buy n Guarantee never to pay >2 x cost of buying Frank Vahid, UC Riverside 45

a 1 Cumulative Benefit Heuristic Maintain cumulative time benefit for each coprocessor n n

a 1 Cumulative Benefit Heuristic Maintain cumulative time benefit for each coprocessor n n n Benefit of coprocessor i: tpi - tci cbenefit(i) = cbenefit(i) + (tpi – tci) Time that coprocessor i would have saved up to this point had it always been used for app i n Only consider loading coproc i if cbenefit(i) > loading_time(i) n Resists loading coprocs that are infrequent or with little speedup a 3 tpi 200 100 50 tci Benefit: tpi-tci 10 20 25 190 80 25 Q = <a 1, a 3, a 2, a 1, a 3> Cumulative benefit table n a 2 c 1: 190 38 0 c 2: 57 80 160 0 c 3: 25 Loads = < --, c 1 190!>200 50 --, 25!>200 380>200 --, --> (already loaded) Assume loading time for all coprocessors is 200 Frank Vahid, UC Riverside 46

– Replacing Coprocessors Cumulative Benefit Heuristic n Replacement policy Subset of resident coprocessors such

– Replacing Coprocessors Cumulative Benefit Heuristic n Replacement policy Subset of resident coprocessors such that n n c 1 cbenefit(i) – loading_time(i) > cbenefit(CP) FPGA Intuition – Don’t replace higher-benefit coprocessors with lower ones • Greedy heuristic, maintains sorted cumulative benefit list • c 1 Time complexity is O(n) c 2 c 3 c 2 ? cbenefit > loading_time, but good enough to replace c 1 or c 2? Loading time is 200 Q = <. . . , a 3 c 1: 950 c 2: 320 Cumulative benefit table n Memory 225>200 can consider load c 3: 200 22 5 But 225 -200 !> 320 DON’T load Frank Vahid, UC Riverside 47

Adjustment for temporal locality n Real application sequences exhibit temporal locality Extend heuristic to

Adjustment for temporal locality n Real application sequences exhibit temporal locality Extend heuristic to “fade” cumulative benefit values n n a 1 Multiply by “f” at each step, 0<=f<=1 Define f proportional to reconfiguration time n a 2 a 3 tpi 200 100 50 tci 10 20 25 Benefit: tpi-tci 190 80 25 Q = <a 1, . . . , a 1, a 2, a 3, a 2, a 3. . . > Small reconfig time – reconfig more freely, less attention to past, small f Cumulative benefit table n c 1: 950 76 0 +80 c 2: 320 12 8 c 3: 200 16 . . . 249. . . 224. . . 100 0 e. g. , f = 0. 8 Frank Vahid, UC Riverside 48

Experiments App sequence total runtime RAW Avg FPGA speedup: 10 x Avg coprocessor gate

Experiments App sequence total runtime RAW Avg FPGA speedup: 10 x Avg coprocessor gate count: 48, 000 Random FPGA size set to 60, 000 n Our online ACBenefit algorithm gets better results than previous online algs FPGA reconfig time Biased Periodic Frank Vahid, UC Riverside 49

More Dynamic Configuration: Configurable Cache Example Way Concatenation W 1 W 2 W 3

More Dynamic Configuration: Configurable Cache Example Way Concatenation W 1 W 2 W 3 W 4 Four Way Set Associative Base Cache W 1 W 2 W 3 [Zhang/Vahid/Najjar, ISCA 2003, ISVLSI 2003, TECS 2005] Way Shutdown W 1 W 2 W 3 W 4 Shut down two ways Bitline Vdd Line Concatenation W 1 4 physical lines filled when line size is 32 bytes 16 bytes bus Counter Bitline W 4 Off Chip Memory Gated-Vdd Control Two Way Set Associative W 1 W 2 W 3 W 4 Gnd 40% avg savings One physical cache, can be dynamically reconfigured to 18 different caches Direct mapped cache Frank Vahid, UC Riverside 50

Highly-Configurable Platforms n Dynamic tuning of configurable components also Application 2 Application 1 Total

Highly-Configurable Platforms n Dynamic tuning of configurable components also Application 2 Application 1 Total size Associativity Line size Memory Encoding schemes L 2 cache Encoding schemes Total size Associativity Line size L 1 cache Voltage/freq RF size Branch pred. Microprocessor Dynamically tuning the configurable components to match the currently executing application can significantly reduce power (and even improve performance) Frank Vahid, UC Riverside 51

Summary Microprocessor instructions FPGA circuits n Software is no longer just "instructions" n n

Summary Microprocessor instructions FPGA circuits n Software is no longer just "instructions" n n The sw elephant has a (new) tail – FPGA circuits Warp processing potentially brings massive FPGA speedups to all of computing (desktop, embedded, scientific, …) n n Patent granted Oct 2007, licensed by Intel, IBM, Freescale (via SRC) Extensive future work n Online CAD algorithms, online architectures and algorithms, . . . Frank Vahid, UC Riverside 52

Warp Processors CAD-Oriented FPGA n Solution: Develop a custom CAD-oriented FPGA Careful simultaneous design

Warp Processors CAD-Oriented FPGA n Solution: Develop a custom CAD-oriented FPGA Careful simultaneous design of FPGA and CAD FPGA features evaluated for impact on CAD Add architecture features for SW kernels n n Binary Partitioning Binary Updater Enables development of fast, lean JIT FPGA compilation tools 1 s <1 s 1 MB Decompilation RT Synthesis 10 s Std. HW Binary 3. 6 MB . 5 MB JIT FPGA Compilation Profiler µP WCLA I$ D$ DPM Updated Binary HW Binary Bitstream Frank Vahid, UC Riverside 53

Warp Processors Warp Configurable Logic Architecture (WCLA) n n n Data address generators (DADG)

Warp Processors Warp Configurable Logic Architecture (WCLA) n n n Data address generators (DADG) and Loop control hardware (LCH) n n n Need a fast, efficient coprocessor interface Analyzed digital signal processors (DSP) and existing coprocessors Provide fast loop execution Supports memory accesses with regular access pattern Integrated 32 -bit multiplier-accumulator (MAC) n n Frequently found in within critical SW kernels Fast, single-cycle multipliers are large and require many interconnections Profiler ARM WCLA I$ D$ DPM DADG & LCH Reg 0 Reg 1 Reg 2 32 -bit MAC Configurable Logic Fabric Vahid, UC Riverside A Configurable Logic Fabric for Dynamic Hardware/Software Frank Partitioning, DATE’ 04 54

Warp Processors - WCLA Configurable Logic Fabric n Configurable Logic Fabric (CLF) n Hundreds

Warp Processors - WCLA Configurable Logic Fabric n Configurable Logic Fabric (CLF) n Hundreds of existing commercial and research FPGA fabrics n n n Most designed to balance circuit density and speed Analyzed FPGA’s features to determine their impact of CAD Designed our CLF in conjunction with JIT FPGA compilation tools n n Array of configurable logic blocks (CLBs) surrounded by switch matrices (SMs) CLB is directly connected to a SM n Along with SM design, allows for design of lean JIT routing SM SM DADG LCH 32 -bit MAC Configurable Logic Fabric SM SM CLB SM SM µP WCLA FPGA Vahid, UC Riverside A Configurable Logic Fabric for Dynamic Hardware/Software Frank Partitioning, DATE’ 04 I$ D$ DPM 55

Warp Processors - WCLA Combinational Logic Block FPGAs Flexibility/Density: Large CLBs, various internal routing

Warp Processors - WCLA Combinational Logic Block FPGAs Flexibility/Density: Large CLBs, various internal routing resources n a b c d e f LUT o 1 o 2 o 3 o 4 WCLA Simplicity: Limited internal routing, reduce on-chip CAD complexity Adj. CLB Combinational Logic Block n Incorporate two 3 -input 2 -output LUTs n n n Equivalent to four 3 -input LUTs with fixed internal routing Allows for good quality circuit while reducing JIT technology mapping complexity Provide routing resources between adjacent CLBs to support carry chains n Reduces number of nets we need to route µP WCLA FPGA Vahid, UC Riverside A Configurable Logic Fabric for Dynamic Hardware/Software Frank Partitioning, DATE’ 04 I$ D$ DPM 56

Warp Processors - WCLA Switch Matrix FPGAs Flexibility/Speed: Large routing resources, various routing options

Warp Processors - WCLA Switch Matrix FPGAs Flexibility/Speed: Large routing resources, various routing options n WCLA Simplicity: Allow for design of fast, lean routing algorithm 0 1 2 3 0 L 1 L 2 L 3 L 3 L 2 L 1 L 0 L 3 2 1 0 Switch Matrix n All nets are routed using only a single pair of channels throughout the configurable logic fabric n n 3 L 2 L 1 L 0 L 3 2 1 0 0 1 2 3 0 L 1 L 2 L 3 L Each short channel is associated with single long channel Designed for fast, lean JIT FPGA routing µP WCLA FPGA Vahid, UC Riverside A Configurable Logic Fabric for Dynamic Hardware/Software Frank Partitioning, DATE’ 04 I$ D$ DPM 57

Warp Processors JIT FPGA Compilation Binary Partitioning Binary Updater Profiler µP WCLA I$ D$

Warp Processors JIT FPGA Compilation Binary Partitioning Binary Updater Profiler µP WCLA I$ D$ Decompilation Logic Synthesis RT Synthesis Tech. Mapping/Packing Std. HW Binary Placement JIT FPGA Compilation Routing DPM (CAD) Updated Binary HW Binary Bitstream Frank Vahid, UC Riverside 58

Warp Processors ROCM – Riverside On-Chip Minimizer n ROCM - Riverside On-Chip Minimizer n

Warp Processors ROCM – Riverside On-Chip Minimizer n ROCM - Riverside On-Chip Minimizer n n Two-level minimization tool Utilized a combination of approaches from Espresso-II [Brayton, et al. , 1984][Hassoun & Sasoa, 2002] and Presto [Svoboda & White, 1979] n n n Utilizes a single expand phase instead of multiple iterations Eliminate the need to compute the off-set to reduce memory usage On average only 2% larger than optimal solution Expand on-set dc-set off-set Reduce Logic Synthesis Tech. Mapping/Packing Irredundant Placement JIT FPGA Compilation On-Chip Logic Minimization, DAC’ 03 Frank Vahid, UC Riverside A Codesigned On-Chip Logic Minimizer, CODES+ISSS’ 03 Routing 59

Warp Processors - Results ap ute 1 min 1 -2 mins 2 -30 mins

Warp Processors - Results ap ute 1 min 1 -2 mins 2 -30 mins 10 MB 50 MB 60 MB Ro ce 1 min Pla ch Te . M g. Lo Sy n. Execution Time and Memory Requirements 1 s 1 MB Logic Synthesis Tech. Mapping/Packing Placement JIT FPGA Compilation On-Chip Logic Minimization, DAC’ 03 Frank Vahid, UC Riverside A Codesigned On-Chip Logic Minimizer, CODES+ISSS’ 03 Routing 60

Warp Processors ROCTM – Riverside On-Chip Technology Mapper n ROCTM - Technology Mapping/Packing n

Warp Processors ROCTM – Riverside On-Chip Technology Mapper n ROCTM - Technology Mapping/Packing n Decompose hardware circuit into DAG n n Nodes correspond to basic 2 -input logic gates (AND, OR, XOR, etc. ) Hierarchical bottom-up graph clustering algorithm n n n Breadth-first traversal combining nodes to form single-output LUTs Combine LUTs with common inputs to form final 2 -output LUTs Pack LUTs in which output from one LUT is input to second LUT Logic Synthesis Tech. Mapping/Packing Placement JIT FPGA Compilation Dynamic Hardware/Software Partitioning: A First Approach, DAC’ 03 Frank Vahid, UC Riverside A Configurable Logic Fabric for Dynamic Hardware/Software Partitioning, DATE’ 04 Routing 61

Warp Processors - Results ap ute 1 min 1 -2 mins 2 -30 mins

Warp Processors - Results ap ute 1 min 1 -2 mins 2 -30 mins 10 MB 50 MB 60 MB Ro ce 1 min Pla ch Te . M g. Lo Sy n. Execution Time and Memory Requirements 1 s <1 s 1 MB. 5 MB Logic Synthesis Tech. Mapping/Packing Placement JIT FPGA Compilation Dynamic Hardware/Software Partitioning: A First Approach, DAC’ 03 Frank Vahid, UC Riverside A Configurable Logic Fabric for Dynamic Hardware/Software Partitioning, DATE’ 04 Routing 62

Warp Processors ROCPLACE – Riverside On-Chip Placer n ROCPLACE - Placement n Dependency-based positional

Warp Processors ROCPLACE – Riverside On-Chip Placer n ROCPLACE - Placement n Dependency-based positional placement algorithm n n n Identify critical path, placing critical nodes in center of CLF Use dependencies between remaining CLBs to determine placement Attempt to use adjacent CLB routing whenever possible CLB CLB CLB Logic Synthesis Tech. Mapping/Packing Placement JIT FPGA Compilation Dynamic Hardware/Software Partitioning: A First Approach, DAC’ 03 Frank Vahid, UC Riverside A Configurable Logic Fabric for Dynamic Hardware/Software Partitioning, DATE’ 04 Routing 63

Warp Processors - Results ap ute 1 min 1 -2 mins 2 -30 mins

Warp Processors - Results ap ute 1 min 1 -2 mins 2 -30 mins 10 MB 50 MB 60 MB Ro ce 1 min Pla ch Te . M g. Lo Sy n. Execution Time and Memory Requirements 1 s <1 s 1 MB . 5 MB Logic Synthesis Tech. Mapping/Packing Placement JIT FPGA Compilation Dynamic Hardware/Software Partitioning: A First Approach, DAC’ 03 Frank Vahid, UC Riverside A Configurable Logic Fabric for Dynamic Hardware/Software Partitioning, DATE’ 04 Routing 64

Warp Processors Routing n FPGA Routing n n Find a path within FPGA to

Warp Processors Routing n FPGA Routing n n Find a path within FPGA to connect source and sinks of each net within our hardware circuit Pathfinder [Ebeling, et al. , 1995] n n Introduced negotiated congestion During each routing iteration, route nets using shortest path n n n 1 1 1 Allows overuse (congestion) of resources Increased performance over Pathfinder Routability-driven: Use fewest tracks possible Timing-driven: Optimize circuit speed Many techniques are used in commercial FPGA CAD tools 1 12 1 1 Update cost of congested resources Rip-up all routes and reroute all nets VPR [Betz, et al. , 1997] n 1 If congestion exists (illegal routing) n n 1 congestion Logic Synthesis Tech. Mapping/Packing Placement JIT FPGA Compilation Frank Vahid, UC Riverside Routing 65

Warp Processors ROCR – Riverside On-chip Router n ROCR - Riverside On-Chip Router n

Warp Processors ROCR – Riverside On-chip Router n ROCR - Riverside On-Chip Router n Resource Graph n n n Nodes correspond to SMs Edges correspond to channels between SMs Capacity of edge equal to the number of wires within the channel Requires much less memory than VPR as resource graph is smaller Produces circuits with critical path 10% shorter than VPR (RD) 0/4 SM 0/4 Route illegal? Rip-up yes no Resource Routing Resource Graph 0/4 SM 0/4 Done! SM 0/4 0/4 CLB 0/4 SM SM 0/4 CLB 0/4 Logic Synthesis Tech. Mapping/Packing SM JIT FPGA Compilation Frank Vahid, UC Riverside Dynamic FPGA Routing for Just-in-Time FPGA Compilation, DAC’ 04 Placement Routing 66