Warp Processors Frank Vahid Task Leader Department of

Task Description n Warp processing background n n Idea: Invisibly move binary regions from

Background n Motivated by commercial dynamic binary translation of early 2000 s Performance e.

Warp Processing Background 1 Initially, software binary loaded into instruction memory Profiler I Mem

Warp Processing Background 2 Microprocessor executes instructions in software binary Profiler I Mem µP

Warp Processing Background 3 Profiler monitors instructions and detects critical regions in binary Profiler

Warp Processing Background 4 On-chip CAD reads in critical region Profiler I Mem µP

Warp Processing Background 5 On-chip CAD decompiles critical region into control data flow graph

Warp Processing Background 6 On-chip CAD synthesizes decompiled CDFG to a custom (parallel) circuit

Warp Processing Background 7 Software Binary On-chip CAD maps circuit onto FPGA Mov reg

Warp Processing Background 8 On-chip CAD replaces instructions in binary to use hardware, causing

Warp Scenarios Warping takes time – when useful? n Long-running applications n n Scientific

Thread Warping - Overview Multi-core platforms multithreaded apps for (i = 0; i <

Thread Warping Tools n n FPGA µ P Invoked by OS Uses pthread library

Memory Access Synchronization (MAS) n Must deal with widely known memory bottleneck problem n

Memory Access Synchronization (MAS) 1) Identify thread groups – loops that create threads 2)

Memory Access Synchronization (MAS) n n Also detects overlapping memory regions – “windows” Synthesis

Framework n Thread Queue Thread Counts Queue Analysis Thread Functions Also developed initial algorithms

Thread Warping Example int main( ) { filter() threads. . . execute on for

Example int main( ) MAS detects thread group {. . . for (i=0; i

Example Accelerators loaded into FPGA int main( ) {. . . for (i=0; i

Example int main( ) {. . . for (i=0; i < 50; i++) {

Experiments to Determine Thread Warping Performance: Simulator Setup int main( ) {. . .

Experiments n Benchmarks: Image processing, DSP, scientific computing n n n Base architecture –

Speedup from Thread Warping n Average 130 x speedup But, FPGA uses additional area

Why Dynamic? n Static good, but hiding FPGA opens technique to all sw platforms

Warp Processing Enables Expandable Logic Concept RAM Expandable. RAM Logic– –System Warp tools detects

Expandable Logic n n n Used our simulation framework Large speedups – 14 x

Dynamic enables Custom Communication No. C – Network on a Chip provides communication between

Industrial Interactions Year 2 / 3 n Freescale n n n Intel n n

Patents n “Warp Processor” patent n n n Filed with USPTO summer 2004 Granted

Year 1 / 2 publications n New Decompilation Techniques for Binary-level Co-processor Generation. G.

Year 2 / 3 publications n n n n Warp Processing: Dynamic Translation of

Slides: 37

Download presentation

Warp Processors Frank Vahid (Task Leader) Department of Computer Science and Engineering University of California, Riverside Associate Director, Center for Embedded Computer Systems, UC Irvine Task ID: 1331. 001 July 2005 – June 2008 Ph. D. students: Greg Stitt – Ph. D. June 2007, now Asst. Prof. at Univ. of Florida, Gainseville Ann Gordon-Ross – Ph. D. June 2007, now Asst. Prof. at Univ. of Florida, Gainseville David Sheldon Ph. D. expected 2009 Scott Sirowy Ph. D. expected 2010 Industrial Liaisons: Brian W. Einloth, Motorola Dave Clark, Darshan Patra, Intel Jeff Welser, Scott Lekuch, IBM Frank Vahid, UCR

Task Description n Warp processing background n n Idea: Invisibly move binary regions from microprocessor to FPGA 10 x speedups or more, energy gains too Task– Mature warp technology n Years 1/2 n n Automatic high-level construct recovery from binaries In-depth case studies (with Freescale) Warp-tailored FPGA prototype (with Intel) Years 2/3 n n n Frank Vahid, UCR Reduce memory bottleneck by using smart buffer Investigate domain-specific-FPGA concepts (with Freescale) Consider desktop/server domains (with IBM) 2

Background n Motivated by commercial dynamic binary translation of early 2000 s Performance e. g. , Transmeta Crusoe “code morphing” VLIW µP x 86 Binary µP Binary n VLIW Binary Translation FPGA Binary “Translation” Warp processing (Lysecky/Stitt/Vahid 2003 -2007): translate binary to circuits on FPGAs Frank Vahid, UCR dynamically 3

Warp Processing Background 1 Initially, software binary loaded into instruction memory Profiler I Mem µP D$ FPGA Frank Vahid, UCR Software Binary Mov reg 3, 0 Mov reg 4, 0 loop: Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 On-chip CAD 4

Warp Processing Background 2 Microprocessor executes instructions in software binary Profiler I Mem µP D$ FPGA Frank Vahid, UCR Software Binary Mov reg 3, 0 Mov reg 4, 0 loop: Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 On-chip CAD 5

Warp Processing Background 3 Profiler monitors instructions and detects critical regions in binary Profiler beq beq beq add add add µP I Mem D$ FPGA Frank Vahid, UCR On-chip CAD Software Binary Mov reg 3, 0 Mov reg 4, 0 loop: Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 Critical Loop Detected 6

Warp Processing Background 4 On-chip CAD reads in critical region Profiler I Mem µP D$ FPGA Frank Vahid, UCR Software Binary Mov reg 3, 0 Mov reg 4, 0 loop: Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 On-chip CAD 7

Warp Processing Background 5 On-chip CAD decompiles critical region into control data flow graph (CDFG) Profiler I Mem µP D$ FPGA Dynamic Part. On-chip CAD Module (DPM) Decompilation surprisingly effective at recovering high-level program structures Stitt et al ICCAD’ 02, DAC’ 03, CODES/ISSS’ 05, ICCAD’ 05, FPGA’ 05, TODAES’ 06, TODAES’ 07 Frank Vahid, UCR Software Binary Mov reg 3, 0 Mov reg 4, 0 loop: Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 Recover loops, reg 3 : = 0 arrays, reg 4 : = 0 subroutines, loop: etc. – reg 4 : = reg 4 + mem[ needed to reg 2 + (reg 3 << 1)] synthesize reg 3 : = reg 3 + 1 good circuits if (reg 3 < 10) goto loop ret reg 4 8

Warp Processing Background 6 On-chip CAD synthesizes decompiled CDFG to a custom (parallel) circuit Profiler I Mem µP D$ FPGA Dynamic Part. On-chip CAD Module (DPM) Software Binary Mov reg 3, 0 Mov reg 4, 0 loop: Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 + + + reg 3 : = 0 + : = 0+ reg 4 + . . . loop: +reg 4 : = reg 4++ mem[ reg 2 + (reg 3 << 1)] reg 3 : = reg 3 + 1 if (reg 3. . loop +< 10). goto ret reg 4 Frank Vahid, UCR + . . . 9

Warp Processing Background 7 Software Binary On-chip CAD maps circuit onto FPGA Mov reg 3, 0 Mov reg 4, 0 loop: Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 Profiler I Mem µP D$ reg 3 : = 0 + : = 0+ + FPGA reg 4 SM SM SM. . . loop: reg 4+ + mem[ + CLB + +reg 4 : =CLB + reg 2 + (reg 3 << 1)] reg 3 : = reg 3 + 1 Lean place&route/FPGA 10 x faster CAD SM goto. . loop (Lysecky et al DAC’ 03, ISSS/CODES’ 03, DATE’ 04, DAC’ 04, + if. SM(reg 3 +< 10). SM DATE’ 05, FCCM’ 05, TODAES’ 06) ret reg 4 Dynamic Part. On-chip CAD Module (DPM) + Multi-core chips – use 1 powerful core for CAD Frank Vahid, UCR + + + . . . 10

Warp Processing Background 8 On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more Profiler I Mem µP D$ FPGA Dynamic Part. On-chip CAD Module (DPM) >10 x speedups for some apps Software Binary Mov reg 3, 0 Mov reg 4, 0 loop: // instructions Shl reg 1, reg 3, that 1 interact FPGA Add reg 5, with reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 + Software-only “Warped” reg 3 : = 0 + : = 0+ + reg 4 SM SM SM. . . loop: reg 4+ + mem[ + CLB + +reg 4 : =CLB + reg 2 + (reg 3 << 1)] reg 3 : = reg 3 + 1 SM + if. SM (reg 3 goto. . loop +< 10). SM + + ret reg 4 Frank Vahid, UCR + . . . 11

Warp Scenarios Warping takes time – when useful? n Long-running applications n n Scientific computing, etc. Recurring applications (save FPGA configurations) n n Common in embedded systems Might view as (long) boot phase Long Running Applications µP FPGA Recurring Applications µP (1 st execution) On-chip CAD µP FPGA On-chip CAD Time Single-execution speedup Time Speedup Possible platforms: Xilinx Virtex II Pro, Altera Excalibur, Cray XD 1, SGI Altix, Intel Quick. Assist, . . . Frank Vahid, UCR 12

Thread Warping - Overview Multi-core platforms multithreaded apps for (i = 0; i < 10; i++) { } OS schedules threads onto accelerators (possibly dozens), in addition to µPs Compiler Binary µP OS schedules threads onto available µPs f() µP FPGA f() Very large speedups possible – parallelism at bit, arithmetic, and now thread level too On-chip CAD OS Remaining threads added to queue Frank Vahid, UCR Performance thread_create( f, i ); µP f() Acc. Lib OS invokes on-chip CAD tools to create accelerators for f() Thread warping: use one core to create accelerator for waiting threads 13

Thread Warping Tools n n FPGA µ P Invoked by OS Uses pthread library (POSIX) n n On-chip CAD Mutex/semaphore for synchronization Defined methods/algorithms of a thread warping framework Thread Functions Thread Queue false Thread Counts Queue Analysis Thread Functions Accelerators Synthesized? Accelerator Library Not In Library? true Bitfile Place&Route Frank Vahid, UCR false Done true Accelerator Instantiation Schedulable Resource List Decompilation FPGA Netlist Accelerator Synthesis Thread Group Table Updated Binary Hw/Sw Partitioning Hw Memory Access Synchronization Sw Binary Updater High-level Synthesis Netlist Thread Group Table Updated Binary 14

Memory Access Synchronization (MAS) n Must deal with widely known memory bottleneck problem n FPGAs great, but often can’t get data to them fast enough for (i = 0; i < 10; i++) { RAM DMA a() n n FPGA …. b() thread_create( thread_function, a, i ); Data for dozens of threads can create bottleneck } void f( int a[], int val ) { int result; for (i = 0; i < 10; i++) { result += a[i] * val; } Same. . } array Threaded programs exhibit unique feature: Multiple threads often access same data Solution: Fetch data once, broadcast to multiple threads (MAS) Frank Vahid, UCR 15

Memory Access Synchronization (MAS) 1) Identify thread groups – loops that create threads 2) Identify constant memory addresses in thread function n Def-use analysis of parameters to thread function 3) Synthesis creates a “combined” memory access n Execution synchronized by OS Thread Group Def-Use: a is constant for all threads RAM DMA for (i = 0; i < 100; i++) { thread_create( f, a, i ); } Data fetched once, delivered to entire group A[0 -9] enable (from OS) void f( int a[], int val ) { int result; for (i = 0; i < 10; i++) { result += a[i] * val; }. . Addresses of a[0 -9] are constant for } thread group Frank Vahid, UCR f() A[0 -9] ……………… f() Before MAS: 1000 memory accesses After MAS: 100 memory accesses 16

Memory Access Synchronization (MAS) n n Also detects overlapping memory regions – “windows” Synthesis creates extended “smart buffer” [Guo/Najjar FPGA 04] n Caches reused data, delivers windows to threads a[0] a[1] a[2] a[3] a[4] a[5] for (i = 0; i < 100; i++) { thread_create( thread_function, a, i ); } void f( int a[], int i ) { int result; result += a[i]+a[i+1]+a[i+2]+a[i+3]; . . } Each thread accesses different addresses – but addresses may overlap RAM DMA A[0 -103] ……… Data streamed to “smart buffer” Smart Buffer A[0 -3] enable f() A[1 -4] f() ……………… A[6 -9] f() Buffer delivers window to each thread W/O smart buffer: 400 memory accesses With smart buffer: 104 memory accesses Frank Vahid, UCR 17

Framework n Thread Queue Thread Counts Queue Analysis Thread Functions Also developed initial algorithms for: n n false Accelerators Synthesized? Accelerator Library true Not In Library? false true Accelerator Instantiation Accelerator Synthesis Bitfile Netlist Done n Queue analysis Accelerator instantiation OS scheduling of threads to accelerators and cores Place&Route Schedulable Resource List FPGA Frank Vahid, UCR Thread Group Table Updated Binary 18

Thread Warping Example int main( ) { filter() threads. . . execute on for (i=0; i < 50; i++) { thread_create( filter, a, b, i ); available cores }. . . } void filter( int a[51], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] ); } Remaining threads added to queue main() filter() µP µP filter() µP OS invokes CAD (due to queue size or periodically) OS µP FPGA On-chip CAD Thread µP Queue Analysis Thread functions: filter() CAD tools identify filter() for synthesis Frank Vahid, UCR 19

Example int main( ) MAS detects thread group {. . . for (i=0; i < 50; i++) { thread_create( filter, a, b, i ); }. . . } void filter( int a[51], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] ); } MAS detects overlapping windows main() filter() µP µP filter() µP OS µP CAD reads filter() binary FPGA On-chip CAD filter() µP binary Decompilation CDFG Memory Access Synchronization Frank Vahid, UCR 20

Example Accelerators loaded into FPGA int main( ) {. . . for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); }. . . } void filter( int a[51], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] ); } Synthesis creates pipelined accelerator filter() group: accelerators 8 main() filter() µP µP filter() + Frank Vahid, UCR On-chip CAD filter() µP binary filter() Memory Access Synchronization High-level Synthesis 2 >> filter CDFG Smart Buffer . . . µP filter Decompilation RAM + + OS FPGA RAM Accelerator Library Stored for future use 21

Example int main( ) {. . . for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); }. . . } void filter( int a[53], int b[50], int i, ) { b[i][=avg( a[i], a[i+1], a[i+2], a[i+3] ); } Smart Buffer a[2 -5] filter() µP µP µP a[0 -52] enable (from OS) main() filter() RAM . . . OS schedules threads to accelerators FPGA filter On-chip CAD OS µP µP Smart buffer streams a[] data a[9 -12] filter() After buffer fills, delivers a window to all eight accelerators RAM Frank Vahid, UCR 22

Example int main( ) {. . . for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); }. . . } void filter( int a[53], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] ); } main() filter() µP µP filter() µP OS µP FPGA filter On-chip CAD µP RAM a[0 -53] Smart Buffer a[10 -13] filter() . . . a[17 -20] filter() Each cycle, smart buffer delivers eight more windows – pipeline remains full RAM Frank Vahid, UCR 23

Example int main( ) {. . . for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); }. . . } void filter( int a[53], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] ); } main() filter() µP µP filter() µP OS µP FPGA filter On-chip CAD µP RAM a[0 -53] Smart Buffer filter() Accelerators create 8 outputs after pipeline latency passes Frank Vahid, UCR . . . filter() b[2 -9] RAM 24

Example int main( ) {. . . for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); }. . . } void filter( int a[53], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] ); } main() filter() µP µP filter() µP OS µP FPGA filter On-chip CAD µP RAM a[0 -53] Thread warping: 8 pixel Smart Buffer filter() Additional 8 outputs each cycle Frank Vahid, UCR . . . outputs per cycle filter() b[10 -17] RAM Software: 1 pixel output every ~9 cycles 72 x cycle count improvement 25

Experiments to Determine Thread Warping Performance: Simulator Setup int main( ) {. . . for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); }. . . } Parallel Execution Graph (PEG) – represents thread level parallelism main filter ………… filter main Simulation Summary Generate PEG using 1) pthread wrappers Determine SEB performances 2) Sw: Simple. Scalar Hw: Synthesis/simulation (Xilinx) Event-driven simulation – use 3) defined algoritms to change architecture dynamically 4) Complete when all SEBs simulated Frank Observe Vahid, UCR total cycles …… Nodes: Sequential execution blocks (SEBs) Edges: pthread calls Optimistic for Sw execution (no memory contention) Pessimistic for warped execution (accelerators/microprocessors execute exclusively) 26

Experiments n Benchmarks: Image processing, DSP, scientific computing n n n Base architecture – 4 ARM cores n n Highly parallel examples to illustrate thread warping potential We created multithreaded versions Focus on recurring applications (embedded) TW: FPGA running at whatever frequency determined by synthesis Multi-core 4 ARM 11 400 MHz µP µP Thread Warping 4 ARM 11 400 MHz + FPGA (synth freq) µP µP Compared to µP Frank Vahid, UCR µP FPGA On-chip CAD 27

Speedup from Thread Warping n Average 130 x speedup But, FPGA uses additional area So we also compare to systems with 8 to 64 ARM 11 u. Ps – FPGA size = ~36 ARM 11 s n n 11 x faster than 64 -core system Simulation pessimistic, actual results likely better Frank Vahid, UCR 28

Why Dynamic? n Static good, but hiding FPGA opens technique to all sw platforms n Standard languages/tools/binaries Static Compiling to FPGAs Dynamic Compiling to FPGAs Specialized Language Any Language Specialized Compiler Any Compiler Binary Netlist Binary FPGA µ P n n On-chip CAD Can adapt to changing workloads n n µ P Smaller & more accelerators, fewer & large accelerators, … Can add FPGA without changing binaries – like expanding memory, or adding processors to multiprocessor Custom interconnections, tuned processors, … Frank Vahid, UCR 29

Warp Processing Enables Expandable Logic Concept RAM Expandable. RAM Logic– –System Warp tools detects duringinvisibly start, adapt amount. RAM of FPGA, improves performance invisiblyhardware. application to use less/more DMA Cache FPGA Profiler µP µP Warp Tools Expandable Logic Expandable RAM u. P Frank Vahid, UCR Planning MICRO submission Performance 30

Expandable Logic n n n Used our simulation framework Large speedups – 14 x to 400 x (on scientific apps) Different apps require different amounts of FPGA n Expandable logic allows customization of single platform n n Frank Vahid, UCR User selects required amount of FPGA No need to recompile/synthesize 31

Dynamic enables Custom Communication No. C – Network on a Chip provides communication between multiple cores Problem: Best topology is application dependent App 1 µP µP Bus Mesh App 2 µP µP Bus Frank Vahid, UCR Mesh 32

Dynamic enables Custom Communication No. C – Network on a Chip provides communication between multiple cores Problem: Best topology is application dependent App 1 µP µP FPGA Bus Mesh App 2 µP µP Warp processing can dynamically choose topology Frank Vahid, UCR Bus Mesh 33

Industrial Interactions Year 2 / 3 n Freescale n n n Intel n n n Chip prototype: Participated in Intel’s Research Shuttle to build prototype warp FPGA fabric – continued bi-weekly phone meetings with Intel engineers, visit to Intel by PI Vahid and R. Lysecky (now prof. at Uof. A), several day visit to Intel by Lysecky to simulate design, ready for tapout. June’ 06–Intel cancelled entire shuttle program as part of larger cutbacks. Research discussions via email with liaison Darshan Patra (Oregon). IBM n n Research visit: F. Vahid to Freescale, Chicago, Spring’ 06. Talk and full-day research discussion with several engineers. Internships –Scott Sirowy, summer 2006 in Austin (also 2005) Internship: Ryan Mannion, summer and fall 2006 in Yorktown Heights. Caleb Leak, summer/fall 2007. Platform: IBM’s Scott Lekuch and Kai Schleupen 2 -day visit to UCR to set up Cell development platform having FPGAs. Technical discussion: Numerous ongoing email and phone interactions with S. Lekuch regarding our research on Cell/FPGA platform. Several interactions with Xilinx also Frank Vahid, UCR 34

Patents n “Warp Processor” patent n n n Filed with USPTO summer 2004 Granted winter 2007 SRC has non-exclusive royalty-free license Frank Vahid, UCR 35

Year 1 / 2 publications n New Decompilation Techniques for Binary-level Co-processor Generation. G. Stitt, F. n Fast Configurable-Cache Tuning with a Unified Second-Level Cache. A. Gordon-Ross, n Vahid. IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2005. F. Vahid, N. Dutt. Int. Symp. on Low-Power Electronics and Design (ISLPED), 2005. Hardware/Software Partitioning of Software Binaries: A Case Study of H. 264 Decode. G. Stitt, F. Vahid, G. Mc. Gregor, B. Einloth. International Conference on Hardware/Software Codesign and System Synthesis (CODES/ISSS), 2005. (Co-authored paper with Freescale) n Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware. A. n A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation. n A First Look at the Interplay of Code Reordering and Configurable Caches. A. n Gordon-Ross and F. Vahid. IEEE Trans. on Computers, Special Issue- Best of Embedded Systems, Microarchitecture, and Compilation Techniques in Memory of B. Ramakrishna (Bob) Rau, Oct. 2005. R. Lysecky, F. Vahid and S. Tan. IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), 2005. Gordon-Ross, F. Vahid, N. Dutt. Great Lakes Symposium on VLSI (GLSVLSI), April 2005. A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning. R. Lysecky and F. Vahid. Design Automation and Test in Europe (DATE), March 2005. n A Decompilation Approach to Partitioning Software for Microprocessor/FPGA Platforms. G. Stitt and F. Vahid. Design Automation and Test in Europe (DATE), March 2005. Frank Vahid, UCR 36

Year 2 / 3 publications n n n n Warp Processing: Dynamic Translation of Binaries to FPGA Circuits. F. Vahid, G. Stitt, and R. Lysecky. . IEEE Computer, 2008 (to appear). C is for Circuits: Capturing FPGA Circuits as Sequential Code for Portability. S. Sirowy, G. Stitt, and F. Vahid. Int. Symp. on FPGAs, 2008. Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators. G. Stitt and F. Vahid. . Int. Conf. on Hardware/Software Codesign and System Synthesis (CODES/ISSS), 2007, pp. 93 -98. A Self-Tuning Configurable Cache. A. Gordon-Ross and F. Vahid. Design Automation Conference (DAC), 2007. Binary Synthesis. G. Stitt and F. Vahid. ACM Transactions on Design Automation of Electronic Systems (TODAES), Aug 2007. Integrated Coupling and Clock Frequency Assignment. S. Sirowy and F. Vahid. International Embedded Systems Symposium (IESS), 2007. Soft-Core Processor Customization Using the Design of Experiments Paradigm. D. Sheldon, F. Vahid and S. Lonardi. Design Automation and Test in Europe, 2007. A One-Shot Configurable-Cache Tuner for Improved Energy and Performance. A Gordon-Ross, P. Viana, F. Vahid and W. Najjar. Design Automation and Test in Europe, 2007. Two Level Microprocessor-Accelerator Partitioning. S. Sirowy, Y. Wu, S. Lonardi and F. Vahid. Design Automation and Test in Europe, 2007. Clock-Frequency Partitioning for Multiple Clock Domains Systems-on-a-Chip. S. Sirowy, Y. Wu, S. Lonardi and F. Vahid Conjoining Soft-Core FPGA Processors. D. Sheldon, R. Kumar, F. Vahid, D. M. Tullsen, R. Lysecky. IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Nov. 2006. A Code Refinement Methodology for Performance-Improved Synthesis from C. G. Stitt, F. Vahid, W. Najjar. IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Nov. 2006. Application-Specific Customization of Parameterized FPGA Soft-Core Processors. D. Sheldon, R. Kumar, R. Lysecky, F. Vahid, D. M. Tullsen. IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Nov. 2006. Warp Processors. R. Lysecky, G. Stitt, F. Vahid. ACM Transactions on Design Automation of Electronic Systems (TODAES), July 2006, pp. 659 -681. Configurable Cache Subsetting for Fast Cache Tuning. P. Viana, A. Gordon-Ross, E. Keogh, E. Barros, F. Vahid. IEEE/ACM Design Automation Conference (DAC), July 2006. Techniques for Synthesizing Binaries to an Advanced Register/Memory Structure. G. Stitt, Z. Guo, F. Vahid, and W. Najjar. ACM/SIGDA Symp. on Field Programmable Gate Arrays (FPGA), Feb. 2005, pp. 118 -124. Frank Vahid, UCR 37