A Code Refinement Methodology for PerformanceImproved Synthesis from

Introduction n Previous work: In-depth hw/sw partitioning study of H. 264 decoder n Collaboration

Introduction n n Noticed coding constructs/practices limited hw speed Identified problematic coding constructs n

Coding Guidelines n Analyzed dozens of benchmarks n n Identified common problems related to

Fast Refinement Sample Application Apply guidelines to only the critical regions n Profiling Results

Conversion to Constants (CC) n Problem: Arrays of constants commonly not specified as constants

Conversion to Explicit Data Flow (CEDF) n Problem: Global variables make determination of parallelism

Constant Input Enumeration (CIE) n n Problem: Function parameters may limit parallelism Guideline: Create

Conversion to Explicit Control Flow (CECF) n n Problem: Function pointers may prevent static

Algorithmic Specialization (AS) n Algorithms targeting sw may not be fast in hw n

Pass-By-Value Return (PVR) n Problem: Array parameters cannot be prefetched due to potential aliases

Why Synthesis From C? n Why not use HDL? n n HDL may yield

Software Overhead n Refined regions may not be partitioned to hardware n n filter.

Experimental Setup Benchmarks n Benchmark suite n n n Manually applied guidelines n 1

Speedups from Guidelines Explicit Dataflow + Algorithmic Specialization Speedup: 16. 4 x Total Time:

Speedups from Guidelines Input Enumeration Speedup: 16. 7 x Total Time: 20 Conversion to

Speedups from Guidelines n Original code n n n Speedups range from 1 x

Speedups from Guidelines n Guidelines move speedups closer to ideal n n Almost identical

Guideline SW Overhead/Improvement Overhead Improvement n Average Sw performance overhead: -15. 7% (improvement) n

Summary n Simple coding guidelines significantly improve synthesis from C n n 3. 5

Slides: 23

Download presentation

A Code Refinement Methodology for Performance-Improved Synthesis from C Greg Stitt, Frank Vahid*, Walid Najjar Department of Computer Science and Engineering University of California, Riverside Also with the Center for Embedded Computer Systems, UC Irvine This research is supported in part by the National Science Foundation and the Semiconductor Research Corporation

Introduction n Previous work: In-depth hw/sw partitioning study of H. 264 decoder n Collaboration with Freescale H. 264 motion. Comp() filter. Luma() filter. Chroma() deblocking(). . Compiler u. P Select Critical Region deblocking() motion. Comp() Synthesis * * + FPGA * * + + * * * + + 2

Introduction n Previous work: In-depth hw/sw partitioning study of H. 264 decoder n Collaboration with Freescale Large gap between ideal and actual speedup Obtained 2. 5 x speedup 3

Introduction n n Noticed coding constructs/practices limited hw speed Identified problematic coding constructs n Developed simple coding guidelines n n n Dozens of lines of code Minutes per guideline Refined critical regions using guidelines motion. Comp() filter. Luma() filter. Chroma() deblocking(). . motion. Comp’() Apply Guidelines filter. Luma() filter. Chroma() deblocking’(). . Hw/Sw Partitioning 4

Introduction n n Noticed coding constructs/practices limited hw speed Identified problematic coding constructs n Developed simple coding guidelines n n n Dozens of lines of code Minutes per guideline Refined critical regions using guidelines Simple guidelines increased speedup to 6. 5 x Can simple coding guidelines show similar improvements on other applications? 5

Coding Guidelines n Analyzed dozens of benchmarks n n Identified common problems related to synthesis Developed 10 guidelines to fix problems n n Although some are well known, analysis shows they are rarely applied Automation unlikely or impossible in many cases Coding Guidelines Conversion to Constants (CC) Conversion to Explicit Data Flow (CEDF) Conversion to Fixed Point (CF) Conversion to Explicit Memory Accesses (CEMA) Constant Input Enumeration (CIE) Loop Rerolling (LR) Conversion to Explicit Control Flow (CECF) Function Specialization (FS) Algorithmic Specialization (AS) Pass-By-Value Return (PVR) 6

Fast Refinement Sample Application Apply guidelines to only the critical regions n Profiling Results Idct() Memset() FIR() Sort() Search() Read. Input() Write. Output() Matrix() Brev() Compress() Quantize(). . . Only several performance critical regions Several dozen lines of code provide most performance improvement n Refining takes minutes/hours 7

Conversion to Constants (CC) n Problem: Arrays of constants commonly not specified as constants n n Guideline: Use constant wrapper function n n Initialized at runtime Specifies array constant for all future functions Automation n Difficult, requires global defuse/alias analysis int coef[100]; void init. Coef() { // initialize coef } void fir(const fir()const fir( { int array[100])) { // fir filter using const coef array } void fir. Const. Wrapper(const intint array[100]) { { prefetch. Array( array ); // misc code. . . Array can’t change, fir(array); prefetching won’t } violate dependencies void f() { init. Coef() // const. Wrapper(coef); other code fir(); } Can also enable constant folding 8

Conversion to Explicit Data Flow (CEDF) n Problem: Global variables make determination of parallelism difficult n n Guideline: Replace globals with extra parameters n n n Requires global def-use/alias analysis Makes data flow explicit Simpler analysis may expose parallelism Automation n n Been proposed [Lee 01] But, difficult because of aliases int array[100]; void a( a()int{ array[100]) { for (i=0; i < 100; i++) array[i] =. . . } void b( b()int{ array 1[100], int array 2[100]) { for (i=0; i < 100; i++) array[i] ==array[i]+f(i); array 2[i] array 1[i]+f(i); } int c( c()int{ array[100]) { for (i=0; i < 100; i++) temp += array[i]; } void d() { int array 1[100], array 2[100]; forfor (. (. . . . ). {) { a(); a( array 1 ); b( b(); array 1, array 2 ); c( c(); array 2 ); } } a(), b(), c() must execute a() and c() can execute in sequentially because of global parallel after 1 st iteration array dependencies 9

Constant Input Enumeration (CIE) n n Problem: Function parameters may limit parallelism Guideline: Create enum for possible values n n Synthesis can create specialized functions Automation n n In some cases, def-use analysis may identify all inputs In general, difficult due to aliases enum PRM { VAL 1=2, VAL 2=4 }; void f( enum PRM f(int a, int b) a, { enum PRM b) {. . for (i=0; i < a; i++) { for (j=0; j < b; i++) { c[i][j]=i+j; } Bounds not } known, hard } to unroll Specialized Versions: f(2, 2), f(2, 4), f(4, 2), f(2, 4) 0 i 0 j 0 + + c[i][j] c[0][0] 1 + c[0][1] 0 2 One iteration +. . . at a time c[0][2] Iterations can be parallelized in each version 10

Conversion to Explicit Control Flow (CECF) n n Problem: Function pointers may prevent static control flow analysis Guideline: Replace function pointer with if -else, static calls n n Makes possible targets explicit Automation n In general, is impossible n Equivalent to halting problem Synthesis unlikely to determine possible targets of function pointer enum Target { FUNC 1, FUNC 2, FUNC 3 }; void f( int (*fp) (int) fp ) {) { enum Target } } . . . for (i=0; i < 10; i++) { a[i] fp(i); if (fp=== FUNC 1) } a[i] = f 1(i); else if (fp == FUNC 2) a[i] = f 2(i); else Synthesized Hardware a[i] = f 3(i); } ? Synthesized Hardware a[i] f 1(i) f 2(i) f 3(i) fp 3 x 1 a[i] 11

Algorithmic Specialization (AS) n Algorithms targeting sw may not be fast in hw n n Sequential vs. parallel C code generally uses sw algorithms Guideline: Specialize critical functions with hw algorithms Automation n void search(int a[], int k, const intr)s){{ int l, int for (i=0; i <r)s; {i++) { while (l <= if (a[i] == k) mid = (l+r)/2; return i; if (k > a[mid]) } l = mid+1; else if (k < a[mid) r = mid-1; else Can be parallelized in return mid; hardware } return – 1; } Requires higher level specification n Intrinsics 12

Pass-By-Value Return (PVR) n Problem: Array parameters cannot be prefetched due to potential aliases n n n Designer may know aliases don’t exist Guideline: Use pass -by-value-return Automation n Local array can’t be aliased, can prefetch. Can’t prefetch array for g(), void f(int *a, int *b, int array[16]) { may be aliased int local. Array[16]; memcpy(local. Array, array, 16*sizeof(int)); … // misc computation unrelated computation g( local. Array); g(array); … // misc computation unrelated computation memcpy(array, local. Array, 16*sizeof(int)); } int g(int array[16]) { // computation done on array } Requires global alias analysis 13

Why Synthesis From C? n Why not use HDL? n n HDL may yield better results C is mainstream language n n n Acceptable performance in many cases Learning HDL is large overhead Approaches are orthogonal n This work focuses on improving mainstream n Guidelines common for HDL n Can also be applied to algorithmic HDL 14

Software Overhead n Refined regions may not be partitioned to hardware n n filter. Luma() filter. Chroma() Partitioner may select non-refined regions OS may select software or hardware implementation n n motion. Comp’() Based on state of FPGA Coding guidelines have potential software overhead Problem - Refined code mapped to software deblocking’(). . Hw/Sw Partitioning motion. Comp’() filter. Luma() deblocking’() filter. Chroma() u. P FPGA 15

Refinement Methodology Profile n Considerations n n n Determine Critical Region Reduce software overhead Reduce refinement time Apply CC, CF, CEMA, CIE, CEDF, CECF, FS, LR Methodology n n no Profile Iterative-improvement n n Repeat until performance acceptable yes Determine critical region Apply all except PVR/AS n Is overhead of copying array acceptable? Apply PVR Minimal overhead Apply PVR if overhead acceptable Apply AS if known algorithm and overhead acceptable no Does suitable hw algorithm exist and have acceptable sw performance ? yes Apply AS 16

Experimental Setup Benchmarks n Benchmark suite n n n Manually applied guidelines n 1 -2 hours n 23 additional lines/benchmark, on average Target Architecture n n Xilinx Virtex. II FPGA with ARM 9 u. P Hardware/software partitioning n n Media. Bench, Powerstone Selects critical regions for hardware Synthesis n n n Hw/Sw Partitioning Sw Hw Compilation Synthesis Bitfile ~30, 000 lines of C code Outputs register-transfer level (RTL) VHDL RTL Synthesis using Xilinx ISE Compilation n Refined Code High-level synthesis tool n n Manual Refinement ARM 9 Virtex II Gcc with –O 1 optimizations 17

Speedups from Guidelines Explicit Dataflow + Algorithmic Specialization Speedup: 16. 4 x Total Time: 15 minutes No guidelines Conversion to constants Speedup: 2 x Speedup: 3. 6 x Total Time: 5 minutes 18

Speedups from Guidelines Input Enumeration Speedup: 16. 7 x Total Time: 20 Conversion to minutes Constants Speedup: 14. 4 x Total Time: 10 minutes No Guidelines Speedup: 8. 6 x Algorithmic Specialization Speedup: 19 x Time: 30 minutes Sw Overhead: 6000% 19

Speedups from Guidelines n Original code n n n Speedups range from 1 x (no speedup) to 573 x Average: 2. 6 x (excludes brev) Refined code with guidelines n Average: 8. 4 x (excludes brev) n 3. 5 x average improvement compared to original code 20

Speedups from Guidelines n Guidelines move speedups closer to ideal n n Almost identical for mpeg 2, fir Several examples still far from ideal n May imply new guidelines needed 21

Guideline SW Overhead/Improvement Overhead Improvement n Average Sw performance overhead: -15. 7% (improvement) n -1. 1% excluding brev n n 3 examples improved Average Sw size overhead (lines of C code) n 8. 4% excluding brev 22

Summary n Simple coding guidelines significantly improve synthesis from C n n 3. 5 x speedup compared to Hw/Sw synthesized from unrefined code Major rewrites may not be necessary n n Refinement Methodology n Reduces software size/performance overhead n n Between 1 -2 hours In some cases, improvement Future Work n n Test on commercial synthesis tools New guidelines for different domains 23