Data Parallel FPGA Workloads Software Versus Hardware Peter

  • Slides: 32
Download presentation
Data Parallel FPGA Workloads: Software Versus Hardware Peter Yiannacouras J. Gregory Steffan Jonathan Rose

Data Parallel FPGA Workloads: Software Versus Hardware Peter Yiannacouras J. Gregory Steffan Jonathan Rose FPL 2009

FPGA Systems and Soft Processors Weeks Software + Compiler Digital System Soft Processor computation

FPGA Systems and Soft Processors Weeks Software + Compiler Digital System Soft Processor computation Months HDL + CAD Custom HW Used in 25% of designs [source: Altera, 2009] Easier ? Configurable COMPETE Faster Smaller Less Power Simplify FPGA design: Customize soft processor architecture Target: Data level parallelism → vector processors 2

Vector Processing Primer // C code for(i=0; i<16; i++) c[i]=a[i]+b[i] // Vectorized code set

Vector Processing Primer // C code for(i=0; i<16; i++) c[i]=a[i]+b[i] // Vectorized code set vl, 16 vload vr 0, a vload vr 1, b vadd vr 2, vr 0, vr 1 vstore vr 2, c Each vector instruction holds many units of independent operations vadd vr 2[15]=vr 0[15]+vr 1[15] vr 2[14]=vr 0[14]+vr 1[14] vr 2[13]=vr 0[13]+vr 1[13] vr 2[12]=vr 0[12]+vr 1[12] vr 2[11]=vr 0[11]+vr 1[11] vr 2[10]=vr 0[10]+vr 1[10] vr 2[9]= vr 0[9]+vr 1[9] vr 2[8]= vr 0[8]+vr 1[8] vr 2[7]= vr 0[7]+vr 1[7] vr 2[6]= vr 0[6]+vr 1[6] vr 2[5]= vr 0[5]+vr 1[5] vr 2[4]= vr 0[4]+vr 1[4] vr 2[3]= vr 0[3]+vr 1[3] vr 2[2]= vr 0[2]+vr 1[2] vr 2[1]= vr 0[1]+vr 1[1] vr 2[0]= vr 0[0]+vr 1[0] 1 Vector Lane 3

Vector Processing Primer // C code for(i=0; i<16; i++) c[i]=a[i]+b[i] vadd 16 Vector Lanes

Vector Processing Primer // C code for(i=0; i<16; i++) c[i]=a[i]+b[i] vadd 16 Vector Lanes vr 2[15]=vr 0[15]+vr 1[15] vr 2[14]=vr 0[14]+vr 1[14] vr 2[13]=vr 0[13]+vr 1[13] 16 x speedup vr 2[12]=vr 0[12]+vr 1[12] vr 2[11]=vr 0[11]+vr 1[11] // Vectorized code Previous Work (on Soft Vector Processors): vr 2[10]=vr 0[10]+vr 1[10] set vl, 16 vr 2[9]= vr 0[9]+vr 1[9] 1. Scalability vload vr 0, a vr 2[8]= vr 0[8]+vr 1[8] vload vr 1, b 2. Flexibility vr 2[7]= vr 0[7]+vr 1[7] vadd vr 2, vr 0, vr 1 vr 2[6]= vr 0[6]+vr 1[6] 3. Portability vstore vr 2, c vr 2[5]= vr 0[5]+vr 1[5] vr 2[4]= vr 0[4]+vr 1[4] Each vector instruction vr 2[3]= vr 0[3]+vr 1[3] holds many units of vr 2[2]= vr 0[2]+vr 1[2] independent operations vr 2[1]= vr 0[1]+vr 1[1] vr 2[0]= vr 0[0]+vr 1[0] 4

Soft Vector Processors vs HW Weeks Software + Compiler + Vectorizer Scalable Fine-tunable Customizable

Soft Vector Processors vs HW Weeks Software + Compiler + Vectorizer Scalable Fine-tunable Customizable Easier Soft Vector Processor Custom HW Months HDL + CAD Lane Lane 1 2 3 4 5 6 7 8 … 16 Vector Lanes How much? Faster Smaller Less Power What is the soft vector processor vs FPGA custom HW gap? (also vs scalar soft processor) 5

Measuring the Gap EEMBC Benchmarks Soft Vector Processor Scalar Soft Processor Evaluation Speed Area

Measuring the Gap EEMBC Benchmarks Soft Vector Processor Scalar Soft Processor Evaluation Speed Area HW Circuits Evaluation Compare Speed Area Conclusions Evaluation Compare Speed Area 6

VESPA Architecture Design (Vector Extended Soft Processor Architecture) Icache Scalar Pipeline 3 -stage Vector

VESPA Architecture Design (Vector Extended Soft Processor Architecture) Icache Scalar Pipeline 3 -stage Vector Control Pipeline 3 -stage Vector Pipeline 6 -stage Decode RF VC RF Decode Legend Pipe stage Logic Storage Dcache Logic VS RF Decode M U X A L U VC WB Shared Dcache VS WB Replicate Hazard check WB VR VR RF RF Supports integer and fixed-point operations [VIRAM] Lane 1 ALU, Mem Unit Lane 2 ALU, Mem, Mul VR VR WB WB 32 -bit Lanes 7

VESPA Parameters Compute Architecture Instruction Set Architecture Memory Hierarchy Description Symbol Values Number of

VESPA Parameters Compute Architecture Instruction Set Architecture Memory Hierarchy Description Symbol Values Number of Lanes L 1, 2, 4, 8, … Memory Crossbar Lanes M 1, 2, …, L Multiplier Lanes X 1, 2, …, L Maximum Vector Length MVL 2, 4, 8, … Width of Lanes (in bits) W 1 -32 Instruction Enable (each) - on/off Data Cache Capacity DD any Data Cache Line Size DW any Data Prefetch Size DPK < DD Vector Data Prefetch Size DPV < DD/MVL 8

VESPA Evaluation Infrastructure SOFTWARE EEMBC C Benchmarks Vectorized assembly subroutines GCC HARDWARE Verilog scalar

VESPA Evaluation Infrastructure SOFTWARE EEMBC C Benchmarks Vectorized assembly subroutines GCC HARDWARE Verilog scalar μP ld ELF Binary GNU as verification vpu VC WB Logic VS RF Decode VS WB Replicate Hazard check Mem Unit VR VR RF RF RTL Simulation verification cycles VR VR WB WB Saturate A L A U L U TM 4 Instruction Set Simulation + VC RF M U M X U X xx &&satur. Rshift Altera Quartus II v 8. 1 area, clock frequency Realistic and detailed evaluation 9

Measuring the Gap EEMBC Benchmarks Soft Vector Processor Scalar Soft Processor Evaluation Speed Area

Measuring the Gap EEMBC Benchmarks Soft Vector Processor Scalar Soft Processor Evaluation Speed Area HW Circuits Evaluation Compare Speed Area Conclusions Evaluation Compare Speed Area 10

Designing HW Circuits (with simplifying assumptions) HW Memory Request Idealized Control DDR Core Datapath

Designing HW Circuits (with simplifying assumptions) HW Memory Request Idealized Control DDR Core Datapath Altera Quartus II v 8. 1 area, clock frequency cycle count (modelled) Assume fed at full DDR bandwidth Calculate execution time from data size Optimistic HW implementations vs real processors 11

Benchmarks Converted to HW Stratix III 3 S 200 C 2 EEMBC VIRAM HW

Benchmarks Converted to HW Stratix III 3 S 200 C 2 EEMBC VIRAM HW Clock: 275 -475 MHz VESPA Clock: 120 -140 MHz HW advantage: 3 x faster clock frequency 12

Performance/Area Space (vs HW) vs HW HWSlowdown Speed Advantage Scalar – 432 x slower,

Performance/Area Space (vs HW) vs HW HWSlowdown Speed Advantage Scalar – 432 x slower, 7 x larger HW (1, 1) optimistic HWAreavs Advantage HW fastest VESPA 17 x slower, 64 x larger Soft vector processors can significantly close performance gap 13

Area-Delay Product Commonly used to measure efficiency in silicon ¨ Considers both performance and

Area-Delay Product Commonly used to measure efficiency in silicon ¨ Considers both performance and area ¨ Inverse of performance-per-area Calculated using: (Area) × (Wall Clock Execution Time) 14

HW Area-Delay vs HW Advantage Area-Delay Space (vs HW) 2900 x HW Area Advantage

HW Area-Delay vs HW Advantage Area-Delay Space (vs HW) 2900 x HW Area Advantage VESPA up to 3 times better silicon usage than Scalar 15

Reducing the Performance Gap Previously: VESPA was 50 x slower than HW Reducing loop

Reducing the Performance Gap Previously: VESPA was 50 x slower than HW Reducing loop overhead ¨ VESPA: Decoupled pipelines (+7% speed) Improving data delivery ¨ VESPA: Parameterized cache (2 x speed, 2 x area) ¨ VESPA: Data Prefetching (+42% speed) These enhancements were key parts of reducing gap, combined 3 x performance improvement 16

Wider Cache Line Size vld. w VESPA 16 lanes Scalar Vector Coproc (load 16

Wider Cache Line Size vld. w VESPA 16 lanes Scalar Vector Coproc (load 16 sequential 32 -bit words) Lane Lane Lane Lane 0004 0008 0 0 012 4 41516 Vector Memory Crossbar Dcache 4 KB, 16 B line … 17

Wider Cache Line Size vld. w VESPA 16 lanes Scalar Vector Coproc (load 16

Wider Cache Line Size vld. w VESPA 16 lanes Scalar Vector Coproc (load 16 sequential 32 -bit words) Lane Lane Lane Lane 0004 0008 0 0 012 4 41516 Vector Memory Crossbar Dcache 16 KB, 64 B line 2 x speed, 2 x area (reduced cache accesses + some prefetching) 4 x … 4 x 18

Hardware Prefetching Example No Prefetching 3 blocks vld. w MISS Dcache MISS HIT Dcache

Hardware Prefetching Example No Prefetching 3 blocks vld. w MISS Dcache MISS HIT Dcache … … 10 cycle penalty DDR 42% speed improvement from reduced miss cycles 19

Reducing the Area Gap (by Customizing the Instruction Set) FPGAs can be reconfigured between

Reducing the Area Gap (by Customizing the Instruction Set) FPGAs can be reconfigured between applications 1. 2. Observations: Not all applications Operate on 32 -bit data types Use the entire vector instruction set Eliminate unused hardware 20

VESPA Parameters Description Symbol Values Number of Lanes L 1, 2, 4, 8, …

VESPA Parameters Description Symbol Values Number of Lanes L 1, 2, 4, 8, … Maximum Vector Length MVL 2, 4, 8, … Width of Lanes (in bits) W 1 -32 Memory Crossbar Lanes M 1, 2, …, L Multiplier Lanes X 1, 2, …, L Instruction Enable (each) - on/off Data Cache Capacity DD any Data Cache Line Size DW any Data Prefetch Size DPK < DD Vector Data Prefetch Size DPV < DD/MVL Reduce width Subset instruction set 21

HWSlowdown Speed Advantage vs HW Customized VESPA vs HW 45% HW Area Advantage Area

HWSlowdown Speed Advantage vs HW Customized VESPA vs HW 45% HW Area Advantage Area vs HW Up to 45% area saved with width reduction & subsetting 22

Summary VESPA more competitive with HW design ¨ Fastest VESPA only 17 x slower

Summary VESPA more competitive with HW design ¨ Fastest VESPA only 17 x slower than HW Scalar soft processor was 432 x slower than HW ¨ Attacking loop overhead and data delivery was Decoupled pipelines, cache tuning, data prefetching Further enhancements can reduce the gap more key VESPA improves efficiency of silicon usage ¨ 900 x worse area-delay than HW Scalar soft processor 2900 x worse area-delay than HW ¨ Subsetting/width reduction can further reduce to 561 x Enable software implementation for non-critical data-parallel computation 23

Thank You! Stay tuned for public release: 1. 2. GNU assembler ported for VIRAM

Thank You! Stay tuned for public release: 1. 2. GNU assembler ported for VIRAM (integer only) VESPA hardware design (DE 3 ready) 24

Breaking Down Performance Components of performance Loop: <work> goto Loop Iteration-level parallelism … b)

Breaking Down Performance Components of performance Loop: <work> goto Loop Iteration-level parallelism … b) Cycles per iteration × Clock period c) a) Measure the HW advantage in each of these components 25

Breakdown of Performance Loss (16 lane VESPA vs HW) Clock Frequency Iteration Level Parallelism

Breakdown of Performance Loss (16 lane VESPA vs HW) Clock Frequency Iteration Level Parallelism Cycles Per Iteration autcor 2. 6 x 1 x 9. 1 x conven 3. 9 x 1 x 6. 1 x rgbcmyk 3. 7 x 0. 375 x 13. 8 x rgbyiq 2. 2 x 0. 375 x 19. 0 x ip_checksum 3. 7 x 0. 5 x 4. 8 x imgblend 3. 6 x 1 x 4. 4 x GEOMEAN 3. 2 x 0. 64 x 8. 2 x Benchmark Total 17 x Largest factor Was previously worse, recently improved 26

1 -Lane VESPA vs Scalar 1. 2. 3. 4. Efficient pipeline execution Large vector

1 -Lane VESPA vs Scalar 1. 2. 3. 4. Efficient pipeline execution Large vector register file for storage Amortization of loop control instructions. More powerful ISA (VIRAM vs MIPS): 1. 2. 3. 5. 6. Support for fixed-point operations Predication Built-in min/max/absolute instructions Execution in both scalar and vector co-processor Manual vectorization in assembly versus scalar GCC 27

Measuring the Gap Scalar: MIPS soft processor C (complete & real) EEMBC C Benchmarks

Measuring the Gap Scalar: MIPS soft processor C (complete & real) EEMBC C Benchmarks COMPARE VESPA: VIRAM soft vector processor (complete & real) assembly COMPARE HW: Custom circuit for each benchmark (simplified & idealized) Verilog 28

Reporting Comparison Results 1. Scalar (C) vs HW (Verilog) 2. VESPA (Vector assembly) vs

Reporting Comparison Results 1. Scalar (C) vs HW (Verilog) 2. VESPA (Vector assembly) vs HW (Verilog) 3. HW (Verilog) Performance (wall clock time) Execution Time of Processor HW Speed Advantage = Execution Time of Hardware Area (actual silicon area) HW Area Advantage = Area of Processor Area of Hardware 29

Cache Design Space – Performance (Wall Clock Time) 122 MHz 123 MHz 126 MHz

Cache Design Space – Performance (Wall Clock Time) 122 MHz 123 MHz 126 MHz 129 MHz Best cache design almost doubles performance of original VESPA Cache line more important than cache depth (lots of streaming) More pipelining/retiming could reduce clock frequency penalty 30

Vector Length Prefetching Performance Peak 29% 21% Not receptive 2. 2 x no cache

Vector Length Prefetching Performance Peak 29% 21% Not receptive 2. 2 x no cache pollution 1*VL prefetching provides good speedup without tuning, 8*VL best 31

Overall Memory System Performance 16 lanes 67% 48% 31% 4% (4 KB) (16 KB)

Overall Memory System Performance 16 lanes 67% 48% 31% 4% (4 KB) (16 KB) 15 Wider line + prefetching reduces memory unit stall cycles significantly Wider line + prefetching eliminates all but 4% of miss cycles 32