Soft Vector Processors with Streaming Pipelines Aaron Severance

Motivation �Data parallel problems on FPGAs ◦ ESL? ◦ Overlays? ◦ Processors? 2

Example: N-Body Problem �O(N 2) force calculation ◦ Streaming Pipeline (custom vector instruction) �O(N)

Vector. Blox MXP � 1 to 128 parallel vector lanes (4 shown) 5

Custom Vector Instructions (CVIs) �Simple CVI parallel scalar CIs 7

CVI Complications (1) �CVIs can be big ◦ e. g. square root, floating point

CVI Complications (2) �CVIs can be deep ◦ e. g. FP addition >> depth

Multiple Operand CVIs � 2 D N-body problem: 3 inputs, 2 outputs 11

4 Input, 2 Output CVI Option 1: Spatially Interleaved �Easy for interleaved (Array-of-Struct) data

4 Input, 2 Output CVI Option 2: Time Interleaved �Alternate operands every cycle ◦

4 Input, 2 Output CVI Option 2 with Funnel Adapters �Multiplex 2 CVI lanes

Building CVIs �We created CVIs via 3 methods: 1. RTL 2. Altera’s DSP Builder

Altera’s DSP Builder �Fixed or Floating-Point Pipelines ◦ Automatic pipelining given target �Adapters provided

Synthesis From C (using LLVM) for( int glane = 0 ; glane < CVI_LANES

Performance/Area SVP Configuration V 32, 16 physical pipelines MXP Speedup/ALM Relative to Nios II/f

Conclusions �CVIs can incorporate streaming pipelines ◦ SVP handles control, light data processing ◦

Slides: 21

Download presentation

Soft Vector Processors with Streaming Pipelines Aaron Severance Joe Edwards Hossein Omidian Guy G. F. Lemieux

Motivation �Data parallel problems on FPGAs ◦ ESL? ◦ Overlays? ◦ Processors? 2

Example: N-Body Problem �O(N 2) force calculation ◦ Streaming Pipeline (custom vector instruction) �O(N) housekeeping ◦ Overlay (soft vector processor) �O(1) control ◦ Processor (ARM or soft-core) 3

Soft Vector Processor (SVP) 4

Vector. Blox MXP � 1 to 128 parallel vector lanes (4 shown) 5

MXP Datapath 6

Custom Vector Instructions (CVIs) �Simple CVI parallel scalar CIs 7

CVI Complications (1) �CVIs can be big ◦ e. g. square root, floating point ◦ Bigger than entire integer ALU �Make them cheaper ◦ Don’t replicate for every lane ◦ Reuse existing alignment networks �No additional costs, buffering 8

Cheap Heterogeneous Lanes 9

CVI Complications (2) �CVIs can be deep ◦ e. g. FP addition >> depth than MXP pipeline �Execute stage is 3 cycles, stall-free �CVI pipeline must ‘warm up’ ◦ Don’t writeback until valid data appears ◦ Best if vector length >> CVI depth 10

Multiple Operand CVIs � 2 D N-body problem: 3 inputs, 2 outputs 11

4 Input, 2 Output CVI Option 1: Spatially Interleaved �Easy for interleaved (Array-of-Struct) data ◦ But vector data is normally contiguous (So. A) 12

4 Input, 2 Output CVI Option 2: Time Interleaved �Alternate operands every cycle ◦ Data is valid every 2 cycles 13

4 Input, 2 Output CVI Option 2 with Funnel Adapters �Multiplex 2 CVI lanes to one pipeline ◦ Use existing 2 D/3 D instructions to dispatch 14

Building CVIs �We created CVIs via 3 methods: 1. RTL 2. Altera’s DSP Builder 3. Synthesis from C (custom LLVM solution) 15

Altera’s DSP Builder �Fixed or Floating-Point Pipelines ◦ Automatic pipelining given target �Adapters provided to MXP CVI interface 16

Synthesis From C (using LLVM) for( int glane = 0 ; glane < CVI_LANES ; glane++ ) { f 16_t gmm = f 16_mul( ref_gm, m[glane] ); f 16_t dx = f 16_sub( ref_px, px[glane] ); f 16_t dy = f 16_sub( ref_py, py[glane] ); f 16_t dx 2 = f 16_mul(dx, dx); f 16_t dy 2 = f 16_mul(dy, dy); f 16_t r 2 = f 16_add(dx 2, dy 2); f 16_t CVI_LANES r = f 16_sqrt(r 2); #define 8 /* number of physical lanes */ f 16_t int 32_t rr = f 16_div(F 16(1. 0), r); typedef f 16_t gmm_rr = f 16_mul(rr, gmm_68); f 16_t ref_px, ref_py, ref_gm; f 16_t gmm_rr 2 = f 16_mul(rr, gmm_rr); f 16_t px[CVI_LANES], py[CVI_LANES], m[CVI_LANES]; f 16_t gmm_rr 3 = f 16_mul(rr, gmm_rr 2); f 16_t result_x[CVI_LANES], result_y[CVI_LANES]; f 16_t dfx = f 16_mul(dx, gmm_rr 3); f 16_t dfy = f 16_mul(dy, gmm_rr 3); void force_calc() { f 16_t result_x = f 16_add(result_x[glane], dfx); f 16_t for( int result_y glane = 0=; f 16_add(result_y[glane], dfy); glane < CVI_LANES ; glane++ ) { result_x[glane] //CVI code here= result_x; } result_y[glane] = result_y; }} �CVI templates provided �Restricted C subset Verilog ◦ Can run on scalar core for easy debugging 17

N-Body Performance 18

Performance/Area SVP Configuration V 32, 16 physical pipelines MXP Speedup/ALM Relative to Nios II/f 1. 1 MXP + DIV/SQRT 19. 7 MXP + N-Body (floating-point) 68. 7 MXP + N-Body (fixed-point) 116. 0 19

Conclusions �CVIs can incorporate streaming pipelines ◦ SVP handles control, light data processing ◦ Deep pipelines exploit FPGA strengths �Efficient, lightweight interfaces ◦ Including multiple input & output operands �Multiple ways to build and integrate 20

Thank You 21