Soft Vector Processors with Streaming Pipelines Aaron Severance
- Slides: 21
Soft Vector Processors with Streaming Pipelines Aaron Severance Joe Edwards Hossein Omidian Guy G. F. Lemieux
Motivation �Data parallel problems on FPGAs ◦ ESL? ◦ Overlays? ◦ Processors? 2
Example: N-Body Problem �O(N 2) force calculation ◦ Streaming Pipeline (custom vector instruction) �O(N) housekeeping ◦ Overlay (soft vector processor) �O(1) control ◦ Processor (ARM or soft-core) 3
Soft Vector Processor (SVP) 4
Vector. Blox MXP � 1 to 128 parallel vector lanes (4 shown) 5
MXP Datapath 6
Custom Vector Instructions (CVIs) �Simple CVI parallel scalar CIs 7
CVI Complications (1) �CVIs can be big ◦ e. g. square root, floating point ◦ Bigger than entire integer ALU �Make them cheaper ◦ Don’t replicate for every lane ◦ Reuse existing alignment networks �No additional costs, buffering 8
Cheap Heterogeneous Lanes 9
CVI Complications (2) �CVIs can be deep ◦ e. g. FP addition >> depth than MXP pipeline �Execute stage is 3 cycles, stall-free �CVI pipeline must ‘warm up’ ◦ Don’t writeback until valid data appears ◦ Best if vector length >> CVI depth 10
Multiple Operand CVIs � 2 D N-body problem: 3 inputs, 2 outputs 11
4 Input, 2 Output CVI Option 1: Spatially Interleaved �Easy for interleaved (Array-of-Struct) data ◦ But vector data is normally contiguous (So. A) 12
4 Input, 2 Output CVI Option 2: Time Interleaved �Alternate operands every cycle ◦ Data is valid every 2 cycles 13
4 Input, 2 Output CVI Option 2 with Funnel Adapters �Multiplex 2 CVI lanes to one pipeline ◦ Use existing 2 D/3 D instructions to dispatch 14
Building CVIs �We created CVIs via 3 methods: 1. RTL 2. Altera’s DSP Builder 3. Synthesis from C (custom LLVM solution) 15
Altera’s DSP Builder �Fixed or Floating-Point Pipelines ◦ Automatic pipelining given target �Adapters provided to MXP CVI interface 16
Synthesis From C (using LLVM) for( int glane = 0 ; glane < CVI_LANES ; glane++ ) { f 16_t gmm = f 16_mul( ref_gm, m[glane] ); f 16_t dx = f 16_sub( ref_px, px[glane] ); f 16_t dy = f 16_sub( ref_py, py[glane] ); f 16_t dx 2 = f 16_mul(dx, dx); f 16_t dy 2 = f 16_mul(dy, dy); f 16_t r 2 = f 16_add(dx 2, dy 2); f 16_t CVI_LANES r = f 16_sqrt(r 2); #define 8 /* number of physical lanes */ f 16_t int 32_t rr = f 16_div(F 16(1. 0), r); typedef f 16_t gmm_rr = f 16_mul(rr, gmm_68); f 16_t ref_px, ref_py, ref_gm; f 16_t gmm_rr 2 = f 16_mul(rr, gmm_rr); f 16_t px[CVI_LANES], py[CVI_LANES], m[CVI_LANES]; f 16_t gmm_rr 3 = f 16_mul(rr, gmm_rr 2); f 16_t result_x[CVI_LANES], result_y[CVI_LANES]; f 16_t dfx = f 16_mul(dx, gmm_rr 3); f 16_t dfy = f 16_mul(dy, gmm_rr 3); void force_calc() { f 16_t result_x = f 16_add(result_x[glane], dfx); f 16_t for( int result_y glane = 0=; f 16_add(result_y[glane], dfy); glane < CVI_LANES ; glane++ ) { result_x[glane] //CVI code here= result_x; } result_y[glane] = result_y; }} �CVI templates provided �Restricted C subset Verilog ◦ Can run on scalar core for easy debugging 17
N-Body Performance 18
Performance/Area SVP Configuration V 32, 16 physical pipelines MXP Speedup/ALM Relative to Nios II/f 1. 1 MXP + DIV/SQRT 19. 7 MXP + N-Body (floating-point) 68. 7 MXP + N-Body (fixed-point) 116. 0 19
Conclusions �CVIs can incorporate streaming pipelines ◦ SVP handles control, light data processing ◦ Deep pipelines exploit FPGA strengths �Efficient, lightweight interfaces ◦ Including multiple input & output operands �Multiple ways to build and integrate 20
Thank You 21
- Severance streaming
- Updm pipelines
- Westwood pipelines
- Edge to core to cloud
- Ammonia pipelines
- Pipelines
- Questar pipelines
- Ece 526
- Comparison of word processors
- Language and processors for requirement
- Non linear pipeline processor
- David kirk nvidia
- Digital camera processors
- Macro call is handled in this pass of macro processor
- Characterization of query processors
- Networks and telecommunications presentation
- Interrupt handling in arm processors
- Gas processors association
- Advantages and disadvantages of intel processors
- Difference between superscalar and vliw
- Parallel processors from client to cloud
- Programming massively parallel processors, kirk et al.