Polymorphic Processors How to Expose Arbitrary Hardware Functionality

Polymorphic Processors: How to Expose Arbitrary Hardware Functionality to Programmers Stamatis Vassiliadis Computer Engineering, EEMCS, TU Delft http: //ce. et. tudelft. nl Member of Hi. PEAC PACT ’ 04, Antibes, France

PZE and the Amdahl’s law 50% program … 20% Max speedup = 2. 0 Excluding start-up reduced 5 cycles to 3 speedup 1. 6 83% efficiency B Techniques: • ILP • pipeline • technology Potential Zero Execution (PZE) introduced in 87 -88 and published in IBM Journal of R&D 94 Timewise we execute two instructions (50% code elimination) ASIC Very Large The limitation: 10 X 2 X 0. 5 0. 9 Why polymorphic? We can ride the Amdahl’s curve easier and faster PACT ’ 04, Antibes, France

Motivating example Paeth coding Research questions: • What does Paeth means in terms of computations? • Can I put it on hardware? • What is my gain? predictive coding Original image Filtered image decoding Original image compression (ZIP) bitstream UNZIP Filtered image Transmission: bitstream n Goal: get image with more 0’s n Is it possible? : spatial redundancy (adjacent pixels often have same values => many differences between them =0 ) PACT ’ 04, Antibes, France

Motivating example c b a d Paeth(d)= one of a, b, c, which is closest to initial prediction p = a+b-c Original Filtered 0 0 0 0 0 3 3 3 0 0 0 3 4 4 0 0 1 0 0 3 4 5 0 0 Filtered=Original-Paeth =4 - 4 =0 Paeth 0 0 0 3 3 4 0 3 4 4 c=3, b=3 a=4, d=4 p =4+3 -3=4 Paeth(d)=a=4 c b a p=a+b-c pa=|p-a| pb=|p-b| pc=|p-c| pa<=pb? pa<=pc? pb<=pc? 1 0 1 1 Paeth area: …………… 6 8 -bit adders PACT ’ 04, Antibes, France 0 0

Example: Paeth Prediction (PNG) C-code bptr = prev_row+1; dptr = curr_row+1; predptr= predict_row+1; for(i=1; i < length; i++){ c = *(bptr-1); b = *bptr; . . if(. . . ) *predptr = a; else if (. . ) else *predptr = c; . . . bptr++; } Altivec code li r 5, 0 …. totally 6 instructions loop: lvx vr 03, r 1 lvx vr 04, r 2 vsidoi vr 05, vr 01, vr 03, 1 vmrghb vr 07, vr 03, vr 00 vmrglb vr 08, vr 03, vr 00 …totally 6 instructions #Compute vadduhs vr 15, vr 09, vr 11 vadduhs vr 16, vr 10, vr 12 vsubshs vr 15, vr 07 vsubshs vr 16, vr 08. . totally 76 instructions #Pack: vpkshus vr 28, 29 #Store: stvx vr 28, r 3, 0 #Loop control addi r 1, 16 ……. . bneq r 7, r 0, loop What it does initialize # load c's # load a's # load b's load # unpack CSI code li r 5, 1 csi_mt_scr r 1, SCR 1, 0 csi_mt_scr r 5, SCR 1, 1. . totally 20 instructions unpack # a+b # # # process csi_paeth predptr, bptr, dptr # pack # store pack store # Looping ONE INSTRUCTION For all loop iterations n Altivec iteration: 95 instructions per 16 pixels. n CSI code : 1 instruction for all iterations (+20 setup instructions) CSI Instruction design : latency: …………. 5 cycles throughput: ……… 16 pixels/1 cycle ( EUROMICRO 99 ) area: …………… 24 32 -bit adders Cycle = 1 ALU operation PACT ’ 04, Antibes, France

Results: Instruction count and execution time reduction Bench: Paeth kernel, 132 -element vectors (132 pixels in a row) Dynamic instruction counts, normalised to non-CSI counts Execution time: on 4 -issue CPU, with 32 byte-wide CSI unit, normalised to non-CSI execution PACT ’ 04, Antibes, France

Research Questions Motivating example: Obvious observations NO way I can do this on fixed hardware I can do this if the hardware changes functionality at my wishes. EASIER SAID THAN DONE ! I have to answer the following: How can I identify the code for hardware implementation? New kind of tools How can I implement “arbitrary” code? Microarchitecture Is the hardwired code substituted by new instructions? Processor architecture (behavior + logical structure) How can I substitute this code with SW/HW descriptions say at the source level? Programming paradigm (HW and SW descriptions coexisting in a program) How can I automatically generate the “transformed” program? Compilation PACT ’ 04, Antibes, France

Outline Program P’ A DATA GPP RH MEM FPGA What to do: – Identify the “ ” code – Show hardware feasibility of “ ” in FPGA – Map “ ” into reconfigurable hardware (RH) – Eliminate the identified code – Add code to have “equivalent” behavior – Compile new program – Execute Introduce reconfigurable microcode ( - code) Specific code in hardware left to the programmer/hardware designer RESULTS Tools Microarchitecture Architecture Programming Paradigm Compiler MOLEN One time 8 new instructions for any ISA Co-processor paradigm (e. g. vector) New register file for parameter passing Sequential consistency Split-join parallelism Function like code PACT ’ 04, Antibes, France

Tool Chain Human Directives Code New Program where Hardware/software descriptions co-exist Architecture … int fact(int n) { if(n<1) return n else return(n*fact(n-1)); } Retargeted Compiler f(. ) C 2 C Binary Code call f(. ) HDL … NO Critical A L U I T B O ? YES HDL hand coded XILINX VIRTEX-II PRO FPGA IBM Power. PC PACT ’ 04, Antibes, France

The MOLEN ISA Divide RC into two logical phases “SET EXECUTE address” “function” independent No new op-codes Implementation and ISA independent Reconfigurable design (two instructions) Parameter passing: two new instructions + Register file Arbitrary number of parameter passing Parallel execution : split via a Molen instruction and join via a GPP instruction or one special instruction Modularity: by implementing at least the minimal MOLEN instruction set and by reconfiguring to it. Execute on reconfigurable One instruction Speeding up: reconfiguration and execution Two instructions for prefetching Total: 8 PACT ’ 04, Antibes, France new instructions ( SAMOS ‘ 03 )

Instruction Set Partitioning 8 instructions grouped in 6 instruction categories: partial SET (P-SET) SET < address > Complete SET (C-SET) EXECUTE < address > MOVTX and MOVFX. Minimal Preferred SET PREFETCH < address > EXECUTE PREFETCH < address > Complete BREAK: PACT ’ 04, Antibes, France

$Sequence Control Example #pragma call_fpga op 1 int f( int x, int y) {$

Sequence Control Example #pragma call_fpga op 1 int f( int x, int y) { … } #pragma call_fpga op 2 int g(int x) { … } int h(int a, int b, int c) { int m, n, . . . ; m=f(a, b); no data dependency n=g(c); …… } h: mov a -> r 1 movtx r 1 ->XR 2 mov b -> r 2 movtx r 2 ->XR 3 mov c -> r 3 movtx r 3 -> XR 4 set address_set_op 1 set address_set_op 2 ldc 2 ->r 4 movtx r 4 ->XR 0 ldc 4 ->r 5 movtx r 5 ->XR 1 execute address_ex_op 2 movfx XR 2 -> r 6 mov r 6 -> m movfx XR 4 -> r 7 mov r 7 -> n PACT ’ 04, Antibes, France In parallel

Reconfigurable Microcode Storage Micro. Program Frequently used Less frequently used Frequently used On-chip storage Permanently stored From memory Permanently stored FIXED PAGEABLE • Fixed on-chip storage for frequently used microcode • Pageable on-chip storage for less frequently used microcode ( IEEE MICRO ‘ 03 ) PACT ’ 04, Antibes, France

The -code unit R/P CS- / Residence Table H CS- (fixed) CS- , if present (pageable) Determine next microinstruction from execution hardware = reconfigurable unit (CCU) SEQUENCER CSAR set CS- FIXED PAGEABLE execute FIXED PAGEABLE -CONTROL STORE PACT ’ 04, Antibes, France M I R to execution hardware = reconfigurable unit (CCU) microinstruction

More on Architectural support An example microprogram: • located in memory starting at address Instruction format • address point to first microinstruction • terminated by an end_op memory OPC instruction word address Resident (0); Pageable (1) Control Store address ( CS- ); Memory address ( ) end_op PACT ’ 04, Antibes, France 00: load values into adder 01: shift_ins 02: add_ins 03: shift 2_ins 04: SKIP 05: BACK 06: store 07: end_op

The MOLEN -coded processor (FPL’ 01) Arbitrates (redirects) instructions between GPP and RP CCU has direct access to the data memory The arbiter also controls the loading of microcode X registers to exchange parameters between GPP and RU -unit controls CCU by microinstructions PACT ’ 04, Antibes, France

The Molen Prototype Molen machine organization Molen prototype implemented on Virtex II Pro PACT ’ 04, Antibes, France

The Prototype Features A VHDL model has been synthesized for Virtex II Pro technology • 64 KBytes data and 64 KBytes instructions (on-chip) mems; • 64 -bit data memory bus; • 64 -bit instruction memory bus; • 64 bits microcode word length; • 32 MBytes, memory segment for microprograms; • 8 Kx 64 -bit -control store using Dual Port Block RAMs (BRAM); • 512 x 32 -bit XREGs implemented in BRAMs. Three clock domains: • PPC clock – 250 MHz; • MEM clock – 83 MHz; • User clock – external. Trivial HW costs Utilization of FPGA resources (no CCU): Device xc 2 vp 20 -5 Reconf. Processor Arbiter Total incl. XREGs Available resources % # slices 71 84 156 10304 1 # flip-flops 84 69 147 20608 1 171 150 322 20608 1 4 N. A. 5 112 3 130 143 130 # LUT 4 # BRAM Max. Freq. [MHz] N. A ( FCCM 04 ) PACT ’ 04, Antibes, France

Compiling for the Molen C application File_n. c Compiler MAIN. c SUIF frontend Machine SUIF backend framework alpha x 86 backend MOLEN extension PACT ’ 04, Antibes, France FCCM

The Molen Compiler • IBM Power. PC 405 GPP in Virtex II Pro • Register file extension (XRs) • ISA extension SUIF + Machine. SUIF Molen extension ( FPL 03 -04 ) Power. PC backend ISA extension (SET/EXEC) Register extension (XRs) PACT ’ 04, Antibes, France

Code for a “function” • Example: C code: res = alpha(param 1, param 2); movtx XR 1 ← param 1 movtx XR 2 ← param 2 set <address_alpha_set> exec <address_alpha_exec> movfx res ← XR 3 Send parameters HW reconfiguration HW execution Return result PACT ’ 04, Antibes, France

Sequence Control Example Code generation: C code Original code #pragma call_fpga op 1 main: int f(int a, int b){ mrk 2, 13 int c, i; ldc $vr 0. s 32 <- 5 c=0; mov main. z <- $vr 0. s 32 for(i=0; i<b; i++) mrk 2, 14 c = c + a<<i + i; ldc $vr 2. s 32 <- 7 c = c>>b; cal $vr 1. s 32 <- f(main. z, $vr 2. s 32) return c; mov main. x <- $vr 1. s 32 } mrk 2, 15 void main(){ ldc $vr 3. s 32 <- 0 int x, z; ret $vr 3. s 32 z=5; . text_end main x= f(z, 7); } PACT ’ 04, Antibes, France Modified code mrk movtx ldc movtx 2, 14 $vr 2. s 32 <- main. z $vr 1. s 32(XR) <- $vr 2. s 32 $vr 4. s 32 <- 7 $vr 3. s 32(XR) <- $vr 4. s 32 set ldc movtx exec address_op 1_SET $vr 6. s 32(XR) <- 0 $vr 7. s 32(XR) <- vr 6. s 32 address_op 1_EXEC movfx $vr 8. s 32 <- $vr 5. s 32(XR) mov main. x <- $vr 8. s 32

The Experiment (hand tuned HW) Step 1. Obtain MPEG-2 profiling data on a Power. PC system sequence carphone claire container tennis MPEG-2 encoder #frames@Resolution SAD (16 x 16) DCT (8 x 8) IDCT (8 x 8) 96@176 x 144 51. 1 % 12. 5 % 1. 3 % 168@360 x 288 53. 8 % 11. 8 % 1. 0 % 300@352 x 288 56. 2 % 10. 7 % 1. 0 % 112@352 x 240 60. 0 % 9. 5 % 0. 8 % Total 64. 9 % 66. 6 % 67. 9 % 70. 3 % MPEG-2 decoder IDCT (8 x 8) 50. 4 % 37. 6 % 40. 4 % 40. 5 % Step 2. Measure the kernels speedups on the prototype: carphone claire container tennis SAD 16 SAD 128 SAD 256 6. 5 8. 3 12. 2 12. 1 18. 9 23. 9 35. 2 35. 0 22. 2 28. 2 41. 5 41. 2 Step 3. Overall speedup per kernel SAD 16 carphone claire container tennis 1. 76 1. 90 2. 07 2. 22 MPEG-2 encoder SAD 128 SAD 256 1. 94 2. 06 2. 20 2. 40 1. 95 2. 08 2. 21 2. 41 MPEG-2 decoder DCT IDCT 1. 14 1. 13 1. 12 1. 10 1. 01 PACT ’ 04, Antibes, France IDCT 1. 94 1. 56 1. 63 1. 65 DCT IDCT 302. 3 302. 2 302. 1 24. 4 32. 3

Real vs. Theoretical Speedups Step 4. Application speedup Speedup MPEG-2 Encoder MPEG-2 Decoder Prototype theory %Smax carphone 2. 64 2. 85 93 1. 94 2. 02 96 claire 2. 80 2. 99 94 1. 56 1. 60 98 container 2. 96 3. 12 95 1. 63 1. 68 97 tennis 3. 18 3. 37 94 1. 65 1. 68 98 Performance gain T i m e a SAD TSE DCT Implem. in reconf. Recall Smax The MOLEN prototype speeds the MPEG-2 codec up between 93% and 98% of theoretically max. attainable speedups. = 0 Theoretically attainable MAX Measured experimentally PACT ’ 04, Antibes, France 3. 2 MPEG-2 Encoder 0. 65 0. 67 0. 68 a 0. 71 3. 0 2. 8 2. 6 S p e e d u p

mpeg 2 enc Instruction Counts 33. 7 137 million 35. 1 54 46 million PACT ’ 04, Antibes, France 91. 5

M-JPEG (HWAutomatically Generated ) • M-JPEG multimedia benchmark • DCT * hardware implementation • Molen prototype ( FPL 04 ) PACT ’ 04, Antibes, France

Performance MJPEG 2. 5 speedup Execution SW DCT (%) SW DCT HW DCT conv Prototype speedup Theoretical Speedup Efficiency PACT ’ 04, Antibes, France cycles 66 % 1, 242, 017 4, 125 102, 589 2. 5 x 2. 96 x 84 %

Conclusions • We have shown a new: • microarchitecture • processor architecture • programming paradigm • compilation • We have shown that it is easier and faster to ride the Amdahl’s curve with polymorphic processors! PACT ’ 04, Antibes, France

Contact information Computer Engineering Laboratory: http: //ce. et. tudelft. nl MOLEN homepage: http: //ce. et. tudelft. nl/MOLEN Personal homepage: http: //ce. et. tudelft. nl/~stamatis OVERVIEW Paper: The Molen Polymorphic Processor IEEE Transactions on computers NOV 04 PACT ’ 04, Antibes, France