Experience and results porting HPEC Benchmarks to MONARCH

Experience and results porting HPEC Benchmarks to MONARCH Lloyd Lewins & Kenneth Prager Raytheon Company 2000 E. El Segundo Blvd, El Segundo, CA 90245 llewins@raytheon. com, keprager@raytheon. com High Performance Embedded Computing (HPEC) Workshop 23− 25 September 2008 (A) Approved for public release; distribution is unlimited.

Overview of HPEC Benchmarks Provides a means to quantitatively evaluate high performance embedded computing (HPEC) systems n Addresses important operations across a broad range of Do. D signal and image processing applications n – Finite Impulse Response (FIR) Filter – QR Factorization – Singular Value Decomposition – Pattern Matching – Corner turn etc Documentation, Uniprocessor C-code, MATLAB, Sizes n http: //www. ll. mit. edu/HPECchallenge/index. html n 9/23/08 Page 2

Overview of MONARCH u 6 RISC Processors DIFLs Memory Interface P P P u 12 MBytes on-chip DRAM P u 2 DDR 2 External Memory ED P R ED ROM Port P DI/DO Interfaces (8 GB/s BW) CM R PBDIFLs ED P R R DIFLs ED P DIFLs u Flash Port (32 MB) u 2 Serial Rapid. IO Ports (1. 25 GB/s each) DIFLs ED R u 16 IFL ports P DIFLs R ED P DIFLs (2. 6 GB/s each) u On-chip Ring 40 GB/s P P RIO u Reconfigurable Array: RIO DIFLs FPCA (64 GFLOPS) DIFLs 9/23/08 Page 3

Benchmark Selection n Transpose (corner-turn) – 50 x 5000 and 750 x 5000 – Transpose to/from External DRAM n Constant False Alarm Detection (CFAR) – 16 x 64 x 24, 48 x 3500 x 128, 48 x 1909 x 64 and 16 x 9900 x 16 – Few ops – bandwidth limited. – Larger datasets in External DRAM – smaller in EDRAM n QR Factorization – 500 x 100, 180 x 60, and 150 x 150 – Givens Rotation (more complex) – Many 2 x 2 matrix multiplies (but simple) n Note: results for FIR and FFT previously reported 9/23/08 Page 4

MONARCH Mapping Issues n Bandwidth Limitations – External DRAM (DDR 2) l l 4. 7 Gbyte/s peak per port (64 bits @ 333 MHz DDR + overhead) Only one port populated on test board – Implementation Issues l l l EDRAM bank conflict bug – no simultaneous read/write PBuf to Node-Bus arbitration – unload one word every 3 clocks (cuts 10. 6 Gbyte/s PIRX bandwidth down to 3. 6 Gbyte/s). DDR 2 latency versus MMBT pipeline depth – limits reads to 3. 8 Gbyte/s. Partitioning n Algorithm Selection n – “Fast” Givens versus “regular” Givens n Reciprocal/Square Root – Synthesize using Newton-Raphson 9/23/08 Page 5

Corner Turn Benchmark n Hierarchical Block Transpose – FPCA handles 32 x 8 inner block (uses 16 MEM elements) – EDRAM contains 32 x 2528 blocks – ANBI streams into 32 x 8 blocks – MMBT transfers 32 x 2528 blocks to/from DDR 2 n Alignment Issues – MMBT/DDR 2 interactions require transferring 32 words for peak performance – Total transpose was 768 x 5056 (3. 5% larger) n Performance Issues – Single FPCA Transpose engine limits bandwidth to 1. 3 Gbyte/s – Elimination of bank conflict bug and two DDR 2 ports would allow three transpose engines (3. 6 Gbyte/s) – limited by PBuf/Node-Bus arbitration 9/23/08 Page 6

Corner Turn Implementation FPCA ANBI MMBT FPCA – Field Programmable Computer Array; ANBI – Array Node Bus Interface; MMBT – Memory Block Transfer; DDR_A – Double Data Rate DRAM interface A; EDRAM – Embedded DRAM 9/23/08 Page 7

Corner Turn Results n n Measured performance and predicted performance if second DDR 2 bank available: Predicted performance in the absence of the bank conflict bug Note: this is end-to-end bandwidth – achieved memory bandwidth of 2 X Bandwidth is in Bytes per Second DDR 2 – Double Data Rate DRAM interface BCB – Band Conflict Bug (EDRAM) 9/23/08 Page 8

Constant False Alarm Rate Benchmark n Multiple CFAR engines implemented in FPCA – Limited by number of EDRAMs to feed them n Smaller datasets stored in EDRAM – Six CFAR engines – 14 GFLOPS n Larger datasets stored in DDR 2 – Three CFAR engines (because of bank conflict bug) – Further limited by DDR 2 bandwidth – 6. 2 GFLOPS (12. 4 GFLOPS with two DDR ports) 9/23/08 Page 9

CFAR Implementation ANBI FPCA MMBT FPCA – Field Programmable Computer Array; ANBI – Array Node Bus Interface; MMBT – Memory Block Transfer; DDR_A – Double Data Rate DRAM interface A; EDRAM – Embedded DRAM 9/23/08 Page 10

Constant False Alarm Rate Results n n Measured performance and predicted performance if second DDR 2 bank available: Predicted performance in the absence of the bank conflict bug: DDR 2 – Double Data Rate DRAM interface BCB – Band Conflict Bug (EDRAM) 9/23/08 Page 11

QR Factorization Benchmark n Single QR Engine implemented in FPCA – Uses high percentage of resources – Multiple streams to/from memory Performance limited by bandwidth to EDRAM n Classic “Fast Givens” requires even more streams to/from EDRAM n – Issue is not FLOPS, but Bandwidth n Calculating Givens rotation requires square-root and reciprocal. – Implemented in FPCA using Newton-Raphson. 9/23/08 Page 12

QR Factorization Implementation ANBI FPCA Low rate More ops 2 x 2 Multiply FPCA – Field Programmable Computer Array; ANBI – Array Node Bus Interface; EDRAM – Embedded DRAM 9/23/08 Page 13

QR Factorization Results n n Measured performance: Predicted performance in the absence of the bank conflict bug: BCB – Band Conflict Bug (EDRAM) 9/23/08 Page 14

Reciprocal/Square Root FPCA doesn’t support division or square-root directly n Number of approaches considered, including CORDIC n Newton-Raphson works surprisingly well, even for floating point numbers n – Use a few small lookup tables – Integer arithmetic to extract exponent and mantissa – Floating point arithmetic to iterate estimate – Fully pipelined 9/23/08 Page 15

Reciprocal Calculation (Math) n n Newton Raphson: to solve 1/y, given an estimate of 1/y (xi), a better estimate of 1/y (xi+1) is given by: Split the number into exponent (plus sign), and mantissa. Use LUT to calculate reciprocal of exponent, and a second LUT to estimate the reciprocal of the mantissa. Use Newton Raphson twice to refine the reciprocal of the mantissa (getting more than 23 bits) and finally multiply the resulting mantissa and exponent+sign 9 1/X LUT (512) * mantissa 8 mantissa (MS bits) 1/X LUT (256) Newton Raphson 9/23/08 Page 16

Reciprocal Calculation (Implementation) >23 exponent_s 23 exponent. LUT 2 D D D in & mantissa. Full + 3 >15 SSC y D 3 2 D mantissa_s 15 ALU 3 mantissa. F x 0 3 3 x 1_YXi D x 1_YXi. P 2 out MALU 1 D MALU 0 x 1 3 recip x 2 3 3 3 x 2_YXi. P 2 D MUL DDE Delay D MEM 9/23/08 Page 17

Comparison to other Architectures Benchmark (Units) Cornerturn (GBytes/S) PPC-G 4 0. 3 Xeon 0. 4 RAW-16 1. 2 Benchmark (Units) CFAR (GFLOP/S) QR (GFLOP/S) PPC-G 4 0. 2 0. 6 Xeon 1. 1 4. 2 RAW-16 0. 8 RAW-64 1. 2 RAW-64 3. 1 9. 0 MONARCH 1 DDR +BCB 0. 9 MONARCH 2 DDR -BCB 2. 3 MONARCH 1 DDR +BCB 7. 5 5. 5 MONARCH 2 DDR -BCB 11. 0 10. 9 RAW-64 performance projected 9/23/08 Page 18

Conclusion n n Several interesting HPEC benchmarks successfully implemented on MONARCH performance very competitive with other published HPEC results Benchmarks all bandwidth limited – Partitioning focuses on optimizing data movement – Buffer data in EDRAM to ensure sequential DDR accesses – Select algorithm which is “bandwidth friendly” – “It’s the data movement stupid!” Reciprocal/square root readily synthesized from existing FPCA resources – Demonstrates flexibility of FPCA Working around errata of current chip added challenge! This work was supported by the NRO 9/23/08 Page 19