High Performance Embedded Computing Software Initiative HPECSI Dr

High Performance Embedded Computing Software Initiative (HPEC-SI) Dr. Jeremy Kepner MIT Lincoln Laboratory This work is sponsored by the High Performance Computing Modernization Office under Air Force Contract F 19628 -00 -C 0002. Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by the United States Government. Slide-1 HPEC-SI MITRE MIT Lincoln Laboratory www. hpec-si. org AFRL

Outline • Introduction • Software Standards • Parallel VSIPL++ • Future Challenges • Summary Slide-2 www. hpec-si. org MITRE • Do. D Need • Program Structure Lincoln AFRL

Overview - High Performance Embedded Computing (HPEC) Initiative Common Imagery Processor (CIP) DARPA Shared memory server Embedded multiprocessor Applied Research Development HPEC Software Initiative ASARS-2 Demonstration Programs Challenge: Transition advanced software technology and practices into major defense acquisition programs Slide-3 www. hpec-si. org MITRE Enhanced Tactical Radar Correlator (ETRAC) Lincoln AFRL

Why Is Do. D Concerned with Embedded Software? Source: “HPEC Market Study” March 2001 Estimated Do. D expenditures for embedded signal and image processing hardware and software ($B) • COTS acquisition practices have shifted the burden from “point design” hardware to “point design” software • Software costs for embedded systems could be reduced by one-third with improved programming models, methodologies, and standards Slide-4 www. hpec-si. org MITRE Lincoln AFRL

Issues with Current HPEC Development Inadequacy of Software Practices & Standards Predator • High Performance Embedded Computing pervasive through Do. D applications U-2 Global Hawk MK-48 Torpedo JSTARS – Airborne Radar Insertion program MSAT-Air 85% software rewrite for each hardware platform Rivet Joint Standard Missile – Missile common processor F-16 Processor board costs < $100 k Software development costs > $100 M – Torpedo upgrade NSSN AEGIS Two software re-writes required after changes in hardware design P-3/APS-137 System Development/Acquisition Stages 4 Years Program Milestones System Tech. Development System Field Demonstration Engineering/ manufacturing Development Insertion to Military Asset Signal Processor 1 st gen. Evolution Slide-5 www. hpec-si. org MITRE 4 Years Today – Embedded Software Is: • Not portable • Not scalable • Difficult to develop • Expensive to maintain 2 nd gen. 3 rd gen. 4 th gen. 5 th gen. 6 th gen. Lincoln AFRL

Evolution of Software Support Towards “Write Once, Run Anywhere/Anysize” Do. D software development COTS development Application Middleware Embedded SW Standards Vendor Software Vendor SW 1990 Application Middleware Slide-6 www. hpec-si. org Middleware Application MITRE 2000 Application Middleware • Application software has traditionally been tied to the hardware • Many acquisition programs are developing stove-piped middleware “standards” • Open software standards can provide portability, performance, and productivity benefits • Support “Write Once, Run Anywhere/Anysize” 2005 Application Middleware Lincoln AFRL

Overall Initiative Goals & Impact Program Goals • Develop and integrate software ) s ard nd Sta en ) typ e Op (3 x d nte ity oto Interoperable & Scalable Pr ab tiv HPEC Software Initiative rie Po rt uc ilit y (3 x od Pr t. O lop MITRE ve Slide-7 www. hpec-si. org jec Portability: De reduction in lines-of-code to change port/scale to new system Productivity: reduction in overall lines-ofcode Performance: computation and communication benchmarks Demonstrate Ob technologies for embedded parallel systems to address portability, productivity, and performance • Engage acquisition community to promote technology insertion • Deliver quantifiable benefits Performance (1. 5 x) Lincoln AFRL

HPEC-SI Path to Success Benefit to Do. D Programs • • HPEC Software Initiative builds on • Proven technology • Business models • Better software practices Reduces software cost & schedule Enables rapid COTS insertion Improves cross-program interoperability Basis for improved capabilities Benefit to Do. D Contractors • Reduces software complexity & risk • Easier comparisons/more competition • Increased functionality Benefit to Embedded Vendors • Lower software barrier to entry • Reduced software maintenance costs • Evolution of open standards Slide-8 www. hpec-si. org MITRE Lincoln AFRL

Organization Technical Advisory Board Dr. Rich Linderman AFRL Dr. Richard Games MITRE Mr. John Grosh OSD Mr. Bob Graybill DARPA/ITO Dr. Keith Bromley SPAWAR Dr. Mark Richards GTRI Dr. Jeremy Kepner MIT/LL Executive Committee Dr. Charles Holland PADUSD(S+T) RADM Paul Sullivan N 77 Government Lead Dr. Rich Linderman AFRL Demonstration Development Applied Research Advanced Research Dr. Keith Bromley SPAWAR Dr. Richard Games MITRE Dr. Jeremy Kepner, MIT/LL Mr. Brian Sroka MITRE Mr. Ron Williams MITRE. . . Dr. James Lebak MIT/LL Dr. Mark Richards GTRI Mr. Dan Campbell GTRI Mr. Ken Cain MERCURY Mr. Randy Judd SPAWAR. . . Mr. Bob Bond MIT/LL Mr. Ken Flowers MERCURY Dr. Spaanenburg PENTUM Mr. Dennis Cottel SPAWAR Capt. Bergmann AFRL Dr. Tony Skjellum MPISoft. . . Mr. Bob Graybill DARPA • • Slide-9 www. hpec-si. org Partnership with ODUSD(S&T), Government Labs, FFRDCs, Universities, Contractors, Vendors and Do. D programs Over 100 participants from over 20 organizations MITRE Lincoln AFRL

Outline • Introduction • Software Standards • Parallel VSIPL++ • Future Challenges • Summary Slide-10 www. hpec-si. org MITRE • Standards Overview • Future Standards Lincoln AFRL

Emergence of Component Standards Parallel Embedded Processor System Controller Node Controller Data Communication: MPI, MPI/RT, DRI Control Communication: CORBA, HP-CORBA P 0 Consoles P 1 Other Computers HPEC Initiative - Builds on completed research and existing standards and libraries Slide-11 www. hpec-si. org MITRE P 2 Lincoln P 3 Computation: VSIPL++, ||VSIPL++ Definitions VSIPL = Vector, Signal, and Image Processing Library ||VSIPL++ = Parallel Object Oriented VSIPL MPI = Message-passing interface MPI/RT = MPI real-time DRI = Data Re-org Interface CORBA = Common Object Request Broker Architecture HP-CORBA = High Performance CORBA AFRL

The Path to Parallel VSIPL++ (world’s first parallel object oriented standard) Time Phase 3 Applied Research: Self-optimization Phase 2 Applied Research: Fault tolerance Phase 1 Applied Research: prototype Unified Comp/Comm Lib Development: VSIPL++ Object-Oriented Standards Development: prototype Fault tolerance Demonstration: Development: Parallel Unified Comp/Comm Lib. VSIPL++ Unified Comp/Comm Lib Demonstration: Object-Oriented Standards Demonstration: Existing Standards VSIPL++ VSIPL MPI Demonstrate insertions into fielded systems (e. g. , CIP) • High-level code abstraction • Reduce code size 3 x Parallel VSIPL++ Unified embedded computation/ communication standard Functionality • First demo successfully completed • VSIPL++ v 0. 5 spec completed • VSIPL++ v 0. 1 code available • Parallel VSIPL++ spec in progress • High performance C++ demonstrated • Demonstrate scalability Demonstrate 3 x portability Slide-12 www. hpec-si. org MITRE Lincoln AFRL

Working Group Technical Scope Development Applied Research VSIPL++ Parallel VSIPL++ -MAPPING (data parallelism) -Early binding (computations) -Compatibility (backward/forward) -Local Knowledge (accessing local data) -Extensibility (adding new functions) -Remote Procedure Calls (CORBA) -C++ Compiler Support -Test Suite -Adoption Incentives (vendor, integrator) Slide-13 www. hpec-si. org MITRE Lincoln -MAPPING (task/pipeline parallel) -Reconfiguration (for fault tolerance) -Threads -Reliability/Availability -Data Permutation (DRI functionality) -Tools (profiles, timers, . . . ) -Quality of Service AFRL

Overall Technical Tasks and Schedule Task Name Near FY 01 FY 02 VSIPL (Vector, Signal, and Image Processing Library) CIP MPI (Message Passing Interface) CIP Mid FY 04 FY 03 FY 05 Demo 2 FY 06 Long FY 07 FY 08 Applied Research Development Demo 2 Demonstrate Demo 3 VSIPL++ (Object Oriented) v 0. 1 Spec v 0. 1 Code v 0. 5 Spec & Code v 1. 0 Spec & Code Parallel VSIPL++ v 0. 1 Spec v 0. 1 Code v 0. 5 Spec & Code v 1. 0 Spec & Code Demo 4 Demo 5 Demo 6 Fault Tolerance/ Self Optimizing Software Slide-14 www. hpec-si. org MITRE Lincoln AFRL

HPEC-SI Goals 1 st Demo Achievements Portability: zero code changes required Productivity: DRI code 6 x smaller vs MPI (est*) Performance: 3 x reduced cost or form factor Demonstrate s ard Sta nd Lincoln e typ Pr oto Interoperable & Scalable lop MITRE ted en HPEC Software Initiative n rie Op t. O jec ve De Slide-15 www. hpec-si. org Achieved 6 x* Goal 3 x Productivity Ob Achieved 10 x+ Goal 3 x Portability Performance Goal 1. 5 x Achieved 2 x AFRL

Outline • Introduction • Software Standards • Parallel VSIPL++ • Future Challenges • Summary Slide-16 www. hpec-si. org MITRE • Technical Basis • Examples Lincoln AFRL

Parallel Pipeline Signal Processing Algorithm Filter XOUT = FIR(XIN ) Beamform XOUT = w *XIN Detect XOUT = |XIN|>c Mapping Parallel Computer Slide-17 www. hpec-si. org • Data Parallel within stages Lincoln stages Task/Pipeline Parallel across MITRE • AFRL

Types of Parallelism Input Scheduler Task Parallel FIR FIlters Pipeline Slide-18 www. hpec-si. org MITRE Beamformer 1 Beamformer 2 Detector 1 Detector 2 Lincoln Round Robin Data Parallel AFRL

Current Approach to Parallel Code Algorithm + Mapping Stage 1 Stage 2 Proc 1 Proc 3 Proc 2 Proc 4 Proc 5 Proc 6 Slide-19 www. hpec-si. org Code while(!done) { if ( rank()==1 || rank()==2 ) stage 1 (); else if ( rank()==3 || rank()==4 ) stage 2(); } while(!done) { if ( rank()==1 || rank()==2 ) stage 1(); else if ( rank()==3 || rank()==4) || rank()==5 || rank==6 ) stage 2(); } • Algorithm and hardware mapping are linked Lincoln and non-portable MITRE • Resulting code is non-scalable AFRL

Scalable Approach Single Processor Mapping #include <Vector. h> #include <Add. Pvl. h> A =B +C void add. Vectors(a. Map, b. Map, c. Map) { Vector< Complex<Float> > a(‘a’, a. Map, LENGTH); Vector< Complex<Float> > b(‘b’, b. Map, LENGTH); Vector< Complex<Float> > c(‘c’, c. Map, LENGTH); Multi Processor Mapping b = 1; c = 2; a=b+c; A =B +C } Lincoln Parallel Vector Library (PVL) • Single processor and multi-processor code are the same • Maps can be changed without changing software • High level code is compact Slide-20 www. hpec-si. org MITRE Lincoln AFRL

C++ Expression Templates and PETE Expression A=B+C*D Exp res Tem plasion tes Main Parse Tree 1. Pass B and C references to operator + Operator + Expression Type Binary. Node<Op. Assign, Vector, Binary. Node<Op. Add, Vector Binary. Node<Op. Multiply, Vector >>> , B& & C 2. Create expression parse tree 3. Return expression parse tree + B& C& Parse trees, not vectors, created py co 4. Pass expression tree reference to operator Operator = co py & 5. Calculate result and perform assignment B+C A • Expression Templates enhance performance by allowing temporary variables to be avoided Slide-21 www. hpec-si. org MITRE Lincoln AFRL

PETE Linux Cluster Experiments A=B+C*D 1. 3 1. 2 1. 1 1 0. 9 8 32 8 2 68 72 12 51 204 819 327 1310 Vector Length Relative Execution Time 1. 2 A=B+C*D/E+fft(F) 1. 1 1 0. 9 0. 8 1 0. 9 0. 7 0. 8 0. 6 8 32 8 2 2 8 68 72 12 51 204 819 327 1310 8 Vector Length 32 8 12 2 51 2 48 192 2768 107 8 3 13 20 Vector Length • PVL with VSIPL has a small overhead • PVL with PETE can surpass VSIPL Slide-22 www. hpec-si. org MITRE Lincoln AFRL

Power. PC Alti. Vec Experiments Results Hand coded loop achieves good performance, but is problem specific and low level Optimized VSIPL performs well for simple expressions, worse for more complex expressions PETE style array operators perform almost as well as the hand-coded loop and are general, can be composed, and are high-level • • • A=B+C*D+E*F A=B+C*D+E/F Software Technology Alti. Vec loop • • • C For loop Direct use of Alti. Vec extensions Assumes unit stride Assumes vector alignment Slide-23 www. hpec-si. org MITRE VSIPL (vendor optimized) • C • Alti. Vec aware VSIPro Core Lite (www. mpi-softtech. com) • No multiply-add • Cannot assume unit stride • Cannot assume vector alignment Lincoln PETE with Alti. Vec • • • C++ PETE operators Indirect use of Alti. Vec extensions Assumes unit stride Assumes vector alignment AFRL

Outline • Introduction • Software Standards • Parallel VSIPL++ • Future Challenges • Summary Slide-24 www. hpec-si. org MITRE • Technical Basis • Examples Lincoln AFRL

A = sin(A) + 2 * B; • Generated code (no temporaries) for (index i = 0; i < A. size(); ++i) A. put(i, sin(A. get(i)) + 2 * B. get(i)); • Apply inlining to transform to for (index i = 0; i < A. size(); ++i) Ablock[i] = sin(Ablock[i]) + 2 * Bblock[i]; • Apply more inlining to transform to T* Bp = &(Bblock[0]); T* Aend = &(Ablock[A. size()]); for (T* Ap = &(Ablock[0]); Ap < pend; ++Ap, ++Bp) *Ap = fmadd (2, *Bp, sin(*Ap)); • Or apply Power. PC Alti. Vec extensions • Each step can be automatically generated • Optimization level whatever vendor desires Slide-25 www. hpec-si. org MITRE Lincoln AFRL

BLAS zherk Routine • • • BLAS = Basic Linear Algebra Subprograms Hermitian matrix M: conjug(M) = Mt zherk performs a rank-k update of Hermitian matrix C: C a A conjug(A)t + b C • VSIPL code A = vsip_cmcreate_d(10, 15, VSIP_ROW, MEM_NONE); C = vsip_cmcreate_d(10, VSIP_ROW, MEM_NONE); tmp = vsip_cmcreate_d(10, VSIP_ROW, MEM_NONE); vsip_cmprodh_d(A, A, tmp); /* A*conjug(A)t */ vsip_rscmmul_d(alpha, tmp); /* a*A*conjug(A)t */ vsip_rscmmul_d(beta, C, C); /* b*C */ vsip_cmadd_d(tmp, C, C); /* a*A*conjug(A)t + b*C */ vsip_cblockdestroy(vsip_cmdestroy_d(tmp)); vsip_cblockdestroy(vsip_cmdestroy_d(C)); vsip_cblockdestroy(vsip_cmdestroy_d(A)); • VSIPL++ code (also parallel) Matrix<complex<double> > A(10, 15); Matrix<complex<double> > C(10, 10); C = alpha * prodh(A, A) + beta * C; Slide-26 www. hpec-si. org MITRE Lincoln AFRL

Simple Filtering Application int main () { using namespace vsip; const length ROWS = 64; const length COLS = 4096; vsipl v; FFT<Matrix, complex<double>, FORWARD, 0, MULTIPLE, alg_hint ()> forward_fft (Domain<2>(ROWS, COLS), 1. 0); FFT<Matrix, complex<double>, INVERSE, 0, MULTIPLE, alg_hint ()> inverse_fft (Domain<2>(ROWS, COLS), 1. 0); const Matrix<complex<double> > weights (load_weights (ROWS, COLS)); try { while (1) output (inverse_fft (forward_fft (input ()) * weights)); } catch (std: : runtime_error) { // Successfully caught access outside domain. } } Slide-27 www. hpec-si. org MITRE Lincoln AFRL

Explicit Parallel Filter #include <vsiplpp. h> using namespace VSIPL; const int ROWS = 64; const int COLS = 4096; int main (int argc, char **argv) { Matrix<Complex<Float>> W (ROWS, Matrix<Complex<Float>> X (ROWS, load_weights (W) try { while (1) { input (X); Y = IFFT ( mul (FFT(X), output (Y); } } catch (Exception &e) { cerr << e } Slide-28 www. hpec-si. org MITRE COLS, " WMap"); // weights matrix // input matrix // some input function W)); // some output function << endl}; Lincoln AFRL

Multi-Stage Filter (main) using namespace vsip; const length ROWS = 64; const length COLS = 4096; int main (int argc, char **argv) { sample_low_pass_filter<complex<float> > LPF(); sample_beamform<complex<float> > BF(); sample_matched_filter<complex<float> > MF(); try { while (1) output (MF(BF(LPF(input ())))); } catch (std: : runtime_error) { // Successfully caught access outside domain. } } Slide-29 www. hpec-si. org MITRE Lincoln AFRL

$Multi-Stage Filter (low pass filter) template<typename T> class sample_low_pass_filter<T> { public: sample_low_pass_filter() : FIR$

Multi-Stage Filter (low pass filter) template<typename T> class sample_low_pass_filter<T> { public: sample_low_pass_filter() : FIR 1_(load_w 1 (W 1_LENGTH), FIR 2_(load_w 2 (W 2_LENGTH), FIR 2_LENGTH) {} Matrix<T> operator () (const Matrix<T>& Input) { Matrix<T> output(ROWS, COLS); for (index row=0; row<ROWS; row++) output. row(row) = FIR 2_(FIR 1_(Input. row(row)). second; return output; } private: FIR<T, SYMMETRIC_ODD, FIR 1_DECIMATION, CONTINUOUS, alg_hint()> FIR 1_; FIR<T, SYMMETRIC_ODD, FIR 2_DECIMATION, CONTINUOUS, alg_hint()> FIR 2_; } Slide-30 www. hpec-si. org MITRE Lincoln AFRL

$Multi-Stage Filter (beam former) template<typename T> class sample_beamform<T> { public: sample_beamform() : W 3_(load_w$

Multi-Stage Filter (beam former) template<typename T> class sample_beamform<T> { public: sample_beamform() : W 3_(load_w 3 (ROWS, COLS)) { } Matrix<T> operator () (const Matrix<T>& Input) const { return W 3_ * Input; } private: const Matrix<T> W 3_; } Slide-31 www. hpec-si. org MITRE Lincoln AFRL

$Multi-Stage Filter (matched filter) template<typename T> class sample_matched_filter<T> { public: matched_filter() : W 4_(load_w$

Multi-Stage Filter (matched filter) template<typename T> class sample_matched_filter<T> { public: matched_filter() : W 4_(load_w 4 (ROWS, COLS)), forward_fft_ (Domain<2>(ROWS, COLS), 1. 0), inverse_fft_ (Domain<2>(ROWS, COLS), 1. 0) {} Matrix<T> operator () (const Matrix<T>& Input) const { return inverse_fft_ (forward_fft_ (Input) * W 4_); } private: const Matrix<T> W 4_; FFT<Matrix<T>, complex<double>, FORWARD, 0, MULTIPLE, alg_hint()> forward_fft_; FFT<Matrix<T>, complex<double>, INVERSE, 0, MULTIPLE, alg_hint()> inverse_fft_; } Slide-32 www. hpec-si. org MITRE Lincoln AFRL

Outline • Introduction • Software Standards • Parallel VSIPL++ • Future Challenges • Summary Slide-33 www. hpec-si. org MITRE • Fault Tolerance • Self Optimization • High Level Languages Lincoln AFRL

Dynamic Mapping for Fault Tolerance Map 1 XOUT XIN Nodes: 0, 2 Map 0 Map 2 Nodes: 0, 1 Nodes: 1, 3 Input Task Output Task Failure Parallel Processor Slide-34 www. hpec-si. org MITRE Spare • Switching processors is accomplished by. Lincoln switching maps • No change to algorithm required AFRL

Relative Time Dynamic Mapping Performance Results Data Size • Good dynamic mapping performance is possible Slide-35 www. hpec-si. org MITRE Lincoln AFRL

Optimal Mapping of Complex Algorithms Application Input XIN Low Pass Filter XIN FIR 1 W 1 FIR 2 XOUT Matched Filter Beamform XIN mult XOUT XIN FFT IFFT W 4 W 3 W 2 XOUT Different Optimal Maps Intel Cluster Workstation Embedded Board Power. PC Cluster • Need to automate process of mapping MITREalgorithm to hardware Slide-36 www. hpec-si. org Lincoln Embedded Multi-computer Hardware AFRL

Self-optimizing Software for Signal Processing Problem Size Small (48 x 4 K) Large (48 x 128 K) 25 1. 5 Latency (seconds) 1 -1 -1 -1 20 1 -1 -1 -2 1. 0 Find • S 3 P selects correct optimal mapping • Excellent agreement between S 3 P predicted and achieved latencies and throughputs 1 -1 -2 -1 1 -1 -2 -2 15 1 -2 -2 -1 1 -2 -2 -2 -2 1 -2 -2 -3 0. 5 5. 0 10 0. 25 1 -2 -2 -2 Throughput (frames/sec) • 1 -3 -2 -2 1 -2 -2 -2 4. 0 0. 20 1 -2 -2 -1 1 -1 -2 -2 3. 0 1 -1 -2 -1 0. 15 1 -1 -2 -1 – Min(latency | #CPU) – Max(throughput | #CPU) 1 -1 -1 -1 2. 0 0. 10 4 Slide-37 www. hpec-si. org 5 6 #CPU MITRE 7 8 4 5 6 #CPU 7 Lincoln 8 AFRL

High Level Languages • Parallel Matlab need has been identified • HPCMO (OSU) • Required user interface has been demonstrated • Matlab*P (MIT/LCS) • PVL (MIT/LL) • Required hardware interface has been demonstrated High Performance Matlab Applications Do. D Sensor Processing Do. D Mission Planning Scientific Simulation Commercial Applications User Interface Parallel Matlab Toolbox Hardware Interface • Matlab. MPI (MIT/LL) Parallel Computing Hardware • Parallel Matlab Toolbox can now be realized Slide-38 www. hpec-si. org MITRE Lincoln AFRL

Matlab. MPI deployment (speedup) • Maui • Lincoln • MIT • Other – – – Image filtering benchmark (300 x on 304 cpus) Signal Processing (7. 8 x on 8 cpus) Radar simulations (7. 5 x on 8 cpus) Hyperspectral (2. 9 x on 3 cpus) LCS Beowulf (11 x Gflops on 9 duals) AI Lab face recognition (10 x on 8 duals) Ohio St. EM Simulations ARL SAR Image Enhancement Wash U Hearing Aid Simulations So. Ill. Benchmarking JHU Digital Beamforming ISL Radar simulation URI Heart modeling Performance (Gigaflops) – Image Filtering on IBM SP at Maui Computing Center Number of Processors • Rapidly growing Matlab. MPI user base demonstrates need for parallel matlab • Demonstrated scaling to 300 processors Slide-39 www. hpec-si. org MITRE Lincoln AFRL

Summary • HPEC-SI Expected benefit – Open software libraries, programming models, and standards that provide portability (3 x), productivity (3 x), and performance (1. 5 x) benefits to multiple Do. D programs • Invitation to Participate – Do. D Program offices with Signal/Image Processing needs – Academic and Government Researchers interested in high performance embedded computing – Contact: KEPNER@LL. MIT. EDU Slide-40 www. hpec-si. org MITRE Lincoln AFRL

The Links High Performance Embedded Computing Workshop http: //www. ll. mit. edu/HPEC High Performance Embedded Computing Software Initiative http: //www. hpec-si. org/ Vector, Signal, and Image Processing Library http: //www. vsipl. org/ MPI Software Technologies, Inc. http: //www. mpi-softtech. com/ Data Reorganization Initiative http: //www. data-re. org/ Code. Sourcery, LLC http: //www. codesourcery. com/ Matlab. MPI http: //www. ll. mit. edu/Matlab. MPI Lincoln MITRE AFRL Slide-41 www. hpec-si. org