High Performance Embedded Computing Software Initiative HPECSI Dr

High Performance Embedded Computing Software Initiative (HPEC-SI) Dr. Jeremy Kepner / Lincoln Laboratory This work is sponsored by the Department of Defense under Air Force Contract F 19628 -00 -C-0002. Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by the United States Government. Slide-1 HPEC-SI MITRE MIT Lincoln Laboratory www. hpec-si. org AFRL

Outline • Introduction • Demonstration • Development • Applied Research • Future Challenges • Summary Slide-2 www. hpec-si. org MITRE • Goals • Program Structure MIT Lincoln Laboratory AFRL

Overview - High Performance Embedded Computing (HPEC) Initiative Common Imagery Processor (CIP) DARPA Shared memory server Embedded multiprocessor Applied Research Development HPEC Software Initiative ASARS-2 Demonstration Programs Challenge: Transition advanced software technology and practices into major defense acquisition programs Slide-3 www. hpec-si. org MITRE Enhanced Tactical Radar Correlator (ETRAC) MIT Lincoln Laboratory AFRL

Why Is Do. D Concerned with Embedded Software? Source: “HPEC Market Study” March 2001 Estimated Do. D expenditures for embedded signal and image processing hardware and software ($B) • COTS acquisition practices have shifted the burden from “point design” hardware to “point design” software • Software costs for embedded systems could be reduced by one-third with improved programming models, methodologies, and standards Slide-4 www. hpec-si. org MITRE MIT Lincoln Laboratory AFRL

Issues with Current HPEC Development Inadequacy of Software Practices & Standards Predator • High Performance Embedded Computing pervasive through Do. D applications U-2 Global Hawk MK-48 Torpedo JSTARS MSAT-Air Rivet Joint Standard Missile – Airborne Radar Insertion program 85% software rewrite for each hardware platform – Missile common processor F-16 Processor board costs < $100 k Software development costs > $100 M – Torpedo upgrade NSSN AEGIS P-3/APS-137 Two software re-writes required after changes in hardware design System Development/Acquisition Stages 4 Years Program Milestones System Tech. Development System Field Demonstration Engineering/ manufacturing Development Insertion to Military Asset Signal Processor 1 st gen. Evolution Slide-5 www. hpec-si. org MITRE 4 Years Today – Embedded Software Is: • Not portable • Not scalable • Difficult to develop • Expensive to maintain 2 nd gen. 3 rd gen. 4 th gen. 5 th gen. 6 th gen. MIT Lincoln Laboratory AFRL

Evolution of Software Support Towards “Write Once, Run Anywhere/Anysize” COTS development Do. D software development Application Vendor SW Middleware Embedded Standards Vendor Software 1990 Application Slide-6 www. hpec-si. org Middleware Application MITRE 2000 Application Middleware • Application software has traditionally been tied to the hardware • Many acquisition programs are developing stove-piped middleware “standards” • Open software standards can provide portability, performance, and productivity benefits • Support “Write Once, Run Anywhere/Anysize” 2005 Application MIT Lincoln Laboratory AFRL

Quantitative Goals & Impact Program Goals • Develop and integrate software ) s ard nd Sta en (3 x ) typ e Op ity Interoperable & Scalable Pr oto ab tiv d nte rie Po rt uc ilit y (3 x od Pr t. O HPEC Software Initiative lop MITRE ve Slide-7 www. hpec-si. org jec Portability: De reduction in lines-of-code to change port/scale to new system Productivity: reduction in overall lines-ofcode Performance: computation and communication benchmarks Demonstrate Ob technologies for embedded parallel systems to address portability, productivity, and performance • Engage acquisition community to promote technology insertion • Deliver quantifiable benefits Performance (1. 5 x) MIT Lincoln Laboratory AFRL

Organization Technical Advisory Board Dr. Rich Linderman AFRL Dr. Richard Games MITRE Mr. John Grosh OSD Mr. Bob Graybill DARPA/ITO Dr. Keith Bromley SPAWAR Dr. Mark Richards GTRI Dr. Jeremy Kepner MIT/LL Executive Committee Dr. Charles Holland PADUSD(S+T) … Government Lead Dr. Rich Linderman AFRL Demonstration Development Applied Research Advanced Research Dr. Keith Bromley SPAWAR Dr. Richard Games MITRE Dr. Jeremy Kepner, MIT/LL Mr. Brian Sroka MITRE Mr. Ron Williams MITRE. . . Dr. James Lebak MIT/LL Dr. Mark Richards GTRI Mr. Dan Campbell GTRI Mr. Ken Cain MERCURY Mr. Randy Judd SPAWAR. . . Mr. Bob Bond MIT/LL Mr. Ken Flowers MERCURY Dr. Spaanenburg PENTUM Mr. Dennis Cottel SPAWAR Capt. Bergmann AFRL Dr. Tony Skjellum MPISoft. . . Mr. Bob Graybill DARPA • • Slide-8 www. hpec-si. org Partnership with ODUSD(S&T), Government Labs, FFRDCs, Universities, Contractors, Vendors and Do. D programs Over 100 participants from over 20 organizations MITRE MIT Lincoln Laboratory AFRL

HPEC-SI Capability Phases Time Phase 3 Applied Research: Hybrid Architectures Phase 2 Applied Research: Fault tolerance Phase 1 Applied Research: prototype Unified Comp/Comm Lib Development: VSIPL++ Object-Oriented Standards Development: prototype Fault tolerance Demonstration: Development: Parallel Unified Comp/Comm Lib. VSIPL++ Unified Comp/Comm Lib Demonstration: Object-Oriented Standards Demonstration: Existing Standards VSIPL++ VSIPL MPI Demonstrate insertions into fielded systems (CIP) • High-level code abstraction (AEGIS) • Reduce code size 3 x Parallel VSIPL++ Unified embedded computation/ communication standard Functionality • First demo successfully completed • Secondd. Demo Selected • VSIPL++ v 0. 8 spec completed • VSIPL++ v 0. 2 code available • Parallel VSIPL++ v 0. 1 spec completed • High performance C++ demonstrated • Demonstrate scalability Demonstrate 3 x portability Slide-9 www. hpec-si. org MITRE MIT Lincoln Laboratory AFRL

Outline • Introduction • Demonstration • Development • Applied Research • Future Challenges • Summary Slide-10 www. hpec-si. org MITRE • Common Imagery Processor • AEGIS BMD (planned) MIT Lincoln Laboratory AFRL

Common Imagery Processor - Demonstration Overview Common Imagery Processor (CIP) is a cross-service component • • 38. 5” TEG and TES Sample list of CIP modes U-2 (ASARS-2, SYERS) F/A-18 ATARS (EO/IR/APG-73) LO HAE UAV (EO, SAR) System Manager CIP* ETRAC JSIPS-N and TIS Slide-11 www. hpec-si. org MITRE JSIPS and CARS MIT Lincoln Laboratory * CIP picture courtesy of Northrop Grumman Corporation AFRL

Common Imagery Processor - Demonstration Overview - • • Demonstrate standards-based platformindependent CIP processing (ASARS-2) Assess performance of current COTS portability standards (MPI, VSIPL) Validate SW development productivity of emerging Data Reorganization Interface MITRE and Northrop Grumman Common Imagery Processor Shared-Memory Servers Slide-12 www. hpec-si. org MITRE Embedded Multicomputers APG 73 SAR IF Single code base optimized for all high performance architectures provides future flexibility MIT Lincoln Laboratory Commodity Clusters Massively Parallel Processors AFRL

Software Ports • • Embedded Multicomputers CSPI - 500 MHz PPC 7410 (vendor loan) Mercury - 500 MHz PPC 7410 (vendor loan) Sky - 333 MHz PPC 7400 (vendor loan) Sky - 500 MHz PPC 7410 (vendor loan) • • • Mainstream Servers HP/COMPAQ ES 40 LP - 833 -MHz Alpha ev 6 (CIP hardware) HP/COMPAQ ES 40 - 500 -MHz Alpha ev 6 (CIP hardware) SGI Origin 2000 - 250 MHz R 10 k (CIP hardware) SGI Origin 3800 - 400 MHz R 12 k (ARL MSRC) IBM 1. 3 GHz Power 4 (ARL MSRC) Generic LINUX Cluster Slide-13 www. hpec-si. org MITRE MIT Lincoln Laboratory AFRL

Portability: SLOC Comparison ~1% Increase Sequential Slide-14 www. hpec-si. org MITRE ~5% Increase VSIPL Shared Memory VSIPL MIT Lincoln Laboratory DRI VSIPL AFRL

Shared Memory / CIP Server versus Distributed Memory / Embedded Vendor Latency requirement Shared memory limit for this Alpha server Application can now exploit many more processors, embedded processors (3 x form factor advantage) and Linux clusters (3 x cost advantage) Slide-15 www. hpec-si. org MITRE MIT Lincoln Laboratory AFRL

Form Factor Improvements Current Configuration IOP Possible Configuration IOP IFP 1 6 U VME IFP 2 • IOP: 6 U VME chassis (9 slots potentially available) • IFP: HP/COMPAQ ES 40 LP Slide-16 www. hpec-si. org MITRE • IOP could support 2 G 4 IFPs • form factor reduction (x 2) • 6 U VME can support 5 G 4 IFPs • processing capability increase (x 2. 5) MIT Lincoln Laboratory AFRL

HPEC-SI Goals 1 st Demo Achievements Portability: zero code changes required Productivity: DRI code 6 x smaller vs MPI (est*) Performance: 2 x reduced cost or form factor ard Sta nd en Pr oto Interoperable & Scalable typ e ted n rie Op t. O HPEC Software Initiative lop MITRE ve Slide-17 www. hpec-si. org De reduction in lines-of-code to change port/scale to new system Productivity: reduction in overall lines-ofcode Performance: computation and communication benchmarks jec Portability: Achieved* Goal 3 x Productivity Ob Achieved Goal 3 x Portability s Demonstrate Performance Goal 1. 5 x Achieved MIT Lincoln Laboratory AFRL

Outline • Introduction • Demonstration • Development • Applied Research • Future Challenges • Summary Slide-18 www. hpec-si. org MITRE • Object Oriented (VSIPL++) • Parallel (||VSIPL++) MIT Lincoln Laboratory AFRL

Emergence of Component Standards Parallel Embedded Processor System Controller Node Controller Data Communication: MPI, MPI/RT, DRI Control Communication: CORBA, HP-CORBA P 0 Consoles P 1 P 2 Other Computers HPEC Initiative - Builds on completed research and existing standards and libraries Slide-19 www. hpec-si. org MITRE MIT Lincoln Laboratory P 3 Computation: VSIPL++, ||VSIPL++ Definitions VSIPL = Vector, Signal, and Image Processing Library ||VSIPL++ = Parallel Object Oriented VSIPL MPI = Message-passing interface MPI/RT = MPI real-time DRI = Data Re-org Interface CORBA = Common Object Request Broker Architecture HP-CORBA = High Performance CORBA AFRL

VSIPL++ Productivity Examples BLAS zherk Routine • • • BLAS = Basic Linear Algebra Subprograms Hermitian matrix M: conjug(M) = Mt zherk performs a rank-k update of Hermitian matrix C: C a A conjug(A)t + b C • VSIPL code A = vsip_cmcreate_d(10, 15, VSIP_ROW, MEM_NONE); C = vsip_cmcreate_d(10, VSIP_ROW, MEM_NONE); tmp = vsip_cmcreate_d(10, VSIP_ROW, MEM_NONE); vsip_cmprodh_d(A, A, tmp); /* A*conjug(A)t */ vsip_rscmmul_d(alpha, tmp); /* a*A*conjug(A)t */ vsip_rscmmul_d(beta, C, C); /* b*C */ vsip_cmadd_d(tmp, C, C); /* a*A*conjug(A)t + b*C */ vsip_cblockdestroy(vsip_cmdestroy_d(tmp)); vsip_cblockdestroy(vsip_cmdestroy_d(C)); vsip_cblockdestroy(vsip_cmdestroy_d(A)); • VSIPL++ code (also parallel) Matrix<complex<double> > A(10, 15); Matrix<complex<double> > C(10, 10); C = alpha * prodh(A, A) + beta * C; Slide-20 www. hpec-si. org MITRE MIT Lincoln Laboratory Sonar Example • K-W Beamformer • Converted C VSIPL to VSIPL++ • 2. 5 x less SLOCs AFRL

PVL Power. PC Alti. Vec Experiments Results Hand coded loop achieves good performance, but is problem specific and low level Optimized VSIPL performs well for simple expressions, worse for more complex expressions PETE style array operators perform almost as well as the hand-coded loop and are general, can be composed, and are high-level • • • A=B+C*D+E*F A=B+C*D+E/F Software Technology Alti. Vec loop • • • C For loop Direct use of Alti. Vec extensions Assumes unit stride Assumes vector alignment Slide-21 www. hpec-si. org MITRE VSIPL (vendor optimized) • C • Alti. Vec aware VSIPro Core Lite (www. mpi-softtech. com) • No multiply-add • Cannot assume unit stride • Cannot assume vector alignment MIT Lincoln Laboratory PETE with Alti. Vec • • • C++ PETE operators Indirect use of Alti. Vec extensions Assumes unit stride Assumes vector alignment AFRL

Parallel Pipeline Mapping Signal Processing Algorithm Filter XOUT = FIR(XIN ) Beamform XOUT = w *XIN Detect XOUT = |XIN|>c Mapping Parallel Computer • Data Parallel within stages • Task/Pipeline Parallel across stages Slide-22 www. hpec-si. org MITRE MIT Lincoln Laboratory AFRL

Scalable Approach Single Processor Mapping #include <Vector. h> #include <Add. Pvl. h> A =B +C void add. Vectors(a. Map, b. Map, c. Map) { Vector< Complex<Float> > a(‘a’, a. Map, LENGTH); Vector< Complex<Float> > b(‘b’, b. Map, LENGTH); Vector< Complex<Float> > c(‘c’, c. Map, LENGTH); Multi Processor Mapping b = 1; c = 2; a=b+c; A =B +C } Lincoln Parallel Vector Library (PVL) • Single processor and multi-processor code are the same • Maps can be changed without changing software • High level code is compact Slide-23 www. hpec-si. org MITRE MIT Lincoln Laboratory AFRL

Outline • Introduction • Demonstration • Development • Applied Research • Future Challenges • Summary Slide-24 www. hpec-si. org MITRE • Fault Tolerance • Parallel Specification • Hybrid Architectures (see SBR) MIT Lincoln Laboratory AFRL

Dynamic Mapping for Fault Tolerance Map 1 XOUT XIN Nodes: 0, 2 Map 0 Map 2 Nodes: 0, 1 Nodes: 1, 3 Input Task Output Task Failure Parallel Processor Spare • Switching processors is accomplished by switching maps • No change to algorithm required • Developing requirements for ||VSIPL++ Slide-25 www. hpec-si. org MITRE MIT Lincoln Laboratory AFRL

Parallel Specification Clutter Calculation (Linux Cluster) Parallel performance % Initialize p. MATLAB_Init; Ncpus=comm_vars. comm_size; Speedup % Map X to first half and Y to second half. map. X=map([1 Ncpus/2], {}, [1: Ncpus/2]) map. Y=map([Ncpus/2 1], {}, [Ncpus/2+1: Ncpus]); % Create arrays. X = complex(rand(N, M, map. X), rand(N, M, map. X)); Y = complex(zeros(N, M, map. Y); % Initialize coefficents coefs =. . . weights =. . . % Parallel filter + corner turn. Y(: , : ) = conv 2(coefs, X); % Parallel matrix multiply. Y(: , : ) = weights*Y; Number of Processors % Finalize p. MATLAB and exit. p. MATLAB_Finalize; exit; • Matlab is the main specification language for signal processing • p. Matlab allows parallel specifciations using same mapping constructs being developed for ||VSIPL++ Slide-26 www. hpec-si. org MITRE MIT Lincoln Laboratory AFRL

Outline • Introduction • Demonstration • Development • Applied Research • Future Challenges • Summary Slide-27 www. hpec-si. org MITRE MIT Lincoln Laboratory AFRL

Optimal Mapping of Complex Algorithms Application Input Low Pass Filter XIN FIR 1 W 1 FIR 2 Matched Filter Beamform XOUT XIN mult XOUT FFT IFFT XOUT W 4 W 3 W 2 XIN Different Optimal Maps Intel Cluster Workstation Power. PC Cluster Embedded Board Embedded Multi-computer Hardware • Need to automate process of mapping algorithm to hardware Slide-28 www. hpec-si. org MITRE MIT Lincoln Laboratory AFRL

HPEC-SI Future Challenges Phase 5 Applied Research: Higher Languages (Java? ) Phase 4 Applied Research: PCA/Self-optimization Phase 3 Applied Research: Hybrid Architectures prototype Development: Fault tolerance FT VSIPL Development: Hybrid Architectures prototype Hybrid VSIPL Parallel VSIPL++ Unified Comp/Comm Standard • Self-optimization Demonstration: Hybrid Architectures Demonstration: Fault tolerance Demonstration: Unified Comp/Comm Lib Development: Fault Tolerant VSIPL Demonstrate Fault Tolerance Hybrid VSIPL Portability across architectures Functionality End of 5 Year Plan Time • RISC/FPGA Transparency • Increased reliability Demonstrate scalability Slide-29 www. hpec-si. org MITRE MIT Lincoln Laboratory AFRL

Summary • HPEC-SI Program on track toward changing software practice in Do. D HPEC Signal and Image Processing – Outside funding obtained for Do. D program specific activities (on top of core HPEC-SI effort) – 1 st Demo completed; 2 nd selected – Worlds first parallel, object oriented standard – Applied research into task/pipeline parallelism; fault tolerance; parallel specification • Keys to success – Program Office Support: 5 Year Time horizon better match to Do. D program development – Quantitative goals for portability, productivity and performance – Engineering community support Slide-30 www. hpec-si. org MITRE MIT Lincoln Laboratory AFRL

Web Links High Performance Embedded Computing Workshop http: //www. ll. mit. edu/HPEC High Performance Embedded Computing Software Initiative http: //www. hpec-si. org/ Vector, Signal, and Image Processing Library http: //www. vsipl. org/ MPI Software Technologies, Inc. http: //www. mpi-softtech. com/ Data Reorganization Initiative http: //www. data-re. org/ Code. Sourcery, LLC http: //www. codesourcery. com/ Matlab. MPI http: //www. ll. mit. edu/Matlab. MPI Slide-31 www. hpec-si. org MITRE MIT Lincoln Laboratory AFRL