A Look at Library Software for Linear Algebra

A Look at Library Software for Linear Algebra: Past, Present, and Future Jack Dongarra University of Tennessee and Oak Ridge National Laboratory 1

1960’s u u 1961 . – DEC ships first PDP-8 & IBM ships 360 – First CS Ph. D U of Penn. Richard Wexelblat – Wilkinson’s The Algebraic Eigenvalue Problem published – IBM Stretch delivered to LANL u . 1962 – Virtual memory from U Manchester, T. Kilburn u u 1964 . – AEC urges manufacturers to look at ``radical new'' machine structures. » This leads to CDC Star-100, TI ASC, and Illiac-IV. – CDC 6600; S. Cray's design » Functional parallelism, leading to RISC » (3 times faster than the IBM Stretch) 1965 1966 – DARPA contract with U of I to build the ILLIAC IV – Fortran 66 u . 1967 – Forsythe & Moler published u . » Fortran, Algol and PLI 1969 – DARPANET work begins » 4 computers connected u UC-SB, UCLA, SRI and U of Utah – CDC 7600 2 » Pipelined architecture; 3. 1 Mflop/s

Wilkinson-Reinsch Handbook u “In general we have aimed to include only algorithms that provide something approaching an optimal solution, this may be from the point of view of generality, elegance, speed, or economy of storage. ” – Part 1 Linear System – Part 2 Algebraic Eigenvalue Problem u Before publishing Handbook – Virginia Klema and others at ANL began translation of the Algol procedures into Fortran subroutines 3

1970’s u 1970 u . – NATS Project conceived » Concept of certified MS and process involved with production – Purdue Math Software Symposium – NAG project begins u – Cray Research founded – Intel 8008 – 1/4 size Illiac IV installed NASA Ames . » 15 Mflop/s achieved; 64 processors – Paper by S. Reddaway on massive bitlevel parallelism – EISPACK available 1971 . – Handbook for Automatic Computation, Vol II » 150 installations; EISPACK Users' Guide » 5 versions, IBM, CDC, Univac, Honeywell and PDP, distributed free via Argonne Code Center » Landmark in the development of numerical algorithms and software » Basis for a number of software projects EISPACK, a number of linear algebra routines in IMSL and the F chapters of NAG – M. Flynn publishes paper on architectural taxonomy – ARPANet . – IBM 370/195 » Pipelined architecture; Out of order execution; 2. 5 Mflop/s – Intel 4004; 60 Kop/s – IMSL founded 1972 » 37 computers connected u . – 1973 BLAS report in SIGNUM » Lawson, Hanson, &Krogh 4

NATS Project u National Activity for Testing Software (NSF, Argonne, Texas and Stanford) u Project to explore the problems of testing, certifying, disseminating and maintaining quality math software. – First EISPACK, later FUNPACK » Influenced other “PACK”s u u Key ELLPACK, FISKPACK, ITPACK, MINPACK, PDEPACK, QUADPACK, SPARSPAK, ROSEPACK, TOOLPACK, TESTPACK, LINPACK, LAPACK, Sca. LAPACK. . . attributes of math software – reliability, robustness, structure, usability, and 5

EISPACK u Algol versions of the algorithms written in Fortran – Restructured to avoid underflow – Check user’s claims such as if matrix positive definite – Format programs in a unified fashion » Burt Garbow – Field test sites u 1971 U of Michigan Summer Conference – JHW algorithms & CBM software u 1972 Software released via Argonne Code Center – 5 versions u Software certified in the sense that reports of poor or incorrect performance “would gain the immediate 6 attention from the developers”

EISPACK u EISPAC Control Program, Boyle – One interface allowed access to whole package on IBM u Argonne’s interactive RESCUE system – Allowed us to easily manipulate 10 s of 1000’s of lines u Generalizer/Selector – Convert IBM version to general form – Selector extract appropriate version u 1976 Extensions to package ready – EISPACK II 7

1970’s continued u 1974 – Intel 8080 – Level 1 BLAS activity started by community; Purdue 5/74 – LINPACK meeting in summer ANL u – DEC VAX 11/780; super mini . u » . 14 Mflop/s; 4. 3 GB virtual memory – LINPACK test software developed &sent – EISPACK second release – IEEE Arithmetic standard meetings . 1975 – First issue of TOMS – Second LINPACK meeting ANL u . » lay the groundwork and hammer out what was and was not to be included in u the package. Proposal submitted to the NSF 1976 – LINPACK work during summer at ANL – EISPACK Second edition of User's Guide » paper by Palmer on INTEL std for fl pt 1978 – Fortran 77 – LINPACK software released . – Cray 1 - model for vector computing » 4 Mflop/s in 79; 12 Mflop/s in 83; 27 Mflop/s in 89 1977 u » Sent to NESC and IMSL for distribution 1979 – – John Cocke designs 801 ICL DAP delivered to QMC, London Level 1 BLAS published/released 8 LINPACK Users' Guide

Basic Linear Algebra Subprograms u BLAS 1973 -1977 u Consensus on: – Names – Calling sequences – Functional Descriptions – Low level linear algebra operations u Success results from – Extensive Public Involvement – Careful consideration u. A design tool software in numerical linear algebra u Improve readability and aid documentation u Aid modularity and maintenance, and improve robustness of software calling the BLAS u Improve portability, 9 without sacrificing

CRAY and VAXination of groups and departments u Cray’s introduction of vector computing u Both had a significant impact on scientific computing u 10

LINPACK u June 1974, ANL – Jim Pool’s meeting u February 1975, ANL – groundwork for project u January 1976, – funded by NSF and DOE: ANL/UNM/UM/UCSD u Summer u Fall 1976 - Summer 1977 – Software to 26 testsites u December 1978 – Software released, NESC and IMSL 11

LINPACK u Research into mechanics of software production. u Provide a yearstick against which future software would be measured. u Produce library used by people and those that wished to modify/extend software to handle special problems. u Hope would be used in classroom u Machine independent and efficient – No mix mode arithmetic 12

LINPACK Efficiency u Condition estimator u Inner loops via BLAS u Column access u Unrolling loops (done for the BLAS) 0 u BABE Algorithm for tridiagonal matrices u TAMPR system allowed easy generation of versions u LINPACK Benchmark – Today reports machines from Cray T 90 to Palm Pilot 0 13

1980’s Vector Computing (Parallel Processing) u u 1980 . – Illiac IV decommissioned – Steve Chen's group at Cray produces XMP – First Denelcor HEP installed (. 21 Mflop/s) – Sun Microsys, Convex and Alliant founded – Total computers in use in the US exceeds 1 M – CDC Introduces Cyber 205 u . 1981 . – IBM introduces the PC » Intel 8088/DOS – BBN Butterfly delivered – FPS delivers FPS-164 . » Start of mini-supercomputer market – SGI founded by Jim Clark and others – Loop unrolling at outer level for data locality and parallelism . » Amounts to matrix-vector operations – Cuppen's method for Symmetric Eigenvalue D&C published 1982 u . . 1983 – computers in use in the US exceed. Total 10 M – DARPA starts Strategic Computing Initiative » Helps fund Thinking Machines, BBN, WARP – Cosmic Cube hypercube running at Caltech . » John Palmer, after seeing Caltech machine, leaves Intel to found Ncube 14 – Encore , Sequent, TMC, SCS, Myrias

1980’s continued u 1984 u . – NSFNET; 5000 computers; 56 Kb/s lines – Math. Works founded – EISPACK third release – Netlib begins 1/3/84 – Level 2 BLAS activity started . » Gatlinburg(Waterloo), Purdue, SIAM – Intel Scientific Computers started by J. Rattner » Produce commerical hypercube – Cray X-MP 1 processor, 21 Mflop/s Linpack – Multiflow founded by J. Fisher; VLIW architecture – Apple introduces Mac &IBM introduces PC AT – IJK paper . 1985 . – IEEE Standard 754 for floating point – IBM delivers 3090 vector; 16 Mflop/s Linpack, 138 Peak – TMC demos CM 1 to DARPA – Intel produces first i. PSC/1 Hypercube » 80286 connected via Ethernet controllers – Fujitsu VP-400; NEC SX-2; Cray 2; Convex C 1 – Ncube/10 ; . 1 Mflop/s Linpack 1 processor – FPS-264; 5. 9 Mflop/s Linpack 38 Peak – IBM begins RP 2 project – Stellar, Poduska & Ardent, Michels Supertek Computer founded 15 – Denelcor closes doors .

1980’s (Lost Decade for Parallel Software) u 1986 . u – # of computers in US exceeds 30 M – TMC ships CM-1; 64 K 1 bit processors – Cray X-MP – IBM and MIPS release first RISC WS – AMT delivers first re-engineered DAP – Intel produces i. PSC/2 – Stellar and Ardent begin delivering single user graphics workstations – Level 2 BLAS paper published . u 1987 – ETA Systems family of supercomputers – Sun Microsystems introduces its first RISC WS – IBM invests in Steve Chen's SSI – Cray Y-MP – First NA-DIGEST – Level 3 BLAS work begun . . . 1988 u . 1989 – # of computers in the US > 50 M – Stellar and Ardent merge, forming Stardent – S. Cray leaves Cray Research to form Cray Computer – Ncube 2 nd generation machine – ETA out of business 16 – Intel 80486 and i 860 ; 1 M

EISPACK 3 and BLAS 2 & 3 u u u Machine independence for EISPACK Reduce the possibility of overflow and underflow. Mods to the Algol from S. Hammarling Rewrite reductions to tridiagonal form to involve sequential access to memory “Official” Double Precision version Inverse iteration routines modified to reduce the size for u BLAS (Level 1) vector operations provide for too much data movement. u Community effort to define extensions – Matrix-vector ops – Matrix-matrix ops 17

Netlib - Mathematical Software and Data u Began in 1985 – JD and Eric Grosse, AT&T Bell Labs u u Motivated by the need for cost-effective, timely distribution of high-quality mathematical software to the community. Designed to send, by return electronic mail, requested items. Automatic mechanism for electronic dissemination of freely available software. – Still in use and growing – Mirrored at 9 sites around the world 18 Moderated collection /Distributed maintenance

Netlib Growth Just over 6, 000 in 1985 Over 29, 000 total Over 9 Million hits in 1997 5. 4 Million so far in 199 LAPACK Best Seller 1. 6 M hits 19

1990’s u u 1990 – – Internet World Wide Web Motorola introduces 68040 NEC ships SX-3; First parallel Japanese parallel vector supercomputer – IBM announces RS/6000 family . . » has FMA instruction – Intel hypercube based on 860 chip » 128 processors – Alliant delivers FX/2800 based on i 860 – Fujitsu VP-2600 – PVM project started – Level 3 BLAS Published . 1991 – Stardent to sell business and close – Cray C-90 – Kendall Square Research delivers 32 processor KSR-1 – TMC produces CM-200 and announces CM-5 MIMD computer – DEC announces the Alpha – TMC produces the first CM-5 – Fortran 90 – Workshop to consider Message Passing Standard, beginnings of MPI . » Community effort – Xnetlib running 20

21

Parallel Processing Comes of Age u "There are three rules for programming parallel computers. We just don't know what they are yet. " -- Gary Montry u “Embarrassingly Parallel”, Cleve Moler u “Humiliatingly Parallel” 22

Memory Hierarchy and LAPACK u ijk - implementations better keep in of memory hierarchy. Effects order in which data referenced; some at allowing data to higher levels u Applies for matrix multiply, reductions to condensed form – May do slightly more flops 23

Why Higher Level BLAS? u Can only do arithmetic on data at the top of the hierarchy u Higher level BLAS lets us do this Level 3 BLAS Level 2 BLAS Level 1 BLAS u Development of blocked algorithms important for performance 24

History of Block Partitioned Algorithms u Ideas not new. u Early algorithms involved use of small main memory using tapes as secondary storage. u Recent work centers on use of vector registers, level 1 and 2 cache, main memory, and “out of core” memory. 25

LAPACK u Linear Algebra library in Fortran 77 – Solution of systems of equations – Solution of eigenvalue problems u Combine algorithms from LINPACK and EISPACK into a single package u Block algorithms Mflop/s – Efficient on a wide range of computers » RISC, Vector, SMPs u User interface similar to LINPACK – Single, Double, Complex, Double Complex u Built u on the Level 1, 2, and 3 BLAS HP-48 G to CRAY T-90 26

27

1990’s continued u 1993 . . – Intel Pentium system start to ship – Sca. LAPACK Prototype software released » First portable library for distributed memory machines » Intel, TMC and workstations using PVM – PVM 3. 0 available u 1994 – MPI-1 Finished u 1995 – Templates project u 1996 . – Internet; 34 M Users – Nintendo 64 » More computing power than a u 1997 – MPI-2 Finished – Fortran 95 u 1998 –. Issues of parallel and numerical stability – Divide time – DSM architectures – "New" Algorithms » Chaotic iteration » Sparse LU w/o pivoting » Pipeline HQR » Graph partitioning » Algorithmic 28

Templates Project u Iterative methods for large sparse systems – Communicate to HPC community “State of the Art” algorithms – Subtle algorithm issues addressed, i. e. convergence, preconditions, data structures – Performance and parallelism considerations – Gave the computational scientists 29 algorithms in form they wanted.

Sca. LAPACK u Library of software for dense » Sparse direct being developed u Distributed & banded Memory - Message Passing – PVM and MPI u MIMD Computers, Networks of Workstations, and Clumps of SMPs u SPMD Fortran 77 with object based design u Built on various modules – PBLAS (BLACS and BLAS) 30

High-Performance Computing Directions u Move toward shared memory – SMPs and Distributed Shared Memory – Shared address space w/deep memory hierarchy u Clustering of shared memory machines for scalability – Emergence of PC commodity systems » Pentium based, NT or Linux driven u At UTK cluster of 14 (dual) Pentium based 7. 2 Gflop/s u Efficiency of message passing and data parallel programming – Helped by standards efforts such as PVM, MPI and 31 HPF

Heterogeneous Computing u Heterogeneity introduces new bugs in parallel code u Slightly different fl pt can make data dependent branches go different ways when we expect identical behavior. u A “correct” algorithm on a network of identical workstations may fail if a slightly different machine is introduced. u Some easy to fix (compare s < tol on 1 proc and broadcast results) u Some hard to fix (handling denorms; getting 32

Java - For Numerical Computations? u Java likely to be a dominant language. u Provides for machine independent code. u C++ like language u No pointers, goto’s, overloading arith ops, or memory deallocation u Portability achieved via abstract machine 33

Network Enabled Servers u Allow networked resources to be integrated into the desktop. u Many hosts, co-existing in a loose confederation tied together with high-speed links. u Users have the illusion of a very powerful computer on the desk. u Locate and “deliver” software or solutions to the user in a directly usable and “conventional” form. u Part of the motivation software maintenance 34

Future: Petaflops ( fl pt ops/s) u. A Pflop for 1 second a typical workstation computing for 1 year. u From an algorithmic standpoint – dynamic redistribution of workload – new language and constructs – role of numerical May be feasible and “affordable” libraries 35 by the year 2010 – algorithm adaptation – – u concurrency data locality latency & sync floating point accuracy

Summary u As a community we have a lot to be proud of in terms of the algorithms and software we have produced. – generality, elegance, speed, or economy of storage u Software still being used in many cases 30 years after 36

37