Introduction to Sca LAPACK Sca LAPACK Sca LAPACK

  • Slides: 31
Download presentation
Introduction to Sca. LAPACK •

Introduction to Sca. LAPACK •

Sca. LAPACK • Sca. LAPACK is for solving dense linear systems and computing eigenvalues

Sca. LAPACK • Sca. LAPACK is for solving dense linear systems and computing eigenvalues for dense matrices 2

Sca. LAPACK • Scalable Linear Algebra PACKage • Developing team from University of Tennessee,

Sca. LAPACK • Scalable Linear Algebra PACKage • Developing team from University of Tennessee, University of California Berkeley, ORNL, Rice U. , UCLA, UIUC etc. • Support in Commercial Packages – NAG Parallel Library – IBM PESSL – CRAY Scientific Library – VNI IMSL – Fujitsu, HP/Convex, Hitachi, NEC 3

Description • A Scalable Linear Algebra Package (Sca. LAPACK) • A library of linear

Description • A Scalable Linear Algebra Package (Sca. LAPACK) • A library of linear algebra routines for MPP and NOW machines • Language : Fortran • Dense Matrix Problem Solvers – – – Linear Equations Least Squares Eigenvalue 4

Technical Information • Sca. LAPACK web page – http: //www. netlib. org/scalapack • Sca.

Technical Information • Sca. LAPACK web page – http: //www. netlib. org/scalapack • Sca. LAPACK User’s Guide • Technical Working Notes – http: //www. netlib. org/lapack/lawn 5

Packages • LAPACK – Routines are single-PE Linear Algebra solvers, utilities, etc. • BLAS

Packages • LAPACK – Routines are single-PE Linear Algebra solvers, utilities, etc. • BLAS – Single-PE processor work-horse routines (vector, vector-matrix & matrix-matrix) • PBLAS – Parallel counterparts of BLAS. Relies on BLACS for blocking and moving data. 6

Packages (cont. ) • BLACS – Constructs 1 -D & 2 -D processor grids

Packages (cont. ) • BLACS – Constructs 1 -D & 2 -D processor grids for matrix decomposition and nearest neighbor communication patterns. Uses message passing libraries (MPI, PVM, MPL, NX, etc) for data movement. 7

Dependencies & Parallelism of Component Packages Serial Parallel 8

Dependencies & Parallelism of Component Packages Serial Parallel 8

Data Partitioning • Initialize BLACS routines – Set up 2 -D mesh (mp, np)

Data Partitioning • Initialize BLACS routines – Set up 2 -D mesh (mp, np) – Set (Cor. R) Column or Row Major Order – Call returns context handle (icon) – Get row & column through gridinfo (myrow, mycol) • blacs_gridinit( icon, Cor. R, mp, np ) • blacs_gridinfo( icon, mp, np, myrow, mycol) 9

Data Partitioning (cont. ) • Block-cyclic Array Distribution. – Block: (groups of sequential elements)

Data Partitioning (cont. ) • Block-cyclic Array Distribution. – Block: (groups of sequential elements) – Cyclic: (round-robin assignment to PEs) • Example 1 -D Partitioning: – Global 9 -element array distributed on a 3 -PE grid with 2 element blocks: block(2)-cyclic(3) 10

Block-Cyclic Partitioning 11

Block-Cyclic Partitioning 11

2 -D Partitioning • Example 2 -D Partitioning – A(9, 9) mapped onto PE(2,

2 -D Partitioning • Example 2 -D Partitioning – A(9, 9) mapped onto PE(2, 3) grid block(2, 2) – Use 1 -D example to distribute column elements in each row, a block(2)-cyclic(3) distribution. – Assign rows to first PE dimension by block(2)cyclic(2) distribution. 12

11 21 31 41 51 61 71 81 91 11 21 51 61 91

11 21 31 41 51 61 71 81 91 11 21 51 61 91 12 22 52 62 92 17 27 57 67 97 18 28 58 68 98 PE (0, 0) 31 41 71 81 32 42 72 82 37 47 77 87 38 48 78 88 PE (1, 0) 12 13 14 15 16 17 18 19 Global Matrix 22 23 24 25 26 27 28 29 32 33 34 35 36 37 38 39 Global Matrix (9 x 9) 42 43 44 45 46 47 48 49 52 53 54 55 56 57 58 59 Block Size =2 x 2 62 63 64 65 66 67 68 69 Cyclic on 2 x 3 PE Grid 72 73 74 75 76 77 78 79 82 83 84 85 86 87 88 89 92 93 94 95 96 97 98 99 13 23 53 63 93 14 24 54 64 94 19 29 59 69 99 15 16 25 26 55 56 65 66 95 96 PE (0, 1) 33 34 39 43 44 49 73 74 79 83 84 89 PE (1, 1) PE (0, 2) 35 45 75 85 36 46 76 86 PE (1, 2) 13

14

14

Syntax for descinit(idesc, m, n, mb, nb, i, j, icon, mla, ier) Ior. O

Syntax for descinit(idesc, m, n, mb, nb, i, j, icon, mla, ier) Ior. O OUT IN IN OUT arg idesc m n mb nb i j icon mla ier Description Descriptor Row Size (Global) Column Size (Global) Row Block Size Column Size Starting Grid Row Starting Grid Column BLACS context Leading Dimension of Local Matrix Error number The starting grid location is usually (i=0, j=0). 15

Application Program Interfaces (API) • Drivers – Solves a Complete Problem • Computational Components

Application Program Interfaces (API) • Drivers – Solves a Complete Problem • Computational Components – Performs Tasks: LU factorization, etc. • Auxiliary Routines – Scaling, Matrix Norm, etc. • Matrix Redistribution/Copy Routine – Matrix on PE grid 1 -> Matrix on PE grid 2 16

API (cont. . ) • Prepend LAPACK equivalent names with P. PXYYZZZ Computation Performed

API (cont. . ) • Prepend LAPACK equivalent names with P. PXYYZZZ Computation Performed Matrix Type Data Types Data Type real X S double cmplx dble cmplx D C Z 17

Matrix Types (YY) PXYYZZZ DB General Band (diagonally dominant-like) DT general Tridiagonal (Diagonally dominant-like)

Matrix Types (YY) PXYYZZZ DB General Band (diagonally dominant-like) DT general Tridiagonal (Diagonally dominant-like) GB General Band GE GEneral matrices (e. g. , unsymmetric, rectangular, etc. ) GG General matrices, Generalized problem HE Complex Hermitian OR Orthogonal Real PB Positive definite Banded (symmetric or Hermitian) PO Positive definite (symmetric or Hermitian) PT Positive definite Tridiagonal (symmetric or Hermitian) ST Symmetric Tridiagonal Real SY SYmmetric TR TRiangular (possibly quasi-triangular) TZ Trape. Zoidal UN UNitary complex 18

Drivers (ZZZ) Routine types PXYYZZZ SL Linear Equations (SVX*) SV LU Solver VD Singular

Drivers (ZZZ) Routine types PXYYZZZ SL Linear Equations (SVX*) SV LU Solver VD Singular Value EV Eigenvalue (EVX**) GVX Generalized Eigenvalue *Estimates condition numbers. *Provides Error Bounds. *Equilibrates System. **Selected Eigenvalues 19

Rules of Thumb • Use a square processor grid, mp=np • Use block size

Rules of Thumb • Use a square processor grid, mp=np • Use block size 32 (Cray) or 64 – for efficient matrix multiply • Number of PEs to use = Mx. N/10**6 • Bandwith/node (MB/sec) > peak rate (MFLOPS)/10 • Latency < 500 usecs • Use vendor’s BLAS 20

Example program: Linear Equation Solution Ax=b (page 1) program Sample_Sca. LAPACK_program implicit none include

Example program: Linear Equation Solution Ax=b (page 1) program Sample_Sca. LAPACK_program implicit none include "mpif. h" REAL : : A ( 1000 , 1000) REAL, DIMENSION(: , : ), ALLOCATABLE : : A_local INTEGER, PARAMETER : : M_A=1000, N_A=1000 INTEGER : : DESC_A(1: 9) REAL, DIMENSION(: , : ), ALLOCATABLE : : b_local REAL : : b( 1000, 1 ) INTEGER, PARAMETER : : M_B=1000, N_B=1 INTEGER : : DESC_b(1: 9) , i, j INTEGER : : ldb_y INTEGER : : ipiv(1: 1000), info INTEGER : : myrow, mycol !row and column # of PE on PE grid INTEGER : : ictxt !context handle. . . like MPIcommunicator INTEGER : : lda_y, lda_x ! leading dimension of local array A(@) ! row and column # of PE grid INTEGER, PARAMETER : : nprow=2, npcol=2 ! ! RSRC. . . the PE row that owns the first row of A ! CSRC. . . the PE column that owns the first column of A 21

Example program (page 2) INTEGER, PARAMETER : : RSRC = 0, CSRC = 0

Example program (page 2) INTEGER, PARAMETER : : RSRC = 0, CSRC = 0 INTEGER, PARAMETER : : MB = 32, NB=32 ! ! i. A. . . the global row index of A, which points to the beginning !of submatrix that will be operated on. ! INTEGER, PARAMETER : : i. A = 1, j. A = 1 INTEGER, PARAMETER : : i. B = 1, j. B = 1 REAL : : starttime, endtime INTEGER : : taskid, numtask, ierr !***************** ! NOTE: ON THE NPACI MACHINES DO NOT NEED TO CALL INITBUFF ! ON OTHER MACHINES YOU MAY NEED TO CALL INITBUFF !******************* ! ! This is where you should define any other variables ! INTEGER, EXTERNAL : : NUMROC ! ! Initialization of the processor grid. This routine ! must be called ! ! start MPI initialization routines 22

Example program (page 3) CALL MPI_INIT(ierr) CALL MPI_COMM_RANK(MPI_COMM_WORLD, taskid, ierr) CALL MPI_COMM_SIZE(MPI_COMM_WORLD, numtask, ierr)

Example program (page 3) CALL MPI_INIT(ierr) CALL MPI_COMM_RANK(MPI_COMM_WORLD, taskid, ierr) CALL MPI_COMM_SIZE(MPI_COMM_WORLD, numtask, ierr) ! start timing call on 0, 0 processor IF (taskid==0) THEN starttime =MPI_WTIME() END IF CALL BLACS_GET(0, 0, ictxt) CALL BLACS_GRIDINIT(ictxt, 'r', nprow, npcol) ! ! This call returns processor grid information that will be ! used to distribute the global array ! CALL BLACS_GRIDINFO(ictxt, nprow, npcol, myrow, mycol ) ! ! define lda_x and lda_y with numroc ! lda_x. . . the leading dimension of the local array storing the !. . . . local blocks of the distributed matrix A ! lda_x = NUMROC(M_A, MB, myrow, 0, nprow) lda_y = NUMROC(N_A, NB, mycol, 0, npcol) ! ! resizing A so don't waste space ! 23

Example program (page 4) ALLOCATE( A_local (lda_x, lda_y)) if (N_B < NB )then ALLOCATE(

Example program (page 4) ALLOCATE( A_local (lda_x, lda_y)) if (N_B < NB )then ALLOCATE( b_local(lda_x, N_B) ) else ldb_y = NUMROC(N_B, NB, mycol, 0, npcol) ALLOCATE( b_local(lda_x, ldb_y) ) end if ! ! defining global matrix A ! do i=1, 1000 do j = 1, 1000 if (i==j) then A(i, j) = 1000. 0*i*j else A(i, j) = 10. 0*(i+j) end if end do ! defining the global array b do i = 1, 1000 b(i, 1) = 1. 0 end do ! 24

Example program (page 5) ! subroutine setarray maps global array to local arrays !

Example program (page 5) ! subroutine setarray maps global array to local arrays ! ! YOU MUST HAVE DEFINED THE GLOBAL MATRIX BEFORE THIS POINT ! CALL SETARRAY(A, A_local, b, b_local, myrow, mycol , lda_x, lda_y ) ! ! initialize descriptor vector for sca. LAPACK ! CALL DESCINIT(DESC_A, M_A, N_A, MB, NB, RSRC, CSRC, ictxt, lda_x, info) CALL DESCINIT(DESC_b, M_B, N_B, MB, N_B, RSRC, CSRC, ictxt, lda_x, info) ! ! call sca. LAPACK routines ! CALL PSGESV( N_A, 1, A_local, i. A, j. A, DESC_A, ipiv, b_local, i. B, j. B, DESC_b, info) ! ! check Sca. LAPACK returned ok ! 25

Example program (page 6) if (info. ne. 0)then print *, 'unsucessfull return. . .

Example program (page 6) if (info. ne. 0)then print *, 'unsucessfull return. . . something is wrong! info=', info else print *, 'Sucessfull return' end if CALL BLACS_GRIDEXIT( ictxt ) CALL BLACS_EXIT ( 0 ) ! call timing routine on processor 0, 0 to get endtime IF (taskid==0) THEN endtime = MPI_WTIME() print*, 'wall clock time in second = ', endtime - starttime END IF ! end MPI CALL MPI_FINALIZE(ierr) ! ! end of main ! END 26

Example program (page 7) SUBROUTINE SETARRAY(AA, A, BB, B, myrow, mycol, lda_x, lda_y )

Example program (page 7) SUBROUTINE SETARRAY(AA, A, BB, B, myrow, mycol, lda_x, lda_y ) ! reads in global matrix A_local and ! distributes to the local arrays. ! implicit none REAL : : AA(1000, 1000) REAL : : BB(1000, 1) INTEGER : : lda_x, lda_y REAL : : A(lda_x, lda_y) REAL : : ll, mm, Cr, Cc INTEGER : : ii, jj, I, J, myrow, mycol, pr, pc, h, g INTEGER , PARAMETER : : nprow = 2, npcol =2 INTEGER, PARAMETER : : N=1000, M=1000, NB=32, MB=32, RSRC=0, CSRC=0 REAL : : B(lda_x, 1) INTEGER, PARAMETER : : N_B = 1 27

Example program (page 8) do I=1, M do J = 1, N ! finding

Example program (page 8) do I=1, M do J = 1, N ! finding out which PE gets this I, J element Cr = real( (I-1)/MB ) h = RSRC+aint(Cr) pr = mod( h, nprow) Cc = real( (J-1)/MB ) g = CSRC+aint(Cc) pc = mod(g, nprow) ! ! check if on this PE and then set A ! if (myrow ==pr. and. mycol==pc)then ! ii, jj coordinates of local array element ! ii = x + l*MB ! jj = y + m*NB ! ll = real( ( (I-1)/(nprow*MB) ) ) mm = real( ( (J-1)/(npcol*NB) ) ) ii = mod(I-1, MB) + 1 + aint(ll)*MB jj = mod(J-1, NB) + 1 + aint(mm)*NB A(ii, jj) = AA(I, J) end if end do 28

Example Program (page 9) ! reading in and distributing B vector ! do I

Example Program (page 9) ! reading in and distributing B vector ! do I = 1, M J=1 ! finding out which PE gets this I, J element Cr = real( (I-1)/MB ) h = RSRC+aint(Cr) pr = mod( h, nprow) Cc = real( (J-1)/MB ) g = CSRC+aint(Cc) pc = mod(g, nprow) ! check if on this PE and then set A if (myrow ==pr. and. mycol==pc)then ll = real( ( (I-1)/(nprow*MB) ) ) mm = real( ( (J-1)/(npcol*NB) ) ) jj = mod(J-1, NB) + 1 + aint(mm)*NB ii = mod(I-1, MB) + 1 + aint(ll)*MB B(ii, jj) = BB(I, J) end if end do end subroutine 29

Px. GEMM (mflops) N= Cray T 3 E IBM SP 2 Now Berkeley 2000

Px. GEMM (mflops) N= Cray T 3 E IBM SP 2 Now Berkeley 2000 4000 6000 8000 10000 2 x 2 4 x 4 8 x 8 30

Px. GESV (mflops) N= Cray T 3 E IBM SP 2 Now Berkeley 2000

Px. GESV (mflops) N= Cray T 3 E IBM SP 2 Now Berkeley 2000 5000 75000 10000 15000 1 x 4 2 x 8 4 x 16 31