Parallel Sparse LU Factorization on Secondclass Message Passing

  • Slides: 14
Download presentation
Parallel Sparse LU Factorization on Second-class Message Passing Platforms Kai Shen University of Rochester

Parallel Sparse LU Factorization on Second-class Message Passing Platforms Kai Shen University of Rochester 6/22/2005 ICS'2005 1

Preliminary: Parallel Sparse LU Factorization n n LU factorization with partial pivoting: used for

Preliminary: Parallel Sparse LU Factorization n n LU factorization with partial pivoting: used for solving a linear system Ax = b (PA=LU). Applications: n n n Device/circuit simulation, fluid dynamics, . . . … In the Newton’s method for solving non-linear systems Challenges for parallel sparse LU factorization: n n Runtime data structure variation Non-uniform computation/communication patterns ⇒ Irregular 6/22/2005 ICS'2005 2

Existing Solvers and Their Portability n Shared memory solvers: n n Message passing solvers:

Existing Solvers and Their Portability n Shared memory solvers: n n Message passing solvers: n n S+ [Shen et al. 2000], MUMPS [Amestoy et al. 2001], Super. LU_DIST [Li & Demmel 2004] Existing message passing solvers are portable, but perform poorly on platforms with slow message passing n n Super. LU [Li, Demmel et al. 1999], WSMP [Gupta 2000], PARDISO [Schenk & Gärtner 2004] Mostly designed for parallel computers with fast interconnect Performance portability is desirable n Large variation in the characteristics of available platforms 6/22/2005 ICS'2005 3

Example Message Passing Platforms n Three platforms running MPI n n Regatta-shmem, Regatta-TCP/IP, PC

Example Message Passing Platforms n Three platforms running MPI n n Regatta-shmem, Regatta-TCP/IP, PC cluster Per-CPU peak BLAS-3 performance is 971 MFLOPS on Regatta and 1382 MFLOPS on a PC 6/22/2005 ICS'2005 4

Parallel Sparse LU Factorization on the Three Platforms Performance of S+ [Shen et al.

Parallel Sparse LU Factorization on the Three Platforms Performance of S+ [Shen et al. 2000] n We investigate communication reduction techniques to improve the performance on platforms with slow comm. 6/22/2005 ICS'2005 5

Data Structure and Computation Steps Row block K for each column block K (1→N)

Data Structure and Computation Steps Row block K for each column block K (1→N) Perform Factor(K); Perform Swap. Scale(K); Perform Update(K); endfor Processor mapping: Column block K 6/22/2005 n n ICS'2005 1 -D cyclic 2 -D cyclic (more scalable) 6

Large Diagonal Batch Pivoting n Large diagonal batch pivoting n n n may be

Large Diagonal Batch Pivoting n Large diagonal batch pivoting n n n may be numerically unstable n n Batch pivoting to reduce comm. 6/22/2005 Locate the largest elements for all columns in a block using one round of communication Use them as pivoting elements We check the error and fall back to original pivoting if necessary Previous approaches [Duff and Koster 1999, 2001; Li & Demmel 2004] use it in iterative methods ICS'2005 7

Speculative Batch Pivoting n Large diagonal batch pivoting fails the numerical stability test frequently

Speculative Batch Pivoting n Large diagonal batch pivoting fails the numerical stability test frequently n Speculative batch pivoting n n Collect candidate pivot rows (for all columns in a block) at one processor using one gather communication Perform factorization at that processor to determine the pivots Error checking and fall back to original pivoting if necessary Both batch pivoting strategies n n Require additional computation May slightly weaken the numerical stability 6/22/2005 ICS'2005 8

Performance on Regatta-shmem n n LD – large diagonal; SBP – speculative batch pivoting

Performance on Regatta-shmem n n LD – large diagonal; SBP – speculative batch pivoting TP – threshold pivoting [Duff et al. 1986] Virtually no performance benefits 6/22/2005 ICS'2005 9

Performance on Platforms with Slower Message Passing PC cluster: n Improvement of SBP is

Performance on Platforms with Slower Message Passing PC cluster: n Improvement of SBP is 28 -292% for a set of 8 test matrices Regatta-TCP/IP: n The improvement is up to 48% 6/22/2005 ICS'2005 10

Application Adaptation n Communication-reduction techniques n n n Effective on platforms with relatively slow

Application Adaptation n Communication-reduction techniques n n n Effective on platforms with relatively slow message passing Ineffective on first-class platforms n their by-products (e. g. , additional computation) may not be worthwhile Sampling-based adaptation n n Collect application statistics in sampling phase Coupled with platform characteristics, to adaptively determine whether candidate techniques should be employed 6/22/2005 ICS'2005 11

Adaptation on Regatta-shmem The “Adaptive” version: n Disables the comm-reduction techniques for most matrices

Adaptation on Regatta-shmem The “Adaptive” version: n Disables the comm-reduction techniques for most matrices n Achieves similar numerical stability as the “Original” version 6/22/2005 ICS'2005 12

Adaptation on the PC Cluster The “Adaptive” version: n Employs the comm-reduction techniques for

Adaptation on the PC Cluster The “Adaptive” version: n Employs the comm-reduction techniques for all matrices n Performs close to the TP+SBP version 6/22/2005 ICS'2005 13

Conclusion n Contributions: n n n Propose communication-reduction techniques to improve the LU factorization

Conclusion n Contributions: n n n Propose communication-reduction techniques to improve the LU factorization performance on platforms with relatively slow message passing Runtime sampling-based adaptation to automatically choose the appropriate version of the application http: //www. cs. rochester. edu/~kshen/research/s+/ 6/22/2005 ICS'2005 14