Highly Latency Tolerant Gaussian Elimination Toshio ENDO Kenjiro

Background n n n Demands for large scale computing are increasing Grid computing is

Obstacles to Running Apps with Frequent Comm. n Volatility, heterogeneity of computing nodes will

Target Computation Gaussian elimination (GE) of dense matrices for solving linear equations n n

Overview of This Work n n Gaussian elimination algorithm that tolerates large latencies is

Outline n n n Gaussian elimination with partial pivoting Batched pivoting Evaluation n Latency

Gaussian Elimination with Partial pivoting GE of n×n matrix A for k = 1

Problem of GE with PP n # of nodes p=6 (=2 x 3) Well-known

Performance of GE/PP with Large Latencies n We emulated large artificial latencies on a

How about Other Pivoting Methods? Strict Complete Rook [Neal 92] Not latency tolerant Partial

The Aim of Batched Pivoting (BP) n BP reduces the frequency of synchronous communications

Batched Pivoting Algorithm (1) Algorithm that selects d contiguous pivots n n n In

Batched Pivoting Algorithm (2) n n Sets of pivot candidates are gathered The ‘best’

Comparison with PP n Selected pivots n n n PP selects pivot of each

Comparison with other pivoting methods n Threshold pivoting [Malard 91] etc. n n It

Environment for Parallel Experiments n 192 node Linux cluster n n Dual Xeon 2.

Network Structure of Cluster 14 SW 14 1 Gbps 2 Gbps 4 Gbps SW

Basic Parallel Performance n Comparing speeds of PP and BP (d=4, 16, 64) n

Performance with Large Latencies With latencies, BP is much faster n Large latencies are

Evaluation Method of Numerical Accuracy We conducted experiments to evaluate numerical accuracy n Partial,

Numerical Accuracy n n n PP achieves the best accuracy No pivoting, Pairwise are

Summary n A GE algorithm that tolerates large latencies of Grid n n Batched

Future Work n n Performance evaluation on actual Grid Improvement of accuracy n n

Another Approach: Column Distribution n When each column is placed on a single node,

Why PP is fragile to large latencies n n n Batching several steps are

Bandwidth Requirements n Estimation of limit speed when bisection bandwidth is given n n

Effects of latencies n Estimation of limit speed when latency is given n Bluegene/L

Displayed residual of HPL differs from actual value n Displayed results ====================================== T/V N

Numerical Accuracy(2) Compares PP and BP with large matrices n Matrices are generated by

Detail of accuracy (Small matrices) Worst case in 100 trials Grid 05 workshop Standard

Detail of accuracy (Large matrices) n From left to right, ||Ax-b||_oo / ( eps

Notes n Simple batched pivoting may fail, if local GEs fail on all nodes

Slides: 35

Download presentation

Highly Latency Tolerant Gaussian Elimination Toshio ENDO, Kenjiro TAURA University of Tokyo Grid 05 workshop Highly Latency Tolerant GE/T. Endo 1

Background n n n Demands for large scale computing are increasing Grid computing is attractive to improve cost performance Grid has been successful for applications with small numbers of communication n Master-worker, parameter sweeping …@home projects Evaluations of applications with frequent communication on Grid are still rare n n Matrix ops, PDE solver MPICH-G 2, Cactus-G Grid 05 workshop Highly Latency Tolerant GE/T. Endo 2

Obstacles to Running Apps with Frequent Comm. n Volatility, heterogeneity of computing nodes will be solved by progress of programming tools n Low bandwidth of WAN will be solved by improvements of WAN n Large latencies of WAN n More than 10 ms on Grid >> a few us on supercomputers Large latencies will remain an obstacle! Algorithms that tolerate latencies are important Grid 05 workshop Highly Latency Tolerant GE/T. Endo 3

Target Computation Gaussian elimination (GE) of dense matrices for solving linear equations n n Used for n n n Same as LU decomposition, Linpack Fluid simulations, structural analysis Top 500 ranking Difficult to achieve good performance with large latencies n Partial pivoting (PP) introduces frequent synchronous communications Grid 05 workshop Highly Latency Tolerant GE/T. Endo 4

Overview of This Work n n Gaussian elimination algorithm that tolerates large latencies is presented An alternative pivoting method, named batched pivoting (BP) is proposed n n More latency tolerant than PP BP can largely reduce the frequency of synchronous communications Grid 05 workshop Highly Latency Tolerant GE/T. Endo 5

Outline n n n Gaussian elimination with partial pivoting Batched pivoting Evaluation n Latency tolerance Numerical accuracy Summary Grid 05 workshop Highly Latency Tolerant GE/T. Endo 6

Gaussian Elimination with Partial pivoting GE of n×n matrix A for k = 1 to n Pivoting Finds the largest element (pivot) in the k-th column Row Exchange n Update Since pivot is used as divisor, its absolute value should be larger Grid 05 workshop Highly Latency Tolerant GE/T. Endo 7

Problem of GE with PP n # of nodes p=6 (=2 x 3) Well-known distribution: 2 D block cyclic distribution n Good: Comm. amount is small O(n 2 p ) n sb Grid 05 workshop Each column is partitioned among nodes Each pivot selection requires synchronization With large latencies, sync. costs become bottleneck Highly Latency Tolerant GE/T. Endo 8

Performance of GE/PP with Large Latencies n We emulated large artificial latencies on a Linux cluster n n n Identical latencies are inserted among all pairs of nodes base, +2 ms, +5 ms, +10 ms With +10 ms latency, it gets 6 times slower! High performance Linpack (HPL) is measured n n n GE with PP Matrix size n=32768 64 (=8 x 8) nodes GE with PP is weak in large latencies Grid 05 workshop Highly Latency Tolerant GE/T. Endo 9

How about Other Pivoting Methods? Strict Complete Rook [Neal 92] Not latency tolerant Partial Threshold [Malard 91] etc. Batched (Ours) Pairwise Relaxed Grid 05 workshop [Sorensen 85] etc. No pivoting Highly Latency Tolerant GE/T. Endo numerically unstable 10

Outline n n n Gaussian elimination with partial pivoting Batched pivoting Evaluation n Latency tolerance Numerical accuracy Summary Grid 05 workshop Highly Latency Tolerant GE/T. Endo 11

The Aim of Batched Pivoting (BP) n BP reduces the frequency of synchronous communications n We batch pivot selections of several contiguous steps n n The size of batch d is determined in advance Synchronous communications occur only every d steps Grid 05 workshop Highly Latency Tolerant GE/T. Endo 12

Batched Pivoting Algorithm (1) Algorithm that selects d contiguous pivots n n n In the figure, d columns are partitioned between P 1 and P 2 Each node duplicates the columns and makes a sub-matrix Each node locally and speculatively performs GE with PP n Each node obtains d pivot candidates sb d P 1 P 3 P 2 P 4 d P 2 sb Grid 05 workshop Highly Latency Tolerant GE/T. Endo 13

Batched Pivoting Algorithm (2) n n Sets of pivot candidates are gathered The ‘best’ set is selected n n We try to avoid bad pivots d=4 in the figure P 1 I recommend 4. 8 on 50 th row, -2. 5 on 241 th row, 4. 3 on 285 th row, -3. 6 on 36 th row P 2 Compare I recommend -9. 2 on 310 th row, 6. 8 on 121 th row, 0. 8 on 170 th row, -5. 9 on 146 th row Adopt! n Contents of the best set are broadcast as final pivots Grid 05 workshop Highly Latency Tolerant GE/T. Endo 14

Comparison with PP n Selected pivots n n n PP selects pivot of each step independently BP takes pivots of d contiguous steps from a single node Pivots may be worse than PP Computation costs PP: n BP: The difference is small if d<<n n Grid 05 workshop Highly Latency Tolerant GE/T. Endo For local GE 15

Comparison with other pivoting methods n Threshold pivoting [Malard 91] etc. n n It may not select the ‘best’ pivot. An element may become the pivot if holds Good: It can reduce communication for row exchange Bad: Not latency tolerant Pairwise pivoting [Sorensen 85] etc. n n n It repeatedly takes two adjacent rows and eliminates one of the two (cf. bubble sort) Good: It enables pipelining pivot selections Bad: Numerically unstable Grid 05 workshop Highly Latency Tolerant GE/T. Endo 16

Outline n n n Gaussian elimination with partial pivoting Batched pivoting Evaluation n Latency tolerance Numerical accuracy Summary Grid 05 workshop Highly Latency Tolerant GE/T. Endo 17

Environment for Parallel Experiments n 192 node Linux cluster n n Dual Xeon 2. 4/2. 8 GHz (1 CPU per node is used) Gigabit ethernet Latencies: 55— 75 us BP is implemented by modifying HPL n n mpich 1. 2. 6 BLAS library by Kazushige Goto Grid 05 workshop Highly Latency Tolerant GE/T. Endo 18

Network Structure of Cluster 14 SW 14 1 Gbps 2 Gbps 4 Gbps SW SW FS SW SW 14 Grid 05 workshop SW 14 SW SW SW 20 20 SW ISTBS cluster at U. Tokyo 14 Highly Latency Tolerant GE/T. Endo 19

Basic Parallel Performance n Comparing speeds of PP and BP (d=4, 16, 64) n n n BP shows similar scalability to that of PP BP suffers from overheads of additional computation n Grid 05 workshop 32 to 160 nodes n=32768, sb=256 No emulated latency 7 to 15% with d=64 Highly Latency Tolerant GE/T. Endo 20

Performance with Large Latencies With latencies, BP is much faster n Large latencies are added n n BP is stable against large latencies! n Grid 05 workshop +2 ms, +5 ms, +10 ms between all pairs of nodes 64(=8 x 8) nodes n=32768, sb=256 When d is larger, it gets more tolerant of latencies Highly Latency Tolerant GE/T. Endo 21

Evaluation Method of Numerical Accuracy We conducted experiments to evaluate numerical accuracy n Partial, Batched, Threshold, Pairwise, No pivoting are compared n n n 100 random matrices for each condition n n Done on a single node In BP, blocks of size 64 are regarded as nodes Matrix sizes are 128 to 2048 Normalized residuals are evaluated n n : computed solution, ε: machine epsilon(= Next slide shows the average residuals Grid 05 workshop Highly Latency Tolerant GE/T. Endo ) 22

Numerical Accuracy n n n PP achieves the best accuracy No pivoting, Pairwise are numerically unstable BP and threshold achieve comparable accuracy to PP n n Grid 05 workshop Average residuals of BP (d=4) are x 1. 1 --1. 6 of PP The sizes of residuals depend on d Tradeoff between accuracy and latency tolerance Highly Latency Tolerant GE/T. Endo 23

Outline n n n Gaussian elimination with partial pivoting Batched pivoting Evaluation n Latency tolerance Numerical accuracy Summary Grid 05 workshop Highly Latency Tolerant GE/T. Endo 24

Summary n A GE algorithm that tolerates large latencies of Grid n n Batched pivoting largely reduces the number of synchronous communications BP achieves comparable numerical accuracy to PP Grid 05 workshop Highly Latency Tolerant GE/T. Endo 25

Future Work n n Performance evaluation on actual Grid Improvement of accuracy n n Combining batched pivoting with complete or rook pivoting Theoretical error analysis n cf. average case analysis by Trefethen et al. Grid 05 workshop Highly Latency Tolerant GE/T. Endo 26

Another Approach: Column Distribution n When each column is placed on a single node, synchronization is not necessary n It is latency tolerant, but… Slower because of increase in comm. amount n Grid 05 workshop Highly Latency Tolerant GE/T. Endo 28

Why PP is fragile to large latencies n n n Batching several steps are well-known technique for row exchange and update Then, can we reduce synchronizations for pivoting? No. pivoting cannot be batched or pipelined n n Each pivoting depends on pivoting of proceeding steps! In total, n times synchronizations are required If latencies are too large, synchronization costs become bottleneck Grid 05 workshop Highly Latency Tolerant GE/T. Endo 29

Bandwidth Requirements n Estimation of limit speed when bisection bandwidth is given n n Bluegene/L Our experiments Depends on n Very optimistic Our experiments requires 250 Mbps Bluegene/L(Jun. 2005) requires 5 Gbps Grid 05 workshop Highly Latency Tolerant GE/T. Endo 30

Effects of latencies n Estimation of limit speed when latency is given n Bluegene/L Our experiments Depends on n Very optimistic With >7 ms latencies, we can never obtain performance of Bluegene/L Grid 05 workshop Highly Latency Tolerant GE/T. Endo 31

Displayed residual of HPL differs from actual value n Displayed results ====================================== T/V N NB P Q Time Gflops --------------------------------------W 21 L 2 L 4 1024 256 1 1 0. 30 2. 357 e+00 --------------------------------------||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0. 0307237. . . PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0. 0135777. . . PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0. 0035758. . . PASSED Divide by n doesn’t be shown n Source code(HPL_pdtest. c) resid 1 = resid 0 / ( TEST->epsil * Anorm 1 * (double)(N) ); resid 2 = resid 0 / ( TEST->epsil * Anorm 1 * Xnorm 1 ); resid 3 = resid 0 / ( TEST->epsil * Anorm. I * Xnorm. I * (double)(N) ); Actually, devided Grid 05 workshop Highly Latency Tolerant GE/T. Endo 32

Numerical Accuracy(2) Compares PP and BP with large matrices n Matrices are generated by HPL n On 64(=8 x 8) nodes Grid 05 workshop Highly Latency Tolerant GE/T. Endo 33

Detail of accuracy (Small matrices) Worst case in 100 trials Grid 05 workshop Standard deviation Highly Latency Tolerant GE/T. Endo 34

Detail of accuracy (Large matrices) n From left to right, ||Ax-b||_oo / ( eps * ||A||_1 * N ) ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo * N) Grid 05 workshop Highly Latency Tolerant GE/T. Endo 35

Notes n Simple batched pivoting may fail, if local GEs fail on all nodes n n This situation may occur with (nearly) sparse matrices We can recover by restarting from the failed column The number of synchronization gets closer to that of partial pivoting Grid 05 workshop Highly Latency Tolerant GE/T. Endo 36