Highly Latency Tolerant Gaussian Elimination Toshio ENDO Kenjiro
- Slides: 35
Highly Latency Tolerant Gaussian Elimination Toshio ENDO, Kenjiro TAURA University of Tokyo Grid 05 workshop Highly Latency Tolerant GE/T. Endo 1
Background n n n Demands for large scale computing are increasing Grid computing is attractive to improve cost performance Grid has been successful for applications with small numbers of communication n Master-worker, parameter sweeping …@home projects Evaluations of applications with frequent communication on Grid are still rare n n Matrix ops, PDE solver MPICH-G 2, Cactus-G Grid 05 workshop Highly Latency Tolerant GE/T. Endo 2
Obstacles to Running Apps with Frequent Comm. n Volatility, heterogeneity of computing nodes will be solved by progress of programming tools n Low bandwidth of WAN will be solved by improvements of WAN n Large latencies of WAN n More than 10 ms on Grid >> a few us on supercomputers Large latencies will remain an obstacle! Algorithms that tolerate latencies are important Grid 05 workshop Highly Latency Tolerant GE/T. Endo 3
Target Computation Gaussian elimination (GE) of dense matrices for solving linear equations n n Used for n n n Same as LU decomposition, Linpack Fluid simulations, structural analysis Top 500 ranking Difficult to achieve good performance with large latencies n Partial pivoting (PP) introduces frequent synchronous communications Grid 05 workshop Highly Latency Tolerant GE/T. Endo 4
Overview of This Work n n Gaussian elimination algorithm that tolerates large latencies is presented An alternative pivoting method, named batched pivoting (BP) is proposed n n More latency tolerant than PP BP can largely reduce the frequency of synchronous communications Grid 05 workshop Highly Latency Tolerant GE/T. Endo 5
Outline n n n Gaussian elimination with partial pivoting Batched pivoting Evaluation n Latency tolerance Numerical accuracy Summary Grid 05 workshop Highly Latency Tolerant GE/T. Endo 6
Gaussian Elimination with Partial pivoting GE of n×n matrix A for k = 1 to n Pivoting Finds the largest element (pivot) in the k-th column Row Exchange n Update Since pivot is used as divisor, its absolute value should be larger Grid 05 workshop Highly Latency Tolerant GE/T. Endo 7
Problem of GE with PP n # of nodes p=6 (=2 x 3) Well-known distribution: 2 D block cyclic distribution n Good: Comm. amount is small O(n 2 p ) n sb Grid 05 workshop Each column is partitioned among nodes Each pivot selection requires synchronization With large latencies, sync. costs become bottleneck Highly Latency Tolerant GE/T. Endo 8
Performance of GE/PP with Large Latencies n We emulated large artificial latencies on a Linux cluster n n n Identical latencies are inserted among all pairs of nodes base, +2 ms, +5 ms, +10 ms With +10 ms latency, it gets 6 times slower! High performance Linpack (HPL) is measured n n n GE with PP Matrix size n=32768 64 (=8 x 8) nodes GE with PP is weak in large latencies Grid 05 workshop Highly Latency Tolerant GE/T. Endo 9
How about Other Pivoting Methods? Strict Complete Rook [Neal 92] Not latency tolerant Partial Threshold [Malard 91] etc. Batched (Ours) Pairwise Relaxed Grid 05 workshop [Sorensen 85] etc. No pivoting Highly Latency Tolerant GE/T. Endo numerically unstable 10
Outline n n n Gaussian elimination with partial pivoting Batched pivoting Evaluation n Latency tolerance Numerical accuracy Summary Grid 05 workshop Highly Latency Tolerant GE/T. Endo 11
The Aim of Batched Pivoting (BP) n BP reduces the frequency of synchronous communications n We batch pivot selections of several contiguous steps n n The size of batch d is determined in advance Synchronous communications occur only every d steps Grid 05 workshop Highly Latency Tolerant GE/T. Endo 12
Batched Pivoting Algorithm (1) Algorithm that selects d contiguous pivots n n n In the figure, d columns are partitioned between P 1 and P 2 Each node duplicates the columns and makes a sub-matrix Each node locally and speculatively performs GE with PP n Each node obtains d pivot candidates sb d P 1 P 3 P 2 P 4 d P 2 sb Grid 05 workshop Highly Latency Tolerant GE/T. Endo 13
Batched Pivoting Algorithm (2) n n Sets of pivot candidates are gathered The ‘best’ set is selected n n We try to avoid bad pivots d=4 in the figure P 1 I recommend 4. 8 on 50 th row, -2. 5 on 241 th row, 4. 3 on 285 th row, -3. 6 on 36 th row P 2 Compare I recommend -9. 2 on 310 th row, 6. 8 on 121 th row, 0. 8 on 170 th row, -5. 9 on 146 th row Adopt! n Contents of the best set are broadcast as final pivots Grid 05 workshop Highly Latency Tolerant GE/T. Endo 14
Comparison with PP n Selected pivots n n n PP selects pivot of each step independently BP takes pivots of d contiguous steps from a single node Pivots may be worse than PP Computation costs PP: n BP: The difference is small if d<<n n Grid 05 workshop Highly Latency Tolerant GE/T. Endo For local GE 15
Comparison with other pivoting methods n Threshold pivoting [Malard 91] etc. n n It may not select the ‘best’ pivot. An element may become the pivot if holds Good: It can reduce communication for row exchange Bad: Not latency tolerant Pairwise pivoting [Sorensen 85] etc. n n n It repeatedly takes two adjacent rows and eliminates one of the two (cf. bubble sort) Good: It enables pipelining pivot selections Bad: Numerically unstable Grid 05 workshop Highly Latency Tolerant GE/T. Endo 16
Outline n n n Gaussian elimination with partial pivoting Batched pivoting Evaluation n Latency tolerance Numerical accuracy Summary Grid 05 workshop Highly Latency Tolerant GE/T. Endo 17
Environment for Parallel Experiments n 192 node Linux cluster n n Dual Xeon 2. 4/2. 8 GHz (1 CPU per node is used) Gigabit ethernet Latencies: 55— 75 us BP is implemented by modifying HPL n n mpich 1. 2. 6 BLAS library by Kazushige Goto Grid 05 workshop Highly Latency Tolerant GE/T. Endo 18
Network Structure of Cluster 14 SW 14 1 Gbps 2 Gbps 4 Gbps SW SW FS SW SW 14 Grid 05 workshop SW 14 SW SW SW 20 20 SW ISTBS cluster at U. Tokyo 14 Highly Latency Tolerant GE/T. Endo 19
Basic Parallel Performance n Comparing speeds of PP and BP (d=4, 16, 64) n n n BP shows similar scalability to that of PP BP suffers from overheads of additional computation n Grid 05 workshop 32 to 160 nodes n=32768, sb=256 No emulated latency 7 to 15% with d=64 Highly Latency Tolerant GE/T. Endo 20
Performance with Large Latencies With latencies, BP is much faster n Large latencies are added n n BP is stable against large latencies! n Grid 05 workshop +2 ms, +5 ms, +10 ms between all pairs of nodes 64(=8 x 8) nodes n=32768, sb=256 When d is larger, it gets more tolerant of latencies Highly Latency Tolerant GE/T. Endo 21
Evaluation Method of Numerical Accuracy We conducted experiments to evaluate numerical accuracy n Partial, Batched, Threshold, Pairwise, No pivoting are compared n n n 100 random matrices for each condition n n Done on a single node In BP, blocks of size 64 are regarded as nodes Matrix sizes are 128 to 2048 Normalized residuals are evaluated n n : computed solution, ε: machine epsilon(= Next slide shows the average residuals Grid 05 workshop Highly Latency Tolerant GE/T. Endo ) 22
Numerical Accuracy n n n PP achieves the best accuracy No pivoting, Pairwise are numerically unstable BP and threshold achieve comparable accuracy to PP n n Grid 05 workshop Average residuals of BP (d=4) are x 1. 1 --1. 6 of PP The sizes of residuals depend on d Tradeoff between accuracy and latency tolerance Highly Latency Tolerant GE/T. Endo 23
Outline n n n Gaussian elimination with partial pivoting Batched pivoting Evaluation n Latency tolerance Numerical accuracy Summary Grid 05 workshop Highly Latency Tolerant GE/T. Endo 24
Summary n A GE algorithm that tolerates large latencies of Grid n n Batched pivoting largely reduces the number of synchronous communications BP achieves comparable numerical accuracy to PP Grid 05 workshop Highly Latency Tolerant GE/T. Endo 25
Future Work n n Performance evaluation on actual Grid Improvement of accuracy n n Combining batched pivoting with complete or rook pivoting Theoretical error analysis n cf. average case analysis by Trefethen et al. Grid 05 workshop Highly Latency Tolerant GE/T. Endo 26
Another Approach: Column Distribution n When each column is placed on a single node, synchronization is not necessary n It is latency tolerant, but… Slower because of increase in comm. amount n Grid 05 workshop Highly Latency Tolerant GE/T. Endo 28
Why PP is fragile to large latencies n n n Batching several steps are well-known technique for row exchange and update Then, can we reduce synchronizations for pivoting? No. pivoting cannot be batched or pipelined n n Each pivoting depends on pivoting of proceeding steps! In total, n times synchronizations are required If latencies are too large, synchronization costs become bottleneck Grid 05 workshop Highly Latency Tolerant GE/T. Endo 29
Bandwidth Requirements n Estimation of limit speed when bisection bandwidth is given n n Bluegene/L Our experiments Depends on n Very optimistic Our experiments requires 250 Mbps Bluegene/L(Jun. 2005) requires 5 Gbps Grid 05 workshop Highly Latency Tolerant GE/T. Endo 30
Effects of latencies n Estimation of limit speed when latency is given n Bluegene/L Our experiments Depends on n Very optimistic With >7 ms latencies, we can never obtain performance of Bluegene/L Grid 05 workshop Highly Latency Tolerant GE/T. Endo 31
Displayed residual of HPL differs from actual value n Displayed results ====================================== T/V N NB P Q Time Gflops --------------------------------------W 21 L 2 L 4 1024 256 1 1 0. 30 2. 357 e+00 --------------------------------------||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0. 0307237. . . PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0. 0135777. . . PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0. 0035758. . . PASSED Divide by n doesn’t be shown n Source code(HPL_pdtest. c) resid 1 = resid 0 / ( TEST->epsil * Anorm 1 * (double)(N) ); resid 2 = resid 0 / ( TEST->epsil * Anorm 1 * Xnorm 1 ); resid 3 = resid 0 / ( TEST->epsil * Anorm. I * Xnorm. I * (double)(N) ); Actually, devided Grid 05 workshop Highly Latency Tolerant GE/T. Endo 32
Numerical Accuracy(2) Compares PP and BP with large matrices n Matrices are generated by HPL n On 64(=8 x 8) nodes Grid 05 workshop Highly Latency Tolerant GE/T. Endo 33
Detail of accuracy (Small matrices) Worst case in 100 trials Grid 05 workshop Standard deviation Highly Latency Tolerant GE/T. Endo 34
Detail of accuracy (Large matrices) n From left to right, ||Ax-b||_oo / ( eps * ||A||_1 * N ) ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo * N) Grid 05 workshop Highly Latency Tolerant GE/T. Endo 35
Notes n Simple batched pivoting may fail, if local GEs fail on all nodes n n This situation may occur with (nearly) sparse matrices We can recover by restarting from the failed column The number of synchronization gets closer to that of partial pivoting Grid 05 workshop Highly Latency Tolerant GE/T. Endo 36
- Gauss-jordan elimination
- Toshio okumura
- Toshio tsurunaga
- Eliminasi gauss jordan
- Gauss elimination method example
- Gaussian elimination method
- Gauss jordan elimination method
- Parallel gaussian elimination
- Transpose of inverse matrix
- Gauss elimination method
- Byzantine fault tolerant replicated state machine
- Was the ottoman empire tolerant of other religions
- Tolerant retrieval
- Tolerant
- Throchlea
- Teuer comparative superlative
- Heck tate is wise and tolerant.
- Particles in condensation
- Endo crine system
- Endo guys ottawa
- Fosse crânienne postérieure
- Slidetodoc.com
- Endo rule
- Hbr stick diagram
- Megumi endo unfccc
- Endo epi peri
- Abnormal uterine bleeding in postmenopausal
- Sublimation exo or endo
- Celoma
- Endo root word
- Edgenuity
- Endothermic enthalpy
- Orthograde vs retrograde root filling
- Cefts transfer meaning
- Mandibular second molar endo access
- Endo predict