Processes Distribution of Homogeneous Parallel Linear Algebra Routines

Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters Javier Cuenca Luis Pedro García Domingo Giménez Scientific Computation Researching Group, University of Murcia, Spain Antonio Javier Cuenca Muñoz Dpto. Ingeniería y Tecnología de Computadores Jack Dongarra Innovative Computing Laboratory, University of Tennessee, USA

Introduction R Automatically Optimised Linear Algebra Software ¦ Objective ü Software capable of tuning itself according to the execution environment ¦ Motivation ü ü ¦ Non-expert users take decisions about computation Software should adapt to the continuous evolution of hardware Developing efficient code by hand consumes a large quantity of resources System computation capabilities are very variable Some examples of auto-tuning software: ü ATLAS, LFC, FFTW, I-LIB, FIBER, mp. C, Be. BOP, FLAME, . . . Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters 2

Automatic Optimisation on Heterogeneous Parallel Systems R Two possibilities on heterogeneous systems: ¦ ¦ Ho. He: Heterogeneous algorithms (heterogeneous distribution of data). He. Ho: Homogeneous algorithms and heterogeneous assignation of processes: ü A variable number of processes to each processor, depending on the relative speeds ü Mapping processes processors must be made, and without a large execution time in the decision taking ü Theoretical models: parameters which represent the characteristics of the system ü The general assignation problem is NP use of heuristic approximations Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters 3

Our previous Ho. Ho methodology R Routines model ¦ n: problem size ¦ SP: system parameters ü Computation and communication characteristics of the system ¦ AP: algorithm parameters ü Block size, number of processors to use, logical configurations of the processors, . . . (with one process per processor) ü Values are chosen when the routine begins to run Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters 4

Our previous Ho. Ho methodology Our He. Ho meth. R Modifications in the routine model: ¦ New AP: ü Number of processes to generate ü Mapping processes to processors ¦ SP values changes: ü More than one process per processor: Each SPi in processor i as di (number of processes assigned to processor i) times higher ü Implicit synchronization global value of each of the SPi is considered as the maximum value from all the processors. è The slowest process forces to the other ones to reduce their speed, waiting for it at the different synchronization points of the routine. Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters 6

Our He. Ho methodology: an example of routine model R LU factorisation, parallel version. Model: ¦ SP: system parameters ü k 3_DGEMM, k 3_DTRSM, k 2_DGETF 2 ü ts, tw ¦ AP: algorithm parameters ü ü ü b: block size P: number of processors p: number of processes Mapping p processes on the P processors p = r x c: logical configuration of the processes: 2 D mesh Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters 7

Our He. Ho methodology: an example of routine model R Platforms: ¦ SUNEt: ü Five SUN Ultra 1 ü One SUN Ultra 5 ü Interconexion network: Ethernet ¦ TORC (Innovative Computing Laboratory): ü 21 nodes of different types è dual and single processors è Pentium II, III and 4 è AMD Athlon ü Interconexion networks: è Fast. Ethernet è Giganet è Myrinet Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters 8

Our He. Ho methodology: an example of routine model R Theoretical vs. Experimental time on SUNEt. n=2048 Mapping of 8 processes on the 6 processors Logical topology of the 8 processes Block AP 1 (1, 1, 1, 3) 2 х4 32 AP 2 (2, 1, 1, 2) 2 х4 32 AP 3 (2, 2, 1, 1) 2 х4 32 AP 4 (1, 1, 1, 3) 2 х4 64 AP 5 (2, 1, 1, 2) 2 х4 64 AP 6 (2, 2, 1, 1) 2 х4 64 AP 7 (1, 1, 1, 3) 1 х8 32 AP 8 (2, 1, 1, 2) 1 х8 32 AP 9 (2, 2, 1, 1) 1 х8 32 AP 10 (1, 1, 1, 3) 1 х8 64 AP 11 (2, 1, 1, 2) 1 х8 64 AP 12 (2, 2, 1, 1) 1 х8 64 size Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters 9

Our He. Ho methodology: an example of routine model R Theoretical vs. Experimental time on TORC. n=4096 Mapping of 8 processes on 19 processors Logical Block topology of the Size 8 processes AP 1 (1, 0, 1, 0, 0, 0, 0) 4 х2 32 AP 2 (1, 0, 1, 0, 0, 0, 0) 8 х1 32 AP 3 (1, 0, 1, 0, 0, 2, 0) 4 х2 32 AP 4 (1, 0, 1, 0, 0, 2, 0) 8 х1 32 AP 5 (1, 0, 0, 0, 0, 2, 2, 1) 4 х2 32 AP 6 (1, 0, 0, 0, 0, 2, 2, 1) 8 х1 32 AP 7 (1, 0, 1, 0, 0, 0, 2, 0, 0) 4 х2 32 AP 8 (1, 0, 1, 0, 0, 0, 2, 0, 0) 8 х1 32 Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters 10

Our He. Ho methodology R Our approach: Assignment tree P processors 1 2 2 3. . . P 2 3 3 . . . P P . . . p processes 1 R A limit in the height of the tree (number of processes) is necessary R Each node represents a possible solution: processes processors R The other APs (block size, logical topology) are chosen at each node Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters 11

Our He. Ho methodology R For each node: ¦ EET(node): Estimated Execution Time ü Optimization problem: finding the node with the lowest EET ¦ ¦ LET(node): Lowest Execution Time GET(node): Greatest Execution Time ü LET and GET are lower and upper bounds of the optimum solution of the subtree below the node ¦ LET and GET to limit the number of nodes evaluated ü MEET = minevaluated_nodes {GET(node)} ü If {LET (node) > MEET} do not work below this node Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters 12

Our He. Ho methodology R Automatic searching strategies in the assignment tree: ¦ Method 1: ü Backtracking ü GET = EET. ¦ Method 2: ü Backtraking ü GET obtained with a greedy approach ¦ Method 3: ü Backtraking ü GETobtained with a greedy approach ü LET obtained with a greedy approach ¦ Method 4: ü Greedy method on the current assignment tree è (a combinatorial tree with repetitions) ¦ Method 5: ü Greedy method on a permutational tree with repetitions Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters 13

Our He. Ho methodology R Automatic searching strategies in the assignment tree: ¦ Method 1: ü Backtracking ü GET = EET ü LET = LETari + LETcom è LETari = sequential time divided by the maximum achievable speed-up when using all the processors not yet discarded è LETcom = assuming the best logical topology of processes that can be obtained from this node Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters 14

Our He. Ho methodology R Automatic searching strategies in the assignment tree: ¦ Method 2: ü Backtracking ü GET = a greedy approach: the EET for each of the children of the node is calculated, and the node with the lowest EET is included in the solution ü LET = LETari + LETcom è LETari = sequential time divided by the maximum achievable speed-up when using all the processors not yet discarded è LETcom = assuming the best logical topology of processes that can be obtained from this node. ü Fewer nodes are analyzed, but the evaluated cost per node increases Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters 15

Our He. Ho methodology R Automatic searching strategies in the assignment tree: ¦ Method 3: ü Backtracking ü GET = a greedy approach: the EET for each of the children of the node is calculated, and the node with the lowest EET is included in the solution ü LET = LETari + LETcom è LETari = A greedy approach is used: Ø For each node, the child that least increases the cost of arithmetic operations is included in the solution to obtain the lowest bound è LETcom = assuming the best logical topology of processes that can be obtained from this node. ü It is possible that a branch to a optimal solution will be discarded Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters 16

Our He. Ho methodology R Automatic searching strategies in the assignment tree: ¦ Method 4: ü Greedy method on the current assignment tree è (a combinatorial tree with repetitions) ¦ Method 5: ü Greedy method on a permutational tree with repetitions ¦ Both methods 4 and 5: ü To obtain better logical topologies of the processes: è traversal searching continues (through the best child for each node) until the established maximum level is reached. Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters 17

Experimental Results R Human searching strategies in the assignment tree: ¦ Greedy User (GU) ü Use ALL the available processors ü One process per processor ¦ Conservative User (CU) ü Use HALF of the available processors ü One process per processor ¦ Expert User (EU): ü Use 1 processor, HALF or ALL the processors depending on the problem size ü One process per processor Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters 18

Experimental Results R Automatic decisions vs. Users, on SUNEt (n = 7680) Method Processes mapping b Logical Topology Solution t. t. t. Level 1 (1, 1, 1, 1) 64 2 х3 718. 94 0. 02 25 2 (1, 1, 1, 1) 64 2 х3 718. 94 0. 04 25 3 (1, 1, 1, 1) 64 2 х3 718. 94 0. 02 25 4 (1, 1, 0, 0, 0, 1) 128 1 х3 887. 85 0. 0001 25 5 (1, 1, 0, 0, 0, 1) 128 1 х3 887. 85 0. 0005 25 CU (1, 1, 0, 0, 0, 1) 128 1 х3 1047. 13 GU (1, 1, 1, 1) 64 2 х3 887. 85 EU (1, 1, 1, 1) 64 2 х3 887. 85 Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters 21

Experimental Results R Automatic decisions vs. Users, on TORC (n = 2048) Method Processes mapping b Logical Topology Solution t. t. t. Level 1 (1, 1, 1, 1, 0, 0) 64 3 х5 17. 91 3. 08 15 2 (1, 1, 1, 1, 0, 0) 64 3 х5 17. 91 3. 08 15 3 (1, 1, 1, 1, 0, 0) 64 4 х4 15. 27 0. 06 25 4 (0, 0, 0, 0, 0, 1, 0) 64 1 х1 43. 16 0. 0012 30 5 (1, 1, 1, 1 , 1, 1, 0, 0) 64 4 х4 15. 27 0. 01 30 CU (1, 1, 1, 0, 0, 0, 1, 1, 1) 64 3 х3 23. 77 GU (1, 1, 1, 1, 1, 1) 32 1 х 19 33. 57 EU (1, 1, 1, 0, 0, 0, 1, 1, 1) 64 3 х3 23. 77 Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters 22

Simulations R Virtual Platforms: variations and/or increases of the real platforms: ¦ m. TORC-01 ü the quantity of 17 P 4 is increased to 11 ü Number of processors: 29. Types of processors: 4 ¦ m. TORC-02 ü the quantities of DPIII, SPIII, Ath and 17 P 4 are increased to 10, 10 and 20 respectively. Number of processors: 50. Types of processors: 4 ¦ m. TORC-03 ü the quantities of DPIII, SPIII, Ath and 17 P 4 are increased to 10, 15, 5 and 10, respectively ü additional processors have been included ü Number of processors: 100. Types of processors: 10 Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters 24

Simulations R Automatic decisions vs. Users R On virtual platform: m. TORC 01 (n = 20000) ü the quantity of 17 P 4 is increased to 11 ü Number of processors: 29. Types of processors: 4 Met. 1 Met. 2 Met. 3 Met. 4 Met. 5 CU GU EU Solution 666. 44 818. 82 666. 44 1322. 23 1145. 09 t. t. t 20. 39 59. 45 0. 68 0. 0007 0. 0122 Level 15 15 20 25 25 Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters 25

Experimental Results R Automatic decisions vs. Users R On virtual platform: m. TORC 02 (n = 20000) ü the quantities of DPIII, SPIII, Ath and 17 P 4 are increased to 10, 10 and 20 respectively ü Number of processors: 50. Types of processors: 4 Met. 1 Solution Met. 2 Met. 3 Met. 4 Met. 5 CU GU EU 3721. 98 3791. 98 2439. 43 1958. 43 1500. 24 2249. 70 2748. 36 2249. 70 t. t. t 259. 44 792. 32 7. 46 0. 01 0. 07 Level 15 15 25 30 30 Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters 26

Experimental Results R Automatic decisions vs. Users R On virtual platform: m. TORC 03 (n = 20000) ü the quantities of DPIII, SPIII, Ath and 17 P 4 are increased to 10, 15, 5 and 10, respectively ü additional processors have been included ü Number of processors: 100. Types of processors: 10 Met. 1 Met. 2 Met. 3 Met. 4 Met. 5 Solution 10712. 55 14532. 45 10712. 55 4333. 23 t. t. t 109. 24 169. 72 1274. 34 0. 08 2. 34 Level 10 10 5 25 40 CU GU EU 7405. 34 5422. 87 Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters 27

Conclusions ¦ Extension of our previous self-optimisation methodology for homogeneous systems ¦ On hetereogeneous systems, new decisions: ü Number of processes ü Mapping processes processors ¦ Good results with parallel LU factorisation ¦ Same methodology could be applied to other linear algebra routines Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters 28