Matrix Multiplication on Two Interconnected Processors Brett A

Matrix Multiplication on Two Interconnected Processors ____________________________ Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory School of Computer Science and Informatics University College Dublin Hetero. Par’ 06 Barcelona Sept. 28, 2006

Outline ● Motivation and Goals ● Introduction: ‘Straight-Line’ Partitionings ● The ‘Square-Corner’ Partitioning - Minimizing the Total Volume of Communication ● MPI Experiments / Results ● Conclusion / Future Work

Motivation and Goals ● Partitioning algorithms for MMM designed for n processors result in partitionings which are not always optimal on a small number of processors ● We seek to lower the Total Volume of Communication by utilizing a new partitioning strategy. ● Our ultimate interest is to determine if the Square-Corner partitioning is a viable technique for deployment on 2 interconnected Clusters.

Background: Straight-Line Partitioning Total Volume of Inter-Processor Communication (TVC) is proportional to the Sum of Half-Perimeters (S) Lower Bound (L) of S is when all partitions are square

Straight-Line Partitioning Average and Minimum values of for two million randomly generated areas From: Olivier Beaumont, Vincent Boudet, Fabrice Rastello and Yves Robert, “Matrix-Matrix Multiplication on Heterogeneous Platforms”, IEEE Transactions on Parallel and Distributed Systems, 2001, Vol. 12, No. 10, pp. 1033 -1051.

Background: Straight-Line Partitioning 2 Processors The Straight-Line Partitioning can not meet the lower bound, L

Background: Straight-Line Partitioning 2 Processors Total Volume of Inter-Processor Communication (TVC) = N 2

Introduction: Square-Corner Partitioning

Square-Corner Partitioning The Square-Corner Partitioning can meet the lower bound, L

Square-Corner Partitioning Average and Minimum values of for 2 million randomly generated areas Power Ratio > 3: 1 Adapted From: Olivier Beaumont, Vincent Boudet, Fabrice Rastello and Yves Robert, “Matrix-Matrix Multiplication on Heterogeneous Platforms”, IEEE Transactions on Parallel and Distributed Systems, 2001, Vol. 12, No. 10, pp. 1033 -1051.

Square-Corner Partitioning Minimizing the TVC Theorem: The Total Volume of Communication is minimized when the slower processor’s partition is a square Theorem: The Square-Corner Partitioning has a lower Total Volume of Communication compared to the Straight-Line Partitioning Provided the Processor Power Ratio is > 3: 1

Results: Square-Corner Partitioning Matrix-Matrix Multiplication, N=6500, Bandwidth = 80 Mb/s Average Reduction in Communication Time = 45% Lower TVC a Average Reduction in Execution Time = 14% Lower Communication Time a Lower Execution Time

Results: Square-Corner Partitioning Matrix-Matrix Multiplication, N=6500, Bandwidth = 380 Mb/s Average Reduction in Communication Time = 44% Lower TVC a Average Reduction in Execution Time = 10% Lower Communication Time a Lower Execution Time

Square-Corner Partitioning Overlapping Communication and Computation A sub-partition of Processor 1’s C Partition is Immediately Calculable

Square-Corner Partitioning Overlapping Communication and Computation MM Multiplication, N=4500, Bandwidth=100 Mb/s, Ratio=5: 1, Algorithm Straight-Line Square-Corner (No Overlapping) Square-Corner (Overlapping) Sequential Execution Time Speedup 83 s 69 s 51 s 78 s 0. 94 1. 13 1. 53 N/A Overlapping more than doubled advantage of Square-Corner algorithm. ● No Overlapping → 17% faster than Straight-Line algorithm. ● Overlapping → 39% faster than Straight-Line algorithm.

Square-Corner Partitioning Two Cluster Architecture Total of 20 Homogeneous Nodes in 2 Clusters

Square-Corner Partitioning Two Clusters MM Multiplication, N=9000, Bandwidth=100 Mb/s All Machines are Homogeneous. One Cluster of 4, One Cluster of 16 Algorithm Straight-Line Square-Corner Sequential Execution Time Speedup 123 s 115 s 128 s 1. 04 1. 11 N/A

Conclusions ____________________________ ● The Square-Corner Partitioning reduces the Total Volume of Communication provided the processor power ratio is > 3: 1 ● The possibility of Overlapping Communication and Computation can bring further reductions in Execution Time ● The Square-Corner Partitioning is viable on Two Clusters

Current and Future Work ____________________________ ● We have successfully extended the Square-Corner Partitioning to Three Processors To do: ● Experiment on more Two-Cluster architectures ● Overlap Communication and Computation on Two Clusters ● Extend to Three-Processor Algorithm to Three Clusters

Acknowledgements This work was supported by: