Matrix Multiplication in Hadoop Siddharth Saraph Matrix Multiplication

  • Slides: 11
Download presentation
Matrix Multiplication in Hadoop Siddharth Saraph

Matrix Multiplication in Hadoop Siddharth Saraph

Matrix Multiplication • Matrices are very practical: sciences, engineering, statistics, etc. • Multiplication is

Matrix Multiplication • Matrices are very practical: sciences, engineering, statistics, etc. • Multiplication is a fundamental nontrivial matrix operation. • Simpler than something like matrix inversion (although the time complexity is the same).

Matrix Multiplication • Problem: Some people want to use enormous matrices. Cannot be handled

Matrix Multiplication • Problem: Some people want to use enormous matrices. Cannot be handled on one machine • Take advantage of map-reduce parallelism to approach this problem. • Heuristics: – 10, 000 x 10, 000: 100, 000 entries – 100, 000 x 100, 000: 10, 000, 000 entries • In practice: sparse matrices.

First Step: Matrix Representation • How to represent a matrix for input to a

First Step: Matrix Representation • How to represent a matrix for input to a mapreduce job? • Convenient for sparse matrices: “Coordinate list format. ” • (row index, col index, value). (Board). • Omit entries with value 0. • Entries can be in an arbitrary order.

Second Step: Map Reduce Algorithm • Simple “entrywise” method. • Various related block methods:

Second Step: Map Reduce Algorithm • Simple “entrywise” method. • Various related block methods: matrices are partitioned into smaller blocks, and logically processed as blocks. An excess of notation and indices to keep track of, easy to get lost.

Second Step: Map Reduce Algorithm • Chalkboard.

Second Step: Map Reduce Algorithm • Chalkboard.

Implementation • Java, not Hadoop streaming. Why? – Seemed like a more complex project

Implementation • Java, not Hadoop streaming. Why? – Seemed like a more complex project that would require more control. – Custom Key and Value classes. – Custom Partitioner class for the block method, for distributing keys to reducers. – Learn java.

Performance • Random matrix generator: row dimension, column dimension, density. Doubles in (-1000, 1000).

Performance • Random matrix generator: row dimension, column dimension, density. Doubles in (-1000, 1000). • Many parameters to vary: matrix dimensions, double max, number of splits, number of reducers, density of matrix • Sparse 1000 x 1000, . 1, 6 splits, 12 reducers, 2. 9 MB: – 5 minutes • Sparse 5000 x 5000, . 1, 20 splits, 20 reducers, 73 MB: – > 1 Hour

MATLAB Performance • Windows 7, MATLAB 2015 a 64 -bit. • Engineering Library cluster,

MATLAB Performance • Windows 7, MATLAB 2015 a 64 -bit. • Engineering Library cluster, 4 GB RAM: – 13, 000 x 13, 000 about largest that could fit in memory. Full random matrices of doubles. Multiplication time: about 2 minutes. • La. Fortune cluster, 16 GB RAM: – 20, 000 x 20, 000. 1 density, sparse matrix. Multiplication time: about 2 minutes 30 seconds.

Improvements? • Different matrix representation? Maybe there are better ways to represent sparse matrices

Improvements? • Different matrix representation? Maybe there are better ways to represent sparse matrices than Coordinate List format. • Strassen’s algorithm? O(n 2. 8), benefits of about 10% with matrix dimensions of few thousand. • Use a different algorithm? • Use a different platform? Spark?

Conclusion • What happened to the enormous matrices? • From my project, I do

Conclusion • What happened to the enormous matrices? • From my project, I do not think Hadoop is a practical choice for implementing matrix multiplication. • I did not find any implementations of matrix multiplication in Hadoop that provide significant benefit over local machines.