Transforming Big Data with D 4 M Jeremy

Transforming Big Data with D 4 M Jeremy Kepner MIT Lincoln Laboratory 3 October 2012 This work is sponsored by the Department of the Air Force under Air Force Contract #FA 8721 -05 -C-0002. Opinions, interpretations, recommendations and conclusions are those of the authors and are not necessarily endorsed by the United States Government. D 4 M-1

Acknowledgements • • • D 4 M-2 Nicholas Arcolano Michelle Beard Bob Bond Josh Haines Matthew Schmidt Ben Miller Benjamin O’Gwynn Tamara Yu Bill Arcand Bill Bergeron • • • David Bestor Chansup Byun Matt Hubbell Pete Michaleas Julie Mullen Andy Prout Albert Reuther Tony Rosa Charles Yee Dylan Hutchinson

Outline D 4 M-3 • Introduction • Theory • Results • Summary

Example Applications of Graph Analytics ISR Social Cyber • Graphs represent entities and relationships detected through multi-INT sources • Graphs represent relationships between individuals or documents • Graphs represent communication patterns of computers on a network • 1, 000 s – 1, 000 s tracks and locations • 10, 000 s – 10, 000 s individual and interactions • 1, 000 s – 1, 000, 000 s network events • GOAL: Identify anomalous patterns of life • GOAL: Identify hidden social networks • GOAL: Detect cyber attacks or malicious software • D 4 M-4 Cross-Mission Challenge: Detection of subtle patterns in massive multi-source noisy datasets

Four Ecosystems Dominate Cloud Computing Enterprise Big Compute - Interactive - On-demand - Elastic - High performance - Parallel Languages - Scientific computing - Java - Map/Reduce - Easy admin - Indexing - Search - Security Big Data • • • DBMS Each ecosystem is at the center of a multi-$B market Pros/cons of each are numerous; diverging hardware/software Some missions can exist wholly in one ecosystem; some can’t D 4 M-5

Four Ecosystems Dominate Cloud Computing LLGrid Enterprise Big Compute - Interactive - On-demand - Elastic - High performance - Parallel Languages - Scientific computing Map. Reduce - Java - Map/Reduce - Easy admin - Indexing - Search - Security Big Data • • DBMS LLGrid Map. Reduce provides map/reduce interface in a big compute environment D 4 M provides an interactive parallel scientific computing environment to databases D 4 M-6

Big Data + Big Compute Stack Novel Analytics for: Text, Cyber, Bio Weak Signatures, Noisy Data, Dynamics B High Level Composable API: D 4 M (“Databases for Matlab”) A C Array Algebra E Distributed Database: Accumulo (triple store) High Performance Computing: LLGrid + Hadoop • D 4 M-7 Distributed Database/ Distributed File System Interactive Supercomputing Combining Big Compute and Big Data enables entirely new domains

High Level Language: D 4 M http: //www. mit. edu/~kepner/D 4 M Distributed Database D 4 M Dynamic Distributed Dimensional Data Model Associative Arrays Numerical Computing Environment B A C Query: Alice Bob Cathy David Earl E D A D 4 M query returns a sparse matrix or a graph… …for statistical signal processing or graph analysis in MATLAB D 4 M binds associative arrays to databases, enabling rapid prototyping of data-intensive cloud analytics and visualization D 4 M-8

Outline D 4 M-9 • Introduction • Theory – Associate Arrays – Incidence Matrix • Results • Summary

What are Spreadsheets and Big Tables? Big Tables Spreadsheets • Spreadsheets are the most commonly used analytical structure on Earth (100 M users/day? ) • Big Tables (Google, Amazon, Facebook, …) store most of the analyzed data in the world (Exabytes? ) • Simultaneous diverse data: strings, dates, integers, reals, … • Simultaneous diverse uses: matrices, functions, hash tables, databases, … • No formal mathematical basis; Zero papers in AMA or SIAM D 4 M-10

D 4 M Key Concept: Associative Arrays Unify Four Abstractions • Extends associative arrays to 2 D and mixed data types A('alice ', 'bob ') = 'cited ' or • A('alice ', 'bob ') = 47. 0 Key innovation: 2 D is 1 -to-1 with triple store ('alice ', 'bob ', 'cited ') or ('alice ', 'bob ', 47. 0) bob AT x bob cited carl alice cited carl D 4 M-11 alice

Composable Associative Arrays • Key innovation: mathematical closure – • All associative array operations return associative arrays Enables composable mathematical operations A+B • A-B A&B A|B A*B Enables composable query operations via array indexing A('alice bob ', : ) A('alice ', : ) A('al* ', : ) A('alice : bob ', : ) A(1: 2, : ) A == 47. 0 • Simple to implement in a library (~2000 lines) in programming environments with: 1 st class support of 2 D arrays, operator overloading, sparse linear algebra • • Complex queries with ~50 x less effort than Java/SQL Naturally leads to high performance parallel implementation D 4 M-12

Associative Array Definitions • Keys and values are from the infinite strict totally ordered set S • Associative array A(k) : Sd S, k=(k 1, …, kd), is a partial function from d keys (typically 2) to 1 value, where A(ki) = vi and otherwise • Binary operations on associative arrays A 3 = A 1 A 2, where = f() or f(), have the properties – If A 1(ki) = v 1 and A 2(ki) = v 2, then A 3(ki) is v 1 f() v 2 = f(v 1, v 2) or v 1 f() v 2 = f(v 1, v 2) – If A 1(ki) = v or and A 2(ki) = or v, then A 3(ki) is v f() = v • • • or v f() = High level usage dictated by these definitions Deeper algebraic properties set by the collision function f() Frequent switching between “algebras” (how spreadsheets are used) D 4 M-13

Theory Questions • Associative arrays can be constructed from a few definitions • Similar to linear algebra, but applicable to a wider range of data • Key questions – Which linear algebra properties do apply to associative arrays (intuitive) – Which linear algebra properties do not apply to associative arrays (watch out) – Which associative array properties do not apply to linear algebra (new) Associative Arrays new D 4 M-14 Linear Algebra intuitive watch out

References • • • D 4 M-15 Book: “Graph Algorithms in the Language of Linear Algebra” Editors: Kepner (MIT-LL) and Gilbert (UCSB) Contributors: – – – – – Bader (Ga Tech) Bliss (MIT-LL) Bond (MIT-LL) Dunlavy (Sandia) Faloutsos (CMU) Fineman (CMU) Gilbert (USCB) Heitsch (Ga Tech) Hendrickson (Sandia) Kegelmeyer (Sandia) Kepner (MIT-LL) Kolda (Sandia) Leskovec (CMU) Madduri (Ga Tech) Mohindra (MIT-LL) Nguyen (MIT) Radar (MIT-LL) Reinhardt (Microsoft) Robinson (MIT-LL) Shah (USCB)

Outline D 4 M-16 • Introduction • Theory – Associate Arrays – Incidence Matrix • Results • Summary

Digraphs are Black & White D 4 M-17

The World is Color Artist: Ann Pibal; Painting: “XCRS” D 4 M-18

5 Edge Colors Blue Silver Green Orange Pink Artist: Ann Pibal; Painting: “XCRS” D 4 M-19

20 Vertices V 12 V 14 V 3 V 17 V 8 V 19 V 13 V 7 V 9 V 11 V 2 V 16 V 5 V 10 V 15 V 4 V 18 Artist: Ann Pibal; Painting: “XCRS” D 4 M-20 V 20

1 Isolated Standard Edge P 4 Artist: Ann Pibal; Painting: “XCRS” D 4 M-21

2, G 2 , O 3, O 4, P 2 B 2, S B 1, S 1, G 1 , O 1, O 2, P 1 12 Multi Edges Artist: Ann Pibal; Painting: “XCRS” D 4 M-22

P 8 B 2, S P 5 O 5 P 7 P 3 P 6 Artist: Ann Pibal; Painting: “XCRS” D 4 M-23 2, G 2 , O 3, O 4, P 2 B 1, S 1, G 1 , O 1, O 2, P 1 18 Hyper Edges

27 Edge Orderings P 8 B 2, S B 1, S P 5 O 5 P 7 P 3 P 6 Artist: Ann Pibal; Painting: “XCRS” D 4 M-24 2, G 2 , O 3, O 4, P 2 1, G 1 , O 1, O 2, P 1 O 5 < P 3, P 6, P 7, P 8 O 5 < B 1, S 1, G 1, O 2, P 1 O 5 < B 2, S 2, G 2, O 3, O 4, P 2 < P 7, P 8

O 5 x 5 P 7 x 2 P 3 x 3 P 6 x 2 Artist: Ann Pibal; Painting: “XCRS” D 4 M-25 P 2)x 4 (B 2, S 2, G P 8 x 2 2, O 3 , O 4, (B 1, S 1, G P 5 x 2 1, O 1 , O 2, P 1)x 2 52 Standard Multi Edges

Summary Observations • Standard edge representation fragments hyper edges – • Digraph representation compresses multi-edges – • Information is lost Standard graph representation drops edge order – • Information is lost Matrix representation drops edge labels – • Information is lost Need edge representation that preserves information Artist: Ann Pibal; Painting: “XCRS” D 4 M-26

Solution: Incidence Matrix Edge Color Order V 01 V 02 V 03 V 04 V 05 V 06 V 07 V 08 V 09 V 10 V 11 V 12 V 13 V 14 V 15 V 16 V 17 V 18 V 19 V 20 B 1 Blue 2 1 1 1 Silver 2 1 1 1 Green 2 1 1 1 Orange 2 1 1 1 O 2 Orange 2 1 1 1 Pink 2 1 1 1 B 2 Blue 2 1 1 1 S 2 Silver 2 1 1 1 G 2 Green 2 1 1 1 O 3 Orange 2 1 1 1 O 4 Orange 2 1 1 1 P 2 2 1 1 1 Pink 1 O 5 Orange 1 P 3 Pink 2 P 4 Pink 2 1 P 5 Pink 2 1 P 6 Pink 2 P 7 Pink 3 P 8 Pink 3 1 1 1 1 Artist: Ann Pibal; Painting: “XCRS” D 4 M-27 1 1 1

Outline • Introduction • Theory • Results • D 4 M-28 – Network monitoring example – Bioinformatics example Summary

Graph Construction Using D 4 M: Explode Schema Raw Data CSV Files Assoc. Arrays Distributed Database Dense Table log_id Use as row indices src_ip server_ip 001 128. 0. 0. 1 208. 29. 69. 138 002 192. 168. 1. 2 157. 166. 255. 18 003 128. 0. 0. 1 74. 125. 224. 72 Create columns for each unique type/value pair src_ip|128. 0. 0. 1 src_ip|192. 168. 1. 2 server_ip|157. 166. 255. 18 server_ip|208. 29. 69. 138 log_id|001 1 0 0 1 0 log_id|002 0 1 1 0 0 log_id|003 1 0 0 0 1 Exploded Table D 4 M-29 server_ip|74. 125. 224. 72

Graph Construction Using D 4 M: Construct Associative Arrays Raw Data CSV Files Distributed Database D 4 M Query #1 keys = T(: , ’time_stamp|10/May/2011: 00: 00’, : , . . . ’time_stamp|13/May/2011: 23: 59’, ); (‘log_id|001’, ‘time_stamp|11/May/2011: 09: 52: 53’, 1) (‘log_id|002’, ‘time_stamp|12/May/2011: 13: 24: 11’, 1) (‘log_id|003’, ‘time_stamp|13/May/2011: 05: 12’, 1). . . D 4 M-31 Assoc. Arrays

Graph Construction Using D 4 M: Construct Associative Arrays Raw Data CSV Files Distributed Database D 4 M Query #1 keys = T(: , ’time_stamp|10/May/2011: 00: 00’, : , . . . ’time_stamp|13/May/2011: 23: 59’, ); D 4 M Query #2 data = T(Row(keys), : ); Associative Array Algebra G = data(: , ’src_ip|*’). ’ * data(: , ’server_ip|*’); (‘src_ip|128. 0. 0. 1’, ‘server_ip|208. 29. 69. 138’, 1) (‘src_ip|128. 0. 0. 1’, ‘server_ip|74. 125. 224. 72’, 1) (‘src_ip|192. 168. 1. 2’, ‘server_ip|157. 166. 255. 18’, 1). . . D 4 M-33 Assoc. Arrays

Graph Construction Using D 4 M: Construct Associative Arrays Raw Data CSV Files Distributed Database Assoc. Arrays D 4 M Query #1 keys = T(: , ’time_stamp|10/May/2011: 00: 00’, : , . . . ’time_stamp|13/May/2011: 23: 59’, ); D 4 M Query #2 data = T(Row(keys), : ); Associative Array Algebra G = data(: , ’src_ip|*’). ’ * data(: , ’server_ip|*’); • D 4 M-34 Adj(G); Graphs can be constructed with minimal effort using D 4 M queries and associative array algebra

Accumulo Ingestion Scalability Study LLGrid Map. Reduce With A Python Application Accumulo Database: 1 Master + 7 Tablet servers 4 Mil e/s Data #1: 5 GB of 200 files D 4 M-35 Data #2: 30 GB of 1000 files

Outline • Introduction • Theory • Results • D 4 M-36 – Network monitoring example – Bioinformatics example Summary

Relative Cost per DNA Sequence Big Data Energy Efficient High Volume Sequencer D 4 M-37 Portable Sequencer Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Large-Scale Genome Sequencing Program Available at: www. genome. gov/sequencingcosts. Accessed 03/08/2012

Example Disease Outbreak May-July 2011 - Virulent E. Coli Outbreak Germany Outbreak identified diarrhea kidney Spanish Cucumbers implicated DNA Sequence released Sprouts Identified Deaths www. rki. de EHEC final report Conclusions: Identification of E. Coli source too late to have substantial impact on illnesses Publishing sequence data allowed for broad community to fully characterize pathogen Sequencing and crowd source analysis showed promising potential -> Still too slow D 4 M-38

Sequence Matching Graph Sparse Matrix Multiply in D 4 M Collected Sample unknown bacteria reference bacteria RNA Reference Set unknown sequence ID reference sequence ID A 1 A 2' sequence word (10 mer) reference sequence ID sequence word (10 mer) unknown sequence ID • Associative arrays provide a natural framework for sequence matching D 4 M-39

Database Automatically Computes Reference 10 mer Distribution 0. 5% 5% 50% • Using 10 mer distribution can quickly select reference 10 mers that maximally differentiate sample sequences and eliminate most 10 mers D 4 M-40

Leveraging “Big Data” Technologies for High Speed Sequence Matching 10000 D 4 M 1000 BLAST 100 x faster run time (seconds) 100 x smaller 100 D 4 M + Triple Store 10 10000 code volume (lines) 1000000 • High performance triple store database trades computations for lookups • Used Apache Accumulo database to accelerate comparison by 100 x • Used Lincoln D 4 M software to reduce code size by 100 x D 4 M-41

Summary • Big data is found across a wide range of areas – Document analysis – Computer network analysis – DNA Sequencing • Currently there is a gap in big data analysis tools for algorithm developers • D 4 M fills this gap by providing algorithm developers composable associative arrays that admit linear algebraic manipulation D 4 M-42

Assignment (1/3) (1) Install Matlab or GNU Octave on your computer or login into a computer with that software already installed. (2) Download D 4 M and follow the installation instructions. (3) Cd to /d 4 m_api/examples to your LLGrid folder. (4) Start Matlab (or GNU Octave) and type: • help D 4 M • This should display the D 4 M function list. If it does not, then you should e-mail pmatlab@ll. mit. edu for help on setting up D 4 M-43

Assignment (2/3) (5) cd to the second example in your example directory • cd examples/1 Intro/2 Edge. Art (6) Run each of the examples: • EA 1_Graph. TEST • EA 2_Subsref. TEST • EA 3_Sub. Graph. TEST D 4 M-44

Assignment (3/3) (7) Select a picture with a small number of edges and vertices (e. g. , from pibalart. com). • Label the edges and vertices • Create the incidence matrix E • Compute adjacency matrix from the incidence matrix using the formula: A=transpose(E)*E You can do this assignment by hand or by modifying the examples. D 4 M-45