Distributed Structured Storage Systems Mark Feltner Big Data
(Distributed) (Structured) Storage Systems Mark Feltner
Big Data � 2. 5 Petabytes/day: Wal-Mart's transaction database � 40 Terabytes/second: CERN � 1 Terabyte/day: NYSE Trading data � 10 billion: Facebook photos
Overview �Theory �Algorithms �Implementations & Technology
Relational databases
ACID
Atomicty �All-or-nothing
Consistency �Data is always in a valid state
Isolation �Serially executed transactions result in same state as concurrent transactions
Durability �COMMIT means transaction is permanent across all clients
Non-relational databases
Key-value
Document-oriented
Graphs
Distributed Systems
Fallacies of Distributed Computing 1. 2. 3. 4. 5. 6. 7. 8. The network is reliable. Latency is zero. Bandwidth is infinite. The network is secure. Topology doesn't change. There is one administrator. Transport cost is zero. The network is homogeneous.
CAP Theorem
Consistency “…there must exist a total order on all operations such that each operation looks as if it were completed at a single instant. This is equivalent to requiring requests of the distributed shared memory to act as if they were executing on a single node, responding to operations one at a time. ” (Gilbert, Lynch) �Eventual consistency
Availability “For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response” (Gilbert, Lynch)
Partition Tolerance “In order to model partition tolerance, the network will be allowed to lose arbitrarily many messages sent from one node to another. When a network is partitioned, all messages sent from nodes in one component of the partition to nodes in another component are lost”(Gilbert, Lynch)
(CA || CP || AP) ?
Algorithms
Row- versus column- orientation Title Artist Album Year Breaking the Law Judas Priest British Steel 1980 Aces High Iron Maiden Powerslave 1984 Kickstart My Heat Motley Crue Dr. Feelgood 1989 Raining Blood Slayer Reign in Blood 1986 I Wanna Be Somebody W. A. S. P. 1984
Row-oriented Data Storage Model: Breaking the Law Judas Priest British Steel 1980 Aces High Iron Maiden Powerslave 1984 Kickstart My heart Motley Crue Dr. Feelgood 1989 Raining Blood Slayer Reign in Blood 1986 I Wanna Be Somebody W. A. S. P. 1984
Column-oriented Data Storage Model: Breaking the Law Aces High Kickstart My Heart Raining Blood I Wanna Be Somebody Judas Priest Iron Madien Motley Crue Slayer W. A. S. P. British Steel Powerslave Dr. Feelgood Reign in Blood W. A. S. P. 1980 1984 1989 1986 1984
Comparison of Row- vs. Column. Orientation �CREATE �SELECT �MAX, MIN, SUM, AVG, …
Map. Reduce
Technology
Implementations
Big. Table �High performance �Map. Reduce �Powers: Google Reader, Maps, Book Search, You. Tube, Gmail, …
Hadoop �Map. Reduce �Yahoo! �World Record Holder!
Cassandra �Key-value �Map. Reduce �Facebook �Eventual consistency �Scalable, fault-tolerant
My. SQL �Relational �LAMP
Redis �Key-value �What is lacks in durability, it makes up for in speed / simplicity.
HBase �Map. Reduce �Hadoop + HDFS �Java and REST API �Column-oriented �Excellent fault-tolerance �Replication �Streaming
Neo 4 J �Graph Database
Mongo. DB �Document-oriented
Conclusions �Pick the right tool for the job.
- Slides: 39