Introduction to Data Center Computing Derek Murray October

Introduction to Data Center Computing Derek Murray October 2010

What we’ll cover • Techniques for handling “big data” – Distributed storage – Distributed computation • Focus on recent papers describing real systems

Example: web search Crawling Indexing WWW Querying

A system architecture? Network Computers Storage

Data Center architecture Core switch Rack switch Server Server Server

Distributed storage • High volume of data • High volume of read/write requests • Fault tolerance

Brewer’s CAP theorem (2000) Consistency Availability Partition Tolerance

The Google file system (2003) Client GFS Master Chunk server

Dynamo (2007) Client

Distributed computation • Parallel distributed processing • Single Program, Multiple Data (SPMD) • Fault tolerance • Applications

Task farming Master Worker Storage Worker

Map. Reduce (2004)

Dryad (2007) • Arbitrary directed acyclic graph (DAG) • Vertices and channels • Topological ordering

Dryad. LINQ (2008) • Language Integrated Query (LINQ) var table = Partitioned. Table. Get<int>(“…”); var result = from x in table select x * x; int sum. Squares = result. Sum();

Scheduling issues • Heterogeneous performance • Sharing a cluster fairly • Data locality

Percolator (2010) • Built on Google Big. Table • Transactions via snapshot isolation • Per-column notifications (triggers)

Skywriting and CIEL (2010) • Universal distributed execution engine • Script language for distributed programs • Opportunities for student projects…

References • Storage – – • Computation – – – • Dean and Ghemawat, “Map. Reduce: Simplified Data Processing on Large Clusters”, Proceedings of OSDI 2004 Isard et al. , “Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks”, Proceedings of Euro. Sys 2007 Yu et al. , “Dryad. LINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language”, Proceedings of OSDI 2008 Olston et al. , “Pig Latin: A Not-So-Foreign Language for Data Processing”, Proceedings of SIGMOD 2008 Murray and Hand, “Scripting the Cloud with Skywriting”, Proceedings of Hot. Cloud 2010 Scheduling – – – • Ghemawat et al. , “The Google File System”, Proceedings of SOSP 2003 De. Candia et al. , “Dynamo: Amazon’s Highly-Available Key-value Store”, Proceedings of SOSP 2007 Zaharia et al. , “Improving Map. Reduce Performance in Heterogeneous Environments”, Proceedings of OSDI 2008 Isard et al. , “Quincy: Fair Scheduling for Distributed Computing Clusters”, Proceedings of SOSP 2009 Zaharia et al. , “Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling”, Proceedings of Euro. Sys 2010 Transactions – Peng and Dabek, “Large-Scale Incremental Processing using Distributed Transactions and Notifications”, Proceedings of OSDI 2010

Conclusions • Data centers achieve high performance with commodity parts • Efficient storage requires applicationspecific trade-offs • Data-parallelism simplifies distributed computation on the data

Questions • Now or after the lecture – Email • Derek. Murray@cl. cam. ac. uk – Web • http: //www. cl. cam. ac. uk/~dgm 36/