The Datacenter Needs an Operating System Matei Zaharia

  • Slides: 14
Download presentation
The Datacenter Needs an Operating System Matei Zaharia, Benjamin Hindman, Andy Konwinski, Ali Ghodsi,

The Datacenter Needs an Operating System Matei Zaharia, Benjamin Hindman, Andy Konwinski, Ali Ghodsi, Anthony Joseph, Randy Katz, Scott Shenker, Ion Stoica

Background • Clusters of commodity servers have become a major computing platform in industry

Background • Clusters of commodity servers have become a major computing platform in industry and academia • Driven by data volumes outpacing the processing capabilities of single machines • Democratized by cloud computing

Background • Some have declared that “the datacenter is the new computer” • Claim:

Background • Some have declared that “the datacenter is the new computer” • Claim: this new computer increasingly needs an operating system • Not necessarily a new host OS, but a common software layer that manages resources and provides shared services for the whole datacenter, like an OS does for one host

Why Datacenters Need an OS • Growing number of applications – Parallel processing systems:

Why Datacenters Need an OS • Growing number of applications – Parallel processing systems: Map. Reduce, Dryad, Pregel, Percolator, Dremel, MR Online – Storage systems: GFS, Big. Table, Dynamo, SCADS – Web apps and supporting services • Growing number of users – 200+ for Facebook’s Hadoop data warehouse, running near-interactive ad hoc queries

What Operating Systems Provide • Resource sharing across applications & users • Data sharing

What Operating Systems Provide • Resource sharing across applications & users • Data sharing between programs • Programming abstractions (e. g. threads, IPC) • Debugging facilities (e. g. ptrace, gdb) Result: OSes enable a highly interoperable software ecosystem that we now take for granted

An Analogy • Today, a scientist analyzing data on a single machine can pipe

An Analogy • Today, a scientist analyzing data on a single machine can pipe it through a variety of tools, write new tools that interface with these through standard APIs, and trace across the stack • In the future, the scientist should be able to fire up a cloud on EC 2 and do the same thing: – Intermix a variety of apps & programming models – Write new parallel programs that talk to these – Get a unified interface for managing the cluster – Debug and trace across all these components

Today’s Datacenter OS • Hadoop Map. Reduce as common execution and resource sharing platform

Today’s Datacenter OS • Hadoop Map. Reduce as common execution and resource sharing platform • Hadoop Input. Format API for data sharing • Abstractions for productivity programmers, but not for system builders • Very challenging to debug across all the layers

Tomorrow’s Datacenter OS • Resource sharing: – Lower-level interfaces for fine-grained sharing (Mesos is

Tomorrow’s Datacenter OS • Resource sharing: – Lower-level interfaces for fine-grained sharing (Mesos is a first step in this direction) – Optimization for a variety of metrics (e. g. energy) – Integration with network scheduling mechanisms (e. g. Seawall [NSDI ‘ 11], NOX, Orchestra)

Tomorrow’s Datacenter OS • Data sharing: – Standard interfaces for cluster file systems, keyvalue

Tomorrow’s Datacenter OS • Data sharing: – Standard interfaces for cluster file systems, keyvalue stores, etc – In-memory data sharing (e. g. Spark, DFS cache), and a unified system to manage this memory – Streaming data abstractions (analogous to pipes) – Lineage instead of replication for reliability (RDDs)

Tomorrow’s Datacenter OS • Programming abstractions: – Tools that can be used to build

Tomorrow’s Datacenter OS • Programming abstractions: – Tools that can be used to build the next Map. Reduce / Big. Table in a week (e. g. BOOM) – Efficient implementations of communication primitives (e. g. shuffle, broadcast) – New distributed programming models

Tomorrow’s Datacenter OS • Debugging facilities: – Tracing and debugging tools that work across

Tomorrow’s Datacenter OS • Debugging facilities: – Tracing and debugging tools that work across the cluster software stack (e. g. X-Trace, Dapper) – Replay debugging that takes advantage of limited languages / computational models – Unified monitoring infrastructure and APIs

Putting it All Together • A successful datacenter OS might let users: – Build

Putting it All Together • A successful datacenter OS might let users: – Build a Hadoop-like software stack in a week using the OS’s abstractions, while gaining other benefits (e. g. cross-stack replay debugging) – Share data efficiently between independently developed programming models and applications – Understand cluster behavior without having to log into individual nodes – Dynamically share the cluster with other users

Conclusion • Datacenters need an OS-like software stack for the same reasons single computers

Conclusion • Datacenters need an OS-like software stack for the same reasons single computers did: manageability, efficiency & programmability • An OS is already emerging in an ad-hoc way • Researchers can help by taking a long-term approach towards these problems

How Researchers can Help • Focus on paradigms, not performance – Industry is tackling

How Researchers can Help • Focus on paradigms, not performance – Industry is tackling performance but lacks luxury to take long-term view towards abstractions • Explore clean-slate approaches – Likelier to have impact here than in a “real” OS because datacenter software changes quickly! • Bring cluster computing to non-experts – Much harder and more rewarding than big users