The Datacenter Needs an Operating System Matei Zaharia
- Slides: 14
The Datacenter Needs an Operating System Matei Zaharia, Benjamin Hindman, Andy Konwinski, Ali Ghodsi, Anthony Joseph, Randy Katz, Scott Shenker, Ion Stoica
Background • Clusters of commodity servers have become a major computing platform in industry and academia • Driven by data volumes outpacing the processing capabilities of single machines • Democratized by cloud computing
Background • Some have declared that “the datacenter is the new computer” • Claim: this new computer increasingly needs an operating system • Not necessarily a new host OS, but a common software layer that manages resources and provides shared services for the whole datacenter, like an OS does for one host
Why Datacenters Need an OS • Growing number of applications – Parallel processing systems: Map. Reduce, Dryad, Pregel, Percolator, Dremel, MR Online – Storage systems: GFS, Big. Table, Dynamo, SCADS – Web apps and supporting services • Growing number of users – 200+ for Facebook’s Hadoop data warehouse, running near-interactive ad hoc queries
What Operating Systems Provide • Resource sharing across applications & users • Data sharing between programs • Programming abstractions (e. g. threads, IPC) • Debugging facilities (e. g. ptrace, gdb) Result: OSes enable a highly interoperable software ecosystem that we now take for granted
An Analogy • Today, a scientist analyzing data on a single machine can pipe it through a variety of tools, write new tools that interface with these through standard APIs, and trace across the stack • In the future, the scientist should be able to fire up a cloud on EC 2 and do the same thing: – Intermix a variety of apps & programming models – Write new parallel programs that talk to these – Get a unified interface for managing the cluster – Debug and trace across all these components
Today’s Datacenter OS • Hadoop Map. Reduce as common execution and resource sharing platform • Hadoop Input. Format API for data sharing • Abstractions for productivity programmers, but not for system builders • Very challenging to debug across all the layers
Tomorrow’s Datacenter OS • Resource sharing: – Lower-level interfaces for fine-grained sharing (Mesos is a first step in this direction) – Optimization for a variety of metrics (e. g. energy) – Integration with network scheduling mechanisms (e. g. Seawall [NSDI ‘ 11], NOX, Orchestra)
Tomorrow’s Datacenter OS • Data sharing: – Standard interfaces for cluster file systems, keyvalue stores, etc – In-memory data sharing (e. g. Spark, DFS cache), and a unified system to manage this memory – Streaming data abstractions (analogous to pipes) – Lineage instead of replication for reliability (RDDs)
Tomorrow’s Datacenter OS • Programming abstractions: – Tools that can be used to build the next Map. Reduce / Big. Table in a week (e. g. BOOM) – Efficient implementations of communication primitives (e. g. shuffle, broadcast) – New distributed programming models
Tomorrow’s Datacenter OS • Debugging facilities: – Tracing and debugging tools that work across the cluster software stack (e. g. X-Trace, Dapper) – Replay debugging that takes advantage of limited languages / computational models – Unified monitoring infrastructure and APIs
Putting it All Together • A successful datacenter OS might let users: – Build a Hadoop-like software stack in a week using the OS’s abstractions, while gaining other benefits (e. g. cross-stack replay debugging) – Share data efficiently between independently developed programming models and applications – Understand cluster behavior without having to log into individual nodes – Dynamically share the cluster with other users
Conclusion • Datacenters need an OS-like software stack for the same reasons single computers did: manageability, efficiency & programmability • An OS is already emerging in an ad-hoc way • Researchers can help by taking a long-term approach towards these problems
How Researchers can Help • Focus on paradigms, not performance – Industry is tackling performance but lacks luxury to take long-term view towards abstractions • Explore clean-slate approaches – Likelier to have impact here than in a “real” OS because datacenter software changes quickly! • Bring cluster computing to non-experts – Much harder and more rewarding than big users