Largescale cluster management at Google with Borg By

Large-scale cluster management at Google with Borg By: Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, John Wilkes Presented by: Xianghan Pei EECS 582 – W 16 1

Outline • Introduction & Motivation • The User Perspective • Borg Architecture • Scalability & Availability • Utilization • Isolation • Lessons • Future and Related Work EECS 582 – W 16 2

Introduction & Motivation • Introduction • Borg: An OS of Cluster(Datacenter) • Motivation • Hide the details(programmer focus on App) • Provide resource sharing The high-level architecture of Borg • Provide high reliability and availability for Cluster EECS 582 – W 16 3

The User Perspective • Workload • production(prod) VS non-production(non-prod) • Cells • Heterogeneous, around 10 k machines in median • Jobs and Tasks > borgcfg. . . /hello_world_webserver. borg up EECS 582 – W 16 4

The User Perspective • Allocs • Reserved set of resources • Priority, Quota, and Admission Control • Job has a priority (preempting) • Quota is used to decide which jobs to admit for scheduling • Naming and Monitoring • 50. jfoo. ubar. cc. borg. google. com • Monitoring health of the task and thousands of performance metrics EECS 582 – W 16 5

Borg Architecture • Borgmaster • Main Borgmaster process & Scheduler • Five replicas • Borglet • Manage and monitor tasks and resource • Borgmaster polls Borglet every few seconds The high-level architecture of Borg(A Cell) EECS 582 – W 16 6

Borg Architecture • Scheduling • feasibility checking: find machines • Scoring: pick one machines • User preferences & build-in criteria • E-PVM VS best-fit • Tradeoff The high-level architecture of Borg(A Cell) EECS 582 – W 16 7

Scalability • Separate scheduler • Separate threads to poll the Borglets • Partition functions across the five replicas • Score caching • Equivalence classes • Relaxed randomization The high-level architecture of Borg(A Cell) EECS 582 – W 16 8

Availability • System • Borgmaster: Five replicas • Borglet: Borgmaster • Take checkpoint • Job • Reschedules evicted tasks • Spreading tasks across failure domian • … EECS 582 – W 16 9

Utilization • Cell Sharing • Segregating prod and non-prod work into different cells would need more machines EECS 582 – W 16 10

Utilization • Cell Size • Subdividing cells into smaller ones would require more machines EECS 582 – W 16 11

Utilization • Fine-grained resource requests No bucket sizes fit most of the tasks well Bucketing resource would need more machines EECS 582 – W 16 12

Utilization • Resource reclamation • estimate how many resources a task will use and reclaim the rest for work • Kill non-prod if not available EECS 582 – W 16 13

Utilization • Resource reclamation • Choose medium EECS 582 – W 16 14

Isolation • Security isolation • chroot jail as the primary security isolation mechanism • Performance isolation • High-priority LV task(prod) get best treatment • compressible resources: reclaimed • non-compressible resources: kill or remove task EECS 582 – W 16 15

Lessons • Bad • Jobs are restrictive as the only grouping mechanism for tasks • One IP address per machine complicates things • Optimizing for power users at the expense of casual ones • Good • • Allocs are useful Cluster management is more than task management Introspection is vital The master is the kernel of a distributed system EECS 582 – W 16 16

Future Work • Kubernetes • Evolved from Borg • An open-source system for automating deployment, operations, and scaling of containerized applications • Pods: groups of containers • Labels • Replica controller • Services EECS 582 – W 16 17

Relative Work • Apache Mesos • YARN • Tupperware EECS 582 – W 16 18

Questions? EECS 582 – W 16 19

References • Borg Paper • Borg Slides at Info. Q EECS 582 – W 16 20