Largescale cluster management at Google with Borg Google

Large-scale cluster management at Google with Borg Google Inc.

Borg �A cluster manager. ◦ Runs hundreds of thousands of jobs from many thousands of different applications. ◦ Across a number of clusters each with up to tens of thousands of machines. �With very high reliability and availability.

Workloads �Heterogeneous workload with two main parts. ◦ Long-running services �Handle short-lived latency-sensitive requests. �High priority(prod). ◦ Batch jobs �Take a few seconds to a few days to complete. �Low priority(non-prod).

Jobs and Tasks �Job ◦ Runs in one Borg cell. ◦ Consist of many tasks. ◦ Has properties and constraints. �name, owner, number of tasks, priority. �Task ◦ Maps to a set of Linux processes running in a container on a machine. ◦ Has properties and constraints. �resource requirements(CPU cores, RAM, disk space, disk access rate, TCP ports, etc).

Jobs and Tasks(Cont. )

Jobs and Tasks(Cont. ) �Non-overlapping priority bands ◦ Monitoring, production, batch, and best effort. ◦ Tasks from jobs with higher priority can preempt lower priority one. ◦ Disallow tasks in the production priority band to preempt one another.

Jobs and Tasks(Cont. ) �Jobs with insufficient quota are immediately rejected upon submission. ◦ Quota: a vector of resource quantities. �(CPU, RAM, disk space, etc. ) ◦ Higher-priority quota costs more.

Architecture(Cont. ) �Cell ◦ A set of heterogeneous machines that run jobs in a cluster. ◦ Median cell size: 10 k machines. �Alloc ◦ A reserved set of resources on a machine.

Scheduler �The scheduling algorithm consists of two parts. ◦ Feasibility checking: find machines on which the task could run. ◦ Scoring: picks one of the feasible machines. �Spreading load v. s. Best-fit �Use a hybrid method to reduce the amount of stranded resources – ones that cannot be used because of another resource on the machine is fully allocated.

Performance Isolation �To help with overload and overcommitment. �Latency-sensitive(LS) tasks v. s. the rest(batch). ◦ LS tasks are capable of temporarily starving batch tasks for several seconds. �Compressible v. s. non-compressible resources. ◦ Terminates low priority tasks while running out of non-compressible. ◦ Throttles usage(favoring LS tasks) while

Combined vs Segregated

Lesson Learned �The bad: ◦ Jobs are restrictive as the only grouping mechanism for tasks. ◦ One IP address per machine complicates things. ◦ Optimizing for power users at the expense of casual ones.

Lesson Learned(Cont. ) �The good: ◦ Allocs are useful. ◦ Cluster management is more than task management. ◦ Introspection is vital. ◦ The master is the kernel of a distributed system.

Conclusion �Virtually all of Google’s cluster workloads have switched to use Borg over the past decade. �They continue to evolve it, and have applied the lessons we learned from it to Kubernetes.