Bigtable A Distributed Storage System for Structured Data

Abstract • Distributed Storage System • Petabytes of Structured Data • Web indexing, Google Earth, Google Finance……

1. Introduction • Goals: wide applicability, scalability, high performance, high availability • Not full relational data model • Row + Column Name • Uninterpreted Strings

2. Data Model • A sparse, distributed, persistent multidimensional sorted map (row: string, column: string, time: int 64)--->string

Rows • • Row keys are arbitrary strings(up to 64 KB) Atomic read and write Lexicographic order Tablet is the unit of distribution and balancing

Column Families • Column keys are grouped into sets called column families • Same type data(compressed together) • Column Key family: qualifier • Access control and both disk and memory accounting are performed at the columnfamily level

Timestamp • 64 -bit integer(us) • Each cell can contain multiple versions of the same data • Assigned by Bigtable or client • Data are stored in decreasing timestamp order • Garbage collect

3. API • Create and delete tables and column families • Change cluster, table and column family metadata, such as control rights • Read and write values in Bigtable

Feature • Single-row transaction • Client-supplied scripts(Sawzall) for data processing(only reading)

4. Building Blocks • Google File System(GFS) • Operates in a shared pool of machines • Depends on a cluster management system for scheduling jobs, managing resources, dealing with machine failures, and monitoring machine status

SSTable and Chubby • SSTable – Block – block index • Chubby – distributed lock service – Paxos – dir and small files can be used as lock – session/session lease – callback

Chubby • Bigtable use Chubby for a variety of tasks – ensure one active master at any time – store bootstrap location – discover tablet servers and finalize tablet server deaths – store schema information – store access control lists

5. Implementation • Client library • Master server • Tablet server

5. 1 Tablet Location • Three level hierarchy

METADATA • METADATA Table (table id, end row)---> (location of tablet, secondary information) • Client library caches tablet location – incorrect, empty, stale – prefetch

5. 2 Tablet Assignment • Master assigns tablet to tablet servers • When tablet server starts, it creates and acquires an exclusive lock on, a uniquelynamed file in specific Chubby directory(servers directory) • A tablet server stops serving its tablets if it loses its exclusive lock(e. g. loses its session) • Whenever a tablet server terminates, it attempts to release its lock.

• When a master is started by the cluster management system – grabs a unique master lock in Chubby – scans the servers directory – communicates with every live tablet server to discover what tablets are already assigned to each server – scans the METADATA table – add unassigned tablet to a unassigned tablet set – first add root tablet

• The master is responsible for detecting when a tablet server is no longer serving its tablets, and for reassigning those tablets as soon as possible • The master periodically asks each tablet server for the status of its lock

• The master initiates these tablets changes – when a table is created or deleted – when two tablets are merged • A tablet server initiates tablet split – commit the split by recording in METADATA – notifies the master

5. 3 Tablet Serving

Write Operation • Check – well-formed – authorized • Write to commit log(group commit) • Insert into memtable

Read Operation • Check – well-formed – authorized • Merged view of SSTables and memtable – SSTables and memtable are lexicographic sorted data structures, the merged view can be formed efficiently

Recover Tablet • Memtable – recently committed updates • SSTables – older updates • To recover a tablet – a tablet server read its metadata from METADATA – metadata contains the list of SSTables and a set of redo points – the tablet server read SSTables indices and reconstruct the memtable since the redo points

5. 4 Compactions • Minor compaction – when the memtable size reaches a threshold, the memtable is frozen, a new memtable is created, and the frozen memtable is converted to an SSTable and written to GFS • Merging compaction – merge a few SSTables and the memtable and write out a new SSTable • Major compaction – rewrites all SSTables into one SSTable – produces an SSTable contains no deleted data – Bigtable cycles through all of its tablets and regularly applies major compactions to them

6. Refinements • The implementation described in the previous section required a number of refinements to achieve the high performance, availability, and reliability required by our users.

Locality Groups • Clients can group multiple column families together into a locality group • A separate SSTable is generated for each locality group • Effect – efficient reads – tuning parameters based on locality group • in-memory(read frequently, location column family in METADATA table)

Compression • Clients can control whether or not the SSTables for a locality group are compressed, and if so, which compression format is used • Many clients use a two-pass custom compression scheme • When similar data ends up clustered, applications achieve very good compression ratios

Caching for Read Performance • Scan Cache – higher level, key-value pairs – from SSTable interface – useful for repeatedly read • Block Cache – lower-level, SSTable blocks – from GFS – useful for sequential reads

Bloom Filters • A Bloom filter allows us to ask whether an SSTable might contain any data for a specfied row/column pair • In tablet server memory • Reduce disk seeks

Commit-log Implementation • A single commit log per tablet server • Parallelize sorting log on different tablet server; sequential reads • To protect mutations from GFS latency spikes, each tablet server actually has two log writing threads, each writing to its own log file; only one of these two threads is actively in use at a time

Speeding Up Tablet Recovery • When move a tablet • Two minor compactions first-->stop serving-->second-->unload • After this second minor compaction is complete, the tablet can be loaded on another tablet server without requiring any recovery of log entries

Exploiting Immutability • All of the SSTables that we generate are immutable • The only mutable data structure that is accessed by both reads and writes is the memtable • Copy-on-write • Mark-and-swap

7. Performance Evaluation