Bigtable A Distributed Storage System for Structured Data
Bigtable: A Distributed Storage System for Structured Data Jing Zhang Reference: Handling Large Datasets at Google: Current System and Future Directions, Jeff Dean 11/30/2020 EECS 584, Fall 2011 1
Outline n Motivation n Data Model n APIs n Building Blocks n Implementation n Refinement n Evaluation 11/30/2020 EECS 584, Fall 2011 2
Outline n Motivation n Data Model n APIs n Building Blocks n Implementation n Refinement n Evaluation 11/30/2020 EECS 584, Fall 2011 3
Google’s Motivation – Scale! n Scale Problem – – Lots of data Millions of machines Different project/applications Hundreds of millions of users Storage for (semi-)structured data n No commercial system big enough n – Couldn’t afford if there was one n Low-level storage optimization help performance significantly – Much harder to do when running on top of a database layer 11/30/2020 EECS 584, Fall 2011 4
Bigtable Distributed multi-level map n Fault-tolerant, persistent n Scalable n – – n Thousands of servers Terabytes of in-memory data Petabyte of disk-based data Millions of reads/writes per second, efficient scans Self-managing – Servers can be added/removed dynamically – Servers adjust to load imbalance 11/30/2020 EECS 584, Fall 2011 5
Real Applications 11/30/2020 EECS 584, Fall 2011 6
Outline n Motivation n Data Model n APIs n Building Blocks n Implementation n Refinement n Evaluation 11/30/2020 EECS 584, Fall 2011 7
Data Model n a sparse, distributed persistent multidimensional sorted map (row, column, timestamp) -> cell contents 11/30/2020 EECS 584, Fall 2011 8
Data Model n Rows – Arbitrary string – Access to data in a row is atomic – Ordered lexicographically 11/30/2020 EECS 584, Fall 2011 9
Data Model n Column – Tow-level name structure: • family: qualifier – Column Family is the unit of access control 11/30/2020 EECS 584, Fall 2011 10
Data Model n Timestamps – Store different versions of data in a cell – Lookup options • Return most recent K values • Return all values 11/30/2020 EECS 584, Fall 2011 11
Data Model The row range for a table is dynamically partitioned n Each row range is called a tablet n Tablet is the unit for distribution and load balancing n 11/30/2020 EECS 584, Fall 2011 12
Outline n Motivation n Data Model n APIs n Building Blocks n Implementation n Refinement n Evaluation 11/30/2020 EECS 584, Fall 2011 13
APIs n Metadata operations – Create/delete tables, column families, change metadata n Writes – Set(): write cells in a row – Delete. Cells(): delete cells in a row – Delete. Row(): delete all cells in a row n Reads – Scanner: read arbitrary cells in a bigtable • • 11/30/2020 Each row read is atomic Can restrict returned rows to a particular range Can ask for just data from 1 row, all rows, etc. Can ask for all columns, just certain column families, or specific columns EECS 584, Fall 2011 14
Outline n Motivation n Data Model n APIs n Building Blocks n Implementation n Refinement n Evaluation 11/30/2020 EECS 584, Fall 2011 15
Typical Cluster Shared pool of machines that also run other distributed applications 11/30/2020 EECS 584, Fall 2011 16
Building Blocks n Google File System (GFS) – stores persistent data (SSTable file format) n Scheduler – schedules jobs onto machines n Chubby – Lock service: distributed lock manager – master election, location bootstrapping n Map. Reduce (optional) – Data processing – Read/write Bigtable data 11/30/2020 EECS 584, Fall 2011 17
Chubby {lock/file/name} service n Coarse-grained locks n Each clients has a session with Chubby. n – The session expires if it is unable to renew its session lease within the lease expiration time. 5 replicas, need a majority vote to be active n Also an OSDI ’ 06 Paper n 11/30/2020 EECS 584, Fall 2011 18
Outline n Motivation n Overall Architecture & Building Blocks n Data Model n APIs n Implementation n Refinement n Evaluation 11/30/2020 EECS 584, Fall 2011 19
Implementation Single-master distributed system n Three major components n – Library that linked into every client – One master server • • • Assigning tablets to tablet servers Detecting addition and expiration of tablet servers Balancing tablet-server load Garbage collection Metadata Operations – Many tablet servers • Tablet servers handle read and write requests to its table • Splits tablets that have grown too large 11/30/2020 EECS 584, Fall 2011 20
Implementation 11/30/2020 EECS 584, Fall 2011 21
Tablets n Each Tablets is assigned to one tablet server. – Tablet holds contiguous range of rows • Clients can often choose row keys to achieve locality – Aim for ~100 MB to 200 MB of data per tablet n Tablet server is responsible for ~100 tablets – Fast recovery: • 100 machines each pick up 1 tablet for failed machine – Fine-grained load balancing: • Migrate tablets away from overloaded machine • Master makes load-balancing decisions 11/30/2020 EECS 584, Fall 2011 22
How to locate a Tablet? n Given a row, how do clients find the location of the tablet whose row range covers the target row? METADATA: Key: table id + end row, Data: location n Aggressive Caching and Prefetching at Client side n 11/30/2020 EECS 584, Fall 2011 23
Tablet Assignment Each tablet is assigned to one tablet server at a time. n Master server keeps track of the set of live tablet servers and current assignments of tablets to servers. n When a tablet is unassigned, master assigns the tablet to an tablet server with sufficient room. n It uses Chubby to monitor health of tablet servers, and restart/replace failed servers. n 11/30/2020 EECS 584, Fall 2011 24
Tablet Assignment n Chubby – Tablet server registers itself by getting a lock in a specific directory chubby • Chubby gives “lease” on lock, must be renewed periodically • Server loses lock if it gets disconnected – Master monitors this directory to find which servers exist/are alive • If server not contactable/has lost lock, master grabs lock and reassigns tablets • GFS replicates data. Prefer to start tablet server on same machine that the data is already at 11/30/2020 EECS 584, Fall 2011 25
Outline n Motivation n Overall Architecture & Building Blocks n Data Model n APIs n Implementation n Refinement n Evaluation 11/30/2020 EECS 584, Fall 2011 26
Refinement – Locality groups & Compression n Locality Groups – Can group multiple column families into a locality group • Separate SSTable is created for each locality group in each tablet. – Segregating columns families that are not typically accessed together enables more efficient reads. • In Web. Table, page metadata can be in one group and contents of the page in another group. n Compression – Many opportunities for compression • Similar values in the cell at different timestamps • Similar values in different columns • Similar values across adjacent rows 11/30/2020 EECS 584, Fall 2011 27
Outline n Motivation n Overall Architecture & Building Blocks n Data Model n APIs n Implementation n Refinement n Evaluation 11/30/2020 EECS 584, Fall 2011 28
Performance - Scaling Not Linear! WHY? n As the number of tablet servers is increased by a factor of 500: – Performance of random reads from memory increases by a factor of 300. – Performance of scans increases by a factor of 260. 11/30/2020 EECS 584, Fall 2011 29
Not linearly? n Load Imbalance – Competitions with other processes • Network • CPU – Rebalancing algorithm does not work perfectly • Reduce the number of tablet movement • Load shifted around as the benchmark progresses 11/30/2020 EECS 584, Fall 2011 30
- Slides: 30