HBase Introduction Introduction Open source nonrelational distributed database
HBase Introduction
Introduction �Open source, non-relational, distributed database �Originate from Google’s Big. Table �Data storage system built on Google file system �Part of Apache Software Foundation
History � 2006/11 �Google 發佈Big. Table相關論文 � 2007/02 �建立初始HBase原型,用來作為Hadoop的 具 � 2007/10 �第一個可用的HBase產生 � 2008/01 �Hadoop成為Apache 專案,HBase為子專案
Feature �Column-Oriented �Data Store instead of Data base �Key/Value Database �No. SQL Database �No SQL Language? �CAP Theorem? �Distributed
Column-Oriented
Column-Oriented �Row-Oriented � 001: 10, Smith, Joe, 40000; 002: 12, Jones, Mary, 50000; 003: 11, J ohnson, Cathy, 44000; 004: 22, Jones, Bob, 55000; �Primary key is the rowid that is mapped to indexed data �Column-Oriented � 10: 001, 12: 002, 11: 003, 22: 004; Smith: 001, Jones: 002, 004, John son: 003; Joe: 001, Mary: 002, Cathy: 003, Bob: 004; 40000: 001, 50000: 002, 44000: 003, 55000: 004; �Primary key is data, mapping back to rowids
Column-Oriented �Advantage �Easy to update and rewrite the column data �Easy to compute over many rows but only for a smaller subset of all column of data �Easy to parallel the task �Disadvantage �If the row-size is small, row-oriented is efficient than column-oriented �many columns of a single row are required
Key Value Database �Use to solve huge data, such as google and e. Bay �scalability �Cloud Storage �Different kind of people and application � Hard to separate and define the schema
Key Value Database
Key Value Database
Key Value Database �Advantage �High scalability �Easy to query �Many Column Families �Disadvantage �No schema, other application can not reuse this data
No. SQL �No SQL is Not Only SQL �Many No. SQL Database provide SQL language �Many of them use key value database �No schema limitation �Eventually Consistency �Distributed to many devices->data inconsistent � Such as facebook
CAP Theorem
HBase Table �Use ('row key', 'family: label', 'timestamp') �('ricky', 'score', 'T 5') ricky, T 5, Eng: 30
HBase
HBase
Hbase
Hbase
HBase
Zookeeper �a high-performance coordination service for distributed applications �naming, configuration management, synchronization, and group services �Ensure only one master in the group �Store the location of all regions �Monitor the status of region servers and provide the status to master �Store the schema of HBase �tables and column families of tables
HMaster �Allocate regions to region server �Balance the loading of region server �Reallocate the region to other region servers from failed region server �Garbage collection from GFS(Google File System) �Handle new request of schema
Region Server �Maintain the region from master �Handle the IO request from the region �Balance the loading of region and separate big region into small regions
HLog �Store all the edits to the HStore �timestamp, sequence number, table name and region name �Write-ahead logging implement �All modifications are written to a log before they are applied �One HLog per region server �Identified by a unique long Int
Memstore �Cache �Store the new data �Improve the performance � last written data is accessed more frequently than older data �When certain thresholds are met, Memstore data gets flushed into HFile �Problem? Create many HFiles read speed will be sufferd �HBase will periodically compact multiple smaller HFiles into a big one
Hbase Process Client send request to Region Server find the target region Region Server check the consistent between schema and data Return the date to client Update the HLog Update the Memstore
Question?
Thanks for listening.
- Slides: 27