HBase Introduction Introduction Open source nonrelational distributed database

HBase Introduction

Introduction �Open source, non-relational, distributed database �Originate from Google’s Big. Table �Data storage system built on Google file system �Part of Apache Software Foundation

History � 2006/11 �Google 發佈Big. Table相關論文 � 2007/02 �建立初始HBase原型，用來作為Hadoop的具 � 2007/10 �第一個可用的HBase產生 � 2008/01 �Hadoop成為Apache 專案，HBase為子專案

Feature �Column-Oriented �Data Store instead of Data base �Key/Value Database �No. SQL Database �No SQL Language? �CAP Theorem? �Distributed

Column-Oriented

Column-Oriented �Row-Oriented � 001: 10, Smith, Joe, 40000; 002: 12, Jones, Mary, 50000; 003: 11, J ohnson, Cathy, 44000; 004: 22, Jones, Bob, 55000; �Primary key is the rowid that is mapped to indexed data �Column-Oriented � 10: 001, 12: 002, 11: 003, 22: 004; Smith: 001, Jones: 002, 004, John son: 003; Joe: 001, Mary: 002, Cathy: 003, Bob: 004; 40000: 001, 50000: 002, 44000: 003, 55000: 004; �Primary key is data, mapping back to rowids

Column-Oriented �Advantage �Easy to update and rewrite the column data �Easy to compute over many rows but only for a smaller subset of all column of data �Easy to parallel the task �Disadvantage �If the row-size is small, row-oriented is efficient than column-oriented �many columns of a single row are required

Key Value Database �Use to solve huge data, such as google and e. Bay �scalability �Cloud Storage �Different kind of people and application � Hard to separate and define the schema

Key Value Database

Key Value Database �Advantage �High scalability �Easy to query �Many Column Families �Disadvantage �No schema, other application can not reuse this data

No. SQL �No SQL is Not Only SQL �Many No. SQL Database provide SQL language �Many of them use key value database �No schema limitation �Eventually Consistency �Distributed to many devices->data inconsistent � Such as facebook

CAP Theorem

HBase Table �Use ('row key', 'family: label', 'timestamp') �('ricky', 'score', 'T 5') ricky, T 5, Eng: 30

HBase

Hbase

HBase

Zookeeper �a high-performance coordination service for distributed applications �naming, configuration management, synchronization, and group services �Ensure only one master in the group �Store the location of all regions �Monitor the status of region servers and provide the status to master �Store the schema of HBase �tables and column families of tables

HMaster �Allocate regions to region server �Balance the loading of region server �Reallocate the region to other region servers from failed region server �Garbage collection from GFS(Google File System) �Handle new request of schema

Region Server �Maintain the region from master �Handle the IO request from the region �Balance the loading of region and separate big region into small regions

HLog �Store all the edits to the HStore �timestamp, sequence number, table name and region name �Write-ahead logging implement �All modifications are written to a log before they are applied �One HLog per region server �Identified by a unique long Int

Memstore �Cache �Store the new data �Improve the performance � last written data is accessed more frequently than older data �When certain thresholds are met, Memstore data gets flushed into HFile �Problem? Create many HFiles read speed will be sufferd �HBase will periodically compact multiple smaller HFiles into a big one

Hbase Process Client send request to Region Server find the target region Region Server check the consistent between schema and data Return the date to client Update the HLog Update the Memstore

Question?

Thanks for listening.