LargeScale Data Management Hbase 1 HBase Overview HBase

Large-Scale Data Management Hbase 1

HBase: Overview • HBase is a distributed column-oriented data store built on top of HDFS • HBase is an Apache open source project whose goal is to provide storage for the Hadoop Distributed Computing • Data is logically organized into tables, rows and columns 2

Difference • Hive and HBase are two different Hadoop based technologies – • Hive is an SQL-like engine that runs Map. Reduce jobs, and • HBase is a No. SQL key/value database on Hadoop. • Just like Google can be used for search and Facebook for social networking, Hive can be used for analytical queries while HBase for real-time querying. 3

HBase: Part of Hadoop’s Ecosystem HBase is built on top of HDFS HBase files are internally stored in HDFS 4

HBase vs. HDFS • Both are distributed systems that scale to hundreds or thousands of nodes • HDFS is good for batch processing (scans over big files) • Not good for record lookup • Not good for incremental addition of small batches • Not good for updates 5

HBase vs. HDFS (Cont’d) • HBase is designed to efficiently address the above points • Fast record lookup • Support for record-level insertion • Support for updates (not in place) • HBase updates are done by creating new versions of values 6

HBase vs. HDFS (Cont’d) If application has neither random reads or writes Stick to HDFS 7

HBase Data Model 8

HBase Data Model • HBase is based on Google’s Bigtable model • Key-Value pairs 9

HBase Logical View 10

HBase: Keys and Column Families Each record is divided into Column Families Each row has a Key Each column family consists of one or more Columns 11

Column family named “anchor” Column family named “Contents” • Key • Byte array • Serves as the primary key for the table • Indexed far fast lookup • Column named “apache. com” Column Family • Has a name (string) • Contains one or more related columns • Column • Belongs to one column family • Included inside the row • family. Name: column. Na me 12

Version number for each row • Version Number • Unique within each key • By default System’s timestamp • Data type is Long value • Value (Cell) • Byte array 13

Notes on Data Model • HBase schema consists of several Tables • Each table consists of a set of Column Families • Columns are not part of the schema • HBase has Dynamic Columns • Because column names are encoded inside the cells • Different cells can have different columns “Roles” column family has different columns in different cells 14

Notes on Data Model (Cont’d) • The version number can be user-supplied • Even does not have to be inserted in increasing order • Version number are unique within each key • Table can be very sparse • Many cells are empty • Keys are indexed as the primary key Has two columns [cnnsi. com & my. look. ca]

HBase Physical Model 16

HBase Physical Model • Each column family is stored in a separate file (called HTables) • Key & Version numbers are replicated with each column family • Empty cells are not stored HBase maintains a multilevel index on values: <key, column family, column name, timestamp> 17

Exampl e 18

Column Families 19

HBase Regions • Each HTable (column family) is partitioned horizontally into regions • Regions are counterpart to HDFS blocks Each will be one region 20

HBase Architecture 21

Three Major Components • The HBase. Master • One master • The HRegion. Server • Many region servers • The HBase client 22

HBase Architecture • In HBase, tables are split into regions and are served by the region servers. • Regions are vertically divided by column families into “Stores”. • HBase has three major components: the client library, a master server, and region servers. • Region servers can be added or removed as per requirement. 23

HBase Architecture Master. Server • Assigns regions to the region servers and takes the help of Apache Zoo. Keeper for this task. • Handles load balancing of the regions across region servers. • It unloads the busy servers and shifts the regions to less occupied servers. • Maintains the state of the cluster by negotiating the load balancing. • Is responsible for schema changes and other metadata operations such 24 as creation of tables and column families.

Regions • Regions are nothing but tables that are split up and spread across the region servers. • Communicate with the client and handle data-related operations. • Handle read and write requests for all the regions under it. • When we take a deeper look into the region server, it contain regions and stores as shown: 25

• The store contains memory store and HFiles. Memstore is just like a cache memory. • Anything that is entered into the HBase is stored here initially. • Later, the data is transferred and saved in Hfiles as blocks and the memstore is flushed. 26

Zookeeper • Zookeeper is an open-source project that provides services like maintaining configuration information, naming, providing distributed synchronization, etc. • Zookeeper has ephemeral nodes representing different region servers. Master servers use these nodes to discover available servers. • In addition to availability, the nodes are also used to track server failures or network partitions. • Clients communicate with region servers via zookeeper. • In pseudo and standalone modes, HBase itself will take care of zookeeper. 27

Big Picture 28

Select value from table where key=‘com. apache. www’ AND label=‘anchor: apache. com’ Get() Row key Time Stamp Column “anchor: ” t 12 “com. apache. www” t 11 t 10 “anchor: apache. com” “APACHE” t 9 “anchor: cnnsi. com” “CNN” t 8 “anchor: my. look. ca” “CNN. com” “com. cnn. www” t 6 t 5 t 3

Select value from table where anchor=‘cnnsi. com’ Scan() Row key Time Stamp Column “anchor: ” t 12 “com. apache. www” t 11 t 10 “anchor: apache. com” “APACHE” t 9 “anchor: cnnsi. com” “CNN” t 8 “anchor: my. look. ca” “CNN. com” “com. cnn. www” t 6 t 5 t 3

Operations On Regions: Delete() • Marking table cells as deleted • Multiple levels • Can mark an entire column family as deleted • Can make all column families of a given row as deleted • All operations are logged by the Region. Servers • The log is flushed periodically 31

HBase: Joins • HBase does not support joins • Can be done in the application layer • Using scan() and get() operations 32

Logging Operations 33

HBase Deployment Master node Slave nodes 34

HBase vs. RDBMS 35

When to use HBase 36