HBase and Bigtable Storage Xiaoming Gao Judy Qiu

HBase and Bigtable Storage Xiaoming Gao Judy Qiu Hui Li

Outline • HBase and Bigtable Storage • HBase Use Cases • Hands-on: Load CSV file to Hbase table with Map. Reduce • Demo Search Engine System with Map. Reduce Technologies (Hadoop/HDFS/HBase/Pig)

HBase Introduction • HBase is an open source, distributed, sorted map modeled after Google’s Big. Table • HBase is built on Hadoop: – Fault tolerance – Scalability – Batch processing with Map. Reduce • HBase uses HDFS for storage

HBase Cluster Architecture • Tables split into regions and served by region servers • Regions vertically divided by column families into “stores” • Stores saved as files on HDFS

Data Model: A Big Sorted Map • • A Big Sorted Map Not a relational database, no sql, Tables consist of rows, each of which has a primary key (row key) Each row has any number of columns: sorted. Map<row. Key, List(sorted. Map(Column, List(Value, Time. Stamp))))>

HBase VS. RDBMS HBase Data layout Row-oriented Column-family-oriented Indexes On row and columns On row Hardware requirement Large arrays of fast and expensive disks Designed for commodity hardware Max data size TBs ~1 PB Read/write throughput 1000 s queries/second Millions of queries/second Query language SQL (Join, Group) Get/Put/ Easy of use Relational data modeling, easy to learn A sorted Map, significant learning curve, communities and tools are increasing

When to Use HBase • Dataset Scale – Indexing huge amount of web pages in internet or genome data – Need data mining large social media data sets • Read/Write Scale – reads/writes are distributed as tables are distributed across nodes – Writes are extremely fast and require no index updates • Batch Analysis – Massive and convoluted SQL queries can be executed in parallel via Map. Reduce jobs

Use Cases: • Facebook Analytics – Real-time counters of URLs shared, preferred links • Twitter – 25 TB of message every month • Mozilla – Store crashes report, 2. 5 million per day.

Programming with HBase 1. HBase shell – Scan, List, Create 2. Native Java API – Get(byte[] row, byte[] column, long ts, int version) 3. Non-Java Clients – Thrift server (Ruby, C++, PHP) – REST server 4. HBase Map. Reduce API – hbase. mapreduce. Table. Mapper; – hbase. mapreduce. Table. Reducer; 5. High Level Interface – Pig, Hive

Hands-on HBase Map. Reduce Programming • HBase Map. Reduce API import org. apache. hadoop. hbase. HBase. Configuration; import org. apache. hadoop. hbase. client. Result; import org. apache. hadoop. hbase. client. Scan; import org. apache. hadoop. hbase. client. Put; import org. apache. hadoop. hbase. mapreduce. Table. Mapper; import org. apache. hadoop. hbase. mapreduce. Table. Reducer; import org. apache. hadoop. hbase. io. Immutable. Bytes. Writable; import org. apache. hadoop. hbase. mapreduce. Table. Map. Reduce. Util; import org. apache. hadoop. hbase. util. Bytes;

Hands-on: load CSV file into HBase table with Map. Reduce • CSV represent for comma separate values • CSV file is common file in many scientific fields such as flow cytometry in bioinformatics

Hands-on: load CSV file into HBase table with Map. Reduce • Main entry point of program public static void main(String[] args) throws Exception { Configuration conf = HBase. Configuration. create(); String[] other. Args = new Generic. Options. Parser(conf, args). get. Remaining. Args(); if(other. Args. length != 2) { System. err. println("Wrong number of arguments: " + other. Args. length); System. err. println("Usage: <csv file> <hbase table name>"); System. exit(-1); }//end if Job job = configure. Job(conf, other. Args); System. exit(job. wait. For. Completion(true) ? 0 : 1); }//main

Hands-on: load CSV file into HBase table with Map. Reduce • Configure HBase Map. Reduce job public static Job configure. Job(Configuration conf, String [] args) throws IOException { Path input. Path = new Path(args[0]); String table. Name = args[1]; Job job = new Job(conf, table. Name); job. set. Jar. By. Class(CSV 2 HBase. class); File. Input. Format. set. Input. Paths(job, input. Path); job. set. Input. Format. Class(Text. Input. Format. class); job. set. Mapper. Class(CSV 2 HBase. class); Table. Map. Reduce. Util. init. Table. Reducer. Job(table. Name, null, job); job. set. Num. Reduce. Tasks(0); return job; }//public static Job configure

Hands-on: load CSV file into HBase table with Map. Reduce • The map function public void map(Long. Writable key, Text line, Context context) throws IOException { // Input is a CSV file Each map() is a single line, where the key is the line number // Each line is comma-delimited; row, family, qualifier, value String [] values = line. to. String(). split(", "); if(values. length != 4) { return; } byte [] row = Bytes. to. Bytes(values[0]); byte [] family = Bytes. to. Bytes(values[1]); byte [] qualifier = Bytes. to. Bytes(values[2]); byte [] value = Bytes. to. Bytes(values[3]); Put put = new Put(row); put. add(family, qualifier, value); try { context. write(new Immutable. Bytes. Writable(row), put); } catch (Interrupted. Exception e) { e. print. Stack. Trace(); } if(++count % checkpoint == 0) { context. set. Status("Emitting Put " + count); } } }

Hands-on: steps to load CSV file into HBase table with Map. Reduce 1. Check Hbase installation in Ubuntu Sandbox 1. http: //salsahpc. indiana. edu/Science. Cloud/virtualbox_appliance_guide. html 2. Echo $HBASE_HOME 2. Start Hadoop and Hbase cluster 1. Start-all. sh 2. Start-hbase. sh 3. Create hbase table with specified data schema 1. Hbase shell 2. Create “csv 2 hbase”, ”f 1” 4. Compile the program with Ant 1. cd “hbasetutorial” 2. Ant 5. Upload input. csv into HDFS 1. Hadoop dfs –mkdir input 2. Hadoop dfs –copy. From. Local input. csv input/input. csv 6. Run the program: /bin/hadoop jar dist/lib/cgl. HBase. Summer. School. jar iu. pti. hbaseapp. CSV 2 HBase input/input. csv “csv 2 hbase” 7. Check inserted records in Hbase table 1. Hbase shell 2. Scan “csv 2 hbase”

Hands-on: load CSV file into HBase table with Map. Reduce

Extension: set HBase table as Input Using Table. Input. Format and Table. Map. Reduce. Util to use an HTable as input to a map/reduce job public static Job configure. Job (Configuration conf, String [] args) throws IOException { conf. set(Table. Input. Format. SCAN, Table. Map. Reduce. Util. convert. Scan. To. String(new Scan())); conf. set(Table. Input. Format. INPUT_TABLE, table. Name); conf. set("index. tablename", table. Name); conf. set("index. familyname", column. Family); String[] fields = new String[args. length - 2]; for(int i = 0; i < fields. length; i++) { fields[i] = args[i + 2]; } conf. set. Strings("index. fields", fields); conf. set("index. familyname", "attributes"); Job job = new Job(conf, table. Name); job. set. Jar. By. Class(Index. Builder. class); job. set. Mapper. Class(Map. class); job. set. Num. Reduce. Tasks(0); job. set. Input. Format. Class(Table. Input. Format. class); job. set. Output. Format. Class(Multi. Table. Output. Format. class); return job; }

Extension: write output to HBase table public static class Map extends Mapper<Immutable. Bytes. Writable, Result, Immutable. Bytes. Writable, Writable> { private byte[] family; private Hash. Map<byte[], Immutable. Bytes. Writable> indexes; protected void map(Immutable. Bytes. Writable row. Key, Result result, Context context) throws IOException, Interrupted. Exception { for(java. util. Map. Entry<byte[], Immutable. Bytes. Writable> index : indexes. entry. Set()) { byte[] qualifier = index. get. Key(); Immutable. Bytes. Writable. Name = index. get. Value(); byte[] value = result. get. Value(family, qualifier); if (value != null) { Put put = new Put(value); put. add(INDEX_COLUMN, INDEX_QUALIFIER, row. Key. get()); context. write(table. Name, put); }//if }//for }//map

Big Data Challenge Peta 10^15 Tera 10^12 Giga 10^9 Mega 10^6

Search Engine System with Map. Reduce Technologies 1. Search Engine System for Summer School 2. To give an example of how to use Map. Reduce technologies to solve big data challenge. 3. Using Hadoop/HDFS/HBase/Pig 4. Indexed 656 K web pages (540 MB in size) selected from Clueweb 09 data set. 5. Calculate ranking values for 2 million web sites.

Architecture for SESSS Apache Lucene Inverted Indexing System Web UI PHP script Hive/Pig script Apache Server Thrift client on Salsa Portal HBase Tables 1. inverted index table 2. page rank table Thrift server Pig script Hadoop Cluster on Future. Grid Ranking System

Demo Search Engine System for Summer School build-index-demo. exe (build index with HBase) pagerank-demo. exe (compute page rank with Pig) http: //salsahpc. indiana. edu/sesss/index. php

High Level Language: Pig Latin Hui Li Judy Qiu Some material adapted from slides by Adam Kawa the 3 rd meeting of WHUG June 21, 2012

What is Pig • Framework for analyzing large un-structured and semistructured data on top of Hadoop. – Pig Engine Parses, compiles Pig Latin scripts into Map. Reduce jobs run on top of Hadoop. – Pig Latin is simple but powerful data flow language similar to scripting languages.

Motivation of Using Pig • Faster development – Fewer lines of code (Writing map reduce like writing SQL queries) – Re-use the code (Pig library, Piggy bank) • One test: Find the top 5 words with most high frequency – 10 lines of Pig Latin V. S 200 lines in Java – 15 minutes in Pig Latin V. S 4 hours in Java Pig Latin 300 250 150 minutes 200 150 100 50 50 0 0 Java

Word Count using Map. Reduce

Pig performance VS Map. Reduce • Pigmix : pig vs mapreduce

Word Count using Pig Lines=LOAD ‘input/hadoop. log’ AS (line: chararray); Words = FOREACH Lines GENERATE FLATTEN(TOKENIZE(line)) AS word; Groups = GROUP Words BY word; Counts = FOREACH Groups GENERATE group, COUNT(Words); Results = ORDER Words BY Counts DESC; Top 5 = LIMIT Results 5; STORE Top 5 INTO /output/top 5 words;

Who uses Pig for What • 70% of production jobs at Yahoo (10 ks per day) • Twitter, Linked. In, Ebay, AOL, … • Used to – Process web logs – Build user behavior models – Process images – Build maps of the web – Do research on raw data sets

Pig Tutorial • Accessing Pig • Basic Pig knowledge: (Word Count) – Pig Data Types – Pig Operations – How to run Pig Scripts • Advanced Pig features: (Kmeans Clustering) – Embedding Pig within Python – User Defined Function

Accessing Pig • Accessing approaches: – Batch mode: submit a script directly – Interactive mode: Grunt, the pig shell – Pig. Server Java class, a JDBC like interface • Execution mode: – Local mode: pig –x local – Mapreduce mode: pig –x mapreduce

Pig Data Types • Concepts: fields, tuples, bags, relations, – – A Field is a piece of data A Tuple is an ordered set of fields A Bag is a collection of tuples A Relation is a bag • Simple Types – Int, long, float, double, boolean, nul, chararray, bytearry, • Complex types – Tuple Row in Database • ( 0002576169, Tome, 21, “Male”) – Data Bag Table or View in Database {(0002576169 , Tome, 21, “Male”), (0002576170, Mike, 20, “Male”), (0002576171 Lucy, 20, “Female”)…. }

Pig Operations • Loading data – LOAD loads input data – Lines=LOAD ‘input/access. log’ AS (line: chararray); • Projection – FOREACH … GENERTE … (similar to SELECT) – takes a set of expressions and applies them to every record. • Grouping – GROUP collects together records with the same key • Dump/Store – Dump displays results to screen, Store save results to file system • Aggregation – AVG, COUNT_STAR, MAX, MIN, SUM

How to run Pig Latin scripts • Local mode – Local host and local file system is used – Neither Hadoop nor HDFS is required – Useful for prototyping and debugging • Map. Reduce mode – Run on a Hadoop cluster and HDFS • Batch mode - run a script directly – Pig –x local my_pig_script. pig – Pig –x mapreduce my_pig_script. pig • Interactive mode use the Pig shell to run script – Grunt> Lines = LOAD ‘/input. txt’ AS (line: chararray); – Grunt> Unique = DISTINCT Lines; – Grunt> DUMP Unique;

Hands-on: Word Count using Pig Latin 1. 2. 3. 4. 5. 6. 7. 8. 9. cd pigtutorial/pig-hands-on/ tar –xf pig-wordcount. tar cd pig-wordcount pig –x local grunt> Lines=LOAD ‘input. txt’ AS (line: chararray); grunt>Words = FOREACH Lines GENERATE FLATTEN(TOKENIZE(line)) AS word; grunt>Groups = GROUP Words BY word; grunt>counts = FOREACH Groups GENERATE group, COUNT(Words); grunt>DUMP counts;

Sample: Kmeans using Pig Latin A method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. Assignment step: Assign each observation to the cluster with the closest mean Update step: Calculate the new means to be the centroid of the observations in the cluster. Reference: http: //en. wikipedia. org/wiki/K-means_clustering

Kmeans Using Pig Latin PC = Pig. compile("""register udf. jar DEFINE find_centroid Find. Centroid('$centroids'); raw = load 'student. txt' as (name: chararray, age: int, gpa: double); centroided = foreach raw generate gpa, find_centroid(gpa) as centroid; grouped = group centroided by centroid; result = Foreach grouped Generate group, AVG(centroided. gpa); store result into 'output'; """)

$Kmeans Using Pig Latin while iter_num<MAX_ITERATION: PCB = PC. bind({'centroids': initial_centroids}) results = PCB.$

Kmeans Using Pig Latin while iter_num<MAX_ITERATION: PCB = PC. bind({'centroids': initial_centroids}) results = PCB. run. Single() iter = results. result("result"). iterator() centroids = [None] * v distance_move = 0. 0 # get new centroid of this iteration, calculate the moving distance with last iteration for i in range(v): tuple = iter. next() centroids[i] = float(str(tuple. get(1))) distance_move = distance_move + fabs(last_centroids[i]-centroids[i]) distance_move = distance_move / v; if distance_move<tolerance: converged = True break ……

Embedding Python scripts with Pig Statements • Pig does not support flow control statement: if/else, while loop, for loop, etc. • Pig embedding API can leverage all language features provided by Python including control flow: – Loop and exit criteria – Similar to the database embedding API – Easier parameter passing • Java. Script is available as well • The framework is extensible. Any JVM implementation of a language could be integrated

User Defined Function • What is UDF – Way to do an operation on a field or fields – Called from within a pig script – Currently all done in Java • Why use UDF – You need to do more than grouping or filtering – Actually filtering is a UDF – Maybe more comfortable in Java land than in SQL/Pig Latin P = Pig. compile("""register udf. jar DEFINE find_centroid Find. Centroid('$centroids');

Hands-on Run Pig Latin Kmeans 1. 2. 3. 4. export PIG_CLASSPATH= /opt/pig/lib/jython-2. 5. 0. jar Hadoop dfs –copy. From. Local input. txt. /input. txt pig –x mapreduce kmeans. py pig—x local kmeans. py

Hands-on Run Pig Latin Kmeans 2012 -07 -14 14: 51: 24, 636 [main] INFO org. apache. pig. scripting. Bound. Script - Query to run: register udf. jar DEFINE find_centroid Find. Centroid('0. 0: 1. 0: 2. 0: 3. 0'); raw = load 'student. txt' as (name: chararray, age: int, gpa: double); centroided = foreach raw generate gpa, find_centroid(gpa) as centroid; grouped = group centroided by centroid; result = foreach grouped generate group, AVG(centroided. gpa); store result into 'output'; Input(s): Successfully read 10000 records (219190 bytes) from: "hdfs: //iw-ubuntu/user/developer/student. txt" Output(s): Successfully stored 4 records (134 bytes) in: "hdfs: //iw-ubuntu/user/developer/output“ last centroids: [0. 371927835052, 1. 22406743491, 2. 24162171881, 3. 40173705722]

References: 1. 2. 3. 4. 5. 6. http: //pig. apache. org (Pig official site) http: //en. wikipedia. org/wiki/K-means_clustering Docs http: //pig. apache. org/docs/r 0. 9. 0 Papers: http: //wiki. apache. org/pig/Pig. Talks. Papers http: //en. wikipedia. org/wiki/Pig_Latin Slides by Adam Kawa the 3 rd meeting of WHUG June 21, 2012 • Questions?

Acknowledgement

HBase Cluster Architecture • Tables split into regions and served by region servers • Regions vertically divided by column families into “stores” • Stores saved as files on HDFS