Analyzing Twitter Data with Hadoop Gwen Shapira Software

IOUG SIG Meetings at Open. World All meetings located in Moscone South - Room

COLLABORATE 15 – IOUG Forum April 12 -16, 2015 Mandalay Bay Resort and Casino

I have 15 years of experience in moving data around © 2014 Cloudera, Inc.

In my spare time… • • Oracle ACE Director Member of Oak Table Blogger

Analyzing Twitter Data with Hadoop BUILDING AN HADOOP APPLICATION 6 © 2012 Cloudera, Inc.

Hive Level Architecture Hive + Oozie Data Source 8 Flume HDFS © 2012 Cloudera,

Analyzing Twitter Data with Hadoop AN EXAMPLE USE CASE 9 © 2012 Cloudera, Inc.

Analyzing Twitter • • • 10 Social media popular with marketing teams Twitter is

Analyzing Twitter Data with Hadoop HOW DO WE ANSWER THESE QUESTIONS? 11 © 2012

Techniques Bring Data with Flume • Complex data • Deeply nested • Variable schema

Analyzing Twitter Data with Hadoop FLUME 13

In our case… • Twitter source • Pulls JSON format files from twitter Memory

$What is JSON? { "retweeted_status": { "contributors": null, "text": "#Crowdsourcing – drivers already generate$

But Wait! There’s More! • • • 17 Many sources – directory, files, log

High Level Pipeline Architecture Query With Hbase API Or Impala Web App Flume Avro

Configuration Twitter. Agent. sources = Twitter. Agent. channels = Mem. Channel Twitter. Agent. sinks

Analyzing Twitter Data with Hadoop FLUME DEMO 20 © 2012 Cloudera, Inc.

Analyzing Twitter Data with Hadoop HIVE 21 © 2012 Cloudera, Inc.

What is Hive? Created at Facebook • Hive. QL • • SQL like interface

Hive Details • Metastore contains table definitions Stored in a relational database • Basically

Complex Data SELECT t. retweet_screen_name, sum(retweets) AS total_retweets, count(*) AS tweet_count FROM (SELECT retweeted_status.

Analyzing Twitter Data with Hadoop HIVE DEMO 25 © 2012 Cloudera, Inc.

Analyzing Twitter Data with Hadoop IT’S A TRAP 26 © 2012 Cloudera, Inc.

Not a Database Language RDBMS Hive Generally >= SQL-92 Subset of SQLplus Hive specific

Analyzing Twitter Data with Hadoop DATA FORMATS 28

I don’t like our data Lots of small files • JSON – requires parsing

I’d rather use Avro • • • 30 Few large files containing records Schema

Lets convert Create table AVRO_TWEETS • Insert into Avro_tweets select …. From tweets •

Analyzing Twitter Data with Hadoop IMPALA ASIDE 32 © 2012 Cloudera, Inc.

Cloudera Impala Real-Time Query for Data Stored in Hadoop. 33 FAMILIAR Supports Hive SQL

Benefits of Cloudera Impala Real-Time Query for Data Stored in Hadoop 34 SPEED TO

Cloudera Impala Details Unified metadata and scheduler Hive Metastore SQL App ODBC Low-latency scheduler

Oracle Connectors for Hadoop Oracle Loader for Hadoop • Oracle SQL Connector for Hadoop

Oracle Loader for Hadoop • • • Load data from Hadoop into Oracle Map-Reduce

Oracle SQL Connector for Hadoop • • • Run a Java app Creates an

Big Data SQL Also external table • Can also use Hive metastore for schema

Analyzing Twitter Data with Hadoop PUTTING IT ALL TOGETHER 41 © 2012 Cloudera, Inc.

Hive Level Architecture Hive + Oozie Data Source 42 Flume HDFS © 2012 Cloudera,

What next? Download Hadoop! • CDH available at www. cloudera. com • Cloudera provides

Slides: 44

Download presentation

IOUG SIG Meetings at Open. World All meetings located in Moscone South - Room 208 Monday, September 29 Exadata SIG: 2: 00 p. m. - 3: 00 p. m. BIWA SIG: 5: 00 p. m. – 6: 00 p. m. Tuesday, September 30 Internet of Things SIG: 11: 00 a. m. - 12: 00 p. m. Storage SIG: 4: 00 p. m. - 5: 00 p. m. SPARC/Solaris SIG: 5: 00 p. m. - 6: 00 p. m. Wednesday, October 1 Oracle Enterprise Manager SIG: 8: 00 a. m. - 9: 00 a. m. Big Data SIG: 10: 30 a. m. - 11: 30 a. m. Oracle 12 c SIG: 2: 00 p. m. – 3: 00 p. m. Oracle Spatial and Graph SIG: 4: 00 p. m. (*OTN lounge)

COLLABORATE 15 – IOUG Forum April 12 -16, 2015 Mandalay Bay Resort and Casino Las Vegas, NV The IOUG Forum Advantage • • • Save more than $1, 000 on education offerings like pre-conference workshops Access the brand-new, specialized IOUG Strategic Leadership Program Priority access to the hands-on labs with Oracle ACE support Advance access to supplemental session material and presentations ers k Special IOUG activities with no "ante in" needed - evening networkingfoopportunities Spea r Call 0 5 and more 1 er 1 ATE OR tob c B O A L s COL End www. collaborate. ioug. org Follow us on Twitter at @IOUG or via the conference hashtag #C 15 LV!

In my spare time… • • Oracle ACE Director Member of Oak Table Blogger Presenter – Hotsos, IOUG, OOW, OSCON No. COUG board Contributor to Apache Oozie, Sqoop, Kafka Author – Hadoop Application Architectures © 2014 Cloudera, Inc. All rights reserved.

Analyzing Twitter • • • 10 Social media popular with marketing teams Twitter is an effective tool for promotion Which twitter user gets the most retweets? Who is influential in our industry? Which topics are trending? © 2012 Cloudera, Inc.

Techniques Bring Data with Flume • Complex data • Deeply nested • Variable schema • Clean, Standardize, Partition, etc • SQL • Filtering • Aggregation • Sorting • 12

Analyzing Twitter Data with Hadoop FLUME 13

Flume Agent design 14

In our case… • Twitter source • Pulls JSON format files from twitter Memory Channel • HDFS Sink – directory per hour • 15

$What is JSON? { "retweeted_status": { "contributors": null, "text": "#Crowdsourcing – drivers already generate$

What is JSON? { "retweeted_status": { "contributors": null, "text": "#Crowdsourcing – drivers already generate traffic data for your smartphone to suggest alternative routes when a road is clogged. #bigdata", "retweeted": false, "entities": { "hashtags": [ { "text": "Crowdsourcing", "indices": [0, 14] }, { "text": "bigdata", "indices": [129, 137] } ], "user_mentions": [] } } } 16 © 2012 Cloudera, Inc.

But Wait! There’s More! • • • 17 Many sources – directory, files, log 4 j, net, JMS Interceptors – process data in flight Selectors – choose which sink Many channels – Memory, file Many sinks – HDFS, Hbase, Solr

High Level Pipeline Architecture Query With Hbase API Or Impala Web App Flume Avro Client Web App Flume Avro Client Fan-in Pattern Spark. Streaming data is sub set of whole events Flume Agent Pull Near Real Time Results Spark. Streaming Report App HBase Flume Agent HDFS Batch Report Updates Flume Agent Web App Flume Avro Client ML Map/Reduce Jobs Multi Agents for Failover and rolling restarts Client providing, multithreading, compression, encryption, and batching 18

Configuration Twitter. Agent. sources = Twitter. Agent. channels = Mem. Channel Twitter. Agent. sinks = HDFS Twitter. Agent. sources. Twitter. type = com. cloudera. flume. source. Twitter. Source Twitter. Agent. sources. Twitter. channels = Mem. Channel Twitter. Agent. sources. Twitter. consumer. Key = Twitter. Agent. sources. Twitter. consumer. Secret = Twitter. Agent. sources. Twitter. access. Token. Secret = Twitter. Agent. sources. Twitter. keywords = hadoop, big data, flume, sqoop, oracle, oow Twitter. Agent. sinks. HDFS. channel = Mem. Channel Twitter. Agent. sinks. HDFS. type = hdfs Twitter. Agent. sinks. HDFS. hdfs. path = hdfs: //quickstart : 8020/user/flume/tweets/%Y/%m/%d/%H/ Twitter. Agent. sinks. HDFS. serializer = text Twitter. Agent. channels. Mem. Channel. type = memory 19

Hive Details • Metastore contains table definitions Stored in a relational database • Basically a data dictionary • Ser. Des parse data • and converts to table/column structure • Ser. De: • CSV, XML, JSON, Avro, Parquet, OCR files • Or write your own (We created one for Copy. Book) • 23

Complex Data SELECT t. retweet_screen_name, sum(retweets) AS total_retweets, count(*) AS tweet_count FROM (SELECT retweeted_status. user. screen_name AS retweet_screen_name, retweeted_status. text, max(retweeted_status. retweet_count) AS retweets FROM tweets GROUP BY retweeted_status. user. screen_name, retweeted_status. text) t GROUP BY t. retweet_screen_name ORDER BY total_retweets DESC LIMIT 10; 24 © 2012 Cloudera, Inc.

Not a Database Language RDBMS Hive Generally >= SQL-92 Subset of SQLplus Hive specific 92 extensions Transactions INSERT, UPDATE, DELETE Yes Latency Update Capabilities Impala Bulk INSERT, Insert, truncate UPDATE, DELETE Yes No Sub-second Minutes Sub-second Indexes Yes Data size Few Terabytes Petabytes No Lots of Terabytes 27 © 2012 Cloudera, Inc.

Analyzing Twitter Data with Hadoop DATA FORMATS 28

I don’t like our data Lots of small files • JSON – requires parsing • Can’t compress • Sensitive to changes • 29

I’d rather use Avro • • • 30 Few large files containing records Schema in file Schema evolution Can compress Well supported in Hadoop Clients in other languages

Lets convert Create table AVRO_TWEETS • Insert into Avro_tweets select …. From tweets • 31

Cloudera Impala Real-Time Query for Data Stored in Hadoop. 33 FAMILIAR Supports Hive SQL FAST 4 -30 X faster than Hive over Map. Reduce FLEXIBLE Supports multiple storage engines & file formats INTEGRATED Uses existing drivers, integrates with existing metastore, works with leading BI tools 100% OPEN SOURCE Flexible, cost-effective, no lock-in EASY TO USE Deploy & operate with Cloudera Enterprise RTQ © 2012 Cloudera, Inc.

Benefits of Cloudera Impala Real-Time Query for Data Stored in Hadoop 34 SPEED TO INSIGHT • Real-time queries run directly on source data • No ETL delays • No jumping between data silos COST SAVINGS • • FULL FIDELITY ANALYSIS • All data available for interactive queries • No loss of fidelity from fixed data schemas DISCOVERABILITY • Single metadata store from origination through analysis • No need to hunt through multiple data silos No double storage with EDW/RDBMS Unlock analysis on more data No need to create and maintain complex ETL between systems No need to preplan schemas © 2012 Cloudera, Inc.

Cloudera Impala Details Unified metadata and scheduler Hive Metastore SQL App ODBC Low-latency scheduler and cache (low-impact failures) Query Planner Query Coordinator Query Exec Engine HBase HDFS NN State Store Common Hive SQL and interface HDFS DN YARN HDFS DN HBase Fully MPP Distributed Query Planner Query Coordinator Query Exec Engine HDFS DN Local Direct Reads 35 © 2012 Cloudera, Inc. HBase

LOAD DATA TO ORACLE

Oracle Connectors for Hadoop Oracle Loader for Hadoop • Oracle SQL Connector for Hadoop • Big. Data SQL •

Oracle Loader for Hadoop • • • Load data from Hadoop into Oracle Map-Reduce job inside Hadoop Converts data types, partitions and sorts Direct path loads Reduces CPU utilization on database Supports Avro and compression

Oracle SQL Connector for Hadoop • • • Run a Java app Creates an external table Runs Map. Reduce when external table is queries Can use Hive Metastore for schema Optimized for parallel queries Supports Avro and compression

Big Data SQL Also external table • Can also use Hive metastore for schema • But …. NO Map. Reduce • Instead – an agent will do SMART SCANS • Bloom filters • Storage indexes • Filters • • 40 Supports any Hadoop data format

What next? Download Hadoop! • CDH available at www. cloudera. com • Cloudera provides pre-loaded VMs • • • https: //ccp. cloudera. com/display/SUPPORT/Cloudera+Ma nager+Free+Edition+Demo+VM Clone the source repo • https: //github. com/cloudera/cdh-twitter-example