Analyzing Twitter Data with Hadoop Gwen Shapira Software
- Slides: 44
Analyzing Twitter Data with Hadoop Gwen Shapira, Software Engineer @Gwenshap 1 © 2012 Cloudera, Inc.
IOUG SIG Meetings at Open. World All meetings located in Moscone South - Room 208 Monday, September 29 Exadata SIG: 2: 00 p. m. - 3: 00 p. m. BIWA SIG: 5: 00 p. m. – 6: 00 p. m. Tuesday, September 30 Internet of Things SIG: 11: 00 a. m. - 12: 00 p. m. Storage SIG: 4: 00 p. m. - 5: 00 p. m. SPARC/Solaris SIG: 5: 00 p. m. - 6: 00 p. m. Wednesday, October 1 Oracle Enterprise Manager SIG: 8: 00 a. m. - 9: 00 a. m. Big Data SIG: 10: 30 a. m. - 11: 30 a. m. Oracle 12 c SIG: 2: 00 p. m. – 3: 00 p. m. Oracle Spatial and Graph SIG: 4: 00 p. m. (*OTN lounge)
COLLABORATE 15 – IOUG Forum April 12 -16, 2015 Mandalay Bay Resort and Casino Las Vegas, NV The IOUG Forum Advantage • • • Save more than $1, 000 on education offerings like pre-conference workshops Access the brand-new, specialized IOUG Strategic Leadership Program Priority access to the hands-on labs with Oracle ACE support Advance access to supplemental session material and presentations ers k Special IOUG activities with no "ante in" needed - evening networkingfoopportunities Spea r Call 0 5 and more 1 er 1 ATE OR tob c B O A L s COL End www. collaborate. ioug. org Follow us on Twitter at @IOUG or via the conference hashtag #C 15 LV!
I have 15 years of experience in moving data around © 2014 Cloudera, Inc. All rights reserved.
In my spare time… • • Oracle ACE Director Member of Oak Table Blogger Presenter – Hotsos, IOUG, OOW, OSCON No. COUG board Contributor to Apache Oozie, Sqoop, Kafka Author – Hadoop Application Architectures © 2014 Cloudera, Inc. All rights reserved.
Analyzing Twitter Data with Hadoop BUILDING AN HADOOP APPLICATION 6 © 2012 Cloudera, Inc.
7
Hive Level Architecture Hive + Oozie Data Source 8 Flume HDFS © 2012 Cloudera, Inc. Impala / Oracle
Analyzing Twitter Data with Hadoop AN EXAMPLE USE CASE 9 © 2012 Cloudera, Inc.
Analyzing Twitter • • • 10 Social media popular with marketing teams Twitter is an effective tool for promotion Which twitter user gets the most retweets? Who is influential in our industry? Which topics are trending? © 2012 Cloudera, Inc.
Analyzing Twitter Data with Hadoop HOW DO WE ANSWER THESE QUESTIONS? 11 © 2012 Cloudera, Inc.
Techniques Bring Data with Flume • Complex data • Deeply nested • Variable schema • Clean, Standardize, Partition, etc • SQL • Filtering • Aggregation • Sorting • 12
Analyzing Twitter Data with Hadoop FLUME 13
Flume Agent design 14
In our case… • Twitter source • Pulls JSON format files from twitter Memory Channel • HDFS Sink – directory per hour • 15
What is JSON? { "retweeted_status": { "contributors": null, "text": "#Crowdsourcing – drivers already generate traffic data for your smartphone to suggest alternative routes when a road is clogged. #bigdata", "retweeted": false, "entities": { "hashtags": [ { "text": "Crowdsourcing", "indices": [0, 14] }, { "text": "bigdata", "indices": [129, 137] } ], "user_mentions": [] } } } 16 © 2012 Cloudera, Inc.
But Wait! There’s More! • • • 17 Many sources – directory, files, log 4 j, net, JMS Interceptors – process data in flight Selectors – choose which sink Many channels – Memory, file Many sinks – HDFS, Hbase, Solr
High Level Pipeline Architecture Query With Hbase API Or Impala Web App Flume Avro Client Web App Flume Avro Client Fan-in Pattern Spark. Streaming data is sub set of whole events Flume Agent Pull Near Real Time Results Spark. Streaming Report App HBase Flume Agent HDFS Batch Report Updates Flume Agent Web App Flume Avro Client ML Map/Reduce Jobs Multi Agents for Failover and rolling restarts Client providing, multithreading, compression, encryption, and batching 18
Configuration Twitter. Agent. sources = Twitter. Agent. channels = Mem. Channel Twitter. Agent. sinks = HDFS Twitter. Agent. sources. Twitter. type = com. cloudera. flume. source. Twitter. Source Twitter. Agent. sources. Twitter. channels = Mem. Channel Twitter. Agent. sources. Twitter. consumer. Key = Twitter. Agent. sources. Twitter. consumer. Secret = Twitter. Agent. sources. Twitter. access. Token. Secret = Twitter. Agent. sources. Twitter. keywords = hadoop, big data, flume, sqoop, oracle, oow Twitter. Agent. sinks. HDFS. channel = Mem. Channel Twitter. Agent. sinks. HDFS. type = hdfs Twitter. Agent. sinks. HDFS. hdfs. path = hdfs: //quickstart : 8020/user/flume/tweets/%Y/%m/%d/%H/ Twitter. Agent. sinks. HDFS. serializer = text Twitter. Agent. channels. Mem. Channel. type = memory 19
Analyzing Twitter Data with Hadoop FLUME DEMO 20 © 2012 Cloudera, Inc.
Analyzing Twitter Data with Hadoop HIVE 21 © 2012 Cloudera, Inc.
What is Hive? Created at Facebook • Hive. QL • • SQL like interface Hive interpreter converts Hive. QL to Map. Reduce code • Returns results to the client • 22 © 2012 Cloudera, Inc.
Hive Details • Metastore contains table definitions Stored in a relational database • Basically a data dictionary • Ser. Des parse data • and converts to table/column structure • Ser. De: • CSV, XML, JSON, Avro, Parquet, OCR files • Or write your own (We created one for Copy. Book) • 23
Complex Data SELECT t. retweet_screen_name, sum(retweets) AS total_retweets, count(*) AS tweet_count FROM (SELECT retweeted_status. user. screen_name AS retweet_screen_name, retweeted_status. text, max(retweeted_status. retweet_count) AS retweets FROM tweets GROUP BY retweeted_status. user. screen_name, retweeted_status. text) t GROUP BY t. retweet_screen_name ORDER BY total_retweets DESC LIMIT 10; 24 © 2012 Cloudera, Inc.
Analyzing Twitter Data with Hadoop HIVE DEMO 25 © 2012 Cloudera, Inc.
Analyzing Twitter Data with Hadoop IT’S A TRAP 26 © 2012 Cloudera, Inc.
Not a Database Language RDBMS Hive Generally >= SQL-92 Subset of SQLplus Hive specific 92 extensions Transactions INSERT, UPDATE, DELETE Yes Latency Update Capabilities Impala Bulk INSERT, Insert, truncate UPDATE, DELETE Yes No Sub-second Minutes Sub-second Indexes Yes Data size Few Terabytes Petabytes No Lots of Terabytes 27 © 2012 Cloudera, Inc.
Analyzing Twitter Data with Hadoop DATA FORMATS 28
I don’t like our data Lots of small files • JSON – requires parsing • Can’t compress • Sensitive to changes • 29
I’d rather use Avro • • • 30 Few large files containing records Schema in file Schema evolution Can compress Well supported in Hadoop Clients in other languages
Lets convert Create table AVRO_TWEETS • Insert into Avro_tweets select …. From tweets • 31
Analyzing Twitter Data with Hadoop IMPALA ASIDE 32 © 2012 Cloudera, Inc.
Cloudera Impala Real-Time Query for Data Stored in Hadoop. 33 FAMILIAR Supports Hive SQL FAST 4 -30 X faster than Hive over Map. Reduce FLEXIBLE Supports multiple storage engines & file formats INTEGRATED Uses existing drivers, integrates with existing metastore, works with leading BI tools 100% OPEN SOURCE Flexible, cost-effective, no lock-in EASY TO USE Deploy & operate with Cloudera Enterprise RTQ © 2012 Cloudera, Inc.
Benefits of Cloudera Impala Real-Time Query for Data Stored in Hadoop 34 SPEED TO INSIGHT • Real-time queries run directly on source data • No ETL delays • No jumping between data silos COST SAVINGS • • FULL FIDELITY ANALYSIS • All data available for interactive queries • No loss of fidelity from fixed data schemas DISCOVERABILITY • Single metadata store from origination through analysis • No need to hunt through multiple data silos No double storage with EDW/RDBMS Unlock analysis on more data No need to create and maintain complex ETL between systems No need to preplan schemas © 2012 Cloudera, Inc.
Cloudera Impala Details Unified metadata and scheduler Hive Metastore SQL App ODBC Low-latency scheduler and cache (low-impact failures) Query Planner Query Coordinator Query Exec Engine HBase HDFS NN State Store Common Hive SQL and interface HDFS DN YARN HDFS DN HBase Fully MPP Distributed Query Planner Query Coordinator Query Exec Engine HDFS DN Local Direct Reads 35 © 2012 Cloudera, Inc. HBase
LOAD DATA TO ORACLE
Oracle Connectors for Hadoop Oracle Loader for Hadoop • Oracle SQL Connector for Hadoop • Big. Data SQL •
Oracle Loader for Hadoop • • • Load data from Hadoop into Oracle Map-Reduce job inside Hadoop Converts data types, partitions and sorts Direct path loads Reduces CPU utilization on database Supports Avro and compression
Oracle SQL Connector for Hadoop • • • Run a Java app Creates an external table Runs Map. Reduce when external table is queries Can use Hive Metastore for schema Optimized for parallel queries Supports Avro and compression
Big Data SQL Also external table • Can also use Hive metastore for schema • But …. NO Map. Reduce • Instead – an agent will do SMART SCANS • Bloom filters • Storage indexes • Filters • • 40 Supports any Hadoop data format
Analyzing Twitter Data with Hadoop PUTTING IT ALL TOGETHER 41 © 2012 Cloudera, Inc.
Hive Level Architecture Hive + Oozie Data Source 42 Flume HDFS © 2012 Cloudera, Inc. Impala / Oracle
What next? Download Hadoop! • CDH available at www. cloudera. com • Cloudera provides pre-loaded VMs • • • https: //ccp. cloudera. com/display/SUPPORT/Cloudera+Ma nager+Free+Edition+Demo+VM Clone the source repo • https: //github. com/cloudera/cdh-twitter-example
44 © 2012 Cloudera, Inc.
- Hadoop io
- Hive provides data warehousing layer to data over hadoop
- Van nuffelen marc
- Gwen nuttall
- Gwen clifford
- Gwen blumberg
- Gwen exerts a 36n horizontal force
- Gwen graphs
- Burgerparticipatie en veiligheidsgevoel
- Transcount
- Gwen hansen
- Suburban sonnet gwen harwood
- What came down the chimney in freak the mighty
- Gwen harwood selected poems
- Hadoop is open source
- Big data analytics with r and hadoop
- Analyzing ethnographic data
- Analyzing and drawing conclusions
- Chapter 2 analyzing data
- Chapter 2 analyzing data answer key
- Data dictionary example in system analysis and design
- Analyzing and visualizing data with microsoft power bi
- Interpreting quantitative data
- Analyzing categorical data
- Chapter 1 analyzing one-variable data answers
- Research design
- Analyzing and interpreting data in research
- Analyzing and interpreting data in research
- Analyzing and interpreting data in research
- Analyzing and interpreting data in research
- Analyzing data
- Analyzing categorical data
- Analyzing and interpreting data
- Hadoop webinar
- Hadoop 101
- Hadoop matrix multiplication
- Hadoop streaming assignment 1: words rating
- Supercloud hadoop
- Hdfs ls
- Intro to hadoop
- Install hadoop on ubuntu virtualbox
- Hadoop hdfs latency
- Jaql hadoop
- Hadoop distributed file system
- Hadoop's parallel world