Largescale information extraction and integration infrastructure for supporting

  • Slides: 47
Download presentation
Large-scale information extraction and integration infrastructure for supporting financial decision making (FP 7 -ICT-257928)

Large-scale information extraction and integration infrastructure for supporting financial decision making (FP 7 -ICT-257928) http: //project-first. eu Big data analytics Miha Grčar 1, 2 1 Jožef Stefan Institute 2 Sowa Labs Gmb. H

Outline • What is big data? What caused it? Who should care? • Solving

Outline • What is big data? What caused it? Who should care? • Solving big data problems • Examples Frankfurt, 11/11/2013 Miha Grčar 2

What is big data? • “How many terabytes? ” • We deliberately avoid being

What is big data? • “How many terabytes? ” • We deliberately avoid being specific • Big data refers to datasets that cannot be captured, stored, managed, and/or analyzed by the mainstream storage and processing devices Frankfurt, 11/11/2013 Miha Grčar 3

What is big data? Frankfurt, 11/11/2013 Miha Grčar 4

What is big data? Frankfurt, 11/11/2013 Miha Grčar 4

What caused big data? Storage capacity and processing power Source: Hilbert and López, “The

What caused big data? Storage capacity and processing power Source: Hilbert and López, “The world’s technological capacity to store, communicate, and compute information, ” Science, 2011 Frankfurt, 11/11/2013 Miha Grčar 5

What caused big data? Data availability (industry) Source: IDC; US Bureau of Labor Statistics;

What caused big data? Data availability (industry) Source: IDC; US Bureau of Labor Statistics; Mc. Kinsey Global Institute analysis Frankfurt, 11/11/2013 Miha Grčar 6

What caused big data? Data availability (social media and mobile devices) Source: www. creotivo.

What caused big data? Data availability (social media and mobile devices) Source: www. creotivo. com

What caused big data? Data availability (sensors) Source: Analyst interviews; Mc. Kinsey Global Institute

What caused big data? Data availability (sensors) Source: Analyst interviews; Mc. Kinsey Global Institute analysis Frankfurt, 11/11/2013 Miha Grčar 8

What caused big data? Emerging, hyped Maturity of technologies & tools Mature Source: Gartner

What caused big data? Emerging, hyped Maturity of technologies & tools Mature Source: Gartner (July, 2012) Frankfurt, 11/11/2013 Miha Grčar 9

Who should care about big data? Source: US Bureau of Labor Statistics; Mc. Kinsey

Who should care about big data? Source: US Bureau of Labor Statistics; Mc. Kinsey Global Institute analysis Frankfurt, 11/11/2013 Miha Grčar 10

Solving big data problems • Distributed infrastructure – Cloud 1+1= (EC 3) Amazon Elastic

Solving big data problems • Distributed infrastructure – Cloud 1+1= (EC 3) Amazon Elastic Compute Cloud • Distributed processing – Map. Reduce / batches – Distributed workflows / streams Hadoop 1+1=Twitter Storm 1+1= • Distributed storage – Distributed FS/DB – No. SQL Frankfurt, 11/11/2013 Miha Grčar 11

Solving big data problems • Distributed infrastructure – Cloud Amazon EC 2, Windows Azure,

Solving big data problems • Distributed infrastructure – Cloud Amazon EC 2, Windows Azure, Google Cloud Amazon Elastic Compute Cloud (EC 2) Platform, Cloudwatt… • Distributed processing – Map. Reduce / batches – Distributed workflows / streams Hadoop, MS Dryad. LINQ, Disco, Misco, Phoenix, Cloud Map. Reduce, Hadoop Storm (Twitter), S 4 (Yahoo), bashreduce, Qizmt… “Real-time Hadoops”: Twitter Storm Impala, HFlame, Spark… • Distributed storage – Distributed FS/DB – No. SQL Frankfurt, 11/11/2013 Google File System, HDFS, Google Big Table, HBase, Cassandra, Mongo. DB, Couch. DB, Hive… Miha Grčar 12

Amazon EC 2 = ECC = Elastic Compute Cloud • Central part of Amazon.

Amazon EC 2 = ECC = Elastic Compute Cloud • Central part of Amazon. com’s cloud computing service • ~500, 000 physical Linux machines • Elastic: possibility to start / stop servers with respect to demand; pay only for running servers • Instances (several examples) – Micro, 1 ECU, 1 Core, 613 Mi. B – High-Memory XL, 6. 5 ECUs, 2 Cores, 17. 1 Gi. B – High-CPU XL, 20 ECUs, 8 Cores, 7 Gi. B • OS – Windows – Linux – Free. BSD • Storage – Temporary instance-storage – Persistent Elastic Block Storage (EBS) Frankfurt, 11/11/2013 Miha Grčar 13

Map. Reduce (Hadoop) Frankfurt, 11/11/2013 Miha Grčar 17

Map. Reduce (Hadoop) Frankfurt, 11/11/2013 Miha Grčar 17

A bunch of ballots, all mixed up… Ma Still mixed up… p A B

A bunch of ballots, all mixed up… Ma Still mixed up… p A B C A R ce u ed A B Election results: A: 321, 015 B: 179, 539 C: 201, 734 C B C

195005150700+0000 195005151200+0022 195005151800 -0011 194903241200+0111 194903241800+0078 data Map. Reduce (Hadoop) 1950 0 1950 1949

195005150700+0000 195005151200+0022 195005151800 -0011 194903241200+0111 194903241800+0078 data Map. Reduce (Hadoop) 1950 0 1950 1949 map sort 22 -11 111 78 1950 [ 0, 22, -11 ] 1949 [ 111, 78 ] copy 1950 [ 22 ] 1949 [ 111 ] merge reduce output Source: Tom White: Hadoop, The Definitive Guide, 3 rd Ed. , 2012 (O’Reilly & Yahoo! Press) Frankfurt, 11/11/2013 Miha Grčar 19

Map. Reduce (Hadoop) Source: Tom White: Hadoop, The Definitive Guide, 3 rd Ed. ,

Map. Reduce (Hadoop) Source: Tom White: Hadoop, The Definitive Guide, 3 rd Ed. , 2012 (O’Reilly & Yahoo! Press) Frankfurt, 11/11/2013 Miha Grčar 20

Twitter Storm Produce report Print Spout Bolt Collate & bind Bolt Data source Sign

Twitter Storm Produce report Print Spout Bolt Collate & bind Bolt Data source Sign Send Bolt Data sink Data processors Frankfurt, 11/11/2013 Miha Grčar 21

Twitter Storm 195005150700+0000 195005151200+0022 195005151800 -0011 194903241200+0111 194903241800+0078 Basic principle Spout Received: 111 Current

Twitter Storm 195005150700+0000 195005151200+0022 195005151800 -0011 194903241200+0111 194903241800+0078 Basic principle Spout Received: 111 Current max: 22 New max: 111 Bolt 194903241200+0111 Data source Frankfurt, 11/11/2013 Overwrite 22 with 111 Bolt 111 Data processor Miha Grčar Data sink/writer 22

Twitter Storm Topology Frankfurt, 11/11/2013 Miha Grčar 23

Twitter Storm Topology Frankfurt, 11/11/2013 Miha Grčar 23

Twitter Storm Parallelization Pipelining and parallelization Stream Pipelining Frankfurt, 11/11/2013 Miha Grčar 24

Twitter Storm Parallelization Pipelining and parallelization Stream Pipelining Frankfurt, 11/11/2013 Miha Grčar 24

Examples • Twitter sentiment and volume – Elections – Stock trading • News cohesiveness,

Examples • Twitter sentiment and volume – Elections – Stock trading • News cohesiveness, volume, and sentiment – Correlation with VIX, CDS – Correlation with big events • Vocabulary in news & blogs – Pump & dump use case Frankfurt, 11/11/2013 Miha Grčar 25

Slovene elections • 3 candidates, 3 live debates • Sentiment analysis provider: Gama System

Slovene elections • 3 candidates, 3 live debates • Sentiment analysis provider: Gama System & our team at JSI • Streamed live, in real time, in prime time during the debates on POP TV • During and after the debates (3 broadcasts), the sentiment chart was shown 5 times (with commentary) Frankfurt, 11/11/2013 Miha Grčar 26

First live debate Second live debate Third live debate Elections (first round)

First live debate Second live debate Third live debate Elections (first round)

Criticizing the gov Supporting the gov Criticizing a questionable pardoning of a criminal Justifying

Criticizing the gov Supporting the gov Criticizing a questionable pardoning of a criminal Justifying it Candidates justifying their wealth Candidates joined by their wives

“Democratic. ” Zver: --“What kind of a political party leader were you if they

“Democratic. ” Zver: --“What kind of a political party leader were you if they (party members) didn’t follow your lead? ” Pahor: --“Democratic. ” Frankfurt, 11/11/2013 Miha Grčar 29

Polls vs. sentiment vs. outcome Actual outcome November 11, 2012 Delo Stik (Delo, 9.

Polls vs. sentiment vs. outcome Actual outcome November 11, 2012 Delo Stik (Delo, 9. 11. ) Mediana (Slovenske novice, 9. 11. ) 44 / 31 / 25 1. Borut Pahor 40%41. 67 (+4%)/ 34. 72 / 23. 61 2. Danilo Türk 3. Milan Zver Ninamedia (Mladina, 9. 11. ) 43. 8 / 33. 6 / 22. 6 Frankfurt, 11/11/2013 36% 24% Twitter sentiment “Borut Pahor will win” Miha Grčar 30

Twitter volume and election results There’s no such thing as bad publicity. “We believe

Twitter volume and election results There’s no such thing as bad publicity. “We believe that Twitter and other social media reflect the underlying trend in a political race that goes beyond a district’s fundamental geographic and demographic composition. If people must talk about you, even in negative ways, it is a signal that a candidate is on the verge of victory. The attention given to winners creates a situation in which all publicity is good publicity. ” (Di. Grazia, Mc. Kelvey, Bollen, Rojas: More tweets, more votes: Social media as a quantitative indicator of political behavior, February 2013) Source: Smailović, Kranjc, Juršič, Grčar, Gačnik, Mozetič: Monitoring the Twitter sentiment during the Bulgarian elections (2013; to appear) Frankfurt, 11/11/2013 Miha Grčar 32

We’re looking at the stock of The blue line Amazon. com… shows the stock

We’re looking at the stock of The blue line Amazon. com… shows the stock price. …during 2012. The red line shows the related Twitter sentiment. The black line is The green-red line the 7 -day moving shows whether we average. profited (green) or not (red) from blindly following the social signals. A MA zero cross-over serves as a buy or sell signal. Source: Sowa Labs Gmb. H Frankfurt, 11/11/2013 Miha Grčar 33

Q 4/’ 11 results Q 1 results On April 26, 2012 Amazon announced financial

Q 4/’ 11 results Q 1 results On April 26, 2012 Amazon announced financial results for Q 3 Amazon has been Q 2 ended March 31, 2012. its first quarter results spending lots of money on expanding its operations, so analysts expected a huge drop in profit for this first quarter. However, Amazon blows analysts’ estimates away. Even though earnings did fall, they didn't decline nearly as much as analysts had feared. Amazon earned $130 million or 28 cents per share for the quarter that ended March 31. That was a 35% decline from a year ago, but it was much better than the 7 cents per share forecasts from analysts polled by Thomson Reuters. Based on this news, Amazon shares surged nearly 16% on Friday morning April 27, 2012. Source: Sowa Labs Gmb. H Frankfurt, 11/11/2013 Miha Grčar 34

The sentiment MA cross-over happens well before the price jump. Source: Sowa Labs Gmb.

The sentiment MA cross-over happens well before the price jump. Source: Sowa Labs Gmb. H Frankfurt, 11/11/2013 Miha Grčar 35

We’re looking at the stock of Google… Q 1 Q 4/’ 11 Q 2

We’re looking at the stock of Google… Q 1 Q 4/’ 11 Q 2 results On October 18, 2012, Google’s shares plunged by 9% after the search giant’s third-quarter earnings came in considerably lower than expected. Q 3 …during results 2012. The results were accidentally released several hours earlier than expected, leading to a halt in the shares’ trading for a time. Source: Sowa Labs Gmb. H Frankfurt, 11/11/2013 Miha Grčar 36

The sentiment MA cross-over happens well before the price plunge. Source: Sowa Labs Gmb.

The sentiment MA cross-over happens well before the price plunge. Source: Sowa Labs Gmb. H Frankfurt, 11/11/2013 Miha Grčar 37

Source: Sowa Labs Gmb. H Frankfurt, 11/11/2013 Miha Grčar 38

Source: Sowa Labs Gmb. H Frankfurt, 11/11/2013 Miha Grčar 38

Sentiment in news: Spain, Greece, Italy, Germany Frankfurt, 11/11/2013 Miha Grčar 39

Sentiment in news: Spain, Greece, Italy, Germany Frankfurt, 11/11/2013 Miha Grčar 39

News cohesiveness and VIX – implied volatility of S&P 500 (aka fear index) Source:

News cohesiveness and VIX – implied volatility of S&P 500 (aka fear index) Source: Rudjer Boskovic Institute, Boston University, Jozef Stefan Institute Frankfurt, 11/11/2013 Miha Grčar 40

News cohesiveness and CDS – Credit Default Swaps (insurance against default) Source: Rudjer Boskovic

News cohesiveness and CDS – Credit Default Swaps (insurance against default) Source: Rudjer Boskovic Institute, Boston University, Jozef Stefan Institute Frankfurt, 11/11/2013 Miha Grčar 41

Pump & dump Source: b-next, Goethe Universität, JSI (FIRST) Frankfurt, 11/11/2013 Miha Grčar 42

Pump & dump Source: b-next, Goethe Universität, JSI (FIRST) Frankfurt, 11/11/2013 Miha Grčar 42

Pump & dump Country Black List Industry Black List Company Age Bankrupt Market Segment

Pump & dump Country Black List Industry Black List Company Age Bankrupt Market Segment History Comp_Fin. Inst Market Capitalization Trading Volume Number of Trades Sentiment Pump & Dump Financial Instrument Trading News Content Source: b-next, Goethe Universität, JSI (FIRST) Frankfurt, 11/11/2013 Miha Grčar 43

Quick recap (1/3) • Big data: volume, velocity, variety • Enablers – Storage capacity

Quick recap (1/3) • Big data: volume, velocity, variety • Enablers – Storage capacity & processing power – Maturity of technologies – Availability of data, e. g. , social networks and mobile devices – Mindset • Financial domain: one of the biggest gainers Frankfurt, 11/11/2013 Miha Grčar 44

Quick recap (2/3) Solving big data problems – Distributed infrastructure • Amazon EC 2

Quick recap (2/3) Solving big data problems – Distributed infrastructure • Amazon EC 2 – Distributed processing capacity • Map. Reduce (Hadoop) • Twitter Storm – Distributed storage Frankfurt, 11/11/2013 Miha Grčar 45

Quick recap (3/3) Examples – Elections • No such thing as bad publicity –

Quick recap (3/3) Examples – Elections • No such thing as bad publicity – Stock trading • Sentiment vs. price, Twitter volume vs. trading volume – News & blogs • Volume & sentiment expose big events • Cohesiveness vs. VIX & CDS • Content and sentiment as inputs into a pump & dump detection model Frankfurt, 11/11/2013 Miha Grčar 46

Large-scale information extraction and integration infrastructure for supporting financial decision making (FP 7 -ICT-257928)

Large-scale information extraction and integration infrastructure for supporting financial decision making (FP 7 -ICT-257928) http: //project-first. eu http: //www. sowalabs. de (coming really soon!) Frankfurt, 11/11/2013 Miha Grčar 47