Largescale information extraction and integration infrastructure for supporting
- Slides: 47
Large-scale information extraction and integration infrastructure for supporting financial decision making (FP 7 -ICT-257928) http: //project-first. eu Big data analytics Miha Grčar 1, 2 1 Jožef Stefan Institute 2 Sowa Labs Gmb. H
Outline • What is big data? What caused it? Who should care? • Solving big data problems • Examples Frankfurt, 11/11/2013 Miha Grčar 2
What is big data? • “How many terabytes? ” • We deliberately avoid being specific • Big data refers to datasets that cannot be captured, stored, managed, and/or analyzed by the mainstream storage and processing devices Frankfurt, 11/11/2013 Miha Grčar 3
What is big data? Frankfurt, 11/11/2013 Miha Grčar 4
What caused big data? Storage capacity and processing power Source: Hilbert and López, “The world’s technological capacity to store, communicate, and compute information, ” Science, 2011 Frankfurt, 11/11/2013 Miha Grčar 5
What caused big data? Data availability (industry) Source: IDC; US Bureau of Labor Statistics; Mc. Kinsey Global Institute analysis Frankfurt, 11/11/2013 Miha Grčar 6
What caused big data? Data availability (social media and mobile devices) Source: www. creotivo. com
What caused big data? Data availability (sensors) Source: Analyst interviews; Mc. Kinsey Global Institute analysis Frankfurt, 11/11/2013 Miha Grčar 8
What caused big data? Emerging, hyped Maturity of technologies & tools Mature Source: Gartner (July, 2012) Frankfurt, 11/11/2013 Miha Grčar 9
Who should care about big data? Source: US Bureau of Labor Statistics; Mc. Kinsey Global Institute analysis Frankfurt, 11/11/2013 Miha Grčar 10
Solving big data problems • Distributed infrastructure – Cloud 1+1= (EC 3) Amazon Elastic Compute Cloud • Distributed processing – Map. Reduce / batches – Distributed workflows / streams Hadoop 1+1=Twitter Storm 1+1= • Distributed storage – Distributed FS/DB – No. SQL Frankfurt, 11/11/2013 Miha Grčar 11
Solving big data problems • Distributed infrastructure – Cloud Amazon EC 2, Windows Azure, Google Cloud Amazon Elastic Compute Cloud (EC 2) Platform, Cloudwatt… • Distributed processing – Map. Reduce / batches – Distributed workflows / streams Hadoop, MS Dryad. LINQ, Disco, Misco, Phoenix, Cloud Map. Reduce, Hadoop Storm (Twitter), S 4 (Yahoo), bashreduce, Qizmt… “Real-time Hadoops”: Twitter Storm Impala, HFlame, Spark… • Distributed storage – Distributed FS/DB – No. SQL Frankfurt, 11/11/2013 Google File System, HDFS, Google Big Table, HBase, Cassandra, Mongo. DB, Couch. DB, Hive… Miha Grčar 12
Amazon EC 2 = ECC = Elastic Compute Cloud • Central part of Amazon. com’s cloud computing service • ~500, 000 physical Linux machines • Elastic: possibility to start / stop servers with respect to demand; pay only for running servers • Instances (several examples) – Micro, 1 ECU, 1 Core, 613 Mi. B – High-Memory XL, 6. 5 ECUs, 2 Cores, 17. 1 Gi. B – High-CPU XL, 20 ECUs, 8 Cores, 7 Gi. B • OS – Windows – Linux – Free. BSD • Storage – Temporary instance-storage – Persistent Elastic Block Storage (EBS) Frankfurt, 11/11/2013 Miha Grčar 13
Map. Reduce (Hadoop) Frankfurt, 11/11/2013 Miha Grčar 17
A bunch of ballots, all mixed up… Ma Still mixed up… p A B C A R ce u ed A B Election results: A: 321, 015 B: 179, 539 C: 201, 734 C B C
195005150700+0000 195005151200+0022 195005151800 -0011 194903241200+0111 194903241800+0078 data Map. Reduce (Hadoop) 1950 0 1950 1949 map sort 22 -11 111 78 1950 [ 0, 22, -11 ] 1949 [ 111, 78 ] copy 1950 [ 22 ] 1949 [ 111 ] merge reduce output Source: Tom White: Hadoop, The Definitive Guide, 3 rd Ed. , 2012 (O’Reilly & Yahoo! Press) Frankfurt, 11/11/2013 Miha Grčar 19
Map. Reduce (Hadoop) Source: Tom White: Hadoop, The Definitive Guide, 3 rd Ed. , 2012 (O’Reilly & Yahoo! Press) Frankfurt, 11/11/2013 Miha Grčar 20
Twitter Storm Produce report Print Spout Bolt Collate & bind Bolt Data source Sign Send Bolt Data sink Data processors Frankfurt, 11/11/2013 Miha Grčar 21
Twitter Storm 195005150700+0000 195005151200+0022 195005151800 -0011 194903241200+0111 194903241800+0078 Basic principle Spout Received: 111 Current max: 22 New max: 111 Bolt 194903241200+0111 Data source Frankfurt, 11/11/2013 Overwrite 22 with 111 Bolt 111 Data processor Miha Grčar Data sink/writer 22
Twitter Storm Topology Frankfurt, 11/11/2013 Miha Grčar 23
Twitter Storm Parallelization Pipelining and parallelization Stream Pipelining Frankfurt, 11/11/2013 Miha Grčar 24
Examples • Twitter sentiment and volume – Elections – Stock trading • News cohesiveness, volume, and sentiment – Correlation with VIX, CDS – Correlation with big events • Vocabulary in news & blogs – Pump & dump use case Frankfurt, 11/11/2013 Miha Grčar 25
Slovene elections • 3 candidates, 3 live debates • Sentiment analysis provider: Gama System & our team at JSI • Streamed live, in real time, in prime time during the debates on POP TV • During and after the debates (3 broadcasts), the sentiment chart was shown 5 times (with commentary) Frankfurt, 11/11/2013 Miha Grčar 26
First live debate Second live debate Third live debate Elections (first round)
Criticizing the gov Supporting the gov Criticizing a questionable pardoning of a criminal Justifying it Candidates justifying their wealth Candidates joined by their wives
“Democratic. ” Zver: --“What kind of a political party leader were you if they (party members) didn’t follow your lead? ” Pahor: --“Democratic. ” Frankfurt, 11/11/2013 Miha Grčar 29
Polls vs. sentiment vs. outcome Actual outcome November 11, 2012 Delo Stik (Delo, 9. 11. ) Mediana (Slovenske novice, 9. 11. ) 44 / 31 / 25 1. Borut Pahor 40%41. 67 (+4%)/ 34. 72 / 23. 61 2. Danilo Türk 3. Milan Zver Ninamedia (Mladina, 9. 11. ) 43. 8 / 33. 6 / 22. 6 Frankfurt, 11/11/2013 36% 24% Twitter sentiment “Borut Pahor will win” Miha Grčar 30
Twitter volume and election results There’s no such thing as bad publicity. “We believe that Twitter and other social media reflect the underlying trend in a political race that goes beyond a district’s fundamental geographic and demographic composition. If people must talk about you, even in negative ways, it is a signal that a candidate is on the verge of victory. The attention given to winners creates a situation in which all publicity is good publicity. ” (Di. Grazia, Mc. Kelvey, Bollen, Rojas: More tweets, more votes: Social media as a quantitative indicator of political behavior, February 2013) Source: Smailović, Kranjc, Juršič, Grčar, Gačnik, Mozetič: Monitoring the Twitter sentiment during the Bulgarian elections (2013; to appear) Frankfurt, 11/11/2013 Miha Grčar 32
We’re looking at the stock of The blue line Amazon. com… shows the stock price. …during 2012. The red line shows the related Twitter sentiment. The black line is The green-red line the 7 -day moving shows whether we average. profited (green) or not (red) from blindly following the social signals. A MA zero cross-over serves as a buy or sell signal. Source: Sowa Labs Gmb. H Frankfurt, 11/11/2013 Miha Grčar 33
Q 4/’ 11 results Q 1 results On April 26, 2012 Amazon announced financial results for Q 3 Amazon has been Q 2 ended March 31, 2012. its first quarter results spending lots of money on expanding its operations, so analysts expected a huge drop in profit for this first quarter. However, Amazon blows analysts’ estimates away. Even though earnings did fall, they didn't decline nearly as much as analysts had feared. Amazon earned $130 million or 28 cents per share for the quarter that ended March 31. That was a 35% decline from a year ago, but it was much better than the 7 cents per share forecasts from analysts polled by Thomson Reuters. Based on this news, Amazon shares surged nearly 16% on Friday morning April 27, 2012. Source: Sowa Labs Gmb. H Frankfurt, 11/11/2013 Miha Grčar 34
The sentiment MA cross-over happens well before the price jump. Source: Sowa Labs Gmb. H Frankfurt, 11/11/2013 Miha Grčar 35
We’re looking at the stock of Google… Q 1 Q 4/’ 11 Q 2 results On October 18, 2012, Google’s shares plunged by 9% after the search giant’s third-quarter earnings came in considerably lower than expected. Q 3 …during results 2012. The results were accidentally released several hours earlier than expected, leading to a halt in the shares’ trading for a time. Source: Sowa Labs Gmb. H Frankfurt, 11/11/2013 Miha Grčar 36
The sentiment MA cross-over happens well before the price plunge. Source: Sowa Labs Gmb. H Frankfurt, 11/11/2013 Miha Grčar 37
Source: Sowa Labs Gmb. H Frankfurt, 11/11/2013 Miha Grčar 38
Sentiment in news: Spain, Greece, Italy, Germany Frankfurt, 11/11/2013 Miha Grčar 39
News cohesiveness and VIX – implied volatility of S&P 500 (aka fear index) Source: Rudjer Boskovic Institute, Boston University, Jozef Stefan Institute Frankfurt, 11/11/2013 Miha Grčar 40
News cohesiveness and CDS – Credit Default Swaps (insurance against default) Source: Rudjer Boskovic Institute, Boston University, Jozef Stefan Institute Frankfurt, 11/11/2013 Miha Grčar 41
Pump & dump Source: b-next, Goethe Universität, JSI (FIRST) Frankfurt, 11/11/2013 Miha Grčar 42
Pump & dump Country Black List Industry Black List Company Age Bankrupt Market Segment History Comp_Fin. Inst Market Capitalization Trading Volume Number of Trades Sentiment Pump & Dump Financial Instrument Trading News Content Source: b-next, Goethe Universität, JSI (FIRST) Frankfurt, 11/11/2013 Miha Grčar 43
Quick recap (1/3) • Big data: volume, velocity, variety • Enablers – Storage capacity & processing power – Maturity of technologies – Availability of data, e. g. , social networks and mobile devices – Mindset • Financial domain: one of the biggest gainers Frankfurt, 11/11/2013 Miha Grčar 44
Quick recap (2/3) Solving big data problems – Distributed infrastructure • Amazon EC 2 – Distributed processing capacity • Map. Reduce (Hadoop) • Twitter Storm – Distributed storage Frankfurt, 11/11/2013 Miha Grčar 45
Quick recap (3/3) Examples – Elections • No such thing as bad publicity – Stock trading • Sentiment vs. price, Twitter volume vs. trading volume – News & blogs • Volume & sentiment expose big events • Cohesiveness vs. VIX & CDS • Content and sentiment as inputs into a pump & dump detection model Frankfurt, 11/11/2013 Miha Grčar 46
Large-scale information extraction and integration infrastructure for supporting financial decision making (FP 7 -ICT-257928) http: //project-first. eu http: //www. sowalabs. de (coming really soon!) Frankfurt, 11/11/2013 Miha Grčar 47
- Vehicle infrastructure integration
- Three dimensions of corporate strategy
- Forward integration and backward integration
- Simultaneous integration examples
- Temporal information extraction
- Key information extraction
- Information extraction algorithms
- Information system infrastructure
- Information infrastructure
- Critical energy infrastructure information
- Itil information technology infrastructure library
- National health information infrastructure
- Kontinuitetshantering i praktiken
- Novell typiska drag
- Nationell inriktning för artificiell intelligens
- Vad står k.r.å.k.a.n för
- Varför kallas perioden 1918-1939 för mellankrigstiden
- En lathund för arbete med kontinuitetshantering
- Underlag för särskild löneskatt på pensionskostnader
- Tidbok
- A gastrica
- Förklara densitet för barn
- Datorkunskap för nybörjare
- Boverket ka
- Debatt artikel mall
- Autokratiskt ledarskap
- Nyckelkompetenser för livslångt lärande
- Påbyggnader för flakfordon
- Vätsketryck formel
- Svenskt ramverk för digital samverkan
- Lyckans minut erik lindorm analys
- Presentera för publik crossboss
- Vad är ett minoritetsspråk
- Bat mitza
- Treserva lathund
- Luftstrupen för medicinare
- Bästa kameran för astrofoto
- Centrum för kunskap och säkerhet
- Programskede byggprocessen
- Mat för idrottare
- Verktyg för automatisering av utbetalningar
- Rutin för avvikelsehantering
- Smärtskolan kunskap för livet
- Ministerstyre för och nackdelar
- Tack för att ni har lyssnat
- Referat mall
- Redogör för vad psykologi är
- Matematisk modellering eksempel