Situational Business Intelligence Volker Markl Technische Universitt Berlin
Situational Business Intelligence Volker Markl Technische Universität Berlin Database Systems and Information Management Technische Universität Berlin © Chair for Database Systems and Information Management 1
Agenda ► Traditional Business Intelligence ► Next Generation Business Intelligence ► Building Blocks Cloud Computing, Map-Reduce, and Hadoop, Piglatin UIMA, Social Tagging ► The Long Tail of Situational Applications ► Situational Business Intelligence ► Challenges © Chair for Database Systems and Information Management 2
Traditional Business Intelligence © Chair for Database Systems and Information Management 3
How Did We Get Here? BI over Text Web enabled Business Intelligence Client Server Business Intelligence Query/Reporting OLAP Batch Reporting Actual and forecasted BI tools software revenue as reported by IDC Source: IDC © Chair for Database Systems and Information Management Source: Gartner 4
2008 CIO Priorities 2008 CIO Technology Priorities To what extent will each of the following technologies be a Top 5 priority for you in 2008? Rank 2008 Rank 2007 2006 2008 Increase* Business Intelligence Applications 1 11. 20% Enterprise Applications (ERP, SCM, and CRM) 2 2 ** 8. 02% Server and Storage Technologies (Virtualization) 3 5 9 8. 45% Legacy Application Modernization 4 3 10 5. 79% Security Technologies 5 6 2 8. 53% Technical Infrastructure 6 8 12 4. 67% Networking, Voice, and Data Communications (Vo. IP) 7 4 8 6. 83% Collaboration Technologies 8 10 4 7. 75% Document Management 9 9 ** 7. 91% Service-Oriented Technologies (SOA and SOBA) 10 7 6 6. 71% Source: 2008 Gartner Executive Programs CIO Survey, January 10, 2008 © Chair for Database Systems and Information Management * Unweighted average budget change ** New question for 2007 5
What are CIOs missing? Better/more information Faster/quick retrieval Accurate/updated data Consistent platform Better integration Standardization Other single mentions 22. 9% 14. 3% 11. 4% 8. 6% 40. 0% Please give me an example of how your business intelligence solution could better meet your organizations main objective? Source: Business Intelligence Survey, IDC © Chair for Database Systems and Information Management 6
Next Generation Business Intelligence Internet Intranet Text Text XLS Text XML Who is leading in American Idol? Information Extraction Semantic Integration Load/Refresh or ad-hoc Analysis Schema and Entities Who are the biggest players in the Linux market? Which insurance policy customers are at risk of being hit by a current storm? Data Warehouse Data Marts The next generation of Business Intelligence (NGBI) correlates data warehouses with text and semi-structured data from webservices of corporate intranets and the internet © Chair for Database Systems and Information Management 7
Answering a NGBI Query Who are the biggest players in the “Linux” market? Web 2. 0 documents from 332 Wiki News docs (January –March 2007) © Chair for Database Systems and Information Management 8
Data Source Identification Data Source identification Atomic Entity extraction Schema extraction Data Cleansing Data Fusion ► Data Warehouse ► Masterdata ► Information Providers ► Information Marketplaces ► Crawling (Internet/Intranet) © Chair for Database Systems and Information Management 9
Atomic Entity Extraction Additional extraction and data cleansing effort Data Source identification Atomic Entity extraction Schema extraction Data Cleansing Data Fusion Out-of-the box data ► Web Services for complex, atomic and named entities Frameworks ► Infrastructures for extracting, managing and scalable storage of named entities ► Web Services for extracting named entities Basic Components ► Screen scraper © Chair for Database Systems and Information Management 10
Ad hoc analysis process Data Source identification Atomic Entity extraction Schema extraction © Chair for Database Systems and Information Management Data Cleansing Data Fusion 11
Schema Extraction Pre Process Base extraction Schema extraction Company Technology ->Technology © Chair for Database Systems and Information Management Data Cleansing Data Fusion Company Technology -> Company 12
Data Cleansing Pre Process Base extraction Schema extraction Data Cleansing Data Fusion Duplicates © Chair for Database Systems and Information Management 13
Data Fusion Information Integration Pre Process Base extraction Schema extraction Data Cleansing Data Fusion Data Source A Schema Mapping Duplicate Detection Apple i. Phone 3 Gen match Data Fusion Apple max length i. Phone 3 G Data Source B © Chair for Database Systems and Information Management 299. 95 min Apple i. Phone 3 Gen 199. 99 e. g. , Hummer (U Potsdam) 14
Data Fusion Pre Process Schema extraction Base extraction a b c - a b - d a b - - a b c - a e - d Data Cleansing Data Fusion a b c d Integration of complementary tuples a b - - Elemination of identical tuples a b c - Elemination of subsumed tuples a f(b, e) c © Chair for Database Systems and Information Management d Conflict resolution 15
Address Uncertainty: Query Refinement ► Extract->SELECT->PROJECT-JOIN-(COUNT, AVG, SUM, MEAN. . ) ► “Everything” about Dell? ► The market of “Linux” from 2007 -2008? ► “What's the average analyst quote about the IBM stock price for the last month? ” ► Drill down on region, time, organization …. QUERY U S S © Chair for Database Systems and Information Management DATA U 16
Building Blocks ► Cloud Computing ► Map Reduce ► Pig ► UIMA ► Social Tagging © Chair for Database Systems and Information Management 17
Cloud Computing ► What is Cloud Computing? Computing platform architecture Scales to any application High fault tolerance No generally accepted definition available Separation from Utility or Grid Computing is not obvious © Chair for Database Systems and Information Management 18
Cloud Computing ► How does Cloud Computing work? Lots of loosely coupled computers Use of commodity hardware Flexible up- or downgrading of resources APIs offer access to cloud computing systems Software takes care of parallelization, hardware failures and error handling Resources (e. g. storage, computing power) can be bought as services (paying for usage, e. g. Amazon) © Chair for Database Systems and Information Management 19
Map. Reduce – Programming Model Program logic is split into 2 functions: Map(k, v) and Reduce(k, list(v)) ► Functions receive and produce (Key, Value)-pairs ► Map(k, v) computes for each (k, v)-pair an intermediate (ki, vi)-pair ► Reduce(k, list(v)) merges all values with the same key k and outputs the result. ► Map. Reduce programs are easy to develop ► Frameworks provide libraries Frameworks take care of parallelization, distribution and error handling Only application specific source code is required (no parallelization and error handling code) © Chair for Database Systems and Information Management 20
Map. Reduce – Group AVG Example MAP(k, v) Input Data Intermediate (K, V)-Pairs (US, 10) (US, 40) New. York, US, 10 Los. Angeles, US, 40 London, GB, 20 Berlin, Glasgow, Munich, … DE, 60 GB, 10 DE, 30 REDUCE(k, list(v)) Result (US, 10) (US, 40) (GB, 20) (GB, 10) (DE, 45) (GB, 15) (US, 25) (GB, 10) (DE, 60) (DE, 30) © Chair for Database Systems and Information Management (DE, 60) (DE, 30) 21
Map. Reduce ► Map. Reduce For processing of huge amounts of data Massive parallelization of computing tasks Applicable to many real world applications Map. Reduce programs are easy to implement ► Map. Reduce Programming Model Engine Environment to run Map. Reduce programs Distributes computing tasks Errors are transparently handled Very scalable architecture Examples: Google Map. Reduce & Apache Hadoop © Chair for Database Systems and Information Management 22
Hadoop ► What is Hadoop? Free software framework for data intensive applications Enables distributed processing of vast amounts of data on cloud computing architectures Supports clouds with 1000+ nodes Two components: 1) Hadoop Distributed File System (HDFS) 2) Map. Reduce Engine ► Where can you get Hadoop? Top-level Apache Project: http: //hadoop. apache. org/core/ © Chair for Database Systems and Information Management 23
Hadoop - HDFS ► ► ► Inspired by Google File System Distributed storage for large files Files are split up in multiple parts (default size 64 MB) Parts are spread over the HDFS nodes Each part replicated (default 3 times) © Chair for Database Systems and Information Management 24
Hadoop – Map. Reduce Engine Runs Map. Reduce programs ► Libraries for Java and C++ ► Assigns Map and Reduce tasks to computing nodes ► Reduction of data transfer volume ► Tasks are assigned to nodes holding the data ► Node failures are transparently handled Tasks are restarted on node holding a replica of the data Task. Manager MAP( MAP( … ) ) FAILS! ) © Chair for Database Systems and Information Management 25
Hadoop ► Who uses Hadoop? Amazon A 9. com (Search Index Building, Analytics) Facebook (Logfile Analysis) Google & IBM (University Initiative to Address Internet-Scale Computing Challenges) Yahoo! (Crawling, Indexing, Searching) Yahoo! Hadoop Cluster runs Terabyte Sort Benchmark in 209 seconds And many others… (see http: //wiki. apache. org/hadoop/Powered. By) ► Hadoop resembles Google‘s Map. Reduce Framework J. Dean, S. Ghemawat „Map. Reduce: Simplified Data Processing on Large Clusters“ © Chair for Database Systems and Information Management 26
The Pig Project A platform for analyzing large data sets ► Pig consists of two parts: ► Pig. Latin: A Data Processing Language Pig Infrastructure (Grunt): An Evaluator for Pig. Latin programs ► Where can you get Pig? Apache Incubator Project: http: //incubator. apache. org/pig ► Alternatives: HIVE (Facebook) JAQL (IBM Research) © Chair for Database Systems and Information Management 27
Pig. Latin Data Processing Language ► Pig. Latin is imperative (whereas SQL is declarative) Step-by-step programming approach Pig. Latin queries are easy to write and understand ► Fully nestable data model Atomic values, tuples, bags, maps ► Operators of two flavors: Relational style operators (filter, join, etc. ) Functional-programming style operators (map, reduce) ► ► Easy to extend by user functions Example: “Find the top 10 most visited pages in each category” visits g. Visits visit. Counts url. Info = = visit. Counts = g. Categories = top. Urls = store top. Urls load ‘/data/visits’ as (user, url, time); group visits by url; foreach g. Visits generate url, count(visits); load ‘/data/url. Info’ as (url, category, p. Rank); join visit. Counts by url, url. Info by url; group visit. Counts by category; foreach g. Categories generate top(visit. Counts, 10); into ‘/data/top. Urls’; Example taken from: “Pig Latin: A Not-So-Foreign Language For Data Processing” Talk, Sigmod 2008 © Chair for Database Systems and Information Management 28
Pig Infrastructure ► Currently two modes: Local: Pig. Latin programs are locally evaluated (run in a single JVM) Map. Reduce: Pig. Latin programs are compiled to sequences of Map. Reduce programs and executed (e. g. on Hadoop) ► Example: Map 1 LOAD visits GROUP BY url Reduce 1 FOREACH url GENERATE count LOAD url info JOIN on url Map 3 Map 2 Reduce 2 GROUP by category Reduce 3 FOREACH category GENERATE top 10(urls) © Chair for Database Systems and Information Management Example taken from: “Pig Latin: A Not-So-Foreign Language For Data Processing” Talk, Sigmod 2008 29
UIMA © Chair for Database Systems and Information Management 30
UIMA Pre-Processing Analysis Phase © Chair for Database Systems and Information Management Post-Processing 31
UIMA ► Annotators for Part of Speech detection, Named-Entity detection and Relation detection. © Chair for Database Systems and Information Management 32
The Stratosphere Project ► Many BI queries exceed the capabilities of today‘s BI systems „ Who are the biggest players in the Linux market? “ „ Which insurance policy customers are at risk of being hit by a current storm? “ ► The Internet offers valuable information Enterprise announcements and public business reports User generated content: Blogs, Wikis, Reviews, Comments, etc. News websites and feeds ► Next Generation Business Intelligence (NGBI) requires joint analysis of internet and enterprise data Internet, Intranet, Data Warehouse and Local Data must be processed Goal of the Stratosphere Project is to build a NGBI System on a Cloud Computing Platform © Chair for Database Systems and Information Management 33
Stratosphere - Architecture Further data sources: Intranet Internet Data Warehouse Office documents (spreadsheets) Email Computing Cloud Extract (UIMA) Crawl Scan Cache Filter HADOOP Retrieve Extract Process Join Group Query UI Result © Chair for Database Systems and Information Management Query Plan Query Translation 34
Stratosphere – Research Challenges ► Definition an algebra for expressing NGBI-queries Includes: traditional database operators, data retrieving operators, information extraction operators, and information integration operators ► Implementation of NGBI query operators Requirements: highly-scalable, robust, self-tuning Leveraging Hadoop and map-reduce-frameworks ► Implementation of a cloud computing monitoring infrastructure Enabling for self-tuning NGBI-operators © Chair for Database Systems and Information Management 35
Related Project: DBLife © Chair for Database Systems and Information Management 36
Related Projects: Avatar Email Search © Chair for Database Systems and Information Management 37
Situational Business Intelligence Example (Zipcode) Which insurance policy customers are at risk of being hit by a current storm? Severe weather – Meet Pete, an insurance agent in Lousiana. 1. He sees a news report of a severe storm. What is the company’s risk? 2. Pete has an Excel spreadsheet with all policy holders he manages, which he filters to select only properties insured for more than $250, 000. 3. Pete searches for a website that can predict flood levels for his area and finds www. floodlevels. com, a mashup which predicts the flood level for a geographic area based on USGS flood level forecasts, and GIS databases from 4. Pete connects his spreadsheet to www. floodlevels. com 5. He then forwards a risk summary to executives. (HUC = Hydrological Unit Code) http: //water. usgs. gov/waterwatch/ (Geocode = Latitude/Longitude) edc. usgs. gov/ © Chair for Database Systems and Information Management (Geocode = Latitude/Longitude) http: //www. dotd. florida. gov/ 38
Flood Risk Assessment Mashup Report Mashup Search Standardization Standardize Screen Scraping www. floodlevels. com Lineage standardize policy XLS water. usgs. gov © Chair for Database Systems and Information Management edc. usgs. gov dotd. louisiana. gov 39
Situational BI Evolution Portals Mission Critical Data. Mart Line of Business Best Effort, Ad. Hoc IT Dept SCA Data. Warehouse New Initiatives Proof of Concept Mashups Limited Time, Immediate © Chair for Database Systems and Information Management Lots of Time 40
Select Literature (Algebraic) Extraction ► Michele Banko, Michael J. Cafarella, Stephen Soderland, Matthew Broadhead, Oren Etzioni: Open Information Extraction from the Web. International Joint Conferences on Artificial Intelligence (IJCAI) 2007: 2670 -2676 ► Frederick Reiss, Shivakumar Vaithyanathan, Sriram Raghavan, Rajasekar Krishnamurthy, Huaiyu Zhu: An Algebraic Approach to Rule-Based Information Extraction. International Conference on data engineering (ICDE) 2008. 933 -942 Schema generation from extracted uncertain data ► Xin Dong, Alon Y. Halevy: Malleable Schemas: A Preliminary Report. Web. DB 2005: 139 -144 ► Marcos Antonio Vaz Salles, Jens-Peter Dittrich, Shant Kirakos Karakashian, Olivier René Girard, Lukas Blunschi: i. Trails: Pay-asyou-go Information Integration in Dataspaces. International Conference on Very Large Databases (VLDB) 2007: 663 -674 Optimization ► Alpa Jain, An. Hai Doan, Luis Gravano: Optimizing SQL Queries over Text Databases. International Conference on Data Engineering (ICDE) 2008: 636 -645 BI over text ► Alpa Jain, An. Hai Doan, Luis Gravano: Optimizing SQL Queries over Text Databases. International Conference on Data Engineering (ICDE) 2008: 636 -645 ► Raghu Ramakrishnan and Andrew Tomkins: Towards a People. Web. IEEE Computer 40(8): 63 -72. ► Web 2. 0 Business Analytics. Alexander Löser, Gregor Hackenbroich, Hong-Hai Do, Henrike Berthold. Datenbank Spektrum 25/2008 ► T. S. Jayram, Andrew Mc. Gregor, S. Muthukrishan, Erik Vee: Estimating Statistical Aggregateson Probabilistic Data Streams. PODS 07 © Chair for Database Systems and Information Management 41
Conclusion ► BI over text will tap into a huge set of additional information for BI ► The next generation of business intelligence applications will utilize technologies for scalable processing and service computing to integrate data sources from warehouses, intranet, and internet ► Situational BI will create ad-hoc applications to answer complex questions over integrated data sources ► Open research problems: Which is the right extraction service? “How much” schema can be generated? “How much” optimization has the user to add? How to optimize UIMA based extraction plans on a HADDOP cloud? What is a suitable query language over HADOOP? Data cleansing, completion, and Duplicate detection of extracted data? Data explanation: Lineage but also: Why I do NOT see that data tuple? © Chair for Database Systems and Information Management 42
Acknowledgements ► Discussions at IBM Anant Jhingran Hamid Pirahesh Kevin Beyer David Simmen Mehmet Altinel et al. ► My team at TU Alexander Löser Fabian Hüske Stephan Ewen Helko Glathe Research and IBM SWG Berlin © Chair for Database Systems and Information Management 43
Hindi Thai Traditional Chinese Gracias Spanish Russian Thank You English Arabic Brazilian Portuguese Danke German Grazie Merci Italian Simplified Chinese Tamil Obrigado French Japanese Korean © Chair for Database Systems and Information Management 44
- Slides: 44