732 A 54 TDDE 31 Big Data Analytics
732 A 54 / TDDE 31 Big Data Analytics 6 hp http: //www. ida. liu. se/~732 A 54 http: //www. ida. liu. se/~TDDE 31
Teachers Examiner: Patrick Lambrix (B: 2 B: 474) n Lectures: Patrick Lambrix, Christoph Kessler, Jose Pena, Olaf Hartig n Labs: Huanyu Li, Jose Pena n n 2 Director of studies: Patrick Lambrix
Course literature Articles (on web/handout) n Lab descriptions (on web) n 3
Data and Data Storage 4
Data and Data Storage Database / Data source n One (of several) ways to store data in electronic format n Used in everyday life: bank, hotel reservations, library search, shopping n 5
Databases / Data sourcces Database management system (DBMS): a collection of programs to create and maintain a database n Database system = database + DBMS n 6
Databases / Data sources Information Model Database system Database management system Queries Processing of queries/updates Access to stored data Physical database 7 Answer
What information is stored? Model the information - Entity-Relationship model (ER) - Unified Modeling Language (UML) n 8
What information is stored? - ER entities and attributes n entity types n key attributes n relationships n cardinality constraints n n 9 EER: sub-types
DEFINITION Homo sapiens adrenergic, beta-1 -, receptor ACCESSION NM_000684 SOURCE ORGANISM human REFERENCE 1 AUTHORS Frielle, Collins, Daniel, Caron, Lefkowitz, Kobilka TITLE Cloning of the c. DNA for the human beta 1 -adrenergic receptor REFERENCE 2 AUTHORS Frielle, Kobilka, Lefkowitz, Caron TITLE Human beta 1 - and beta 2 -adrenergic receptors: structurally and functionally related receptors derived from distinct genes 10
Entity-relationship source protein-id PROTEIN accession m definition Reference n article-id 11 title ARTICLE author
Databases / Data sources Information Model Database system Database management system Queries Processing of queries/updates Access to stored data Physical database 12 Answer
How is the information stored? (high level) How is the information accessed? (user level) Text (IR) n Semi-structured data n Data models (DB) n Rules + Facts (KB) n 13 structure precision
IR - formal characterization Information retrieval model: (D, Q, F, R) n D is a set of document representations n Q is a set of queries n F is a framework for modeling document representations, queries and their relationships n R associates a real number to documentquery-pairs (ranking) 14
IR - Boolean model adrenergic cloning receptor Doc 1 yes no --> Doc 2 no yes no --> (0 1 0) (1 1 0) Q 1: cloning and (adrenergic or receptor) --> (1 1 0) or (1 1 1) or (0 1 1) Result: Doc 1 Q 2: cloning and not adrenergic --> (0 1 0) or (0 1 1) Result: Doc 2 15
IR - Vector model (simplified) Doc 1 (1, 1, 0) cloning Doc 2 (0, 1, 0) Q (1, 1, 1) adrenergic sim(d, q) = d. q |d| x |q| receptor 16
Semi-structured data NM_000684 ACCESSION Protein DB ”Homo sapiens adrenergic, beta-1 -, receptor” human SOURCE DEFINITION PROTEIN REFERENCE AUTHOR TITLE AUTHOR ”Cloning of …” AUTHOR ”Human beta-1 …” Daniel AUTHOR Caron AUTHOR 17 Frielle TITLE Collins Lefkowitz Kobilka
Semi-structured data - Queries select source from PROTEINDB. protein P where P. accession = ”NM_000684”; 18
Relational databases PROTEIN REFERENCE PROTEIN-ID 1 ACCESSION DEFINITION SOURCE PROTEIN-ID ARTICLE-ID NM_000684 Homo sapiens adrenergic, beta-1 -, receptor human 1 1 1 2 ARTICLE-TITLE ARTICLE-AUTHOR ARTICLE-ID 1 1 1 2 2 19 AUTHOR Frielle Collins Daniel Caron Lefkowitz Kobilka Frielle Kobilka Lefkowitz Caron ARTICLE-ID TITLE 1 Cloning of the c. DNA for the human beta 1 -adrenergic receptor 2 Human beta 1 - and beta 2 adrenergic receptors: structurally and functionally related receptors derived from distinct genes
Relational databases - SQL select source from protein where accession = NM_000684; PROTEIN-ID 1 20 ACCESSION DEFINITION SOURCE NM_000684 Homo sapiens adrenergic, beta-1 -, receptor human
Evolution of Database Technology n 1960 s: ¨ n 1970 s: ¨ n n Relational data model, relational DBMS implementation 1980 s: ¨ Advanced data models (extended-relational, OO, deductive, etc. ) ¨ Application-oriented DBMS (spatial, temporal, multimedia, etc. ) 1990 s: ¨ n Data collection, database creation, IMS and network DBMS Data mining, data warehousing, multimedia databases, and Web databases 2000 s ¨ Stream data management and mining ¨ Data mining and its applications ¨ Web technology (XML, data integration) and global information systems ¨ No. SQL databases 21
Knowledge bases (F) source(NM_000684, Human) (R) source(P? , Human) => source(P? , Mammal) (R) source(P? , Mammal) => source(P? , Vertebrate) Q: ? - source(NM_000684, Vertebrate) A: yes Q: ? - source(x? , Mammal) A: x? = NM_000684 22
Interested in more? 732 A 57/TDDD 12/TDDD 37/TDDD 46/ TDDD 74/TDDD 81 Database Technology (relational databases) n TDDD 43 Advanced data models and databases (IR, semi-structured data, DB, KB) n 23
Analytics
Analytics n 25 Discovery, interpretation and communication of meaningful patterns in data
Analytics - IBM n What is happening? Descriptive Discovery and explanation n Why did it happen? Diagnostic Reporting, analysis, content analytics n What could happen? Predictive analytics and modeling n What action should I take? Prescriptive Decision management n What did I learn, what is best? Cognitive
Analytics - Oracle Classification n Regression n Clustering n Attribute importance n Anomaly detection n Feature extraction and creation n Market basket analysis n
Why Analytics? n The Explosive Growth of Data ¨ Data collection and data availability n Automated data collection tools, database systems, Web, computerized society ¨ Major sources of abundant data n Business: Web, e-commerce, transactions, stocks, … n Science: Remote sensing, bioinformatics, scientific simulation, … n n Society and everyone: news, digital cameras, You. Tube We are drowning in data, but starving for knowledge! 28
Ex. : Market Analysis and Management n Where does the data come from? —Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies n Target marketing ¨ Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc. ¨ Determine customer purchasing patterns over time n Cross-market analysis—Find associations/co-relations between product sales, & predict based on such association n Customer profiling—What types of customers buy what products (clustering or classification) n Customer requirement analysis n ¨ Identify the best products for different groups of customers ¨ Predict what factors will attract new customers Provision of summary information ¨ Multidimensional summary reports ¨ Statistical summary information (data central tendency and variation) 29
Ex. : Fraud Detection & Mining Unusual Patterns n Approaches: Clustering & model construction for frauds, outlier analysis n Applications: Health care, retail, credit card service, telecomm. ¨ Auto insurance: ring of collisions ¨ Money laundering: suspicious monetary transactions ¨ Medical insurance ¨ n Professional patients, ring of doctors, and ring of references n Unnecessary or correlated screening tests Telecommunications: phone-call fraud n ¨ Phone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm Anti-terrorism 30
Knowledge Discovery (KDD) Process Pattern evaluation and presentation Data Mining Task-relevant Data Warehouse Selection and transformation Data Cleaning Data Integration Databases 31
Data Mining – what kinds of patterns? n Concept/class description: ¨ Characterization: summarizing the data of the class under study in general terms n ¨ E. g. Characteristics of customers spending more than 10000 sek per year Discrimination: comparing target class with other (contrasting) classes n E. g. Compare the characteristics of products that had a sales increase to products that had a sales decrease last year 32
Data Mining – what kinds of patterns? n Frequent patterns, association, correlations ¨ Frequent itemset ¨ Frequent sequential pattern ¨ Frequent structured pattern ¨ E. g. buy(X, “Diaper”) buy(X, “Beer”) [support=0. 5%, confidence=75%] confidence: if X buys a diaper, then there is 75% chance that X buys beer support: of all transactions under consideration 0. 5% showed that diaper and beer were bought together ¨ E. g. Age(X, ” 20. . 29”) and income(X, ” 20 k. . 29 k”) buys(X, ”cd-player”) [support=2%, confidence=60%] 33
Data Mining – what kinds of patterns? n Classification and prediction ¨ Construct models (functions) that describe and distinguish classes or concepts for future prediction. The derived model is based on analyzing training data – data whose class labels are known. n E. g. , classify countries based on (climate), or classify cars based on (gas mileage) ¨ Predict some unknown or missing numerical values 34
Data Mining – what kinds of patterns? n Cluster analysis ¨ Class label is unknown: Group data to form new classes, e. g. , cluster customers to find target groups for marketing ¨ Maximizing intra-class similarity & minimizing interclass similarity n Outlier analysis ¨ Outlier: Data object that does not comply with the general behavior of the data ¨ Noise or exception? Useful in fraud detection, rare events analysis n Trend and evolution analysis ¨ Trend and deviation 35
Interested in more? 732 A 95/TDDE 01 Introduction to machine learning n 732 A 75/TDDD 41 Advanced data mining / Data mining – clustering and association analysis n 36
Big Data 37
Big Data n 38 So large data that it becomes difficult to process it using a ’traditional’ system
Big Data – 3 Vs n Volume ¨ size of the data 39
Volume - examples Facebook processes 500 TB per day n Walmart handles 1 million customer transactions per hour n Airbus generates 640 TB in one fligth (10 TB per 30 minutes) n 72 hours of video uploaded to youtube every minute n SMS, e-mail, internet, social media n
https: //y 2 socialcomputing. files. wordpress. com/2012/06/ social-media-visual-last-blog-post-what-happens-in-an-internet-minute-infographic. jpg
Big Data – 3 Vs n Volume ¨ size of the data n Variety ¨ type and nature of the data n 42 text, semi-structured data, databases, knowledge bases
Linking Open Data cloud diagram 2014, by Max Schmachtenberg, Christian Bizer, Anja Jentzsch and Richard Cyganiak. http: //lod-cloud. net/
Linked open data of US government Format (# Datasets) http: //catalog. data. gov/ n HTML (27005) n XML (24077) n PDF (19628) n CSV (10058) n JSON (8948) n RDF (6153) n JPG (5419) n WMS (5019) n Excel (3389) n WFS (2781)
Big Data – 3 Vs n Volume ¨ size of the data n Variety ¨ type and nature of the data n Velocity ¨ speed of generation and processing of data 45
Velocity - examples Traffic data n Financial market n Social networks n
http: //www. ibmbigdatahub. com/infographic/four-vs-big-data
Big Data – other Vs n Variability ¨ inconsistency of the data n Veracity ¨ quality of the data n Value ¨ useful analysis results n 48 …
BDA system architecture Specialized services for domain A Specialized services for domain B Big Data Services Layer Knowledge Management Layer Data Storage and Management Layer
BDA system architecture ¨ Large amounts of data, distributed environment ¨ Unstructured and semi-structured data ¨ Not necessarily a schema ¨ Heterogeneous ¨ Streams ¨ Varying quality Data Storage and Management Layer
Data Storage and management – this course n Data storage: ¨ No. SQL databases ¨ OLTP vs OLAP ¨ Horizontal scalability ¨ Consistency, availability, partition tolerance n Data management ¨ Hadoop ¨ Data management systems
BDA system architecture ¨ Semantic technologies ¨ Integration ¨ Knowledge acquisition Knowledge Management Layer
Knowledge management – this course Not a focus topic in this course n For semantic and integration approaches see TDDD 43 n
BDA system architecture ¨ Analytics services for Big Data Services Layer
Big Data Services – this course n Big data versions of analytics/data mining algorithms
Databases Parallel programming Machine learning
Course overview n Review : ONLY 732 A 54 ¨ Databases (lectures + labs) Databases for Big Data (lectures + lab) Parallel algorithms for processing Big Data (lectures + lab + exercise session) Machine Learning for Big Data (lectures + lab) n Visit to National Supercomputer Centre n n n 57
Info Results reported in connection to exams n Info about handing in labs on web; strong recommendation to hand in as soon as possible n Sign up for labs via web (in pairs) n 58
Info ONLY 732 A 54: Relational database labs require special database account make sure you are registered for the course n BDA labs require special access to NSC resources fill out forms n 59
Info n Lab deadlines: ¨ Final deadlines in connection to the exams; no reporting between exams ¨ HARD DEADLINE: May exam (No guarantee NSC resources available after April. ) 60
Examination Written exam n Labs n 61
What if I already took …? What if I also take…? ONLY 732 A 54: n TDDD 37/732 A 57 Database technology ¨ RDB labs 1 -2 in one of the courses, results registered for both
Changes w. r. t. last year 63
My own interest and research n Modeling of data ¨ n Ontologies Ontology engineering Ontology alignment (Winner Anatomy track OAEI 2008 / Organizer OAEI tracks since 2013) ¨ Ontology debugging and completion (Founder and organizer Wo. DOOM/Co. De. S 2012 -2016) ¨ n 64 Ontologies and databases for Big Data ¨ ¨ Swedish Veterinary Agency, animal health surveillance Swedish e-Science Centre, materials design ¨ ¨ Swedish Food Agency, toxics EU VALCRI, crime scene analysis
My own interest and research n Sports Analytics n n Player, team performance Injuries n Former work: knowledge representation, data integration, knowledge-based information retrieval, object-centered databases n http: //www. ida. liu. se/~patla 00/research. shtml 65
https: //www. youtube. com/watch? v=Lr. Nl. Z 7 -SMPk 66
- Slides: 66