Machine Learning Data Mining and Knowledge Discovery An

Kdnuggets. com/Education KDnuggets teaching modules This site contains a set of teaching modules for

Course Outline § Machine Learning § input, representation, decision trees § Weka - machine

CTU Course Outline § Basic concepts in databases & datawarehouses § Machine Learning §

Praktické problémy dobývání znalostí Zdeněk Kouba, Olga Štěpánková et al. 1. DM – úvod,

Recommended Reading Petr Berka: Dobývání znalostí z databází, Academia 2003 F. Železný, J. Kléma,

Lesson Outline §Introduction: Data Flood §Data Mining Application Examples §Data Mining & Knowledge Discovery

Trends leading to Data Flood § More data is generated: § Bank, telecom, other

Big Data Examples § Europe's Very Long Baseline Interferometry (VLBI) has 16 telescopes, each

Largest databases in 2003 § Commercial databases: § Winter Corp. 2003 Survey: France Telecom

Data Flood? Prefix Multiplier Giga 109 Tera 1012 Peta 1015 Exa 1018 Zetta 1021

From terabytes to exabytes to … § UC Berkeley 2003 estimate: 5 exabytes (5

Data Growth Rate § Twice as much information was created in 2002 as in

Machine Learning / Data Mining Application areas § Science § astronomy, bioinformatics, drug discovery,

DM applications in 2004 (in %) § 13%: Banking § 9%: Direct Marketing, Fraud

Data Mining for Customer Modeling § Customer Tasks: § attrition prediction (odchod zákazníků) §

Customer Attrition: Case Study § Situation: Attrition rate at for mobile phone customers is

Customer Attrition Results § Verizon Wireless built a customer data warehouse § Identified potential

Assessing Credit Risk: Case Study § Situation: Person applies for a loan § Task:

Credit Risk - Results § Banks develop credit models using variety of machine learning

Successful e-commerce – Case Study § A person buys a book (product) at Amazon.

Unsuccessful e-commerce case study (KDD-Cup 2000) § Data: clickstream and purchase data from Gazelle.

Genomic Microarrays – Case Study Given microarray data for a number of samples (patients),

Example: ALL/AML data § 38 training cases, 34 test, ~ 7, 000 genes §

Security and Fraud Detection Case Study § Credit Card Fraud Detection § Detection of

Data Mining and Privacy § in 2006, NSA (National Security Agency) was reported to

Some more examples from ČVUT § Results of the Sol-Eu-Net project (2000 -2003), Sol-Eu-Net:

Knowledge Discovery Definition Knowledge Discovery in Data is the non-trivial process of identifying §

Related Fields Machine Learning Visualization Data Mining and Knowledge Discovery Statistics Databases 32

Data Mining Development • Similarity Measures • Hierarchical Clustering • IR Systems • Imprecise

Statistics, Machine Learning and Data Mining § § Statistics: § more theory-based § more

Knowledge Discovery Process flow, according to CRISP-DM see www. crisp-dm. org for more information

Význam jednotlivých kroků CRISP (v %): celkové časové nároky a úspěch DM řešení 36

Historical Note: Many Names of Data Mining § Data Fishing, Data Dredging: 1960§ used

KDD Issues § Human Interaction § Multimedia Data § Overfitting § Missing Data §

Major Data Mining Tasks § Classification: predicting an item class § Clustering: finding clusters

Major Data Mining Tasks/Larose § Description of patterns and trends with intention to provide

Data Mining Tasks: Classification Learn a method for predicting the instance class from pre-labeled

Data Mining Tasks: Clustering Find “natural” grouping of instances given un-labeled data 44

Some more DM tasks § Description searches for ways how to describe patterns and

Summary: § Technology trends lead to data flood § data mining is needed to

More on Data Mining and Knowledge Discovery KDnuggets. com § News, Publications § Software,

Slides: 47

Download presentation

Machine Learning, Data Mining, and Knowledge Discovery: An Introduction Gregory Piatetsky-Shapiro + additional notes/poznámky OS

Kdnuggets. com/Education KDnuggets teaching modules This site contains a set of teaching modules for a one-semester introductory course on Data Mining, suitable for advanced undergraduates or first-year graduate students. The teaching modules were created in 2004 (modfied in 2006) by Dr. Gregory Piatetsky-Shapiro KDnuggets Prof. Gary Parker Connecticut College This project was funded by a grant from W. M. Keck Foundation, Los Angeles, CA and Howard Hughes Medical Institute, Chevy Chase, MD, as part of Connecticut College Series of Modules in Emerging Fields. 2

Course Outline § Machine Learning § input, representation, decision trees § Weka - machine learning workbench § Data Mining § associations, deviation detection, clustering, visualization § Case Studies § targeted marketing, genomic microarrays § Data Mining, Privacy and Security § Final Project: Microarray Data Mining Competition (data accompanying the original course) or analysis of any data-set of interest 3

CTU Course Outline § Basic concepts in databases & datawarehouses § Machine Learning § input, representation, decision trees § Weka - machine learning workbench § Data preprocessing and visualization. Sumatra. TT § Data Mining § associations, deviation detection, clustering, visualization, ILP § Lisp. Miner § Text mining § Case Studies § genomic microarrays, banking application, … § Data Mining, Privacy and Security § Final Project: § DM competition, § individual projects, … 4

Praktické problémy dobývání znalostí Zdeněk Kouba, Olga Štěpánková et al. 1. DM – úvod, popis a metodika procesu a motivační příklady 2. Používané metody strojového učení – stromy a jejich prořezávání, asociační pravidla 3. Příprava dat a použití Sumatra. TT ( Lenka Nováková) 4. Vizualizace dat a její využití v DM ( Lenka Nováková) 5. Lisp. Miner (Jan Rauch nebo M. Šimůnek, VŠE) 6. Práce s relačními daty, ILP (Filip Železný? ) 7. DM v Biomedicínské informatice (Jiří Kléma? ) 8. Text mining 9. Shrnutí postupů a nástrojů používaných v 7 PRO (Petr Křemen? ) 10. Aplikace data mining v bankovním marketingu (Petr Husták? ) 11. ? ? (Jan Kout) Prerekvizity: Přehled základních pojmů ze statistiky, databáze a datové sklady 5

Recommended Reading Petr Berka: Dobývání znalostí z databází, Academia 2003 F. Železný, J. Kléma, O. Štěpánková: Strojové učení v dobývání dat, Mařík et al. (eds) Umělá inteligence (4), Academia 2003 M. Kubát: Strojové učení, Mařík et al. (eds) Umělá inteligence (1), Academia 1993 Michael Berthold, David J. Hand: Intelligent Data Analysis, Springer 1999, 2003 Daniel T. Larose: Discovering Knowledge in Data, Wiley 2005 Daniel T. Larose: Data Mining: Methods and Models, Wiley 2006 Oded Maimon, Lior Rokach (eds): The Data Mining and Knowledge Discovery Handbook, Springer 2005 6

Lesson Outline §Introduction: Data Flood §Data Mining Application Examples §Data Mining & Knowledge Discovery §Data Mining Tasks 7

Trends leading to Data Flood § More data is generated: § Bank, telecom, other business transactions. . . § Scientific data: astronomy, biology, etc § Web, text, and e-commerce 8

Big Data Examples § Europe's Very Long Baseline Interferometry (VLBI) has 16 telescopes, each of which produces 1 Gigabit/second of astronomical data over a 25 -day observation session § storage and analysis a big problem § AT&T handles billions of calls per day § so much data, it cannot be all stored -- analysis has to be done “on the fly”, on streaming data 9

Largest databases in 2003 § Commercial databases: § Winter Corp. 2003 Survey: France Telecom has largest decisionsupport DB, ~30 TB; AT&T ~ 26 TB § Web § Alexa internet archive: 7 years of data, 500 TB § Google searches 4+ Billion pages, many hundreds TB § IBM Web. Fountain, 160 TB (2003) § Internet Archive (www. archive. org), ~ 300 TB 1 billion = 1012, prefix Tera 10

Data Flood? Prefix Multiplier Giga 109 Tera 1012 Peta 1015 Exa 1018 Zetta 1021 Yotta 1024 § The U. S. Library of Congress Web Capture team: "as of May 2008, the Library has collected more than 82. 6 terabytes of data". § Ancestry. com claims approximately 600 terabytes of genealogical data with the inclusion of US Census data from 1790 to 1930. § In 1993 total Internet traffic was around 100 terabytes for the year. As of June 2008, Cisco Systems estimated Internet traffic at 160 terabytes per second (which equals about 5 Zettabytes for the year). 11

From terabytes to exabytes to … § UC Berkeley 2003 estimate: 5 exabytes (5 million terabytes) of new data was created in 2002. www. sims. berkeley. edu/research/projects/how-much-info-2003/ § US produces ~40% of new stored data worldwide § 2006 estimate: 161 exabytes (IDC study) § www. usatoday. com/tech/news/2007 -03 -05 -data_N. htm § 2010 projection: 988 exabytes 12

Data Growth Rate § Twice as much information was created in 2002 as in 1999 (~30% growth rate) § Other growth rate estimates even higher § Very little data will ever be looked at by a human § Knowledge Discovery is NEEDED to make sense and use of data. 13

Lesson Outline §Introduction: Data Flood §Data Mining Application Examples §Data Mining & Knowledge Discovery §Data Mining Tasks 14

Machine Learning / Data Mining Application areas § Science § astronomy, bioinformatics, drug discovery, … § Business § advertising, CRM (Customer Relationship management), investments, manufacturing, sports/entertainment, telecom, e. Commerce, targeted marketing, health care, … § Web: § search engines, advertising, web and text mining, … § Government § surveillance & anti-terror (? |), crime detection, profiling tax cheaters, … 15

DM applications in 2004 (in %) § 13%: Banking § 9%: Direct Marketing, Fraud Detection, Scientific data analysis § 8%: Bioinformatics § 7%: Insurance, Medical/Pharmaceutic Applications § 6%: e. Commerce/Web, Telecommunications § 4%: Investments/Stocks, Manufacturing, Retail, Security § Bellow: Travel, Entertainment/News, … 16

Data Mining for Customer Modeling § Customer Tasks: § attrition prediction (odchod zákazníků) § targeted marketing: § cross-sell, customer acquisition § credit-risk § fraud detection § Industries § banking, telecom, retail sales (maloobchodní prodej), … 17

Customer Attrition: Case Study § Situation: Attrition rate at for mobile phone customers is around 25 -30% a year! Task: § Given customer information for the past N months, predict who is likely to attrite next month. § Also, estimate customer value and what is the costeffective offer to be made to this customer. 18

Customer Attrition Results § Verizon Wireless built a customer data warehouse § Identified potential attriters § Developed multiple, regional models § Targeted customers with high propensity to accept the offer § Reduced attrition rate from over 2%/month to under 1. 5%/month (huge impact, with >30 M subscribers) (Reported in 2003) 19

Assessing Credit Risk: Case Study § Situation: Person applies for a loan § Task: Should a bank approve the loan? § Note: People who have the best credit don’t need the loans, and people with worst credit are not likely to repay. Bank’s best customers are in the middle 20

Credit Risk - Results § Banks develop credit models using variety of machine learning methods. § Mortgage and credit card proliferation are the results of being able to successfully predict if a person is likely to default on a loan § Widely deployed in many countries 21

Successful e-commerce – Case Study § A person buys a book (product) at Amazon. com. § Task: Recommend other books (products) this person is likely to buy § Amazon does clustering based on books bought: § customers who bought “Advances in Knowledge Discovery and Data Mining”, also bought “Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations” § Recommendation program is quite successful 22

Unsuccessful e-commerce case study (KDD-Cup 2000) § Data: clickstream and purchase data from Gazelle. com, legwear and legcare e-tailer § Q: Characterize visitors who spend more than $12 on an average order at the site § Dataset of 3, 465 purchases, 1, 831 customers § Very interesting analysis by Cup participants § thousands of hours - $X, 000 (Millions) of consulting § Total sales -- $Y, 000 § Obituary: Gazelle. com out of business, Aug 2000 23

Genomic Microarrays – Case Study Given microarray data for a number of samples (patients), can we § Accurately diagnose the disease? § Predict outcome for given treatment? § Recommend best treatment? 24

Example: ALL/AML data § 38 training cases, 34 test, ~ 7, 000 genes § 2 Classes: Acute Lymphoblastic Leukemia (ALL) vs Acute Myeloid Leukemia (AML) § Use train data to build diagnostic model ALL AML Results on test data: 33/34 correct, 1 error may be mislabeled 25

Security and Fraud Detection Case Study § Credit Card Fraud Detection § Detection of Money laundering § FAIS (US Treasury) § Securities Fraud § NASDAQ KDD system § Phone fraud § AT&T, Bell Atlantic, British Telecom/MCI § Bio-terrorism detection at Salt Lake Olympics 2002 26

Data Mining and Privacy § in 2006, NSA (National Security Agency) was reported to be mining years of call info, to identify terrorism networks § Social network analysis has a potential to find networks § Invasion of privacy – do you mind if your call information is in a gov database? § What if NSA program finds one real suspect for 1, 000 false leads ? 1, 000 false leads? 27

Some more examples from ČVUT § Results of the Sol-Eu-Net project (2000 -2003), Sol-Eu-Net: DM and Decision Support for Business Competitiveness: A European Virtual Enterprise § Short-term prediction of local energy consumption § Early diagnosis of motor failures … 29

Lesson Outline §Introduction: Data Flood §Data Mining Application Examples §Data Mining & Knowledge Discovery §Data Mining Tasks 30

Knowledge Discovery Definition Knowledge Discovery in Data is the non-trivial process of identifying § valid § novel § potentially useful § and ultimately understandable patterns in data. from Advances in Knowledge Discovery and Data Mining, Fayyad, Piatetsky-Shapiro, Smyth, and Uthurusamy, (Chapter 1), AAAI/MIT Press 1996 31

Related Fields Machine Learning Visualization Data Mining and Knowledge Discovery Statistics Databases 32

Data Mining Development • Similarity Measures • Hierarchical Clustering • IR Systems • Imprecise Queries • Textual Data • Web Search Engines • Relational Data Model • SQL • Association Rule Algorithms • Data Warehousing • Scalability Techniques • Bayes Theorem • Regression Analysis • EM Algorithm • K-Means Clustering • Time Series Analysis • Algorithm Design Techniques • Algorithm Analysis • Data Structures • Neural Networks • Decision Tree Algorithms 33

Statistics, Machine Learning and Data Mining § § Statistics: § more theory-based § more focused on testing hypotheses Machine learning § more heuristic § focused on improving performance of a learning agent § also looks at real-time learning and robotics – areas not part of data mining Data Mining and Knowledge Discovery § integrates theory and heuristics § focus on the entire process of knowledge discovery, including data cleaning, learning, and integration and visualization of results Distinctions are fuzzy witten&eibe 34

Knowledge Discovery Process flow, according to CRISP-DM see www. crisp-dm. org for more information Monitoring 35

Význam jednotlivých kroků CRISP (v %): celkové časové nároky a úspěch DM řešení 36

Historical Note: Many Names of Data Mining § Data Fishing, Data Dredging: 1960§ used by Statistician (as bad name) § Data Mining : 1990 -§ used DB, business § in 2003 – bad image because of TIA (Total Information Awareness: anti-terrorist project of US Dept. Of Defense) § Knowledge Discovery in Databases (1989 -) § used by AI, Machine Learning Community § also Data Archaeology, Information Harvesting, Information Discovery, Knowledge Extraction, . . . Currently: Data Mining and Knowledge Discovery are used interchangeably 37

KDD Issues § Human Interaction § Multimedia Data § Overfitting § Missing Data § Outliers § Irrelevant Data § Interpretation § Noisy Data § Visualization § Changing Data § Large Datasets § Integration § High Dimensionality § Application 38

Lesson Outline §Introduction: Data Flood §Data Mining Application Examples §Data Mining & Knowledge Discovery §Data Mining Tasks 39

Major Data Mining Tasks § Classification: predicting an item class § Clustering: finding clusters in data § Associations: e. g. A & B & C occur frequently § Visualization: to facilitate human discovery § Summarization: describing a group § Deviation Detection: finding changes § Estimation: predicting a continuous value § Link Analysis: finding relationships § … 40

Major Data Mining Tasks/Larose § Description of patterns and trends with intention to provide an intuitive interpretation and explanation; Exploratory Data Analysis § Prediction of some future values, NN, decision trees, k-nearest neighbor § Estimation (similar to classification, outcome is real value), Statistical Analysis § Classification: predicting an item class § Clustering: finding clusters in data § Associations: e. g. A & B & C occur frequently § Visualization: to facilitate human discovery § Summarization: describing a group § Deviation Detection: finding changes 41

Data Mining Models and Tasks 42

Data Mining Tasks: Classification Learn a method for predicting the instance class from pre-labeled (classified) instances Many approaches: Statistics, Decision Trees, Neural Networks, . . . 43

Data Mining Tasks: Clustering Find “natural” grouping of instances given un-labeled data 44

Some more DM tasks § Description searches for ways how to describe patterns and trends lying within data, e. g. exploratory data analysis § Estimation. E. g. „estimating grade-point average of a graduate student based on his-her undergraduate results!“ § Prediction like estimation but the results lie in future § Classification, clustering 45

Summary: § Technology trends lead to data flood § data mining is needed to make sense of data § Data Mining has many applications, successful and not § Knowledge Discovery Process § Data Mining Tasks § classification, clustering, … 46

More on Data Mining and Knowledge Discovery KDnuggets. com § News, Publications § Software, Solutions § Courses, Meetings, Education § Publications, Websites, Datasets § Companies, Jobs §… 47

Data Mining Jobs in KDnuggets 48