Overview of Data Mining and the KDD Process

  • Slides: 25
Download presentation
Overview of Data Mining and the KDD Process Bamshad Mobasher De. Paul University

Overview of Data Mining and the KDD Process Bamshad Mobasher De. Paul University

From Data to Wisdom i Data 4 The raw material of information Wisdom i

From Data to Wisdom i Data 4 The raw material of information Wisdom i Information 4 Data organized and presented by someone Knowledge i Knowledge 4 Information read, heard or seen and understood and integrated Information Data i Wisdom 4 Distilled knowledge and understanding which can lead to decisions The Information Hierarchy 2

Why Data Mining? i. The Explosive Growth of Data: from terabytes to petabytes 4

Why Data Mining? i. The Explosive Growth of Data: from terabytes to petabytes 4 Data collection and data availability h. Automated data collection tools, database systems, Web, computerized society 4 Major sources of abundant data h. Business: Web, e-commerce, transactions, stocks, … h. Science: Remote sensing, bioinformatics, scientific simulation, … h. Society and everyone: news, images, video, documents h. Internet … 3

Source: Intel 4

Source: Intel 4

How much data? i Google: ~20 -30 PB a day i Wayback Machine has

How much data? i Google: ~20 -30 PB a day i Wayback Machine has ~4 PB + 100 -200 TB/month i Facebook: ~3 PB of user data + 25 TB/day i e. Bay: ~7 PB of user data + 50 TB/day i CERN’s Large Hydron Collider generates 15 PB a year i In 2010, enterprises stored 7 Exabytes = 7, 000, 000 GB 640 K ought to be enough for anybody.

Big Data Growing IDC predicts: From 2005 to 2020, the digital universe will double

Big Data Growing IDC predicts: From 2005 to 2020, the digital universe will double every 2 years and grow from 130 exabytes to 40, 000 exabytes or 5, 200 GB / person in 2020. The Untapped Data Gap: Most of the useful data will not be tagged or analyzed – partly due to skill shortage 6

What Is Data Mining? i We are drowning in data, but starving for knowledge!

What Is Data Mining? i We are drowning in data, but starving for knowledge! i “Necessity is the mother of invention”—Data mining— Automated analysis of massive data sets i Data Mining: A Definition The non-trivial extraction of implicit, previously unknown and potentially useful knowledge from data in large data repositories 4 Non-trivial: obvious knowledge is not useful 4 implicit: hidden difficult to observe knowledge 4 previously unknown 4 potentially useful: actionable; easy to understand 7

Data Mining: Confluence of Multiple Disciplines Machine Learning Applications Algorithm Pattern Recognition Data Mining

Data Mining: Confluence of Multiple Disciplines Machine Learning Applications Algorithm Pattern Recognition Data Mining Database Technology 8 Statistics Visualization High-Performance Computing

Data Mining’s Virtuous Cycle 1. Identifying the problem 2. Mining data to transform it

Data Mining’s Virtuous Cycle 1. Identifying the problem 2. Mining data to transform it into actionable information 3. Acting on the information 4. Measuring the results 9

The Knowledge Discovery Process i Data Mining v. Knowledge Discovery in Databases (KDD) 4

The Knowledge Discovery Process i Data Mining v. Knowledge Discovery in Databases (KDD) 4 DM and KDD are often used interchangeably 4 actually, DM is only part of the KDD process - The KDD Process 10

Types of Knowledge Discovery i Two kinds of knowledge discovery: directed and undirected i

Types of Knowledge Discovery i Two kinds of knowledge discovery: directed and undirected i Directed Knowledge Discovery 4 Purpose: Explain value of some field in terms of all the others (goal-oriented) 4 Method: select the target field based on some hypothesis about the data; ask the algorithm to tell us how to predict or classify new instances 4 Examples: hwhat products show increased sale when cream cheese is discounted hwhich banner ad to use on a web page for a given user coming to the site i Undirected Knowledge Discovery 4 Purpose: Find patterns in the data that may be interesting (no target field) 4 Method: clustering, affinity grouping 4 Examples: hwhich products in the catalog often sell together hmarket segmentation (find groups of customers/users with similar characteristics or behavioral patterns) 11

From Data Mining to Data Science 12

From Data Mining to Data Science 12

Data Mining: On What Kinds of Data? i Database-oriented data sets and applications 4

Data Mining: On What Kinds of Data? i Database-oriented data sets and applications 4 Relational database, data warehouse, transactional database 4 Object-relational databases, Heterogeneous databases and legacy databases i Advanced data sets and advanced applications 4 Data streams and sensor data 4 Time-series data, temporal data, sequence data (incl. bio-sequences) 4 Structure data, graphs, social networks and information networks 4 Spatial data and spatiotemporal data 4 Multimedia database 4 Text databases 4 The World-Wide Web 13

Data Mining: What Kind of Data? i. Structured Databases 4 relational, object-relational, etc. 4

Data Mining: What Kind of Data? i. Structured Databases 4 relational, object-relational, etc. 4 can use SQL to perform parts of the process e. g. , SELECT count(*) FROM Items WHERE type=video GROUP BY category 14

Data Mining: What Kind of Data? i Flat Files 4 most common data source

Data Mining: What Kind of Data? i Flat Files 4 most common data source 4 can be text (or HTML) or binary 4 may contain transactions, statistical data, measurements, etc. i Transactional databases 4 set of records each with a transaction id, time stamp, and a set of items 4 may have an associated “description” file for the items 4 typical source of data used in market basket analysis 15

Data Mining: What Kind of Data? i Other Types of Databases 4 legacy databases

Data Mining: What Kind of Data? i Other Types of Databases 4 legacy databases 4 multimedia databases (usually very high-dimensional) 4 spatial databases (containing geographical information, such as maps, or satellite imaging data, etc. ) 4 Time Series Temporal Data (time dependent information such as stock market data; usually very dynamic) i World Wide Web 4 basically a large, heterogeneous, distributed database 4 need for new or additional tools and techniques hinformation retrieval, filtering and extraction hagents to assist in browsing and filtering h. Web content, usage, and structure (linkage) mining tools 4 The “social Web” h User generated meta-data, social networks, shared resources, etc. 16

What Can Data Mining Do i Many Data Mining Tasks 4 often inter-related 4

What Can Data Mining Do i Many Data Mining Tasks 4 often inter-related 4 often need to try different techniques/algorithms for each task 4 each tasks may require different types of knowledge discovery i What are some of data mining tasks 4 Classification 4 Prediction 4 Clustering 4 Affinity Grouping / Association discovery 4 Sequence Analysis 4 Characterization 4 Discrimination 17

Some Applications of Data mining i Business data analysis and decision support 4 Marketing

Some Applications of Data mining i Business data analysis and decision support 4 Marketing focalization h. Recognizing specific market segments that respond to particular characteristics h. Return on mailing campaign (target marketing) 4 Customer Profiling h. Segmentation of customer for marketing strategies and/or product offerings h. Customer behavior understanding h. Customer retention and loyalty h. Mass customization / personalization 18

Some Applications of Data mining i Business data analysis and decision support (cont. )

Some Applications of Data mining i Business data analysis and decision support (cont. ) 4 Market analysis and management h. Provide summary information for decision-making h. Market basket analysis, cross selling, market segmentation. h. Resource planning 4 Risk analysis and management h"What if" analysis h. Forecasting h. Pricing analysis, competitive analysis h. Time-series analysis (Ex. stock market) 19

Some Applications of Data mining i Fraud detection 4 Detecting telephone fraud: h Telephone

Some Applications of Data mining i Fraud detection 4 Detecting telephone fraud: h Telephone call model: destination of the call, duration, time of day or week h Analyze patterns that deviate from an expected norm h British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud scheme 4 Detection of credit-card fraud 4 Detecting suspicious money transactions (money laundering) i Text mining: 4 Message filtering (e-mail, newsgroups, etc. ) 4 Newspaper articles analysis 4 Text and document categorization i Web Mining 4 Mining patterns from the content, usage, and structure of Web resources 20

Types of Web Mining Web Content Mining Web Usage Mining 21 Web Structure Mining

Types of Web Mining Web Content Mining Web Usage Mining 21 Web Structure Mining

Types of Web Mining Web Content Mining Web Usage Mining Applications: • document clustering

Types of Web Mining Web Content Mining Web Usage Mining Applications: • document clustering or categorization • topic identification / tracking • concept discovery • focused crawling • content-based personalization • intelligent search tools 22 Web Structure Mining

Types of Web Mining Web Content Mining Web Usage Mining Applications: • user and

Types of Web Mining Web Content Mining Web Usage Mining Applications: • user and customer behavior modeling • Web site optimization • e-customer relationship management • Web marketing • targeted advertising • recommender systems 23 Web Structure Mining

Types of Web Mining Web Content Mining Web Usage Mining Web Structure Mining Applications:

Types of Web Mining Web Content Mining Web Usage Mining Web Structure Mining Applications: • document retrieval and ranking (e. g. , Google) • discovery of “hubs” and “authorities” • discovery of Web communities • social network analysis 24

The Knowledge Discovery Process i Next: We first focus on understanding the data and

The Knowledge Discovery Process i Next: We first focus on understanding the data and data preparation/transformation - The KDD Process 25