DATA MINING Introductory and Advanced Topics Part I

  • Slides: 33
Download presentation
DATA MINING Introductory and Advanced Topics Part I Margaret H. Dunham Department of Computer

DATA MINING Introductory and Advanced Topics Part I Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University 1

Data Mining Outline l PART I ØIntroduction ØRelated Concepts ØData Mining Techniques l PART

Data Mining Outline l PART I ØIntroduction ØRelated Concepts ØData Mining Techniques l PART II Ø Classification Ø Clustering Ø Association Rules l PART III Ø Web Mining Ø Spatial Mining Ø Temporal Mining 2 Ming-Yen Lin, IECS, FCU

Introduction Outline Goal: Provide an overview of data mining. l Define data mining l

Introduction Outline Goal: Provide an overview of data mining. l Define data mining l Data mining vs. databases l Basic data mining tasks l Data mining development l Data mining issues 3 Ming-Yen Lin, IECS, FCU

Introduction l Data is growing at a phenomenal rate l Users expect more sophisticated

Introduction l Data is growing at a phenomenal rate l Users expect more sophisticated information Øsimple listing vs. purchase detail l How? UNCOVER HIDDEN INFORMATION DATA MINING 4 Ming-Yen Lin, IECS, FCU

Data Mining Definition l Finding hidden information in a database l Fit data to

Data Mining Definition l Finding hidden information in a database l Fit data to a model l Similar terms ØExploratory data analysis ØData driven discovery ØDeductive learning Ø. . . 5 Ming-Yen Lin, IECS, FCU

Database Processing vs. Data Mining Processing [Fig. 1. 1] l Query Ø Well defined

Database Processing vs. Data Mining Processing [Fig. 1. 1] l Query Ø Well defined Ø SQL n Data Ø Poorly defined Ø No precise query language n – Operational data n Output – Precise – Subset of database Data – Not operational data n Output – Fuzzy – Not a subset of database 7 Ming-Yen Lin, IECS, FCU

Query Examples l Database – Find all credit applicants with last name of Smith.

Query Examples l Database – Find all credit applicants with last name of Smith. – Identify customers who have purchased more than $10, 000 in the last month. – Find all customers who have purchased milk l Data Mining – Find all credit applicants who are poor credit risks. (classification) – Identify customers with similar buying habits. (Clustering) – Find all items which are frequently purchased with milk. (association rules) – [ex. 1. 1: D. M. helps to authorize a credit card transaction: 4 classes] Ming-Yen Lin, IECS, FCU 8

Data Mining Algorithm l Objective: Fit Data to a Model l Characterize D. M.

Data Mining Algorithm l Objective: Fit Data to a Model l Characterize D. M. Algorithms as 3 parts ØModel ØPreference – Criteria to fit the best model ØSearch – Technique to search the data Ø [ex. 1. 1 illustrated] l Models ØPredictive: predict about values of data ØDescriptive: identify patterns/relationships in data [explore the properties of data] 9 Ming-Yen Lin, IECS, FCU

Data Mining Models and Tasks illustrative examples only, not exhaustive listing 10 Ming-Yen Lin,

Data Mining Models and Tasks illustrative examples only, not exhaustive listing 10 Ming-Yen Lin, IECS, FCU

Predictive Data Mining l Classification maps data into predefined groups or classes Ø Supervised

Predictive Data Mining l Classification maps data into predefined groups or classes Ø Supervised learning Ø examples: loan, credit risk Ø Pattern recognition: a type of classification n example: airport security screening -- face patterns l Regression is used to map a data item to a real valued prediction variable. Ø linear regression, error analysis to find the best l Prediction: predict future data (rather than current data) Ø flooding, speech recognition, … n data collected by the sensors upriver…w. r. t. time 11 Ming-Yen Lin, IECS, FCU

Time Series Analysis l Example: Stock Market l Predict future values l Determine similar

Time Series Analysis l Example: Stock Market l Predict future values l Determine similar patterns over time l Classify behavior: Y[6. . 20] is similar to Z[13. . 27] 12 Ming-Yen Lin, IECS, FCU

Descriptive Data Mining l Clustering groups similar data together into clusters. [vs. classification] Ø

Descriptive Data Mining l Clustering groups similar data together into clusters. [vs. classification] Ø Unsupervised learning Ø Segmentation/Partitioning data Ø example: demographic groups & specialized catalogs l Summarization maps data into subsets with associated simple descriptions. Ø Characterization/Generalization l Link Analysis uncovers relationships among data. Ø Affinity Analysis/Associations Ø Association Rules [store example] Ø Sequential Analysis (sequence discovery) determines sequential patterns. Ming-Yen Lin, IECS, FCU 13

Data Mining 功能 (I) l 概念描述:特徵與區別(Concept description: Characterization and discrimination) Ø 廣義化、綜合(Generalize, summarize) Ø

Data Mining 功能 (I) l 概念描述:特徵與區別(Concept description: Characterization and discrimination) Ø 廣義化、綜合(Generalize, summarize) Ø 對比資料的特性(contrast data characteristics) l 關連(Association :correlation and causality相關、因果) Ø Diaper -> Beer [0. 5%, 75%] l 分類與預測(Classification and Prediction ) Ø 建立模型(函數)以描述與分辨類別或概念,作為未來預測用 n 例:classify countries based on climate, or classify cars based on gas mileage Ø 預測某些未知的、或遺失的(missing) 數值 14 Ming-Yen Lin, IECS, FCU

Data Mining 功能 (II) l 群聚分析 (Cluster analysis) Ø 類別標籤未知: 把資料依相似性分群 n e. g.

Data Mining 功能 (II) l 群聚分析 (Cluster analysis) Ø 類別標籤未知: 把資料依相似性分群 n e. g. , cluster houses to find distribution patterns Ø maximizing intra-class similarity Ø minimizing interclass similarity l 離群分析 (Outlier analysis) Ø outlier: 某資料object,無法符合資料的一般行為(模式) Ø 雜質noise?例外exception? No! 用在fraud detection, rare events analysis l 趨勢與演進 (Trend and evolution analysis) Ø trend and deviation(偏差) : regression analysis Ø sequential pattern mining Ø periodicity analysis Ø similarity-based analysis l Estimation, Visualization 15 Ming-Yen Lin, IECS, FCU

Data Mining vs. KDD l Knowledge Discovery in Databases (KDD): process of finding useful

Data Mining vs. KDD l Knowledge Discovery in Databases (KDD): process of finding useful information and patterns in data. l Data Mining: Use of algorithms to extract the information and patterns derived by the KDD process. 16 Ming-Yen Lin, IECS, FCU

KDD Process Modified from [FPSS 96 C] l Selection: Obtain data from various (heterogeneous)

KDD Process Modified from [FPSS 96 C] l Selection: Obtain data from various (heterogeneous) sources. l Preprocessing: Cleanse (incorrect/missing) data. l Transformation: Convert to common format; Transform to new format; Reduce data amount l Data Mining: Obtain desired results. l Interpretation/Evaluation: Present results to user in meaningful manner. Ming-Yen Lin, IECS, FCU 17

資料探勘:KDD的程序 ØData mining: the core of knowledge discovery process. 核心程序 Pattern Evaluation Data Mining

資料探勘:KDD的程序 ØData mining: the core of knowledge discovery process. 核心程序 Pattern Evaluation Data Mining Task-relevant Data Warehouse Selection Data Cleaning Data Integration Databases Ming-Yen Lin, IECS, FCU 18

KDD: Knowledge Discovery in Database l KDD Process (Interactive and iterative)互動、反覆 Ø Learning the

KDD: Knowledge Discovery in Database l KDD Process (Interactive and iterative)互動、反覆 Ø Learning the application domain (relevant prior knowledge & goals of application)學習應用領域及相關知識 l Steps Ø 資料選擇(data selection:creating a target data set) Ø 資料清理與前置處理(data cleaning & preprocessing :may take 60% of effort!) Ø 資料簡化與轉換(data reduction & transformation:find useful features, dimensionality/variable reduction, invariant representation) Ø 資料探勘 ( choose function: summarization/ classification/ clustering regression/ association choose algorithms search for interest patterns) Ø 模式評估與知識呈現 (Pattern evaluation & knowledge presentation: visualization, transformation) Ming-Yen Lin, IECS, FCU 19

KDD Process Ex. : Web Log l Selection: Ø Select log data (dates and

KDD Process Ex. : Web Log l Selection: Ø Select log data (dates and locations) to use l Preprocessing: Ø Remove identifying URLs Ø Remove error logs l Transformation: Ø Sessionize (sort and group) l Data Mining: Ø Identify and count patterns Ø Construct data structure l Interpretation/Evaluation: Ø Identify and display frequently accessed sequences. l Potential User Applications: Ø Cache prediction Ø Personalization Ming-Yen Lin, IECS, FCU 20

Visualization Techniques l Graphical Øbar chart, pie charts, histograms, line graphs l Geometric Øbox

Visualization Techniques l Graphical Øbar chart, pie charts, histograms, line graphs l Geometric Øbox plot, scatter diagram l Icon-based Øfigures, colors l Pixel-based Øunique colored pixel l Hierarchical l Hybrid Ming-Yen Lin, IECS, FCU 21

Data Mining Development • Relational Data Model • SQL • Association Rule Algorithms •

Data Mining Development • Relational Data Model • SQL • Association Rule Algorithms • Data Warehousing • Scalability Techniques • Similarity Measures • Hierarchical Clustering • IR Systems • Imprecise Queries • Textual Data • Web Search Engines • Bayes Theorem • Regression Analysis • EM Algorithm • K-Means Clustering • Time Series Analysis • Algorithm Design Techniques • Algorithm Analysis • Data Structures Ming-Yen Lin, IECS, FCU [Table 1. 1] • Neural Networks • Decision Tree Algorithms 22

資料探勘的技術 決策支援 Decision Support 統計 Statistics 機器學習 Machine Learning Ming-Yen Lin, IECS, FCU 資料庫管理

資料探勘的技術 決策支援 Decision Support 統計 Statistics 機器學習 Machine Learning Ming-Yen Lin, IECS, FCU 資料庫管理 與資料倉儲 Database Management & Warehousing 資料探勘 Data Mining 其他 Others 平行處理 Parallel Processing 視覺化 Visualization 演算法 Algorithm 23

資料庫技術的演進 l 1960 s 資料收集 Ø Data collection, database creation, information management systems and

資料庫技術的演進 l 1960 s 資料收集 Ø Data collection, database creation, information management systems and network DBMS l 1970 s 資料庫 Ø Relational data model, relational DBMS implementation l 1980 s 進階資料庫 Ø RDBMS, advanced data models (extended-relational, OO, deductive, etc. ) and application-oriented DBMS (spatial, scientific, engineering, etc. ) l 1990 s— 2000 s 資料探勘 Ø Data mining and data warehousing, multimedia databases, and Web databases Ming-Yen Lin, IECS, FCU 24

D. M. Implementation Issues l Human Interaction Ødomain experts/technical experts l Overfitting Ømodel does

D. M. Implementation Issues l Human Interaction Ødomain experts/technical experts l Overfitting Ømodel does not fit future states l Outliers l Interpretation Øexpert/common users l Visualization l Large Datasets l High Dimensionality Ming-Yen Lin, IECS, FCU 25

Implementation Issues (cont’d) l Multimedia Data l Missing Data l Irrelevant Data l Noisy

Implementation Issues (cont’d) l Multimedia Data l Missing Data l Irrelevant Data l Noisy Data l Changing Data l Integration Øinto traditional DBMS l Application Ødetermine the intended use, business practice 26 Ming-Yen Lin, IECS, FCU

Data Mining – 什麼樣的資料? l Relational databases關連式資料庫 l Data warehouses資料倉儲 l Transactional databases交易資料 l

Data Mining – 什麼樣的資料? l Relational databases關連式資料庫 l Data warehouses資料倉儲 l Transactional databases交易資料 l Advanced DB & information repositories(儲藏) Ø Object-oriented and object-relational databases Ø Spatial (空間)databases Ø Time-series (時序)data & temporal (時間的)data Ø Text databases & multimedia databases Ø Heterogeneous (異質)& legacy(傳統) databases Ø WWW 27 Ming-Yen Lin, IECS, FCU

Data Mining Metrics l Effectiveness/Usefulness measure l Return on Investment (ROI) l Accuracy in

Data Mining Metrics l Effectiveness/Usefulness measure l Return on Investment (ROI) l Accuracy in classification l Space/Time complexity analysis 28 Ming-Yen Lin, IECS, FCU

Social Implications of DM l Privacy l Profiling l Unauthorized use 29 Ming-Yen Lin,

Social Implications of DM l Privacy l Profiling l Unauthorized use 29 Ming-Yen Lin, IECS, FCU

Database Perspective on Data Mining l Scalability l Real World Data: noisy, missing values

Database Perspective on Data Mining l Scalability l Real World Data: noisy, missing values l Updates l Ease of Use l abstraction of data definition/access primitives, query processing support 30 Ming-Yen Lin, IECS, FCU

典型資料探勘系統的架構 Graphical user interface Pattern evaluation Data mining engine Knowledge-base Database or data warehouse

典型資料探勘系統的架構 Graphical user interface Pattern evaluation Data mining engine Knowledge-base Database or data warehouse server Data cleaning & data integration Databases Filtering Data Warehouse 31 Ming-Yen Lin, IECS, FCU

The Future l DMQL (data mining query language) Øaccess to concept hierarchy Øexample (p.

The Future l DMQL (data mining query language) Øaccess to concept hierarchy Øexample (p. 18) Ørule_spec ngeneralized relation/characteristic rule/discriminate rule/classification rule l KDD process model: CRISP-DM (Cross. Industry Standard Process for Data Mining) Ø 5 A: assess, access, analyze, act, automate 32 Ming-Yen Lin, IECS, FCU

參考網站 l KDD Øhttp: //www. kdnuggets. com/ Øhttp: //www. acm. org/sigkdd/ Øhttp: //www. acm.

參考網站 l KDD Øhttp: //www. kdnuggets. com/ Øhttp: //www. acm. org/sigkdd/ Øhttp: //www. acm. org/sigmod/ l Ref. slides Øhttp: //www. cs. uiuc. edu/~hanj/book l Research papers Øhttp: //www. researchindex. com/ Øhttp: //www. google. com/ (p. 20) Ming-Yen Lin, IECS, FCU 33