Motivation Necessity is the Mother of Invention Invention

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) �Motivation: “Necessity is the Mother of Invention” Invention u Data

國立聯合大學資訊管理學系 u 資料探勘課程 (陳士杰) Solution: Solution Data warehousing and data mining n Data

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) �Evolution of Database Technology u 1960 s: n u 1970

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) �What Is Data Mining? u Data mining (knowledge discovery from

國立聯合大學資訊管理學系 u 資料探勘課程 (陳士杰) Many people treat data mining as a synonym (

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) 8 Knowledge Discovery (KDD) Process Evaluation and Presentation Data Mining

資料探勘課程 (陳士杰) 國立聯合大學資訊管理學系 KDD Process: Several Key Steps 1. Data cleaning (資料清理) n

國立聯合大學資訊管理學系 u 資料探勘課程 (陳士杰) We adopt a broad view of data mining functionality:

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) 11 Architecture: Typical Data Mining System Graphical User Interface Pattern

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) 12 Data Mining and Business Intelligence Increasing potential to support

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) Watch out: Is everything “data mining”? u Although there are

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) �Data Mining: On What Kind of Data? u u Relational

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) Object-oriented and object-relational databases u Object-oriented database (物件導向資料庫) n Each

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) Spatial and Spatiotemporal Databases u Spatial Database (空間資料庫) n u

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) Temporal, Sequence, and Time-Series Databases u Temporal Database (時間資料庫) n

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) 21 Text databases and multimedia databases u Text database (文件資料庫)

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) Heterogeneous and legacy databases u Heterogeneous database (異質資料庫) n n

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) 23 Data Streams u A new kind of data: n

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) 24 WWW u WWW and its associated distributed information services

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) �Data Mining Functionalities: What kinds of patterns can be mined?

國立聯合大學資訊管理學系 u 資料探勘課程 (陳士杰) Data mining functionalities, and the kinds of patterns they

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) Concept Description: Characterization and Discrimination u Concept Description (or Class

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) Data Characterization u Summarization of the general characteristics or features

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) Data Discrimination u Comparison of the general features of target

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) u Frequent patterns (頻繁模式): are patterns that occur frequently in

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) u Whereas classification predicts categorical (discrete, unordered) labels, prediction models

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) Outlier Analysis u A database may contain data objects that

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) Evolution Analysis u Data evolution analysis describes and models regularities

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) �Why Data Mining? —Potential Applications u 資料分析 (Data analysis) 與決策支援

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) �Are All the “Discovered” Patterns Interesting? u Data mining may

國立聯合大學資訊管理學系 u 資料探勘課程 (陳士杰) The answer of first question: n Interestingness measures ¡

國立聯合大學資訊管理學系 u The answer of second question: n u 資料探勘課程 (陳士杰) Find all

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) 50 ø Data Mining: Confluence of Multiple Disciplines u Data

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) 51 u Because of the diversity of disciplines contributing to

國立聯合大學資訊管理學系 u Kinds of databases mined (根據所探勘的資料庫類型): n u u Relational, data warehouse,

國立聯合大學資訊管理學系 u 資料探勘課程 (陳士杰) 54 Each user will have a data mining task

國立聯合大學資訊管理學系 u 資料探勘課程 (陳士杰) The data mining primitives: n The set of task-relevant

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) �Integration of Data Mining and Data Warehousing u 一個好的系統架構，可以使Data Mining

國立聯合大學資訊管理學系 u No coupling: n n n DM system will not utilize any

國立聯合大學資訊管理學系 u u 資料探勘課程 (陳士杰) Semitight coupling: n Besides linking a DM system

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) �Major Issues in Data Mining u Mining methodology and user

國立聯合大學資訊管理學系 u u 資料探勘課程 (陳士杰) Performance issue n Efficiency and scalability of data

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) �Summary u Data mining: Discovering interesting patterns from large amounts

Slides: 66

Download presentation

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) �Motivation: “Necessity is the Mother of Invention” Invention u Data explosion problem n u Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, databases data warehouses and other information repositories We are drowning in data, data but starving for knowledge! knowledge 3

國立聯合大學資訊管理學系 u 資料探勘課程 (陳士杰) Solution: Solution Data warehousing and data mining n Data warehousing and on-line analytical processing n Extraction of interesting knowledge (rules, rules regularities, regularities patterns, patterns constraints) constraints from data in large databases 4

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) �Evolution of Database Technology u 1960 s: n u 1970 s: n n u n RDBMS, advanced data models (extended-relational, OO, deductive, etc. ) Application-oriented DBMS (spatial(空間), temporal(時序), engineering, etc. ) 1990 s: n u Hierarchical and network database systems Relational data model, relational DBMS implementation 1980 s: n u Data collection, database creation, IMS and network DBMS Data mining, data warehousing, multimedia databases, and Web databases 2000 s n n Stream data management and mining Data mining and its applications 5

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) �What Is Data Mining? u Data mining (knowledge discovery from data) data n Extraction of interesting (non-trivial, non-trivial implicit, implicit previously unknown and potentially useful) useful patterns or knowledge from huge amount of data n Data mining: a misnomer? ¡ u Data mining探勘的不僅僅是資料，而是知識!! 知識 Alternative names n Knowledge discovery (mining) in databases (KDD), knowledge extraction, extraction business intelligence, intelligence data/pattern analysis, data archeology, data dredging, information harvesting, etc. 6

國立聯合大學資訊管理學系 u 資料探勘課程 (陳士杰) Many people treat data mining as a synonym ( 同義字) for another popularly used term, Knowledge Discovery from Data (KDD) — 廣義的Data mining u Alternatively, other view data mining as simply an essential step in the process of knowledge discovery — 狹義的Data mining 7

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) 8 Knowledge Discovery (KDD) Process Evaluation and Presentation Data Mining Patterns Task-relevant Data Warehouse Selection and Transformation Data Cleaning and Data Integration u Databases Data mining—core of knowledge discovery

資料探勘課程 (陳士杰) 國立聯合大學資訊管理學系 KDD Process: Several Key Steps 1. Data cleaning (資料清理) n n Remove noise and inconsistent data may take 60% of effort! 2. Data integration (資料整合) n Where multiple data source may be combined 3. Data selection (資料選擇) n Where data relevant to the analysis task are retrieved from the DB 4. Data transformation (資料轉換) n Where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations, for instance 5. Data mining (資料探勘) n n Intelligent methods are applied in order to extract data patterns. Choosing the mining algorithm(s) for searching patterns of interest 6. Pattern evaluation (模式評估) n To identify the truly interesting patterns representing knowledge based on some interestingness measures. 7. Knowledge presentation (知識表示) n Where visualization and knowledge representation techniques are used to present the mined knowledge to the user 9

國立聯合大學資訊管理學系 u 資料探勘課程 (陳士杰) We adopt a broad view of data mining functionality: n Data mining is the process of discovering interesting knowledge from large amounts of data stored in databases, data warehouses, or other information repositories. 10

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) 11 Architecture: Typical Data Mining System Graphical User Interface Pattern Evaluation Data Mining Engine OLAP: line analytical Processing Database or Data Warehouse Server data cleaning, integration, and selection Data World-Wide Other Info Database Warehouse Web Repositories Knowledge -Base

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) 12 Data Mining and Business Intelligence Increasing potential to support business decisions Decision Making End User Data Presentation Business Analyst Visualization Techniques Data Mining Information Discovery Data Exploration Statistical Summary, Querying, and Reporting Data Analyst Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems DBA

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) Watch out: Is everything “data mining”? u Although there are many “data mining system” on the market, not all of them can perform true data mining: n Machine learning system, statistical data analysis tool ¡ n Does not handle large amounts of data OLAP, database system, information retrieval system ¡ Can only perform data or information retrieval, including finding aggregate values, or that performs deductive query answering in large databases. 13

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) �Data Mining: On What Kind of Data? u u Relational databases Data warehouses Transactional databases Advanced DB and information repositories Object-oriented and object-relational databases n Spatial and Spatiotemporal Databases n Temporal, Sequence, and Time-Series Databases n Text databases and multimedia databases n Heterogeneous and legacy databases n Data Streams n WWW n 14

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) Object-oriented and object-relational databases u Object-oriented database (物件導向資料庫) n Each entity is considered as an object. ¡ ¡ u Object-relational database (物件關係資料庫) n n u For instance, an employee class can contain variables like name, address, and birthday. Suppose that the class, sales_person, is a subclass of the class, employee. It would inherit all of the variables pertaining to its superclass of employee. Inherits the essential concepts of object-oriented database. This model extends the relational model by providing a rich data type for handling complex objects and object orientation. For data mining in object-oriented or object-relational systems, techniques need to developed for handling: n n n Complex object structure Complex data type Class and subclass hierarchies Property inheritance Methods and procedures. 18

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) Spatial and Spatiotemporal Databases u Spatial Database (空間資料庫) n u Contain spatial-related information ¡ 空間拓樸特徵 ¡ (非)空間屬性特徵 ¡ 對象在時間上的變化 n Examples include: Geographic databases, VLSI, Medical and Satellite image database. n Maps can be represented in vector format Spatiotemporal Database (時空資料庫) n Stores spatial objects that change with time. ¡ Group the trends of moving objects and identify some strangely moving vehicles ¡ Distinguish a bioterrorist attack form a normal outbreak of the flu based on the geographic spread of a disease with time. 19

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) Temporal, Sequence, and Time-Series Databases u Temporal Database (時間資料庫) n u Stores relational data that include time-related attributes. Sequence Database (序列資料庫) n Stores sequences of ordered events, with or without a concrete notion of time. ¡ u Time-Series Database (時序資料庫) n Stores sequences of values or events obtained over repeated measurement of time. ¡ u Customer shopping sequences, Web click streams, and biological sequences. The stock exchange, inventory control, the observation of natural phenomena. Data mining techniques can be used to find the characteristics of object evolution, evolution or the trend of changes for objects in the database. 20

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) 21 Text databases and multimedia databases u Text database (文件資料庫) n n Are databases that contain word descriptions for objects. These word descriptions are usually not simple key words but rather long sentences or paragraphs. ¡ n Text databases may be somewhat structured: ¡ ¡ ¡ n u Product specifications, error or bug reports, warning messages, summary reports, notes, or other documents. Highly unstructured (Web pages) Semistructured (e-mail message, XML web pages) Well structured (library catalogue database) Highly regular structures typically can be implemented using relational database systems. Multimedia database (多媒體資料庫) n n n Store image, audio, and video data. Specialized storage and search techniques are also required. Storage and search techniques need to be integrated with standard data mining methods.

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) Heterogeneous and legacy databases u Heterogeneous database (異質資料庫) n n u Legacy database (遺產資料庫) n n u Consists of a set of interconnected, interconnected autonomous component database. Objects in one component database may differ greatly from objects in other component databases, making it difficult to assimilate their semantics into the overall heterogeneous database. Many enterprises acquire legacy databases as a result of the long history of information technology development. A legacy database is a group of heterogeneous database. Information exchange across such databases is difficult because it would require precise transformation rules from one representation to another, considering diverse semantics. 22

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) 23 Data Streams u A new kind of data: n u Unique feature: n n n u Huge or possibly infinite volume Dynamically changing Flowing in and out in a fixed order Demanding fast response time Allowing only one or small number of scans 主要應用場合: data produced in dynamic environments. n n u Data flow in and out of an observation platform dynamically 影像監控 (Video surveillance) 網路流量 (Network traffic) 股票交易 (Stock exchange) 天氣與環境的監視 (Weather or environment monitoring)…等等 Because data streams are normally not stored in any kind of data repository, effective and efficient management and analysis of stream data poses great challenges to researchers.

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) 24 WWW u WWW and its associated distributed information services provide rich, rich worldwide, worldwide on-line information services, where data objects are linked together to facilitate interactive access. u Although web pages may appear fancy and informative to human readers, they can be highly unstructured and lack a predefined schema, schema type, type or pattern n u Web services that provide keyword-based searches without understanding the context behind the web pages can only offer limited help to users. 數據挖掘內容 n 內容檢索 (Text Retrieval) n WEB訪問模式檢索

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) �Data Mining Functionalities: What kinds of patterns can be mined? u Data mining functionalities are used to specify the kinds of patterns to be found in data mining tasks. u Data mining tasks can be classified into two categories: n Descriptive (描述性): ¡ n Characterize the general properties of the data in the database. Predictive (預測性): ¡ Perform inference on the current data in order to make predictions. u In some cases, users may have no idea regarding what kinds of patterns in their data may be interesting, interesting and hence may kind to search for several different kinds of patterns in parallel u Thus it is important to have a data mining system that can mine multiple kinds of patterns to accommodate different 25

國立聯合大學資訊管理學系 u 資料探勘課程 (陳士杰) Data mining functionalities, and the kinds of patterns they can discover, discover are described below: n Concept description: description Characterization and discrimination (概念描述: 特性描述與區分) n Association Analysis (關聯分析) n Classification and Prediction (分類與預測) n Cluster analysis (聚類分析) n Outlier analysis (孤立點分析) n Trend and evolution analysis (趨勢與演化分析) 26

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) Concept Description: Characterization and Discrimination u Concept Description (or Class Description): n 將一群資料，利用匯總的、匯總的簡潔的、簡潔的精確的方式描述精確的成不同的類別 (Class)或 (Class) 概念 (Concept)。 (Concept) ¡ u u 如: 在All. Electronics商店中: p 銷售的商品可分類成電腦與印表機 p 顧客的概念可分成big. Spenders和budget. Spenders These descriptions can be derived via: n Data characterization (資料特性描述) n Data discrimination (資料區分) n Both data characterization and discrimination Chapter 4 27

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) Data Characterization u Summarization of the general characteristics or features of a target class of data. n 範例: 一個data mining system應可對All. Electronics花費$1000 美元以上的顧客 (大客戶) 特徵加以匯總: ¡ ¡ ¡ u 年齡在 40 – 50 有作良好的信用等級 The output of data characterization can be presented in various forms: n n n Pie charts (圓餅圖) Bar charts (直條圖) Curve (曲線) … Chapter 4 28

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) Data Discrimination u Comparison of the general features of target class data objects with the general features of objects from one or a set of contrasting classes n 範例: Data mining system應可比較出所有 All. Electronics客戶中，定期 (每月多於 2次)購買電腦產品和偶爾 (每年少於 3次) 購買這類產品的兩組客戶: ¡ 經常購買的客戶中，80%在 20 – 40歲之間，受過大學教育 ¡ 偶爾購買的客戶中，60%太老或太小，沒有大學學位 29

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) u Frequent patterns (頻繁模式): are patterns that occur frequently in data u Some kinds of frequent patterns: n Frequent itemset: itemset ¡ n Frequent sequential pattern: pattern ¡ n a set of items that frequently appear together in a transactional data set. A frequently occurring subsequence, such as the pattern that customers tend to purchase first a PC, followed by a digital camera, and then a memory card Frequent structured pattern: pattern ¡ A substructure can refer to different structural forms, such as graphs, graphs trees, trees or lattices, lattices which may be combined with itemsets or subsequences ¡ If a substructure occurs frequently, it is called a frequent structured 32

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) u Whereas classification predicts categorical (discrete, unordered) labels, prediction models continuous-valued functions u Although the term prediction may refer to both numeric prediction and class label prediction, prediction in this book we use it to refer primarily to numeric prediction n 預測 (prediction) 可以看作是一種分類，差別在於預測主要是預測未來資料的狀態，而不是當前狀態。未來資料的狀態 u 由於在分析測試資料之前，類別就已經被確定了，所以分類通常被稱做有指導學習 u Chapter 6 37

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) Outlier Analysis u A database may contain data objects that do not comply with the general behavior or model of the data. These data objects are outliers (孤立點, 異常點). n n u 應用 n n u Most data mining methods discard outliers as noise or exceptions. However, in some applications such as fraud detection (詐欺偵測), the rare events (罕見事件) can be more interesting than the more regularly occurring ones. 信用卡詐欺檢測行動電話詐欺檢測客戶劃分醫療分析 (異常) The analysis of outlier data is referred to as outlier mining 40

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) Evolution Analysis u Data evolution analysis describes and models regularities or trends for objects whose behavior changes over time n May include characterization and discrimination, association, classification, prediction of time related data. u 範例: 假定你有紐約股票交易所過去幾年的主要股票市場 ( 時間序列) 資料，並希望投資於高科技業公司的股票。股票交易資料的挖掘研究可以識別整個股票市場和特定公司的股票演變規律。這種規律可以幫助預測股票市場價格的未來走向，幫助你對股票投資作出決策。 u Chapter 8. 41

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) �Why Data Mining? —Potential Applications u 資料分析 (Data analysis) 與決策支援 (decision support) n 市場分析與管理 (Market analysis and management) ¡ n 風險分析與管理 (Risk analysis and management) ¡ n u Target marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentation Forecasting, customer retention, improved underwriting, quality control, competitive analysis 詐欺行為檢測與異常模式檢測 (Fraud detection and detection of unusual patterns (outliers)) Other Applications n Text mining (news group, email, documents) n Web mining n Bioinformatics and bio-data analysis 42

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) �Are All the “Discovered” Patterns Interesting? u Data mining may generate thousands of patterns: Not all of them are interesting u Some serious questions: n What makes a pattern interesting? interesting n Can a data mining system generate all of the interesting pattern? pattern n Can a data mining system generate only interesting patterns? patterns 47

國立聯合大學資訊管理學系 u 資料探勘課程 (陳士杰) The answer of first question: n Interestingness measures ¡ A pattern is interesting if it is: 1. Easily understood by humans, 2. Valid on new or test data with some degree of certainty, certainty 3. Potentially useful, useful 4. Novel, Novel or validates some hypothesis that a user seeks to confirm n Objective vs. subjective interestingness measures ¡ Objective: Objective based on statistics and structures of patterns, e. g. , support, confidence, etc. ¡ Subjective: Subjective based on user’s belief in the data, e. g. , unexpectedness, novelty, actionability, etc. 48

國立聯合大學資訊管理學系 u The answer of second question: n u 資料探勘課程 (陳士杰) Find all the interesting patterns: Completeness ¡ Can a data mining system find all the interesting patterns? Do we need to find all of the interesting patterns? ¡ Heuristic vs. exhaustive search ¡ Association vs. classification vs. clustering The answer of third question: n Search for only interesting patterns: An optimization problem ¡ Can a data mining system find only the interesting patterns? ¡ Approaches p First general all the patterns and then filter out the uninteresting ones p Generate only the interesting patterns—mining query optimization 49

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) 50 ø Data Mining: Confluence of Multiple Disciplines u Data mining is an interdisciplinary field, the confluence of a set of disciplines. 資料庫系統機器學習演算法統計學資料挖掘可視化其他學科 (資訊檢索 IR, …)

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) 51 u Because of the diversity of disciplines contributing to data mining, data mining research is expected to generate a large variety of data mining systems. u Different views lead to different classifications n Data view: view Kinds of data to be mined n Knowledge view: view Kinds of knowledge to be discovered n Method view: view Kinds of techniques utilized n Application view: view Kinds of applications adapted

國立聯合大學資訊管理學系 u Kinds of databases mined (根據所探勘的資料庫類型): n u u Relational, data warehouse, transactional, stream, objectoriented/relational, active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW Kinds of Knowledge mined (根據所要探勘的知識類型): n Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc. n Multiple/integrated functions and mining at multiple levels Techniques utilized (根據探勘所用的技術): n u 資料探勘課程 (陳士杰) Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, etc. Applications adapted (根據探勘的應用): n Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc. 52

國立聯合大學資訊管理學系 u 資料探勘課程 (陳士杰) 54 Each user will have a data mining task in mind, that is, some form of data analysis that he or she would like to have performed. u A data mining task can be specified in the form of a data mining query (Data Mining Query Language, DMQL), DMQL which is input to the data mining system. u A data mining query is defined in terms of data mining task primitives n These primitives allow the user to interactively communicate with the data mining system during discovery in order to direct the mining process, or examine the findings from different angles or

國立聯合大學資訊管理學系 u 資料探勘課程 (陳士杰) The data mining primitives: n The set of task-relevant data to be mined ¡ n The kind of knowledge to be mined ¡ n n n 用以指明在資料庫或資料集當中，使用者有興趣的部份用以指明要執行的資料探勘函數 (data mining function) The background knowledge to be used in the discovery process ¡ 一些有關於被挖掘的領域之背景知識，對於引導知識發掘之程序與評估所發現的模式是很有用的 ¡ 表達背景知識的方式: 概念分層 (Concept Hierarchies) The interestingness measures and thresholds for pattern evaluation ¡ 用於指導挖掘過程或挖掘之後，評估所發現的模式 ¡ 將不感興趣的模式從知識中分開 The expected representation for visualizing the discovered pattern 55

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) �Integration of Data Mining and Data Warehousing u 一個好的系統架構，可以使Data Mining System在性能、性能交互性、擴展性等多個方面的都得到良好的保證。互性使用性以及使用性擴展性 u 當前大部分資料都是存放在資料庫或者是資料倉儲之中，在資料庫資料倉儲此基礎上往往還構建了綜合的訊息處理和訊息分析功能。 u A critical question in the design of a data mining system is how to integrate or couple the DM system with a database system and/or a data warehouse system. n 不耦合 (No coupling) n 鬆散耦合 (Loose coupling) n 半緊密耦合 (Semitight coupling) n 緊密耦合 (Tight coupling) 61

國立聯合大學資訊管理學系 u No coupling: n n n DM system will not utilize any function of a DB or DW system. Simple Drawbacks: ¡ ¡ u 資料探勘課程 (陳士杰) DM system may spend a substantial amount of time finding, collecting, cleaning, and transforming data. DM system will need to use other tools to extract data, making it difficult to integrate such a system into an information processing environment. Loose coupling: n n n DM system will use some facilities of a DB or DW system. Better than no coupling. Drawbacks: ¡ Because mining does not explore data structures and query optimization methods provided by DB or DW systems, it is difficult for loose coupling to achieve high scalability and good performace with large data set. 62

國立聯合大學資訊管理學系 u u 資料探勘課程 (陳士杰) Semitight coupling: n Besides linking a DM system to a DB/DW system, efficient implementations of a few essential data mining primitives can be provided in the DB/DW system. n Some frequently used intermediate mining results can be precomputed and stored in the DB/DW system, this design will enhance the performance of a DM system. Tight coupling: n DM system is smoothly integrated into the DB/DW system. The data mining subsystem is treated as one functional component of an information system. n Data mining queries and functions are optimized based on mining query analysis, data structures, indexing schemes, and query processing methods of a DB or DW system. n This will provide a uniform information processing environment. 63

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) �Major Issues in Data Mining u Mining methodology and user interaction n Mining different kinds of knowledge in databases n Interactive mining of knowledge at multiple levels of abstraction n Incorporation of background knowledge n Data mining query languages and ad-hoc data mining n Expression and visualization of data mining results n Handling noise and incomplete data n Pattern evaluation: the interestingness problem 64

國立聯合大學資訊管理學系 u u 資料探勘課程 (陳士杰) Performance issue n Efficiency and scalability of data mining algorithms n Parallel, distributed and incremental mining methods Issues relating to the diversity of data types n Handling relational and complex types of data n Mining information from heterogeneous databases and global information systems (WWW) 65

國立聯合大學資訊管理學系資料探勘課程 (陳士杰) �Summary u Data mining: Discovering interesting patterns from large amounts of data u A natural evolution of database technology, in great demand, with wide applications u A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation u Mining can be performed in a variety of information repositories u Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc. u Data mining systems and architectures u Major issues in data mining 66