Related Concepts Outline Goal Examine some areas which

  • Slides: 38
Download presentation
Related Concepts Outline Goal: Examine some areas which are related to data mining. l

Related Concepts Outline Goal: Examine some areas which are related to data mining. l Database/OLTP Systems l Fuzzy Sets and Logic l Information Retrieval(Web Search Engines) l Dimensional Modeling l Data Warehousing l OLAP/DSS l Statistics l Machine Learning l Pattern Matching 1 Ming-Yen Lin, IECS, FCU

DB & OLTP Systems On-Line Transaction Processing l Schema Ø (ID, Name, Address, Salary,

DB & OLTP Systems On-Line Transaction Processing l Schema Ø (ID, Name, Address, Salary, Job. No) l Data Model Ø Entity-Relationship Ø Relational l Transaction l Query: SELECT Name FROM T WHERE Salary > 100000 [Fig. 2. 1] DM: Only imprecise queries Ming-Yen Lin, IECS, FCU 2

Fuzzy Sets and Logic l Fuzzy Set: Set membership function is a real valued

Fuzzy Sets and Logic l Fuzzy Set: Set membership function is a real valued function with output in the range [0, 1]. l f(x): Probability x is in F. l 1 -f(x): Probability x is not in F. l EX: Ø T = {x | x is a person and x is tall} Ø Let f(x) be the probability that x is tall Ø Here f is the membership function l {x|x R and x. salary > 100, 000} vs. {x|x R and x is tall} DM: Prediction and classification are fuzzy. Ming-Yen Lin, IECS, FCU 3

Fuzzy Sets & Fuzzy Logic Fuzzy logic: reasoning with uncertainty; multiple valued logic retrieve

Fuzzy Sets & Fuzzy Logic Fuzzy logic: reasoning with uncertainty; multiple valued logic retrieve data with imprecise/missing values mem( x) = 1 - mem(x); mem(x y) = min(mem(x), mem(y)) mem(x y) = max(mem(x), mem(y)) 4 Ming-Yen Lin, IECS, FCU

Classification/Prediction is Fuzzy Grey area Loan Reject Amnt Accept Simple Accept Fuzzy 5 Ming-Yen

Classification/Prediction is Fuzzy Grey area Loan Reject Amnt Accept Simple Accept Fuzzy 5 Ming-Yen Lin, IECS, FCU

Information Retrieval l Information Retrieval (IR): retrieving desired information from textual data. l Library

Information Retrieval l Information Retrieval (IR): retrieving desired information from textual data. l Library Science l Digital Libraries l Web Search Engines l Traditionally keyword based l Sample query: Find all documents about “data mining”. DM: Similarity measures; Mine text/Web data. Ming-Yen Lin, IECS, FCU 6

Information Retrieval (cont’d) l Similarity: measure of how close a query is to a

Information Retrieval (cont’d) l Similarity: measure of how close a query is to a document. l Documents which are “close enough” are retrieved. sim(q, Di); sim(Di, Dj) l Metrics: Ø Precision = |Relevant and Retrieved| |Retrieved| Ø Recall = |Relevant and Retrieved| |Relevant| l Inverse Document Frequency: Ø IDFk = log(n/|documents containing k|) + 1 l Concept hierarchy [Fig. 2. 7] Ø Replace ‘tiger’ with ‘CAT’ Ø May be a Directed Acyclic Graph Ming-Yen Lin, IECS, FCU 7

IR Query Result Measures and Classification calculate precision/recall IR Classification 8 Ming-Yen Lin, IECS,

IR Query Result Measures and Classification calculate precision/recall IR Classification 8 Ming-Yen Lin, IECS, FCU

Decision Support Systems l Improve decision making by providing specific information needed by management

Decision Support Systems l Improve decision making by providing specific information needed by management l Executive information systems l Executive Support Systems l as a suite of tools, assist in the overall DSS process 9 Ming-Yen Lin, IECS, FCU

Dimensional Modeling l a different way to view and interrogate data in DB l

Dimensional Modeling l a different way to view and interrogate data in DB l View data in a hierarchical manner more as business executives might l Useful in decision support systems and mining l Dimension: collection of logically related attributes; axis for modeling data. l Facts: data stored l Ex: Dimensions – products, locations, date Facts – quantity, unit price DM: May view data as dimensional. Ming-Yen Lin, IECS, FCU 10

Relational View of Data 11 Ming-Yen Lin, IECS, FCU

Relational View of Data 11 Ming-Yen Lin, IECS, FCU

Dimensional Modeling Queries l Roll Up: more general dimension l Drill Down: more specific

Dimensional Modeling Queries l Roll Up: more general dimension l Drill Down: more specific dimension l Dimension (Aggregation) Hierarchy l SQL uses aggregation l Multidimensional schemas Østar schema Øsnowflake schema Øfact constellation schema l Multidimensional indexing Øbitmap index, join index Ming-Yen Lin, IECS, FCU 12

Cube view of Data 13 Ming-Yen Lin, IECS, FCU

Cube view of Data 13 Ming-Yen Lin, IECS, FCU

Aggregation Hierarchies order relationship second < minute aggregate sum additive 14 Ming-Yen Lin, IECS,

Aggregation Hierarchies order relationship second < minute aggregate sum additive 14 Ming-Yen Lin, IECS, FCU

Star Schema Day product Sales Division Ming-Yen Lin, IECS, FCU dimension facts Location aggregate

Star Schema Day product Sales Division Ming-Yen Lin, IECS, FCU dimension facts Location aggregate facts for efficiency 15

Example of Star Schema time item time_key day_of_the_week month quarter year Sales Fact Table

Example of Star Schema time item time_key day_of_the_week month quarter year Sales Fact Table time_key item_key branch_key branch_name branch_type location_key units_sold dollars_sold avg_sales item_key item_name brand type supplier_type location_key street city province_or_street country Measures 16 Ming-Yen Lin, IECS, FCU

Options to implement star schema (a) flattened: store data for each dimension in exactly

Options to implement star schema (a) flattened: store data for each dimension in exactly one table; roll up: by SQL aggregate (b) normalized: a table exists for each level in each dimension; each table has one tuple for every occurrence at the level (c) expanded: num. of dimen. tables = normalized; lowest dim. = flattened (d) levelized: has one dim. table as does the flattened, but aggregations have been performed. [Fig. 2. 12] Ming-Yen Lin, IECS, FCU 17

Example of Snowflake Schema time_key day_of_the_week month quarter year item Sales Fact Table time_key

Example of Snowflake Schema time_key day_of_the_week month quarter year item Sales Fact Table time_key item_key branch location_key branch_name branch_type units_sold dollars_sold avg_sales Measures item_key item_name brand type supplier_key supplier_type location_key street city_key city province_or_street country 18 Ming-Yen Lin, IECS, FCU

Example of Fact Constellation time_key day_of_the_week month quarter year item Sales Fact Table time_key

Example of Fact Constellation time_key day_of_the_week month quarter year item Sales Fact Table time_key item_key item_name brand type supplier_type location_key branch_name branch_type units_sold dollars_sold avg_sales item_key shipper_key location to_location_key street city province_or_street country dollars_cost Measures Galaxy schema Ming-Yen Lin, IECS, FCU time_key from_location branch_key branch Shipping Fact Table units_shipped shipper_key shipper_name location_key shipper_type 19

Data Warehousing l “Subject-oriented, integrated, time-variant, nonvolatile” William Inmon l Operational Data: Data used

Data Warehousing l “Subject-oriented, integrated, time-variant, nonvolatile” William Inmon l Operational Data: Data used in day to day needs of company. l Informational Data: Supports other functions such as planning and forecasting. l Data mining tools often access data warehouses rather than operational data. DM: May access data in warehouse. 20 Ming-Yen Lin, IECS, FCU

What is Data Warehouse? l 定義 Ø 一個分別設置的,獨立於公司作業資料庫的,決策支援 資料庫 Ø 為支援資料處理,提供分析之用,提供完善的、統合歷 史資料的平台 l “A

What is Data Warehouse? l 定義 Ø 一個分別設置的,獨立於公司作業資料庫的,決策支援 資料庫 Ø 為支援資料處理,提供分析之用,提供完善的、統合歷 史資料的平台 l “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process. ” —W. H. Inmon l Data warehousing Ø 建構與使用 data warehouses的程序 21 Ming-Yen Lin, IECS, FCU

Data Warehouse—Integrated l 藉整合多個、異質的資料來源而建構 Ø relational databases Ø flat files Ø on-line transaction records

Data Warehouse—Integrated l 藉整合多個、異質的資料來源而建構 Ø relational databases Ø flat files Ø on-line transaction records l 應用data cleaning 與 data integration的技巧 Ø 確保不同資料來源的一致性 n naming conventions n encoding structures n attribute measures n 例:Hotel price: currency, tax, breakfast covered, etc. Ø 當資料「移動」到 warehouse時,已經經轉換 23 Ming-Yen Lin, IECS, FCU

Data Warehouse—Time Variant l data warehouse 的時間軸明顯的比作業性系統長 Ø Operational database: current value data. Ø

Data Warehouse—Time Variant l data warehouse 的時間軸明顯的比作業性系統長 Ø Operational database: current value data. Ø Data warehouse data: provide information from a historical perspective (e. g. , past 5 -10 years) l data warehouse的各主要結構(key structure) Ø 外顯或隱含地(explicitly or implicitly) 包含 time 這個元素 Ø operational data:不一定包含“time element” 24 Ming-Yen Lin, IECS, FCU

Data Warehousing l traditional db: operational data warehouse: information data Ø‘what if’ questions ->

Data Warehousing l traditional db: operational data warehouse: information data Ø‘what if’ questions -> warehouse + query Øeg. analyze trend from historical data l basic components Ødata migration Øwarehouse Øaccess tool 26 Ming-Yen Lin, IECS, FCU

Transformation in DWing l Transformation [Fig. 2. 14] Ø remove unwanted data Ø convert

Transformation in DWing l Transformation [Fig. 2. 14] Ø remove unwanted data Ø convert heterogeneous source into one common format Ø merge snapshots to create historical view Ø summarize data at levels Ø add derived data Ø handling missing/erroneous data Ø also called data scrubbing/data staging l Improve performance of data warehouse applications Ø Summarization Ø Denormalization (speed up join!) Ø Partitioning 27 Ming-Yen Lin, IECS, FCU

Operational vs. Informational Operational Data Warehouse Application OLTP OLAP Use Precise Queries Ad Hoc

Operational vs. Informational Operational Data Warehouse Application OLTP OLAP Use Precise Queries Ad Hoc Temporal Snapshot Historical Modification Dynamic Static Orientation Application Business Data Operational Values Integrated Size Level Gigabits Detailed Terabits Summarized Access Often Less Often Response Few Seconds Minutes Data Schema Relational Star/Snowflake 28 Ming-Yen Lin, IECS, FCU

OLAP l Online Analytic Processing (OLAP): provides more complex queries than OLTP. l On.

OLAP l Online Analytic Processing (OLAP): provides more complex queries than OLTP. l On. Line Transaction Processing (OLTP): traditional database/transaction processing. l Dimensional data; cube view l Visualization of operations: ØSlice: examine sub-cube. ØDice: rotate cube to look at another dimension. ØRoll Up/Drill Down DM: May use OLAP queries. Ming-Yen Lin, IECS, FCU 29

A Concept Hierarchy Dimension (location) all Europe region country city office all Germany Frankfurt

A Concept Hierarchy Dimension (location) all Europe region country city office all Germany Frankfurt . . Spain North_America Canada Vancouver. . . L. Chan . . . Toronto M. Wind Used for multi-level abstraction (for interactive mining) Ming-Yen Lin, IECS, FCU Mexico 30

典型的 OLAP 運算 l Roll up (drill-up): 綜合資料 Ø by climbing up hierarchy or

典型的 OLAP 運算 l Roll up (drill-up): 綜合資料 Ø by climbing up hierarchy or by dimension reduction l Drill down (roll down): roll-up的相反 Ø from higher level summary to lower level summary or detailed data, or introducing new dimensions l Slice and dice: (選取部分) Ø project and select l Pivot (rotate): (旋轉) Ø reorient the cube, visualization, 3 D to series of 2 D planes. l Other operations Ø drill across: involving (across) more than one fact table Ø drill through: through the bottom level of the cube to its backend relational tables (using SQL) 31 Ming-Yen Lin, IECS, FCU

Cube Operations dice (location=x AND time=Y AND item = Z) roll-up (city 2 location)

Cube Operations dice (location=x AND time=Y AND item = Z) roll-up (city 2 location) drill-down (quarter 2 month) slice (time=Q 1) pivot 32 Ming-Yen Lin, IECS, FCU

OLAP Operations Roll Up Drill Down Single Cell Multiple Cells Slice Dice OLAP tools:

OLAP Operations Roll Up Drill Down Single Cell Multiple Cells Slice Dice OLAP tools: ROLAP (relational) or MOLAP (multidimentional) ROLAP: a ROLAP server (middleware) creates MD view for users MOLAP: specialized DBMS & s/w to directly support MD data OR Hybrid tool 33 Ming-Yen Lin, IECS, FCU

Web Search Engines l be viewed as query systems like IR systems l query:

Web Search Engines l be viewed as query systems like IR systems l query: keyword, boolean, weighted, … l Conventional search engines suffer ØAbundance ØLimited coverage ØLimited query ØLimited customization l Web Mining Øcontent/structure/usage ØWeb search => content mining 34 Ming-Yen Lin, IECS, FCU

Statistics l Simple descriptive models l Statistical inference: generalizing a model created from a

Statistics l Simple descriptive models l Statistical inference: generalizing a model created from a sample of the data to the entire dataset. l Exploratory Data Analysis: ØData can actually drive the creation of the model ØOpposite of traditional statistical view. l Data mining targeted to business user DM: Many data mining methods come from statistical techniques. Ming-Yen Lin, IECS, FCU 35

Machine Learning l Machine Learning: area of AI that examines how to write programs

Machine Learning l Machine Learning: area of AI that examines how to write programs that can learn. l Often used in classification and prediction l Supervised Learning: learns by example. l Unsupervised Learning: learns without knowledge of correct answers. l Machine learning often deals with small static datasets. l [table 2. 3] DM: Uses many machine learning techniques. Ming-Yen Lin, IECS, FCU 36

Pattern Matching (Recognition) l Pattern Matching: finds occurrences of a predefined pattern in the

Pattern Matching (Recognition) l Pattern Matching: finds occurrences of a predefined pattern in the data. l Applications include speech recognition, information retrieval, time series analysis. DM: Type of classification. 37 Ming-Yen Lin, IECS, FCU

DM vs. Related Topics 38 Ming-Yen Lin, IECS, FCU

DM vs. Related Topics 38 Ming-Yen Lin, IECS, FCU