Data Mining Concepts and Techniques Slides for Textbook
Data Mining: Concepts and Techniques — Slides for Textbook — — Appendix A — ©Jiawei Han and Micheline Kamber Slides contributed by Jian Pei (peijian@cs. sfu. ca) Intelligent Database Systems Research Lab School of Computing Science Simon Fraser University, Canada http: //www. cs. sfu. ca 10/7/2020 Data Mining: Concepts and Techniques 1
Appendix A: An Introduction to Microsoft’s OLE OLDB for Data Mining n Introduction n Overview and design philosophy n Basic components n Data set components n Data mining models n Operations on data model n Concluding remarks 10/7/2020 Data Mining: Concepts and Techniques 2
Why OLE DB for Data Mining? n n n Industry standard is critical for data mining development, usage, interoperability, and exchange OLEDB for DM is a natural evolution from OLEDB and OLDB for OLAP Building mining applications over relational databases is nontrivial n n n Need different customized data mining algorithms and methods Significant work on the part of application builders Goal: ease the burden of developing mining applications in large relational databases 10/7/2020 Data Mining: Concepts and Techniques 3
Motivation of OLE DB for DM n Facilitate deployment of data mining models n n n 10/7/2020 Generating data mining models Store, maintain and refresh models as data is updated n Programmatically use the model on other data set n Browse models Enable enterprise application developers to participate in building data mining solutions Data Mining: Concepts and Techniques 4
Features of OLE DB for DM n Independent of provider or software n Not specialized to any specific mining model n n 10/7/2020 Structured to cater to all well-known mining models Part of upcoming release of Microsoft SQL Server 2000 Data Mining: Concepts and Techniques 5
Overview n Core relational engine exposes OLE DB in a language -based API n Data mining applications OLE DB OLAP/DM Analysis server exposes OLE DB OLAP and OLE DB DM Analysis Server n Maintain SQL metaphor OLE DB n Reuse existing notions RDB engine 10/7/2020 Data Mining: Concepts and Techniques 6
Key Operations to Support Data Mining Models n n 10/7/2020 Define a mining model n Attributes to be predicted n Attributes to be used for prediction n Algorithm used to build the model Populate a mining model from training data Predict attributes for new data Browse a mining model fro reporting and visualization Data Mining: Concepts and Techniques 7
DMM As Analogous to A Table in SQL n n n Create a data mining module object n CREATE MINING MODEL [model_name] Insert training data into the model and train it n INSERT INTO [model_name] Use the data mining model n SELECT relation_name. [id], [model_name]. [predict_attr] n consult DMM content in order to make predictions and browse statistics obtained by the model Using DELETE to empty/reset Predictions on datasets: prediction join between a model and a data set (tables) Deploy DMM by just writing SQL queries! 10/7/2020 Data Mining: Concepts and Techniques 8
Two Basic Components n Cases/caseset: input data n n A table or nested tables (for hierarchical data) Data mining model (DMM): a special type of table n n n 10/7/2020 A caseset is associated with a DMM and meta-info while creating a DMM Save mining algorithm and resulting abstraction instead of data itself Fundamental operations: CREATE, INSERT INTO, PREDICTION JOIN, SELECT, DELETE FROM, and DROP Data Mining: Concepts and Techniques 9
Flatterned Representation of Caseset Customers Customer ID Gender Hair Color Age Prob Car Owernership Product Purchases Customer ID Problem: Lots of replication! Product Name Quantity Product Type CID Gend Hair Age prob Prod Quan Type Car prob 1 Male Black 35 100% TV 1 Elec Car 100% 1 Male Black 35 100% VCR 1 Elec Car 100% 1 Male Black 35 100% Ham 6 Food Car 100% Car 1 Male Black 35 100% TV 1 Elec Van 50% Car Prob 1 Male Black 35 100% VCR 1 Elec Van 50% 1 Male Black 35 100% Ham 6 Food Van 50% Customer ID 10/7/2020 Data Mining: Concepts and Techniques 10
Logical Nested Table Representation of Caseset n Use Data Shaping Service to generate a hierarchical rowset n Part of Microsoft Data Access Components (MDAC) products CID 1 10/7/2020 Gend Male Hair Black Age 35 Age prob 100% Product Purchases Car Ownership Prod Car Quan Type TV 1 Elec VCR 1 Elec Ham 6 Food Data Mining: Concepts and Techniques Car prob Car 100% Van 50% 11
More About Nested Table n n n 10/7/2020 Not necessary for the storage subsystem to support nested records Cases are only instantiated as nested rowsets prior to training/predicting data mining models Same physical data may be used to generate different casesets Data Mining: Concepts and Techniques 12
Defining A Data Mining Model n The name of the model n The algorithm and parameters n The columns of caseset and the relationships among columns n 10/7/2020 “Source columns” and “prediction columns” Data Mining: Concepts and Techniques 13
Example CREATE MINING MODEL [Age Prediction] %Name of Model ( [Customer ID] LONG KEY, %source column [Gender] TEXT DISCRETE, %source column [Age] Double DISCRETIZED() PREDICT, %prediction colum [Product Purchases] TABLE %source column ( [Product Name] TEXT KEY, %source column [Quantity] DOUBLE NORMAL CONTINUOUS, %source c [Product Type] TEXT DISCRETE RELATED TO [Product Name] %source column )) USING [Decision_Trees_101] %Mining algorithm used 10/7/2020 Data Mining: Concepts and Techniques 14
Column Specifiers n n 10/7/2020 KEY ATTRIBUTE RELATION (RELATED TO clause) QUALIFIER (OF clause) n PROBABILITY: [0, 1] n VARIANCE n SUPPORT n PROBABILITY-VARIANCE n ORDER n TABLE Data Mining: Concepts and Techniques 15
Attribute Types n DISCRETE n ORDERED n CYCLICAL n CONTINOUS n DISCRETIZED n SEQUENCE_TIME 10/7/2020 Data Mining: Concepts and Techniques 16
Populating A DMM n Use INSERT INTO statement n Consuming a case using the data mining model n Use SHAPE statement to create the nested table from the input data 10/7/2020 Data Mining: Concepts and Techniques 17
Example: Populating a DMM INSERT INTO [Age Prediction] ( [Customer ID], [Gender], [Age], [Product Purchases](SKIP, [Product Name], [Quantity], [Product Type]) ) SHAPE {SELECT [Customer ID], [Gender], [Age] FROM Customers ORDER BY [C APPEND {SELECT [Cust. ID], {product Name], [Quantity], [Product Type] FROM Sale ORDER BY [Cust. ID]} RELATE [Customer ID] TO [Cust. ID] ) AS [Product Purchases] 10/7/2020 Data Mining: Concepts and Techniques 18
Using Data Model to Predict n n n 10/7/2020 Prediction join n Prediction on dataset D using DMM M n Different to equi-join DMM: a “truth table” SELECT statement associated with PREDICTION JOIN specifies values extracted from DMM Data Mining: Concepts and Techniques 19
Example: Using a DMM in Prediction SELECT t. [Customer ID], [Age Prediction]. [Age] FROM [Age Prediction] PRECTION JOIN (SHAPE {SELECT [Customer ID], [Gender] FROM Customers ORDER BY [Customer ID]} APPEND ( {SELECT [Cust. ID], [Product Name], [Quantity] FROM Sales ORDER BY [Cust. ID]} RELATE [Customer ID] TO [Cust. ID] ) AS [Product Purchases] ) AS t ON [Age Prediction]. [Gender]=t. [Gender] AND [Age Prediction]. [Product Purchases]. [Product Name]=t. [Product Purchases]. [Product Name] AND [Age Prediction]. [Product Purchases]. [Quantity]=t. [Product Purchases]. [Quantity] 10/7/2020 Data Mining: Concepts and Techniques 20
Browsing DMM n What is in a DMM? n n Browsing DMM n 10/7/2020 Rules, formulas, trees, …, etc Visualization Data Mining: Concepts and Techniques 21
Concluding Remarks n n n OLE DB for DM integrates data mining and database systems n A good standard for mining application builders How can we be involved? n Provide association/sequential pattern mining modules for OLE DB for DM? n Design more concrete language primitives? References n http: //www. microsoft. com/data. oledb/d m. html 10/7/2020 Data Mining: Concepts and Techniques 22
http: //www. cs. sfu. ca/~han Thank you !!! 10/7/2020 Data Mining: Concepts and Techniques 23
- Slides: 23