Overview of SQL Server Data Mining CSD 305

  • Slides: 15
Download presentation
Overview of SQL Server Data Mining CSD 305 Advanced Databases

Overview of SQL Server Data Mining CSD 305 Advanced Databases

Data Mining Definition • Mining • Act of excavation in the earth from which

Data Mining Definition • Mining • Act of excavation in the earth from which ore or minerals can be extracted • Act of excavation in the data from which patterns can be extracted • Alternative name: Knowledge discovery in databases (KDD) • Multiple disciplines: database, statistics, artificial intelligence • Fast maturing technology • Unlimited applicability CSD 305 Advanced Databases • Data Mining 2

Data Mining Process Train the model Test the model Training Data Mining Management System

Data Mining Process Train the model Test the model Training Data Mining Management System (DMMS) Test Data Mining Model Prediction using the model Prediction Input Data CSD 305 Advanced Databases Define a model 3

CSD 305 Advanced Databases Data Mining Tasks 4

CSD 305 Advanced Databases Data Mining Tasks 4

Data Mining Problems Classification (prediction) • Is this student going to go to a

Data Mining Problems Classification (prediction) • Is this student going to go to a college? • Similar questions: • • • Is this a spam email? (spam filtering) How good/bad is your credit? (credit scoring) Recognition of hand-written letters (pen recognition) What is this gene like? (bioinformatics) Does this person behave like a terrorist? (TIA) CSD 305 Advanced Databases • Based on Gender, Parent. Income, Parent. Encouragement, IQ, etc. • E. g. , if Parent. Encouragement=Yes and IQ>100, College=Yes è Classification (prediction) 5

Decision Tree Attend College 55% Yes 45% No Attend College 79% Yes 21% No

Decision Tree Attend College 55% Yes 45% No Attend College 79% Yes 21% No Encouragement = Encouraged Attend College 94% Yes 6% No Attend College 35% Yes 65% No Encouragement = Not Encouraged Attend College 69% Yes 31% No CSD 305 Advanced Databases IQ <= 100 IQ > 100 6

Data Mining Problems Regression (prediction) • Based on Hobby, Marital. Status, Number. Of. Children,

Data Mining Problems Regression (prediction) • Based on Hobby, Marital. Status, Number. Of. Children, Income, House. Ownership, Number. Of. Cars, … • E. g. , If Marital. Status=Yes, Age = 20+4*Number. Of. Children+0. 0001*Income+… Regression (prediction) • Similar questions: • What’s the sales amount of ice cream next month? (sales prediction) • What’s the stock price of MSFT next week? (stock prediction) • What’s the income of a customer? (marketing) • What’s the life-time of a software bug? (bug tracking) CSD 305 Advanced Databases • What is the age of a person? 7

Data Mining Problems Segmentation (Clustering) • Who are my Web visitors? • Similar questions:

Data Mining Problems Segmentation (Clustering) • Who are my Web visitors? • Similar questions: • Identify groups of genes (bioinformatics) • Identify groups of locations of Cholera incidents in London (spatial data mining) • Identify group of customers in merchants (Amazon, E-Bay, MSN, Wal. Mart etc) (target marketing) • Identify groups of documents. (text categorisation) CSD 305 Advanced Databases • Identify similar groups based on demographics, visiting patterns • E. g. , Daily news readers, email users, shoppers, short-stayers, etc è Segmentation (clustering) 8

Data Mining Problems Association Analysis (recommendation, market analysis) • Based on previous purchases (shopping

Data Mining Problems Association Analysis (recommendation, market analysis) • Based on previous purchases (shopping cart) • E. g. , If a digital camera is purchased, flash memory, battery, printer are also purchased. èAssociation Analysis (recommendation, market basket analysis) • Similar questions: • What products to recommend in on-line stores such as Amazon. com. • What items should be displayed together in merchant. • What genes appear together in toxic mushrooms. CSD 305 Advanced Databases • What other products are purchased together with a digital camera? 9

Data Mining Problems Anomaly detection (outlier detection) • Predict likelihood of the network packet

Data Mining Problems Anomaly detection (outlier detection) • Predict likelihood of the network packet pattern è Anomaly detection (outlier detection) • Similar questions: • Are the hospital lab results normal (Adverse drug effect detection) • Is this credit transaction fraudulent? (fraud detection) • Does this person behave unusually, maybe worth highlevel of security clearance? (TIA) CSD 305 Advanced Databases • Could this network packet be from a virus attack? 10

 • Classification • Regression • Segmentation • Association Analysis • Anomaly detection •

• Classification • Regression • Segmentation • Association Analysis • Anomaly detection • Sequence Analysis • Time-series Analysis • Text categorization • Others CSD 305 Advanced Databases Data Mining Tasks - Summary 11

 • Decision Trees • Naïve Bayesian • Clustering • Sequence Clustering • Association

• Decision Trees • Naïve Bayesian • Clustering • Sequence Clustering • Association Rules • Neural Network • Time Series • Support Vector Machines • …. CSD 305 Advanced Databases Data Mining Algorithms 12

√ √ √ √ √ or k et w eu N √ √ √

√ √ √ √ √ or k et w eu N √ √ √ Classification Regression Segmentaion Assoc. Analysis Anomaly Detect. Seq. Analysis √ √ - first choice ra l. N ia tio oc ss A e C Se lu rie s st er n ru in g √ √ √ Ti m st lu C √ √ Se q. er in ay B ve aï N √ g es ee s Tr is io n ec D √ √ le s Data Mining Algorithms √ - second choice CSD 305 Advanced Databases Time series 13

Data Mining Vendors • SAS (Analytics) • http: //www. sas. com/technologies/analytics/index. html • IBM

Data Mining Vendors • SAS (Analytics) • http: //www. sas. com/technologies/analytics/index. html • IBM (DB 2 Info. Sphere Warehouse) • Oracle (ODM option to Oracle 11 g) • • • http: //www. oracle. com/technetwork/database/options/odm/index. html SPSS (Clementine) Insightsful (Insightful Miner) KXEN (Analytic Framework) Prudsys (Discoverer and its family) Microsoft (SQL Server 2012) Angoss (Knowledge. Server and its family) DBMiner (DBMiner) … and many others CSD 305 Advanced Databases • http: //www 01. ibm. com/software/data/infosphere/warehouse/mining. html 14

 • Pang-Ning Tan, Michael Steinbach, Vipin Kumar “Introduction to Data Mining” Pearson Education,

• Pang-Ning Tan, Michael Steinbach, Vipin Kumar “Introduction to Data Mining” Pearson Education, 2006 • T. Marakas “Modern Data Warehousing, Mining and Visualization: Core Concepts” Prentice Hall 2003 • http: //msdn. microsoft. com/enus/library/ms 132058. aspx • Data mining extensions to SQL server • http: //troels. arvin. dk/db/rdbms/ • Comparison of different SQL implementations • Also see the wikibook here http: //en. wikibooks. org/wiki/SQL_dialects_reference CSD 305 Advanced Databases References 15