Extensions to SQL for data mining CSD 305

  • Slides: 25
Download presentation
Extensions to SQL for data mining CSD 305 Advanced Databases Data mining algorithms 1

Extensions to SQL for data mining CSD 305 Advanced Databases Data mining algorithms 1

Data Mining Algorithms • A data mining algorithm is • Analyzes the data you

Data Mining Algorithms • A data mining algorithm is • Analyzes the data you provide, looking for specific types of patterns. • Uses the results of this analysis to define the best parameters for creating the mining model (training). • These parameters are then applied across the entire data set to extract patterns and statistics. CSD 305 Advanced Databases • a set of heuristics and calculations that creates a data mining model from data. 2

 • A set of clusters that describe how the cases in a dataset

• A set of clusters that describe how the cases in a dataset are related. • A decision tree that predicts an outcome. • A mathematical model that forecasts sales. • A set of rules that describe how products are grouped together in a transaction. CSD 305 Advanced Databases The mining model could be any of these: 3

How do you choose the right algorithm to use? CSD 305 Advanced Databases •

How do you choose the right algorithm to use? CSD 305 Advanced Databases • By type of algorithm or • By type of problem 4

 • Classification algorithms predict one or more discrete variables, based on the other

• Classification algorithms predict one or more discrete variables, based on the other attributes in the dataset. • Regression algorithms predict one or more continuous variables, such as profit or loss, based on other attributes in the dataset. • Segmentation algorithms divide data into groups, or clusters, of items that have similar properties. • Association algorithms find correlations between different attributes in a dataset. The most common application of this kind of algorithm is for creating association rules, which can be used in a market basket analysis. • Sequence analysis algorithms summarize frequent sequences in data, such as a Web path flow. CSD 305 Advanced Databases Choosing an Algorithm by Type 5

Choosing an Algorithm by Task Flag the customers in a prospective buyers list as

Choosing an Algorithm by Task Flag the customers in a prospective buyers list as good or poor prospects. Calculate the probability that a server will fail within the next 6 months. Categorize patient outcomes and explore related factors. Microsoft algorithms to use Microsoft Decision Trees Algorithm Microsoft Naive Bayes Algorithm Microsoft Clustering Algorithm Microsoft Neural Network Algorithm CSD 305 Advanced Databases Predicting a discrete attribute examples 6

Choosing an Algorithm by Task • Forecast next year's sales. • Predict site visitors

Choosing an Algorithm by Task • Forecast next year's sales. • Predict site visitors given past historical and seasonal trends. • Generate a risk score given demographics. Microsoft algorithms to use Microsoft Decision Trees Algorithm Microsoft Time Series Algorithm Microsoft Linear Regression Algorithm CSD 305 Advanced Databases Predicting a continuous attribute examples 7

Choosing an Algorithm by Task • Perform clickstream analysis of a company's Web site.

Choosing an Algorithm by Task • Perform clickstream analysis of a company's Web site. • Analyze the factors leading to server failure. • Capture and analyze sequences of activities during outpatient visits, to formulate best practices around common activities. Microsoft algorithms to use Microsoft Sequence Clustering Algorithm CSD 305 Advanced Databases Predicting a sequence examples 8

Choosing an Algorithm by Task • Use market basket analysis to determine product placement.

Choosing an Algorithm by Task • Use market basket analysis to determine product placement. • Suggest additional products to a customer for purchase. • Analyze survey data from visitors to an event, to find which activities or booths were correlated, to plan future activities. Microsoft algorithms to use Microsoft Association Algorithm Microsoft Decision Trees Algorithm CSD 305 Advanced Databases Finding groups of common items in transactions examples 9

Choosing an Algorithm by Task • Create patient risk profiles groups based on attributes

Choosing an Algorithm by Task • Create patient risk profiles groups based on attributes such as demographics and behaviours. • Analyze users by browsing and buying patterns. • Identify servers that have similar usage characteristics. Microsoft algorithms to use Microsoft Clustering Algorithm Microsoft Sequence Clustering Algorithm CSD 305 Advanced Databases Finding groups of similar items examples 10

 • We need to be able to consider the accuracy of predictions from

• We need to be able to consider the accuracy of predictions from a number of different algorithms to help choose which is best • Example in Classification spreadsheet shows accuracy and error calculations for a binary classification • An Association mining structure example is also illustrated CSD 305 Advanced Databases Accuracy of predictions 11

Classification is the task of learning a target function f that maps each attribute

Classification is the task of learning a target function f that maps each attribute set x to one of the predefined class labels y Used for Descriptive Modelling and predictive modelling Descriptive Modelling Summarise data and define which features define a vertebrate as a mammal, reptile, bird or fish CSD 305 Advanced Databases Descriptive Modelling – Training data 12

A classification model can be used to predict the class label of unknown records.

A classification model can be used to predict the class label of unknown records. Can be treated as a black box, it automatically assigns class label when presented with and attribute set of unknown record. CSD 305 Advanced Databases Predictive Modelling 13

for a 2 -class problem F 11 correct Actual/predicted class=1 F 10 wrong Actual

for a 2 -class problem F 11 correct Actual/predicted class=1 F 10 wrong Actual class = 1 predicted class=0 F 01 wrong Actual 0 predicted 1 F 00 correct Actual 0 predicted 0 • Based on the counts of test records correctly and incorrectly modelled. • Tabulated in confusion matrix. • • • Total number of correct predictions is f 11+f 00 Total number of incorrect f 10+f 01 Although it provides information to determine how well a classification model performs, summarising this info will give a single number to help compare performance of different models. • • CSD 305 Advanced Databases Confusion Matrix Accuracy Or error rate 14

CSD 305 Advanced Databases Data Mining Modeling and Language 15

CSD 305 Advanced Databases Data Mining Modeling and Language 15

Data Mining Language • Large spectrum of applications: embedded to interactive BI • Interoperability

Data Mining Language • Large spectrum of applications: embedded to interactive BI • Interoperability between different DM providers (engine) and DM consumers (tools) • Data independence between content representation (trees, attributes, networks, etc) and data mining task (prediction, scoring, etc) • Requirements: • • Algorithm-neutral Task-oriented (specification of what we need, rather than how to) Vendor-neutral Flexible, extensible, declarative/self-contained • Sound familiar? • Yes, SQL CSD 305 Advanced Databases • New challenges in data mining API 16

SQL Revolution (1970’s) Before After Architecture File system, Hierarchical/network DB Relational DB API Proprietary

SQL Revolution (1970’s) Before After Architecture File system, Hierarchical/network DB Relational DB API Proprietary ISAM, X/OPEN CLI, etc SQL Data independence Physical model tied to logical model (appl logic) Physical model change requires re-develop the apps. Clear separation between physical/logical model No more app changes due to physical model update Appl dev tools Not many. Custom dev with consulting services Commodity. Product services than consulting services SQL (w/ RDB) is the biggest contributor to the maturity of DB industry. CSD 305 Advanced Databases 17

DMX Approach • Data Mining Extensions (DMX) to SQL • Table vs. Mining Model

DMX Approach • Data Mining Extensions (DMX) to SQL • Table vs. Mining Model TABLE MINING MODEL schema Column definition Attribute (variable) definition contains Rows Patterns, knowledge, cases DDL (create, drop, alter) Create/drop/alter a model DML (insert, delete) Train (populate) a model Query (select) Prediction/browsing a model operations CSD 305 Advanced Databases 18

Typical DM Process Using DMX Define a model: Train a model: INSERT INTO dmm

Typical DM Process Using DMX Define a model: Train a model: INSERT INTO dmm …. Data Mining Management System (DMMS) Training Data Prediction using a model: Mining Model CSD 305 Advanced Databases CREATE MINING MODEL …. SELECT … FROM dmm PREDICTION JOIN … Prediction Input Data 19

Defining a DM Model Defines • • Shape of “training cases” (top-level entity being

Defining a DM Model Defines • • Shape of “training cases” (top-level entity being modeled) Input/output attributes (variables): type, distribution Algorithms and parameters Example CREATE MINING MODEL College. Plan. Model (Student. ID LONG KEY, Gender TEXT DISCRETE, Parent. Income LONG NORMAL CONTINUOUS, Encouragement TEXT DISCRETE, College. Plans TEXT DISCRETE PREDICT ) USING Microsoft_Decision_Trees (complexity_penalty = 0. 5) CSD 305 Advanced Databases • 20

Training (processing) a DM Model • • • Accessing the training data possibly outside

Training (processing) a DM Model • • • Accessing the training data possibly outside the system Transformation (e. g. , discretization, normalization) Tokenization, numeric conversion, feature selection, etc. Learn the algorithm Persistency of patterns discovered • Multiple ways to specify training data CSD 305 Advanced Databases • Simply issue INSERT with training data • DMMS (data mining in Microsoft SQL Server) takes care of everything: • SELECT, OPENROWSET, SHAPE, etc. 21

INSERT INTO College. Plan. Model (Student. ID, Gender, Parent. Income, Encouragement, College. Plans) OPENROWSET(‘<provider>’,

INSERT INTO College. Plan. Model (Student. ID, Gender, Parent. Income, Encouragement, College. Plans) OPENROWSET(‘<provider>’, ‘<connection>’, ‘SELECT Student. ID, Gender, Parent. Income, Encouragement, College. Plans FROM College. Plans. Train. Data’) CSD 305 Advanced Databases Training a DM Model: Simple 22

Prediction Using a DM Model SELECT t. ID, CPModel. Plan FROM CPModel PREDICTION JOIN

Prediction Using a DM Model SELECT t. ID, CPModel. Plan FROM CPModel PREDICTION JOIN OPENQUERY(…, ‘SELECT * FROM New. Students’) AS t ON CPModel. Gender = t. Gender AND CPModel. IQ = t. IQ CPModel ID Gender IQ Plan ID Gender IQ New. Students CSD 305 Advanced Databases • PREDICTION JOIN 23

Your data mining exercises CSD 305 Advanced Databases • In the tutorial you will

Your data mining exercises CSD 305 Advanced Databases • In the tutorial you will explore the data mining that is possible in SQL Server 2017 Analytical Services • We will be using Adventureworks. DW 24

Adventure Works • Based on a fictional bicycle manufacturing company named Adventure Works Cycles.

Adventure Works • Based on a fictional bicycle manufacturing company named Adventure Works Cycles. • Adventure Works produces and distributes metal and composite bicycles to North American, European, and Asian commercial markets. • The base of operations is located in Bothell, Washington with 500 employees, and several regional sales teams are located throughout their market base. CSD 305 Advanced Databases • Adventure. Works. DW 25