COURSE INTRODUCTION CSC 576 Data Mining Today What

  • Slides: 25
Download presentation
COURSE INTRODUCTION CSC 576: Data Mining

COURSE INTRODUCTION CSC 576: Data Mining

Today What is Data Mining? Syllabus / Course Webpage Types of Data

Today What is Data Mining? Syllabus / Course Webpage Types of Data

How would you define data mining? What is Data Mining? Data Mining and Business

How would you define data mining? What is Data Mining? Data Mining and Business Analytics deal with collecting and analyzing data for better decision making. Goal: solving business problems � Data collection (more and more data is being collected) � Warehousing of data (readily available for analysis; data from numerous sources already integrated) � Computer storage and computer power cheaper every day

Data Mining … blends traditional data analysis (mathematical + statistical) with sophisticated machine learning

Data Mining … blends traditional data analysis (mathematical + statistical) with sophisticated machine learning algorithms Programming ability to process big data. Math Businesses interested in decision making Busine CS “Art” of data mining ss

Predictive Data Mining Moving from data to insights to decisions.

Predictive Data Mining Moving from data to insights to decisions.

Data Mining Applications Businesses collect lots of data: � � � Business Goals: �

Data Mining Applications Businesses collect lots of data: � � � Business Goals: � Purchase information Web site browsing habits Social network data customer profiling, targeted marketing, fraud detection Questions that analyst will try to answer by data mining: “Who are the most profitable customers? ” “What products can be cross-sold? ” “What is the revenue outlook for the company next year? ” Many variables are collected; few turn out to be useful.

More Applications Price Prediction Fraud Detection Risk Assessment Diagnosis

More Applications Price Prediction Fraud Detection Risk Assessment Diagnosis

What we will do in this Course Learn Basic-to-Intermediate Data Mining Techniques Apply them

What we will do in this Course Learn Basic-to-Intermediate Data Mining Techniques Apply them on Datasets Program using Python Read, Understand, Discuss, Critique Scientific Papers Perform Significant Individual Data Mining

Syllabus / Course Webpage

Syllabus / Course Webpage

What is Data Mining? “the process of automatically discovering useful information in large data

What is Data Mining? “the process of automatically discovering useful information in large data repositories” “to find novel and useful patterns that might otherwise remain unknown” What is NOT data Mining? “looking up records in a My. SQL database” (database) “finding relevant web pages based on a Google search query” (information retrieval)

Data Mining and Knowledge Discovery Process of converting raw data into useful information Input

Data Mining and Knowledge Discovery Process of converting raw data into useful information Input Data • My. SQL • . csv • JSON • Twitter API Data Preprocessing • Feature Selection • Dimensionalit y Reduction • Normalization Data Mining • Decision Trees • Support Vector Machines • Linear Regression • Neural Networks Postprocessing • Visualization • Pattern Interpretation Reporting to Boss • “closing the loop”

Input Data Available in data in variety of formats: � � Big Data /

Input Data Available in data in variety of formats: � � Big Data / Data Warehouse � Flat files (. csv or. txt) Spreadsheets (Excel. xls tougher to deal with) Relational tables (My. SQL) Text, data on web page (scraping necessary) Data spread out over multiple locations CS programming ability often necessary Sometimes enormous amount of effort � Digitizing hand-written notes

Preprocessing To transform raw input data into an appropriate format for subsequent analysis �

Preprocessing To transform raw input data into an appropriate format for subsequent analysis � Fusing data from multiple sources � Cleaning data to remove noise � Duplicate observations “garbage in – garbage out” also applies to data mining � Selecting records and features that are relevant to the data mining task at hand

Data Mining Applying Appropriate Data Mining Task � Linear Regression � Support Vector Machines

Data Mining Applying Appropriate Data Mining Task � Linear Regression � Support Vector Machines � Decision Trees � Clustering �…

Postprocessing Performing: � Visualization � Statistical significant tests, confidence intervals, hypothesis testing to eliminate

Postprocessing Performing: � Visualization � Statistical significant tests, confidence intervals, hypothesis testing to eliminate spurious data mining results (yikes, math!)

Challenges of Data Mining Scalability � Gigabytes, terabytes, petabytes, exabytes of data � Storage,

Challenges of Data Mining Scalability � Gigabytes, terabytes, petabytes, exabytes of data � Storage, processing � “are data mining algorithms scalable? ” � Limits of python statistical framework libraries

Challenges of Data Mining High Dimensionality � Datasets with hundreds or thousands of attributes

Challenges of Data Mining High Dimensionality � Datasets with hundreds or thousands of attributes � Some traditional data analysis techniques were developed for low-dimensional data, and many not work well with high-dimensional data � Many variables are collected; few turn out to be useful.

Challenges of Data Mining Heterogeneous and Complex Data � Traditional data analysis often deals

Challenges of Data Mining Heterogeneous and Complex Data � Traditional data analysis often deals with data sets containing attributes of the same type (e. g. all continuous, all categorical) � Non-traditional data: collection of web pages (w/ semi-structured text and hyperlinks)

Challenges of Data Mining Data Ownership � “Good data” being geographically distributed owned by

Challenges of Data Mining Data Ownership � “Good data” being geographically distributed owned by more than one organization (e. g. medical records) � Access to “good data” Facebook and google keep their collected data private

What is interesting in this data? Sample Data Vocabulary: � Column: “attribute”, “feature”, “field”,

What is interesting in this data? Sample Data Vocabulary: � Column: “attribute”, “feature”, “field”, “dimension”, “variable” � Row: “instance”, “record”, “observation”

Data Mining Tasks 1. Predictive Tasks � Objective: predict value of a particular attribute,

Data Mining Tasks 1. Predictive Tasks � Objective: predict value of a particular attribute, based on the values of other attributes • “Defaulted Barrower? ” is the target (or dependent variable) • Attributes/features used for making the prediction are known as explanatory (or

Supervised Machine Learning techniques automatically learn a model of the relationship between a set

Supervised Machine Learning techniques automatically learn a model of the relationship between a set of descriptive features and a target feature from a set of historical examples.

Data Mining Tasks 2. Descriptive Tasks � � Objective: derive patterns (correlations, clusters) that

Data Mining Tasks 2. Descriptive Tasks � � Objective: derive patterns (correlations, clusters) that summarize underlying relationships in data Often more exploratory and requires an explanation of found results

“Free Public Datasets” https: //www. reddit. com/r/datasets/ https: //www. reddit. com/r/opendata/ https: //www. kaggle.

“Free Public Datasets” https: //www. reddit. com/r/datasets/ https: //www. reddit. com/r/opendata/ https: //www. kaggle. com/datasets https: //github. com/awesomedata/awesome-publicdatasets https: //www. forbes. com/sites/bernardmarr/2018/0 2/26/big-data-and-ai-30 -amazing-and-free-publicdata-sources-for-2018/

References Fundamentals of Machine Learning for Predictive Data Analytics, 1 st Edition, Kelleher et

References Fundamentals of Machine Learning for Predictive Data Analytics, 1 st Edition, Kelleher et al. Data Science from Scratch, 1 st Edition, Grus Introduction to Data Mining, 1 st edition, Tan et al. Data Mining and Business Analytics in R, 1 st edition, Ledolter