DATA MINING Data Mining deals with the discovery

  • Slides: 28
Download presentation
DATA MINING Data Mining deals with the discovery of hidden Knowledge , unexpected pattern

DATA MINING Data Mining deals with the discovery of hidden Knowledge , unexpected pattern and new rules from large data sets 11/28/2020 Data Mining -By Dr. S. C. Shirwaikar 1

Examples of Information extracted using query language • List customers who use credit card

Examples of Information extracted using query language • List customers who use credit card to purchase more than Rs 1000 worth groceries • List patients who had atleast one heart attack Examples of what data mining is used for • Develop a general profile of credit card customers • Determine patients whose lifestyle is prone to getting a heart attack in near future • Differentiate poor credit risk customers from good credit card customers 11/28/2020 Data Mining -By Dr. S. C. Shirwaikar 2

Data Mining differs from usual query processing in many ways 1. Query cannot be

Data Mining differs from usual query processing in many ways 1. Query cannot be well-formed or precisely stated as what you are looking for is usually hidden 2. Data in operational data bases may not be sufficient. Data from various sources need to be integrated processed before quality mining can be done 3. Output is not just a subset of data but is analysed and presented as a pattern 11/28/2020 Data Mining -By Dr. S. C. Shirwaikar 3

Data explosion problem: • The Explosive Growth of Data: from terabytes to petabytes •

Data explosion problem: • The Explosive Growth of Data: from terabytes to petabytes • Progress in Hardware technology leading to Automated data collection tools, storage media, affordable computers • Progress in database technology, relational technology leading to powerful database systems • Tremendous amounts of data stored in databases, data warehouses and other information repositories • Quantity of data in the world roughly doubles every year • Distribution and sharing of data is possible 11/28/2020 Data Mining -By Dr. S. C. Shirwaikar 4

 • Due to internet hundreds of megabytes of data are distributed around the

• Due to internet hundreds of megabytes of data are distributed around the world • Heterogeneous data sources can be shared using Open Data. Base Connectivity tools • Data exchange , integration through XML technology • Major sources of abundant data ØBusiness: Web, e-commerce, transactions, stocks, … ØScience: Remote sensing, bioinformatics, scientific simulation, … ØSociety and everyone: news, digital cameras, • More data means less information • We are drowning in data, but starving for knowledge! 11/28/2020 Data Mining -By Dr. S. C. Shirwaikar 5

Computers against computers Automated data collection tools and mechanical production and reproduction of data

Computers against computers Automated data collection tools and mechanical production and reproduction of data force us to develop mechanical methods for filtering selecting and interpreting data 11/28/2020 Data Mining -By Dr. S. C. Shirwaikar 6

 • Data mining (knowledge discovery from data) Extraction of interesting (non-trivial, implicit, previously

• Data mining (knowledge discovery from data) Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data Data mining: a misnomer? • Alternative names Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. 11/28/2020 Data Mining -By Dr. S. C. Shirwaikar 7

Knowledge discovery in databases (KDD)-is a multistep process of finding useful information and patterns

Knowledge discovery in databases (KDD)-is a multistep process of finding useful information and patterns in data while Data Mining is one of the steps in KDD of using algorithms for extraction of patterns Steps Of KDD 1. Selection. Data Extraction -Obtaining Data from heterogeneous data sources -Databases, Data warehouses, World wide web or other information repositories 2. Preprocessing. Data Cleaning- Incomplete , noisy, inconsistent data to be cleaned- Missing data may be ignored or predicted, erroneous data may be deleted or corrected 11/28/2020 Data Mining -By Dr. S. C. Shirwaikar 8

3. Transformation. Data Integration- Combines data from multiple sources into a coherent store -Data

3. Transformation. Data Integration- Combines data from multiple sources into a coherent store -Data can be encoded in common formats, normalized, reduced 4. Data mining – D Apply algorithms to transformed data an extract patterns 5. Pattern Interpretation/evaluation Pattern Evaluation- Evaluate the interestingness of resulting patterns or apply interestingness measures to filter out discovered patterns Knowledge presentation- present the mined knowledgevisualization techniques can be used 11/28/2020 Data Mining -By Dr. S. C. Shirwaikar 9

Visualization Techniques Graphical-bar charts, pie charts Geometric-boxplot, scatter plot histograms Icon-based- using colors Pixel-based-

Visualization Techniques Graphical-bar charts, pie charts Geometric-boxplot, scatter plot histograms Icon-based- using colors Pixel-based- data as colored figures as icons pixels Hierarchical- Hierarchically Hybrid- combination of above dividing display area approaches 11/28/2020 Data Mining -By Dr. S. C. Shirwaikar 10

Knowledge discovery process KDD is the nontrivial extraction of implicit previously unknown and potentially

Knowledge discovery process KDD is the nontrivial extraction of implicit previously unknown and potentially useful knowledge from data Pattern Evaluation Data Mining Data Transformation Data Preprocessing Data Warehouses Data Integration Data Cleaning Selection Operational Databases 11/28/2020 Data Mining -By Dr. S. C. Shirwaikar 11

Data Mining is the process of discovering interesting Knowledge from large amounts of data

Data Mining is the process of discovering interesting Knowledge from large amounts of data stored in data bases, data warehouses or other information repositories The architecture of a typical data mining system may have the following major components ØDatabase, Data warehouse, World wide web or other information repository-Data cleaning and data integration techniques may be performed on the data ØDatabase or Data Warehouse Server-It is responsible for fetching the relevant data based on the user’s data mining request. 11/28/2020 Data Mining -By Dr. S. C. Shirwaikar 12

Graphical User Interface Pattern Evaluation Data Mining Engine Knowl edge. Base Database or Data

Graphical User Interface Pattern Evaluation Data Mining Engine Knowl edge. Base Database or Data Warehouse Server data cleaning, integration, and selection Database 11/28/2020 Data World-Wide Other Info Repositories Warehouse Web Data Mining -By Dr. S. C. Shirwaikar 13

ØData mining Engine-It consists of a set of functional modules for task such as

ØData mining Engine-It consists of a set of functional modules for task such as characterization, association and correlation analysis classification, prediction cluster analysis, outlier analysis etc ØKnowledge base – It is the domain knowledge used to guide the search or evaluate the interestingness of resulting patterns ØPattern evolution module- It applies interestingness measures to filter out discovered patterns ØGraphical User Interface- user can specify a data mining query 11/28/2020 Data Mining -By Dr. S. C. Shirwaikar 14

Why Data Mining? —Potential Applications • Data analysis and decision support – Market analysis

Why Data Mining? —Potential Applications • Data analysis and decision support – Market analysis and management • Target marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentation – Risk analysis and management • Forecasting, customer retention, improved underwriting, quality control, competitive analysis – Fraud detection and detection of unusual patterns (outliers) • Other Applications – Text mining (news group, email, documents) and Web mining – Stream data mining – Bioinformatics and bio-data analysis 11/28/2020 Data Mining -By Dr. S. C. Shirwaikar 15

Data Mining algorithms-All algorithms attempt to fit a model closest to the data being

Data Mining algorithms-All algorithms attempt to fit a model closest to the data being examined. Model is based on the analysis of attributes of a training data set The Model is than evaluated using a test data set Data Model can be Descriptive-characterize, explore properties of current data Predictive-perform inference on current data to make predictions on future data 11/28/2020 Data Mining -By Dr. S. C. Shirwaikar 16

Data Mining Descriptive Predictive Clustering Classification Sequence Discovery Regression Prediction Summarization Association rules Time

Data Mining Descriptive Predictive Clustering Classification Sequence Discovery Regression Prediction Summarization Association rules Time series Analysis 11/28/2020 Data Mining -By Dr. S. C. Shirwaikar 17

Classification- maps data into predefined groups or classes It uses supervised learning. The algorithm

Classification- maps data into predefined groups or classes It uses supervised learning. The algorithm uses learning phase to build a classifier using training data set containing data attributes and associated class labels Regression-maps data into real-valued prediction variable. Algorithm tries to find best function (linear, Non-linear that fits the training data) Time Series Analysis- the value of an attribute is examined as it varies over time It can be used to determine similarities, classify the behavior or predict future values Prediction – predicts future values using regression, time series analysis or other approaches 11/28/2020 Data Mining -By Dr. S. C. Shirwaikar 18

Clustering -Finding similarities between data according to the characteristics found in the data and

Clustering -Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters Unsupervised learning: no predefined classes Interpretability and usability-results should be comprehensible and usable-domain expert is required Summarization - maps data into subsets with simple descriptions- It extracts or derives representative summary type of information Association rules–discovers relationship among data – used in Market basket analysis to find item frequently purchased togather Sequence Discovery- discovers sequential patterns in data-oder in which items are purchased or data is accessed 11/28/2020 Data Mining -By Dr. S. C. Shirwaikar 19

Influence from many disciplines Database Technology Machine Learning Pattern Recognition 11/28/2020 Statistics Data Mining

Influence from many disciplines Database Technology Machine Learning Pattern Recognition 11/28/2020 Statistics Data Mining Algorithm Data Mining -By Dr. S. C. Shirwaikar Visualization Other Disciplines 20

Depending on data mining approach, techniques from other disciplines may be applied such as

Depending on data mining approach, techniques from other disciplines may be applied such as • Information Retrieval • Artificial Intelligence • Neural networks • Fuzzy set theory • Knowledge representation • Logic programming • High performance computing 11/28/2020 Data Mining -By Dr. S. C. Shirwaikar 21

Data Mining issues Human interaction- interfaces required with both domain and technical experts- variety

Data Mining issues Human interaction- interfaces required with both domain and technical experts- variety of databases, variety of users leading to numerous data mining techniques – What is required is not known hence extraction process need to be interactive. Interpretation of results- requirements of expertsinterpretability problems- Background knowledge or domain expertise is essential to guide the discovery process visualization of results- visualization helps- multidimensional data is problematic – The discovered knowledge should expressed in the form of trees , tables, graphs, charts curves etc. 11/28/2020 Data Mining -By Dr. S. C. Shirwaikar 22

Data Mining issues continued Large datasets- scalability is a problem- algorithms do not scale

Data Mining issues continued Large datasets- scalability is a problem- algorithms do not scale well with massive real-world datasets- sampling and parallelization are effective tools High dimensionality -Conventional database may contain many different attributes, all are not relevant-increases complexity and reduces efficiency –dimensionality curse-data reduction-dimensionality reduction Multimedia data - found in GIS databases proves conventional data mining algorithms ineffective Missing data -It is not always possible to ignore missing data but in preprocessing data mining algorithms can be used to replace missing data with estimates 11/28/2020 Data Mining -By Dr. S. C. Shirwaikar 23

Data Mining issues continued Irrelevant data – Data reduced by removing irrelevant data Noisy

Data Mining issues continued Irrelevant data – Data reduced by removing irrelevant data Noisy data and outliers –Invalid , incorrect data will lead to poor quality data mining- Outliers are very much different and do not fit nicely into the derived model Changing data- Data warehouses contain non-volatile data. Dynamic data is uploaded and then algorithms are reapplied Integration- KDD requests are one time needs-data mining functions are now integrated into traditional database systems Applications – Effective use of output of mining algorithm is a challenge rather than the complexity of the mining algorithm 11/28/2020 Data Mining -By Dr. S. C. Shirwaikar 24

Data Mining Metrics How to measure the effectiveness of data mining process? -KDD process

Data Mining Metrics How to measure the effectiveness of data mining process? -KDD process is expensive- Return on investment will be the saving due to decision process using the results -Difficult to measure and quantify -Measured as increase in sales, reduction in advertising cost Social Implications of Data mining Two sides of the coin Data mining can be used to improve customer service and satisfaction Data mining can be used to confront one’s right to privacy Omnipresent Invisible Data mining affecting everyoneprofiling is used to label typical characteristics 11/28/2020 Data Mining -By Dr. S. C. Shirwaikar 25

Data mining should follow certain Guidelines Organization for Economic Co-operation and Development(OECD) established a

Data mining should follow certain Guidelines Organization for Economic Co-operation and Development(OECD) established a set of international guidelines referred as fair information practices • Purpose specification and use limitation-usage of collected should not exceed stated purpose • Openness-right to know the nature of data collected about them • Security safeguards- protected from loss, unauthorized access, destruction, use, modification or disclosure of data 11/28/2020 Data Mining -By Dr. S. C. Shirwaikar 26

Data mining should follow certain Guidelines • Individual participation – Individual has the right

Data mining should follow certain Guidelines • Individual participation – Individual has the right to have the data erased, completed or corrected • Privacy Preserving data mining -secure Multiparty computation- data values are encoded so that no party can learn another’s data values. -data obscuration- actual data is distorted by aggregation or by adding random noisereconstruction algorithm is essential for getting the distribution of original data. 11/28/2020 Data Mining -By Dr. S. C. Shirwaikar 27

BOOKS Data Mining, Introduction and Advanced Topics by Margaret H. Dunham and Sridhar Pearson

BOOKS Data Mining, Introduction and Advanced Topics by Margaret H. Dunham and Sridhar Pearson Education ISBN 81 -7758 -785 -4 Data Mining Techniques by Arun K Pujari Universities Press (India) Limited ISBN 81 -7371 -380 -4 Data mining, Pieter Adriaans& Dolf zantinge: (pearson Education Asia), ISBN 81 -7808 -425 -2. Addison Wesley Longman (Singapore) Data Mining Techniques for Marketing, Sales and Customer Relationship Management by Michael J. A. Berry and Gordon S. Linoff Wiley-dreamtech India Pvt. Ltd. ISBN 81 -265 -0517 -6 Data Mining Concepts and Techniques by Jiawei Han and Micheline Kamber Morgan Kaufmann Publishers. 81 -312 -0535 -5 ISBN 11/28/2020 Data Mining -By Dr. S. C. Shirwaikar 28