Issues in Data Mining Infrastructure Authors Nemanja Jovanovic
- Slides: 72
Issues in Data Mining Infrastructure Authors: Nemanja Jovanovic, nemko@acm. org Valentina Milenkovic, tina@eunet. yu Prof. Dr. Veljko Milutinovic, vm@etf. bg. ac. yu http: //galeb. etf. bg. ac. yu/~vm 1
Data Mining in the Nutshell § Uncovering the hidden knowledge § Huge n-p complete search space § Multidimensional interface 2
A Problem … You are a marketing manager for a cellular phone company § Problem: Churn is too high § Turnover (after contract expires) is 40% § Customers receive free phone (cost 125$) with contract § You pay a sales commission of 250$ per contract § Giving a new telephone to everyone whose contract is expiring is very expensive (as well as wasteful) § Bringing back a customer after quitting is both difficult and expensive 3
… A Solution § Three months before a contract expires, predict which customers will leave § If you want to keep a customer that is predicted to churn, offer them a new phone § The ones that are not predicted to churn need no attention § If you don’t want to keep the customer, do nothing § How can you predict future behavior? § Tarot Cards? § Magic Ball? § Data Mining? 4
Still Skeptical? 5
The Definition The automated extraction of predictive information from (large) databases § Automated § Extraction § Predictive § Databases 6
History of Data Mining 7
Repetition in Solar Activity § 1613 – Galileo Galilei § 1859 – Heinrich Schwabe 8
The Return of the Halley Comet Edmund Halley (1656 - 1742) 1531 1607 1682 239 BC 1910 1986 9 2061 ? ? ?
Data Mining is Not § Data warehousing § Ad-hoc query/reporting § Online Analytical Processing (OLAP) § Data visualization 10
Data Mining is § Automated extraction of predictive information from various data sources § Powerful technology with great potential to help users focus on the most important information stored in data warehouses or streamed through communication lines 11
Data Mining can § Answer question that were too time consuming to resolve in the past § Predict future trends and behaviors, allowing us to make proactive, knowledge driven decision 12
Focus of this Presentation § Data Mining problem types § Data Mining models and algorithms § Efficient Data Mining § Available software 13
Data Mining Problem Types 14
Data Mining Problem Types § 6 types § Often a combination solves the problem 15
Data Description and Summarization § Aims at concise description of data characteristics § Lower end of scale of problem types § Provides the user an overview of the data structure § Typically a sub goal 16
Segmentation § Separates the data into interesting and meaningful subgroups or classes § Manual or (semi)automatic § A problem for itself or just a step in solving a problem 17
Classification § Assumption: existence of objects with characteristics that belong to different classes § Building classification models which assign correct labels in advance § Exists in wide range of various application § Segmentation can provide labels or restrict data sets 18
Concept Description § Understandable description of concepts or classes § Close connection to both segmentation and classification § Similarity and differences to classification 19
Prediction (Regression) § Finds the numerical value of the target attribute for unseen objects § Similar to classification - difference: discrete becomes continuous 20
Dependency Analysis § Finding the model that describes significant dependences between data items or events § Prediction of value of a data item § Special case: associations 21
Data Mining Models 22
Neural Networks § Characterizes processed data with single numeric value § Efficient modeling of large and complex problems § Based on biological structures Neurons § Network consists of neurons grouped into layers 23
Neuron Functionality I 1 W 1 I 2 W 2 I 3 W 3 In f Output Wn Output = f (W 1*I 1, W 2*I 1, …, Wn*In) 24
Training Neural Networks 25
Neural Networks - Conclusion § Once trained, Neural Networks can efficiently estimate value of output variable for given input § Neurons and network topology are essentials § Usually used for prediction or regression problem types § Difficult to understand § Data pre-processing often required 26
Decision Trees § A way of representing a series of rules that lead to a class or value § Iterative splitting of data into discrete groups maximizing distance between them at each split § Classification trees and regression trees § Univariate splits and multivariate splits § Unlimited growth and stopping rules § CHAID, CHART, Quest, C 5. 0 27
Decision Trees Balance>10 Age<=32 Married=NO 28 Balance<=10 Age>32 Married=YES
Decision Trees 29
Rule Induction § Method of deriving a set of rules to classify cases § Creates independent rules that are unlikely to form a tree § Rules may not cover all possible situations § Rules may sometimes conflict in a prediction 30
Rule Induction If balance>100. 000 then confidence=HIGH & weight=1. 7 If balance>25. 000 and status=married then confidence=HIGH & weight=2. 3 If balance<40. 000 then confidence=LOW & weight=1. 9 31
K-nearest Neighbor and Memory-Based Reasoning (MBR) § Usage of knowledge of previously solved similar problems in solving the new problem § Assigning the class to the group where most of the k-”neighbors” belong § First step – finding the suitable measure for distance between attributes in the data § How far is black from green? § + Easy handling of non-standard data types § - Huge models 32
K-nearest Neighbor and Memory-Based Reasoning (MBR) 33
Data Mining Models and Algorithms § Many other available models and algorithms § Logistic regression § Discriminant analysis § Generalized Adaptive Models (GAM) § Genetic algorithms § Etc… § Many application specific variations of known models § Final implementation usually involves several techniques § Selection of solution that match best results 34
Efficient Data Mining 35
NO YES Is It Working? Don’t Mess With It! YES Did You Mess With It? You Shouldn’t Have! NO Anyone Else Knows? NO YES You’re in TROUBLE! NO Hide It Can You Blame Someone Else? YES NO PROBLEM! 36 YES Will it Explode In Your Hands? NO Look The Other Way
DM Process Model § 5 A – used by SPSS Clementine (Assess, Access, Analyze, Act and Automate) § SEMMA – used by SAS Enterprise Miner (Sample, Explore, Modify, Model and Assess) § CRISP–DM – tends to become a standard 37
CRISP - DM § CRoss-Industry Standard for DM § Conceived in 1996 by three companies: 38
CRISP – DM methodology Four level breakdown of the CRISP-DM methodology: Phases Generic Tasks Specialized Tasks Process Instances 39
Mapping generic models to specialized models § Analyze the specific context § Remove any details not applicable to the context § Add any details specific to the context § Specialize generic context according to concrete characteristic of the context § Possibly rename generic contents to provide more explicit meanings 40
Generalized and Specialized Cooking § Preparing food on your own § §Raw Find out what youvegetables? want to eat stake with § § Find the recipe for that meal Check the Cookbook or call mom Gather the ingredients Defrost the meat (if you had it in the fridge) Prepare the meal Buy missing ingredients Enjoy yourthe food or borrow from the neighbors Clean up everything (or leave it for later) Cook the vegetables and fry the meat § Enjoy your food or even more § You were cooking so convince someone else to do the dishes § § § § 41
CRISP – DM model § Business understanding § Data preparation § Modeling Business understanding Deployment § Evaluation § Deployment Evaluation 42 Data understanding Data preparation Modeling
Customizing a Web Page § User-friendly design § Prediction of the users interests § Reduction of server workload § Reduction of Web traffic 43
Customizing a Web Page 44
Business Understanding § Determine business objectives § Assess situation § Determine data mining goals § Produce project plan 45
Business Understanding - Outputs § Background § Business objectives and success criteria § Inventory of resources § Requirements, assumptions, and constrains § Risks and contingencies § Terminology § Costs and benefits § Data mining goals and success criteria § Project plan § Initial assessment of tools and techniques 46
Customizing a Web Page – Business Understanding Example § § Business objectives Make the users surfing Assess situation more comfortable § Make the users surfingfor users Decrease of overhead Data mining goals more comfortable § Reduction of workload and § Find patterns Web the traffic Decrease of overhead for users Project planbehavior in the user § Reduction of workload and Web traffic § 47
Data Understanding § Collect initial data § Describe data § Explore data § Verify data quality 48
Data Understanding - Outputs § Data collection report § Data description § Background of datareport § List of data sources § Data exploration § Detailed descriptionreport of each data source § § For each data source, method of acquisition § List of tables or other database objects Data quality report or patterns and § Expected regularities § Problems encountered in data acquisition of detection §§ methods Description of each field units, codes, etc. Approach taken to assessincluding data quality §§ Regularities or patterns found Results of data quality assessment (expected and unexpected) § Any other surprises § Conclusions for data transformation, data cleaning and any other pre-processing § Conclusions related to data mining goals or business objectives 49
Customizing a Web Page – Data Understanding Example § § Collecting the data Update the server to monitor Data userdescription behavior § Record the users activities Results of data exploring into a storage § § Analyze recorded data § Decide which data is usable for mining Verification of the quality of the data 50
Data Preparation § Select data § Clean data § Construct data § Integrate data § Format data 51
Data Preparation - Outputs § Dataset description report § Background including broad goals and plan for pre-processing § Description of pre-processing § Detailed description of resultant datasets § Rational for inclusion/exclusion of attributes § Discoveries made during pre-processing and implications for further work § Dataset 52
Customizing a Web Page – Data Preparation Example § Decide from what period will the users monitored actions be considered § Make assumptions about unnecessary monitored data and discard them § Classify user actions into categories, group interesting links, etc… § If more information about user is available from other sources, use them § Transform data into suitable forms so several modeling techniques could be applied 53
Modeling § Select modeling technique § Generate test design § Build model § Assess model 54
Modeling - Outputs § Assessment of DM results with respect to business success criteria § Test design § Broaddescription of the type of model and § Model § the training data to be used § Type assessment of model and relation to data mining goals Model § Explanation of how the model will be tested or assessed § Overview assessment including Parameterofsettings used process to produce model § Description of any for testing deviations from thedata testrequired plan § Detailed description of the model and §§ Description of any planned of models Detailed assessment of the examination model any special features by domain or data experts § Comments models by domaininorthe data experts Conclusionson regarding patterns data § Insights into why a certain modeling technique and certain parameter setting lead to good/bad results 55
Customizing a Web Page – Modeling Example § The problem is prediction of behavior § Regression could be a good solution due to distinct nature of the data § Create the software according to the project plan § Observe the behavior of the software § Tune the model after each evaluation phase if needed 56
Evaluation results = models + findings § Evaluate results § Review process § Determine next steps 57
Evaluation - Outputs § Assessment of DM results with respect to business success criteria § Reviewof of process Business Objectives and § Review Business Success Criteria § List of possible actions § Comparison between success criterion and DM results § Conclusion about achievability of success criterion and suitability of data mining process § Review of “Project Success” § Are there new business objectives? 58
Customizing a Web Page – Evaluation Example § Observe the model behavior at work § Collect response from Beta testers § Check user satisfaction § Check server and network engagement § Classify results § Determine which parameter of the model should be changed § Present new ideas and modifications § Step back into previous phases as needed 59
Deployment § Plan deployment § Plan monitoring and maintenance § Produce final report § Review project 60
Deployment - Outputs § Monitoring and maintenance plan § Final report § Overview of deployment results and indication §§ which of results may require updating Summary of Business Understanding Description ofobjectives how updating be triggered (background, and will success criteria) Description how updating will be performed Summary ofof data mining process § Summary of data mining results § Summary of results evaluation § Summary of deployment and maintenance plan § Cost/benefit analysis § Conclusions for the business § Conclusions for future data mining § § 61
Customizing a Web Page – Deployment Example § Make the feature available to all users § Make plan for maintenance and user feedback § Analyze costs and benefits § Summarize the whole documentation § Summarize network and server additional activity § Collect the new ideas § Award according to results § Leave space for upgrade 62
At Last… 63
Available Software 64
Available Software Discussion of data mining vendors and software is not included into this slide set 65
Conclusions 66
WWW. NBA. COM 67
Se 7 en 68
CD – ROM 69
Credits Anne Stern, SPSS, Inc. Djuro Gluvajic, ITE, Denmark Obrad Milivojevic, PC PRO, Yugoslavia 70
References § Bruha, I. , ‘Data Mining, KDD and Knowledge Integration: Methodology and A case Study”, SSGRR 2000 § Fayyad, U. , Shapiro, P. , Smyth, P. , Uthurusamy, R. , “Advances in Knowledge Discovery and Data Mining”, MIT Press, 1996 § Glumour, C. , Maddigan, D. , Pregibon, D. , Smyth, P. , “Statistical Themes nad Lessons for Data Mining”, Data Mining And Knowledge Discovery 1, 11 -28, 1997 § Hecht-Nilsen, R. , “Neurocomputing”, Addison-Wesley, 1990 § Pyle, D. , “Data Preparation for Data Mining”, Morgan Kaufman, 1999 § galeb. etf. bg. ac. yu/~vm § www. thearling. com § www. crisp-dm. com § www. twocrows. com § www. sas. com/products/miner § www. spss. com/clementine 71
The END 72
- Dr nemanja jovanovic
- Eck
- Mining multimedia databases
- Darius davis static shock
- Nemanja ciric
- "nemanja stanković"
- Mining methodology and user interaction issues
- Borjana kremžar jovanović
- Matko sanjin jovanović
- Leksikalna greshka
- Dr tomislav jovanovic
- Biblioteka djordje jovanovic
- Andrea jovanovic
- Bloomova taksonomija
- Milika jovanovic
- Mihajlo jovanovic
- Mića jovanović
- Strip mining vs open pit mining
- Strip mining vs open pit mining
- Difference between strip mining and open pit mining
- Difference between text mining and web mining
- Five layer model of e business infrastructure
- Layers of e commerce infrastructure
- Data reduction in data mining
- What is data mining and data warehousing
- What is missing data in data mining
- Concept hierarchy generation for nominal data
- Data reduction in data mining
- Data reduction in data mining
- Data cube technology in data mining
- Data reduction in data mining
- Perbedaan data warehouse dan data mining
- Data mining dan data warehouse
- Crm data warehouse models
- Descriptive mining of complex data objects
- Data warehouse and olap technology for data mining
- Noisy data in data mining
- Three-tier data warehouse architecture
- Data preparation for data mining
- Data compression in data mining
- Introduction to data warehousing and data mining
- Data warehouse dan data mining
- Complex data types in data mining
- Big data infrastructure
- Big data test infrastructure
- Unstructured cabling
- Data infrastructure
- Data infrastructure
- Hifld data
- It operations management gartner magic quadrant
- Cef big data
- National spatial data infrastructure
- Canadian geospatial data infrastructure
- National spatial data infrastructure
- Spatial data infrastructure components
- Hadoop oltp
- Unsupervised learning in data mining
- Motivation and importance of data mining
- Data mining slides
- Query tools in data mining
- Pump it up data mining the water table
- Output data mining
- Sebutkan 5 peran utama data mining
- Olap stands for: *
- Bloom filter for stream data mining
- What are the steps in mining process?
- Data mining midterm exam with solutions
- Multidimensional space in data mining
- Data mining roadmap
- Pentaho weka
- Spatial data mining applications
- Walmart data mining
- Ibm data mining