1 DATA MINING AND WAREHOUSING EXAMINATION MARKING SCHEME
- Slides: 76
1 DATA MINING AND WAREHOUSING
EXAMINATION MARKING SCHEME Teaching Scheme Examination Scheme Lectures: 3 Hrs/Week In semester Assessment: 30 2 End Semester Assessment : 70
COURSE OBJECTIVES: • To understand the fundamentals of Data Mining • To identify the appropriateness and need of mining the data • To learn the preprocessing, mining and post processing of the data 3 • To understand various methods, techniques and algorithms in data mining
COURSE OUTCOMES: • On completion of the course the student should be able to • Apply basic, intermediate and advanced techniques to mine the data • Analyze the output generated by the process of data mining • Explore the hidden patterns in the data 4 • Optimize the mining process by choosing best data mining technique
SYLLABUS Unit -I Introduction, (8 Hrs. ) Data Mining, Data Mining Task Primitives, Data: Data, Information and Knowledge; Attribute Types: Nominal, Binary, Ordinal and Numeric attributes, Discrete versus Continuous Attributes; Introduction to Data Preprocessing, 5 Data Cleaning: Missing values, Noisy data; Data integration: Correlation analysis; transformation: Minmax normalization, z-score normalization and decimal scaling; data reduction: Data Cube Aggregation, Attribute Subset Selection, sampling; and Data Discretization: Binning, Histogram Analysis
SYLLABUS Unit -II Data Warehouse (8 Hrs. ) 6 Data Warehouse, Operational Database Systems and Data Warehouses(OLTP Vs OLAP), A Multidimensional Data Model: Data Cubes, Stars, Snowflakes, and Fact Constellations Schemas; OLAP Operations in the Multidimensional Data Model, Concept Hierarchies, Data Warehouse Architecture, The Process of Data Warehouse Design, A three-tier data warehousing architecture, Types of OLAP Servers: ROLAP versus MOLAP versus HOLAP.
SYLLABUS Unit -III Measuring Data Similarity and Dissimilarity(8 Hrs. ) • Measuring Data Similarity and Dissimilarity, • Proximity Measures for Nominal Attributes and Binary Attributes, interval scaled; Dissimilarity of Numeric • Data: Minskowski Distance, • Euclidean distance and Manhattan distance; • Proximity Measures for Categorical, Ordinal Attributes, Ratio scaled variables; 7 • Dissimilarity for Attributes of Mixed Types, Cosine Similarity.
SYLLABUS Unit -IV Association Rules Mining (8 Hrs. ) Market basket Analysis, Frequent item set, Closed item set, Association Rules, a-priori Algorithm, Generating Association Rules from Frequent Item sets, Improving the Efficiency of a-priori, Mining Frequent Item sets without Candidate Generation: FP Growth Algorithm; Mining Various. Kinds of Association Rules: Mining multilevel association rules, constraint based association rulemining, 8 Meta rule-Guided Mining of Association Rules.
SYLLABUS Unit -V Classification (8 Hrs. ) Introduction to: Classification and Regression for Predictive Analysis, Decision Tree Induction, Rule-Based Classification: using IF-THEN Rules for Classification, Rule Induction Using a Sequential Covering Algorithm. Bayesian Belief Networks, Training Bayesian Belief Networks, Classification Using Frequent Patterns, Associative Classification, Lazy Learners-k. Nearest. Neighbor Classifiers, 9 Case-Based Reasoning.
SYLLABUS Unit -VI Multiclass Classification (8 Hrs. ) • Multiclass Classification, • Semi-Supervised Classification, Reinforcement learning, Systematic Learning, Wholistic learning and multiperspective learning. • Metrics for Evaluating Classifier Performance: Accuracy, Error Rate, precision, Recall, Sensitivity, Specificity; 10 • Evaluating the Accuracy of a Classifier: Holdout Method, Random Sub sampling and Cross-Validation.
TEXT BOOKS AND REFERENCE BOOKS Han, Jiawei Kamber, Micheline Pei and Jian, “Data Mining: Concepts and Techniques”, Elsevier Publishers, ISBN: 9780123814791, 9780123814807. 2. Parag Kulkarni, “Reinforcement and Systemic Machine Learning for Decision Making” by Wiley-IEEE Press, ISBN: 978 -0 -470 -91999 -6 Reference Books 1. Matthew A. Russell, "Mining the Social Web: Data Mining Facebook, Twitter, Linked. In, 11 Google+, Git. Hub, and More" , Shroff Publishers, 2 nd Edition, ISBN: 9780596006068 Maksim Tsvetovat, Alexander Kouznetsov, "Social Network Analysis for Startups: Finding connections on the social web", Shroff Publishers , ISBN: 10: 1449306462
INTRODUCTION Motivation: Why data mining? What is data mining? Data Mining: On what kind of data? Data mining functionality Are all the patterns interesting? Classification of data mining systems Data Mining Task Primitives Integration of data mining system with a DB and DW System 12 Major issues in data mining
UNIT-1) INTRODUCTION, Data Mining Task Primitives, Data: Data, Information and Knowledge; Attribute Types: Nominal, Binary, Ordinal and Numeric attributes, Discrete versus Continuous Attributes; Introduction to Data Preprocessing, Data Cleaning: Missing values, Noisy data; Data integration: Correlation analysis; transformation: Min-max normalization, z-score normalization and decimal scaling; data reduction: Data Cube Aggregation, Attribute Subset Selection, sampling; and Data Discretization: Binning, Histogram Analysis
SOME DEFINITIONS Data : Data are any facts, numbers, or text that can be processed by a computer. • operational or transactional data such as, sales, cost, inventory, payroll, and accounting • nonoperational data, such as industry sales, forecast data, and macro economic data • meta data - data about the data itself, such as logical database design or data dictionary definitions Information: The patterns, associations, or relationships among all this data can provide information.
DEFINITIONS CONTINUED. . Knowledge: Information can be converted into knowledge about historical patterns and future trends. For example, summary information on retail supermarket sales can be analyzed in terms of promotional efforts to provide knowledge of consumer buying behavior. Thus, a manufacturer or retailer could determine which items are most susceptible to promotional efforts. Data Warehouses: Data warehousing is defined as a process of centralized data management and retrieval.
WHY NEED OF DATA MINING? The Explosive Growth of Data: from terabytes to petabytes • Data collection and data availability • Automated data collection tools, database systems, Web, computerized society • Major sources of abundant data • Business: Web, e-commerce, transactions, stocks, … • Science: Remote sensing, bioinformatics, scientific simulation, … • Society and everyone: news, digital cameras, You. Tube **We are drowning in data, but starving for knowledge! ** “Necessity is the mother of invention”—Data mining— Automated analysis of massive data sets 16
EVOLUTION OF DATABASE TECHNOLOGY 1960 s: • Data collection, database creation, IMS and network DBMS 1970 s: • Relational data model, relational DBMS implementation 1980 s: • RDBMS, advanced data models (extended-relational, OO, deductive, etc. ) • Application-oriented DBMS (spatial, scientific, engineering, etc. ) 1990 s: • Data mining, data warehousing, multimedia databases, and Web databases 2000 s • Stream data management and mining • Data mining and its applications 17 • Web technology (XML, data integration) and global information systems
WHAT IS DATA MINING? Data mining (knowledge discovery from data) , Finding Hidden information in a database. • Extraction of interesting (non-trivial(relevant), implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data • Data mining: a misnomer? -the goal is the extaction of patterns & knowledge from lagre amount of data, not the extraction(mining) of data itself. Alternative names • Simple search and query processing • (Deductive) expert systems 18 • Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, Exploratory data analysis, Data Driven Discovery & Deductive learning. etc. Watch out: Is everything “data mining”?
WHY DATA MINING? —POTENTIAL APPLICATIONS Data analysis and decision support • Market analysis and management • Target marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentation • Risk analysis and management • Forecasting, customer retention, improved underwriting, quality control, competitive analysis • Fraud detection and detection of unusual patterns (outliers) Other Applications • Text mining (news group, email, documents) and Web mining • Stream data mining 19 • Bioinformatics and bio-data analysis
EX. 1: MARKET ANALYSIS AND MANAGEMENT Where does the data come from? —Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies Target marketing • Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc. , • Determine customer purchasing patterns over time Cross-market analysis—Find associations/co-relations between product sales, & predict based on such association Customer profiling—What types of customers buy what products (clustering or classification) Customer requirement analysis • Identify the best products for different customers • Predict what factors will attract new customers Provision of summary information 20 • Multidimensional summary reports • Statistical summary information (data central tendency and variation)
EX. 2: CORPORATE ANALYSIS & RISK MANAGEMENT Finance planning and asset evaluation • cash flow analysis and prediction • contingent claim analysis to evaluate assets • cross-sectional and time series analysis (financial-ratio, trend analysis, etc. ) Resource planning • summarize and compare the resources and spending Competition • monitor competitors and market directions • group customers into classes and a class-based pricing procedure 21 • set pricing strategy in a highly competitive market
EX. 3: FRAUD DETECTION & MINING UNUSUAL PATTERNS Approaches: Clustering & model construction for frauds, outlier analysis Applications: Health care, retail, credit card service, telecomm. • Auto insurance: ring of collisions • Money laundering: suspicious monetary transactions • Medical insurance • Professional patients, ring of doctors, and ring of references • Unnecessary or correlated screening tests • Telecommunications: phone-call fraud • Phone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm • Retail industry • Analysts estimate that 38% of retail shrink is due to dishonest employees 22 • Anti-terrorism
Data Mining Descriptive Predictive Clustering Classification Sequence Discovery Regression Prediction Summarization Association rules 11/5/2020 23 Time series Analysis
Classification- maps data into predefined groups or classes It uses supervised learning. The algorithm uses learning phase to build a classifier using training data set containing data attributes and associated class labels Regression-maps data into real-valued prediction variable. Algorithm tries to find best function (linear, Non-linear that fits the training data) Time Series Analysis- the value of an attribute is examined as it varies over time It can be used to determine similarities, classify the behavior or predict future values 11/5/2020 Data Mining -By Dr. S. C. Shirwaikar 24 Prediction – predicts future values using regression, time series analysis or other approaches
Clustering -Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters Unsupervised learning: no predefined classes Interpretability and usability-results should be comprehensible and usable-domain expert is required Summarization - maps data into subsets with simple descriptions. It extracts or derives representative summary type of information Association rules–discovers relationship among data – used in Market basket analysis to find item frequently purchased togather 11/5/2020 Data Mining -By Dr. S. C. Shirwaikar 25 Sequence Discovery- discovers sequential patterns in data-oder in which items are purchased or data is accessed
KNOWLEDGE DISCOVERY (KDD) PROCESS • Data mining—core of knowledge discovery process Pattern Evaluation Data Mining Task-relevant Data Warehouse Selection Transformation Data Cleaning Databases 26 Data Integration
Data Mining process
KDD PROCESS: SEVERAL KEY STEPS Learning the application domain • relevant prior knowledge and goals of application Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation • Find useful features, dimensionality/variable reduction, invariant representation Choosing functions of data mining • summarization, classification, regression, association, clustering Choosing the mining algorithm(s) Data mining: search for patterns of interest • visualization, transformation, removing redundant patterns, etc. Use of discovered knowledge 28 Pattern evaluation and knowledge presentation
DATA MINING AND BUSINESS INTELLIGENCE Increasing potential to support business decisions Decision Making Data Presentation Visualization Techniques End User Business Analyst Data Mining Information Discovery Data Analyst Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses DBA 29 Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems
KDD VS DATA MINING KDD-(Knowledge Discovery in Databases) is a field of computer science, which includes the tools and theories to help humans in extracting useful and previously unknown information (i. e. knowledge) from large collections of digitized data. KDD consists of several steps, and Data Mining is one of them.
CONTI. . This process deal with the mapping of low-level data into other forms those are more compact, abstract and useful. This is achieved by creating short reports, modelling the process of generating data and developing predictive models that can predict future cases. Data Mining: >> is application of a specific algorithm in order to extract patterns from data.
WHAT IS THE DIFFERENCE BETWEEN KDD AND DATA MINING? Although, the two terms KDD and Data Mining are heavily used interchangeably, they refer to two related yet slightly different concepts. KDD is the overall process of extracting knowledge from data while Data Mining is a step inside the KDD process, which deals with identifying patterns in data. In other words, Data Mining is only the application of a specific algorithm based on the overall goal of the KDD process.
Architecture of a typical data mining System
DATA MINING: CONFLUENCE OF MULTIPLE DISCIPLINES Database Technology Pattern Recognition Data Mining Algorithm Visualization Other Disciplines 34 Machine Learning Statistics
TECHNOLOGIES USED… Data mining includes many techniques from Domains bellow: Statistics Machine Learning Database systems and Data Warehouses Information Retrieval Visualization High performance computing Pattern Matching
TECHNOLOGIES CONTINUED. . Statistics: It studies Collection, Analyasis Interpretation and presentation of Data. #>Statistical research develops tools for prediction and forecasting using data #>Statistical methods can also be used to verify data mining results.
CONTI… Information Retrieval: It is science of searching for documents or information in documents… Text Retrieval Basic Measures of text retrieval. Precision= {Relevant} ∩{Retrieved} / {Retrieved} Recall = {Relevant} ∩{Retrieved} / {Relevant}
CONTI… Database Systems Data Warehouses: This research focuses on the creation, maintainance and use of databases for organizations and end users.
CONTINUED. . Machine Learning: It investigates how computers can learn or improve their performance based on data.
CONTINUED. . High Performance Computing most generally refers to the practice of aggregating computing power in a way that delivers much higher performance than one could get out of a typical desktop computer or workstation in order to solve large problems in science, engineering, or business.
CONTINUED. . Data visualization is a general term that describes any effort to help people understand the significance of data by placing it in a visual context. Patterns, trends and correlations that might go undetected in text-based data can be exposed and recognized easier with data visualization software.
MAJOR ISSUES IN DATA MINING Mining methodology • Mining different kinds of knowledge from diverse data types, e. g. , bio, stream, Web • Performance: efficiency, effectiveness, and scalability • Pattern evaluation: the interestingness problem • Incorporation of background knowledge • Handling noise and incomplete data • Parallel, distributed and incremental mining methods • Integration of the discovered knowledge with existing one: knowledge fusion User interaction • Data mining query languages and ad-hoc mining • Expression and visualization of data mining results • Interactive mining of knowledge at multiple levels of abstraction Applications and social impacts 42 • Domain-specific data mining & invisible data mining • Protection of data security, integrity, and privacy
KDD VS DATA MINING KDD-(Knowledge Discovery in Databases) is a field of computer science, which includes the tools and theories to help humans in extracting useful and previously unknown information (i. e. knowledge) from large collections of digitized data. KDD consists of several steps, and Data Mining is one of them.
WHY NOT TRADITIONAL DATA ANALYSIS? Tremendous amount of data • Algorithms must be highly scalable to handle such as tera-bytes of data High-dimensionality of data • Micro-array may have tens of thousands of dimensions High complexity of data • Data streams and sensor data • Time-series data, temporal data, sequence data • Structure data, graphs, social networks and multi-linked data • Heterogeneous databases and legacy databases • Spatial, spatiotemporal, multimedia, text and Web data • Software programs, scientific simulations 44 New and sophisticated applications
MULTI-DIMENSIONAL VIEW OF DATA MINING Data to be mined • Relational, data warehouse, transactional, stream, object-oriented/relational, active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW Knowledge to be mined • Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc. • Multiple/integrated functions and mining at multiple levels Techniques utilized • Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, etc. Applications adapted 45 • Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.
DATA MINING: CLASSIFICATION SCHEMES General functionality • Descriptive data mining • Predictive data mining Different views lead to different classifications • Data view: Kinds of data to be mined • Knowledge view: Kinds of knowledge to be discovered • Method view: Kinds of techniques utilized 46 • Application view: Kinds of applications adapted
DATA MINING: ON WHAT KINDS OF DATA? Database-oriented data sets and applications • Relational database, data warehouse, transactional database Advanced data sets and advanced applications • Data streams and sensor data • Time-series data, temporal data, sequence data (incl. bio-sequences) • Structure data, graphs, social networks and multi-linked data • Object-relational databases • Heterogeneous databases and legacy databases • Spatial data and spatiotemporal data • Multimedia database • Text databases 47 • The World-Wide Web
DATA MINING FUNCTIONALITIES Multidimensional concept description: Characterization and discrimination • Generalize, summarize, and contrast data characteristics, e. g. , dry vs. wet regions Frequent patterns, association, correlation vs. causality • Diaper Beer [0. 5%, 75%] (Correlation or causality? ) Classification and prediction • Construct models (functions) that describe and distinguish classes or concepts for future prediction • E. g. , classify countries based on (climate), or classify cars based on (gas mileage) 48 • Predict some unknown or missing numerical values
DATA MINING FUNCTIONALITIES (2) Cluster analysis • Class label is unknown: Group data to form new classes, e. g. , cluster houses to find distribution patterns • Maximizing intra-class similarity & minimizing interclass similarity Outlier analysis • Outlier: Data object that does not comply with the general behavior of the data • Noise or exception? Useful in fraud detection, rare events analysis Trend and evolution analysis 49 • Trend and deviation: e. g. , regression analysis • Periodicity analysis • Similarity-based analysis Other pattern-directed or statistical analyses
ARE ALL THE “DISCOVERED” PATTERNS INTERESTING? Data mining may generate thousands of patterns: Not all of them are interesting • Suggested approach: Human-centered, query-based, focused mining Interestingness measures • A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm Objective vs. subjective interestingness measures • Objective: based on statistics and structures of patterns, e. g. , support, confidence, etc. • Subjective: based on user’s belief in the data, e. g. , unexpectedness, novelty, 50 actionability, etc.
FIND ALL AND ONLY INTERESTING PATTERNS? Find all the interesting patterns: Completeness • Can a data mining system find all the interesting patterns? Do we need to find all of the interesting patterns? • Heuristic vs. exhaustive search • Association vs. classification vs. clustering Search for only interesting patterns: An optimization problem • Can a data mining system find only the interesting patterns? • Approaches • First general all the patterns and then filter out the uninteresting ones 51 • Generate only the interesting patterns—mining query optimization
WHY DATA MINING QUERY LANGUAGE? Automated vs. query-driven? • Finding all the patterns autonomously in a database? —unrealistic because the patterns could be too many but uninteresting Data mining should be an interactive process • User directs what to be mined Users must be provided with a set of primitives to be used to communicate with the data mining system Incorporating these primitives in a data mining query language • More flexible user interaction • Foundation for design of graphical user interface 52 • Standardization of data mining industry and practice
DMQL—A DATA MINING QUERY LANGUAGE Motivation • A DMQL can provide the ability to support ad-hoc and interactive data mining • By providing a standardized language like SQL • Hope to achieve a similar effect like that SQL has on relational database • Foundation for system development and evolution 53 • Facilitate information exchange, technology transfer, commercialization and wide acceptance
54 AN EXAMPLE QUERY IN DMQL
INTEGRATION OF DATA MINING AND DATA WAREHOUSING Data mining systems, DBMS, Data warehouse systems coupling • No coupling, loose-coupling, semi-tight-coupling, tight-coupling On-line analytical mining data • integration of mining and OLAP technologies Interactive mining multi-level knowledge • Necessity of mining knowledge and patterns at different levels of abstraction by drilling/rolling, pivoting, slicing/dicing, etc. Integration of multiple mining functions 55 • Characterized classification, first clustering and then association
COUPLING DATA MINING WITH DB/DW SYSTEMS No coupling—flat file processing, not recommended Loose coupling • Fetching data from DB/DW Semi-tight coupling—enhanced DM performance • Provide efficient implement a few data mining primitives in a DB/DW system, e. g. , sorting, indexing, aggregation, histogram analysis, multiway join, precomputation of some stat functions Tight coupling—A uniform information processing environment 56 • DM is smoothly integrated into a DB/DW system, mining query is optimized based on mining query, indexing, query processing methods, etc.
ARCHITECTURE: TYPICAL DATA MINING SYSTEM Graphical User Interface Pattern Evaluation Data Mining Engine Knowl edge. Base Database or Data Warehouse Server Database Data World-Wide Other Info Repositories Warehouse Web 57 data cleaning, integration, and selection
ØData mining Engine-It consists of a set of functional modules for task such as characterization, association and correlation analysis classification, prediction cluster analysis, outlier analysis etc ØKnowledge base – It is the domain knowledge used to guide the search or evaluate the interestingness of resulting patterns ØPattern evolution module- It applies interestingness measures to filter out discovered patterns 11/5/2020 58 ØGraphical User Interface- user can specify a data mining query
MAJOR ISSUES IN DATA MINING Mining methodology • Mining different kinds of knowledge from diverse data types, e. g. , bio, stream, Web • Performance: efficiency, effectiveness, and scalability • Pattern evaluation: the interestingness problem • Incorporation of background knowledge • Handling noise and incomplete data • Parallel, distributed and incremental mining methods • Integration of the discovered knowledge with existing one: knowledge fusion User interaction • Data mining query languages and ad-hoc mining • Expression and visualization of data mining results • Interactive mining of knowledge at multiple levels of abstraction Applications and social impacts 59 • Domain-specific data mining & invisible data mining • Protection of data security, integrity, and privacy
WHAT IS ASSOCIATION RULE MINING? Frequent patterns: patterns (set of items, sequence, etc. ) that occur frequently in a database [AIS 93] Frequent pattern mining: finding regularities in data 60 • What products were often purchased together? • Bread and milk • What are the subsequent purchases after buying a car? • Can we automatically profile customers?
BASICS Itemset: a set of items • E. g. , acm={a, c, m} Support of itemsets • Sup(acm)=3 Given min_sup=3, acm is a frequent pattern Frequent pattern mining: find all frequent patterns in a database Transaction database TDB TID Items bought 100 f, a, c, d, g, I, m, p 200 a, b, c, f, l, m, o 300 b, f, h, j, o 400 b, c, k, s, p 500 a, f, c, e, l, p, m, n 61
FREQUENT PATTERN MINING METHODS Apriori and its variations/improvements Mining frequent-patterns without candidate generation Mining max-patterns and closed itemsets Mining multi-dimensional, multi-level frequent patterns with flexible support constraints 62 Interestingness: correlation and causality
APRIORI: CANDIDATE GENERATION-ANDTEST Any subset of a frequent itemset must be also frequent — an anti-monotone property • A transaction containing {beer, diaper, nuts} also contains {beer, diaper} • {beer, diaper, nuts} is frequent {beer, diaper} must also be frequent No superset of any infrequent itemset should be generated or tested 63 • Many item combinations can be pruned
APRIORI-BASED MINING Generate length (k+1) candidate itemsets from length k frequent itemsets, and 64 Test the candidates against DB
APRIORI ALGORITHM A level-wise, candidate-generation-and-test approach (Agrawal & Srikant 1994) TID 10 20 30 40 Items a, c, d b, c, e a, b, c, e b, e 1 -candidates Scan D Min_sup=2 3 -candidates Scan D Itemset bce Freq 3 -itemsets Itemset bce Sup 2 Itemset a b c d e Sup 2 3 3 1 3 Freq 1 -itemsets Itemset a b c Sup 2 3 3 e 3 Freq 2 -itemsets Itemset ac bc be ce Sup 2 2 3 2 2 -candidates Counting Itemset ab ac ae bc be ce Sup 1 2 3 2 Itemset ab ac ae bc be ce Scan D 65 Data base D
THE APRIORI ALGORITHM Ck: Candidate itemset of size k Lk : frequent itemset of size k L 1 = {frequent items}; for (k = 1; Lk != ; k++) do • Ck+1 = candidates generated from Lk; • for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t • Lk+1 = candidates in Ck+1 with min_support 66 return k Lk;
IMPORTANT DETAILS OF APRIORI How to generate candidates? 67 • Step 1: self-joining Lk • Step 2: pruning How to count supports of candidates?
HOW TO GENERATE CANDIDATES? Suppose the items in Lk-1 are listed in an order Step 1: self-join Lk-1 INSERT INTO Ck SELECT p. item 1, p. item 2, …, p. itemk-1, q. itemk-1 FROM Lk-1 p, Lk-1 q WHERE p. item 1=q. item 1, …, p. itemk-2=q. itemk-2, p. itemk-1 < q. itemk-1 Step 2: pruning • For each itemset c in Ck do • For each (k-1)-subsets s of c do if (s is not in Lk-1) 68 then delete c from Ck
EXAMPLE OF CANDIDATEGENERATION L 3={abc, abd, ace, bcd} Self-joining: L 3*L 3 • abcd from abc and abd • acde from acd and ace Pruning: 69 • acde is removed because ade is not in L 3 C 4={abcd}
HOW TO COUNT SUPPORTS OF CANDIDATES? Why counting supports of candidates a problem? • The total number of candidates can be very huge • One transaction may contain many candidates Method: 70 • Candidate itemsets are stored in a hash-tree • Leaf node of hash-tree contains a list of itemsets and counts • Interior node contains a hash table • Subset function: finds all the candidates contained in a transaction
CHALLENGES OF FREQUENT PATTERN MINING Challenges • Multiple scans of transaction database • Huge number of candidates • Tedious workload of support counting for candidates Improving Apriori: general ideas 71 • Reduce number of transaction database scans • Shrink number of candidates • Facilitate support counting of candidates
SUMMARY Data mining: Discovering interesting patterns from large amounts of data A natural evolution of database technology, in great demand, with wide applications A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation Mining can be performed in a variety of information repositories Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc. Data mining systems and architectures 72 Major issues in data mining
A BRIEF HISTORY OF DATA MINING SOCIETY 1989 IJCAI Workshop on Knowledge Discovery in Databases • Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991) 1991 -1994 Workshops on Knowledge Discovery in Databases • Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996) 1995 -1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD’ 95 -98) • Journal of Data Mining and Knowledge Discovery (1997) ACM SIGKDD conferences since 1998 and SIGKDD Explorations More conferences on data mining • PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), etc. 73 ACM Transactions on KDD starting in 2007
CONFERENCES AND JOURNALS ON DATA MINING o • ACM SIGKDD Int. Conf. on Knowledge Discovery in Databases and Data Mining (KDD) • SIAM Data Mining Conf. (SDM) • (IEEE) Int. Conf. on Data Mining (ICDM) • Conf. on Principles and practices of o Knowledge Discovery and Data Mining (PKDD) • Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD) Other related conferences n ACM SIGMOD n VLDB n (IEEE) ICDE n WWW, SIGIR n ICML, CVPR, NIPS Journals n Data Mining and Knowledge Discovery (DAMI or DMKD) n IEEE Trans. On Knowledge and Data Eng. (TKDE) n KDD Explorations n ACM Trans. on KDD 74 KDD Conferences
WHERE TO FIND REFERENCES? DBLP, CITESEER, GOOGLE Data mining and KDD (SIGKDD: CDROM) • Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc. • Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD Database systems (SIGMOD: ACM SIGMOD Anthology—CD ROM) • Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA • Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J. , Info. Sys. , etc. AI & Machine Learning • Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc. • Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems, IEEEPAMI, etc. Web and IR • Conferences: SIGIR, WWW, CIKM, etc. • Journals: WWW: Internet and Web Information Systems, Statistics • Conferences: Joint Stat. Meeting, etc. • Journals: Annals of statistics, etc. Visualization 75 • Conference proceedings: CHI, ACM-SIGGraph, etc. • Journals: IEEE Trans. visualization and computer graphics, etc.
RECOMMENDED REFERENCE BOOKS S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan Kaufmann, 2002 R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2 ed. , Wiley-Interscience, 2000 T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003 U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996 U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann, 2001 J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2 nd ed. , 2006 D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001 T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer-Verlag, 2001 T. M. Mitchell, Machine Learning, Mc. Graw Hill, 1997 G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991 P. -N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005 S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998 76 I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2 nd ed. 2005
- What is data mining and data warehousing
- Datamart olap
- Data warehouse and olap technology for data mining
- Introduction to data mining and data warehousing
- Eck
- Multimedia data mining
- Hive provides data warehousing layer to data over hadoop.
- Data warehouse design best practices
- Understanding standards national 5 pe
- Write a formal reply declining the invitation
- Marking criteria
- A devoted son comprehension marking scheme
- Learning english through popular culture dse
- Ib marking scheme
- Examinations.ie marking scheme
- Difference between strip mining and open pit mining
- Web text mining
- An overview of data warehousing and olap technology
- An overview of data warehousing and olap technology
- Olap meaning
- Strip mining vs open pit mining
- Mineral resources and mining chapter 13
- 3 domain scheme and 5 kingdom scheme
- Introduction to data warehouse
- Greenplum data warehousing
- Data warehouse component
- Data warehouse project plan
- Human thought process
- Principles of data warehouse
- Introduction to data warehouse
- Concept hierarchy in data warehousing
- Cognos impromptu in data warehousing
- Data warehouse basic concepts
- Inmon cif
- Query driven approach in data warehouse
- Late binding data warehouse
- Technical architecture data warehouse
- Data reduction in data mining
- What is missing data in data mining
- Data reduction in data mining
- Data reduction in data mining
- Data reduction in data mining
- Data cube technology in data mining
- Data reduction in data mining
- Arsitektur data mining
- Perbedaan data warehouse dan data mining
- Multidimensional analysis and descriptive mining of complex
- Noisy data in data mining
- Independent data mart architecture
- Markku roiha
- Data compression in data mining
- Data warehouse dan data mining
- Cs 412 introduction to data mining
- Stata graph schemes download
- Pyramid scheme vs ponzi
- Guide to computer forensics and investigations 6th edition
- Warehouse health and safety
- Nature of warehousing
- Inventory and warehousing cycle
- Inventory and warehousing cycle
- Audit of the inventory and warehousing cycle
- Warehouse regulatory and development authority
- Four types of wholesaling
- Overfitting and pruning in data mining
- Data mining concepts and techniques
- Naive bayes dataset
- Data mining concepts and techniques
- Characterization and comparison in data mining
- Data mining primitives languages and system architecture
- Motivation of data mining
- Similarity and dissimilarity measures in data mining
- Reporting and query tools
- Mining frequent patterns associations and correlations
- Machine learning and data mining
- Classification and clustering in data mining
- Closed patterns and max-patterns
- Data mining slides