1 Last update 29 October 2008 Advanced databases
1 Last update: 29 October 2008 Advanced databases – Inferring new knowledge from data(bases): Deductive Databases; Knowledge Discovery in Databases Bettina Berendt Katholieke Universiteit Leuven, Department of Computer Science http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 1
2 Agenda Motivation I: Application examples Motivation II: Types of reasoning A key concept of deductive DBs: Recursion The process of knowledge discovery (KDD) KDD: Origins and context A short overview of key KDD techniques An algorithm for decision-tree learning: ID 3 Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 2
3 What is the impact of genetically modified organisms? Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 3
Is our school system good for immigrants and/or children from poor backgrounds? Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 4 4
5 What are the effects of teaching in English at universities? Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 5
6 What makes people happy? Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 6
7 What do men and women like? Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 7
8 Is this a man or a woman? clicked on Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 8
And here‘s a somewhat speculative case. . . Who owes money to whom (causing the current financial crisis)? Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 9 9
10 Agenda Motivation I: Application examples Motivation II: Types of reasoning A key concept of deductive DBs: Recursion The process of knowledge discovery (KDD) KDD: Origins and context A short overview of key KDD techniques An algorithm for decision-tree learning: ID 3 Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 10
11 Deductive database languages / Datalog: Motivation SQL-92 (= SQL 2) cannot express some queries: n Are we running low on any parts needed to build a ZX 600 sports car? n What is the total component and assembly cost to build a ZX 600 at today's part prices? ? NB: SQL saw a new version (SQL 3) in 1999 and further developments since then. Some DDB concepts are used to support the advanced features of more recent SQL standards. Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 11
12 What is a deductive database (system)? A deductive database system is a database system which can make deductions (ie: conclude additional facts) based on rules and facts stored in the (deductive) database. Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 12
13 Styles of reasoning: „All swans are white“ n n n Deductive: towards the consequences All swans are white. Tessa is a swan. Tessa is white. Inductive: towards a generalisation of observations Joe and Lisa and Tex and Wili and. . . (all observed swans) are swans. Joe and Lisa and Tex and Wili and. . . (all observed swans) are white. All swans are white. Abductive: towards the (most likely) explanation of an observation. Tessa is white. All swans are white. Tessa is a swan. Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 13
14 What about truth? n Deductive: n n Given the truth of the assumptions, a valid deduction guarantees the truth of the conclusion Inductive: the premises of an argument (are believed to) support the conclusion but do not ensure it has been attacked several times by logicians and philosophers Abductive: formally equivalent to the logical fallacy affirming the consequent Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 14
15 What about new knowledge? C. S. Peirce: n Introduced „abduction“ to modern logic n (after 1900): used „abduction“ to mean: creating new rules to explain new observations (this meaning is actually closest to induction) n <<Abduction is the only logical process that actually creates anything new. >> essential for scientific discovery Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 15
16 Agenda Motivation I: Application examples Motivation II: Types of reasoning A key concept of deductive DBs: Recursion The process of knowledge discovery (KDD) KDD: Origins and context A short overview of key KDD techniques An algorithm for decision-tree learning: ID 3 Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 16
17 Deductive databases in a Computer Science context n Deductive databases have grown out of the desire to combine logic programming with relational databases to construct systems that support a powerful formalism and are still fast and able to deal with very large datasets. n Deductive databases are more expressive than relational databases but less expressive than logic programming systems. n Deductive databases have not found widespread adoptions outside academia, but some of their concepts are used in today‘s relational databases to support the advanced features of more recent SQL standards (≥ SQL: 1999). Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 17
18 Datalog n a query and rule language for deductive databases that syntactically is a subset of Prolog. n Roots in 1970 s; the term Datalog was coined in the mid 1980 s by a group of researchers interested in database theory. n Query evaluation is sound and complete and can be done efficiently even for large databases. n Query evaluation is usually done using bottom up strategies. n In contrast to Prolog, Datalog disallows complex terms as arguments of predicates, e. g. P(1, 2) is admissible but not P(f 1(1), 2), imposes certain stratification restrictions on the use of negation and recursion, and only allows range restricted variables, i. e. each variable in the conclusion of a rule must also appear in a not negated clause in the premise of this rule. Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 18
19 Deductive database languages / Datalog: Motivation SQL-92 cannot express some queries: n Are we running low on any parts needed to build a ZX 600 sports car? n What is the total component and assembly cost to build a ZX 600 at today's part prices? Can we extend the query language to cover such queries? n Yes, by adding recursion. Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 19
20 Datalog SQL queries can be read as follows: “If some tuples exist in the From tables that satisfy the Where conditions, then the Select tuple is in the answer. ” Datalog is a query language that has the same if-then flavor: n New: The answer table can appear in the From clause, i. e. , be defined recursively. n Prolog style syntax is commonly used. Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 20
spoke frame 1 1 seat pedal number wheel 2 1 1 subpart 3 part Example trike 21 tire 1 1 rim tube Find the components of a trike? We can write a relational algebra query to compute the answer on the given instance of Assembly. But there is no R. A. (or SQL-92) query that computes the answer on all Assembly instances. Assembly instance Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 21
22 The Problem with Relational Algebra and SQL-92 Intuitively, we must join Assembly with itself to deduce that trike contains spoke and tire. n Takes us one level down Assembly hierarchy. n To find components that are one level deeper (e. g. , rim), need another join. n To find all components, need as many joins as there are levels in the given instance! For any relational algebra expression, we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression! Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 22
23 A Datalog Query that Does the Job Comp(Part, Subpt) : - Assembly(Part, Subpt, Qty). Comp(Part, Subpt) : - Assembly(Part, Part 2, Qty), Comp(Part 2, Subpt). head of rule implication body of rule Can read the second rule as follows: “For all values of Part, Subpt and Qty, if there is a tuple (Part, Part 2, Qty) in Assembly and a tuple (Part 2, Subpt) in Comp, then there must be a tuple (Part, Subpt) in Comp. ” Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 23
24 Using a Rule to Deduce New Tuples Each rule is a template: by assigning constants to the variables in such a way that each body “literal” is a tuple in the corresponding relation, we identify a tuple that must be in the head relation. n By setting Part=trike, Subpt=wheel, Qty=3 in the first rule, we can deduce that the tuple <trike, wheel> is in the relation Comp. n This is called an inference using the rule. n Given a set of tuples, we apply the rule by making all possible inferences with these tuples in the body. Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 24
25 Example For any instance of Assembly, we can compute all Comp tuples by repeatedly applying the two rules. (Actually, we can apply Rule 1 just once, then apply Rule 2 repeatedly. ) Comp tuples got by applying Rule 2 once Comp tuples got by applying Rule 2 twice Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 25
26 Datalog vs. SQL: 1999 (SQL 3) notation A collection of Datalog rules can be rewritten in SQL syntax, if recursion is allowed (this is the case in SQL: 1999). WITH RECURSIVE Comp(Part, Subpt) AS (SELECT A 1. Part, A 1. Subpt FROM Assembly UNION (SELECT A 2. Part, C 1. Subpt FROM Assembly A 2, Comp C 1 WHERE A 2. Subpt=C 1. Part) SELECT * FROM A 1) Comp Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 26
27 Agenda Motivation I: Application examples Motivation II: Types of reasoning A key concept of deductive DBs: Recursion The process of knowledge discovery (KDD) KDD: Origins and context A short overview of key KDD techniques An algorithm for decision-tree learning: ID 3 Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 27
28 „Data mining“ and „knowledge discovery“ n (informal definition): data mining is about discovering knowledge in (huge amounts of) data n Therefore, it is clearer to speak about “knowledge discovery in data(bases)” Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 28
29 Recall: Data, information, and knowledge Data represents a fact or statement of event without relation to other things. n Ex: It is raining. Information embodies the understanding of a relationship of some sort, possibly cause and effect. n Ex: The temperature dropped 15 degrees and then it started raining. Knowledge represents a pattern that connects and generally provides a high level of predictability as to what is described or what will happen next. n Ex: If the humidity is very high and the temperature drops substantially the atmospheres is often unlikely to be able to hold the moisture so it rains. (This is from knowledge-management theory. If you want to know about wisdom, check the Web page: G. Bellinger, D. Castro, & A. Mills: Data, Information, Knowledge, and Wisdom. http: //www. systems-thinking. org/dikw. htm ) Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 29
30 Why Data Mining? The Explosive Growth of Data: from terabytes to petabytes n Data collection and data availability Automated data collection tools, database systems, Web, computerized society n Major sources of abundant data Business: Web, e-commerce, transactions, stocks, … Science: Remote sensing, bioinformatics, scientific simulation, … Society and everyone: news, digital cameras, We are drowning in data, but starving for knowledge! “Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 30
31 Background: Evolution of Database Technology 1960 s: n Data collection, database creation, IMS and network DBMS 1970 s: n Relational data model, relational DBMS implementation 1980 s: n RDBMS, advanced data models (extended-relational, OO, deductive, etc. ) n Application-oriented DBMS (spatial, scientific, engineering, etc. ) 1990 s: n Data mining, data warehousing, multimedia databases, and Web databases 2000 s n Stream data management and mining n Data mining and its applications n Web technology (XML, data integration) and global information systems Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 31
A note on: Data Warehousing for finding implicit knowledge in data – and why I don‘t include this in the course (now) 32 Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 32
33 The KDD process The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data - Fayyad, Platetsky-Shapiro, Smyth (1996) Multiple process non-trivial process valid novel useful understandable Justified patterns/models Previously unknown Can be used by human and machine Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 33
34 The process part of knowledge discovery CRISP-DM • CRoss Industry Standard Process for Data Mining • a data mining process model that describes commonly used approaches that expert data miners use to tackle problems. Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 34
35 Knowledge discovery, machine learning, data mining n Knowledge discovery = the whole process n Machine learning the application of induction algorithms and other algorithms that can be said to „learn. “ = „modeling“ phase n Data mining sometimes = KD, sometimes = ML Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 35
36 Data organized by function Create/select target database The KDD Process Data warehousing 1 Select sampling technique and sample data Supply missing values Eliminate noisy data Normalize values Transform values 2 Create derived attributes Find important attributes & value ranges 4 3 Select DM task (s) Transform to different representation Select DM method (s) Extract knowledge Test knowledge Refine knowledge Query & report generation Aggregation & sequences Advanced methods 5 Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 36
37 Agenda Motivation I: Application examples Motivation II: Types of reasoning A key concept of deductive DBs: Recursion The process of knowledge discovery (KDD) KDD: Origins and context A short overview of key KDD techniques An algorithm for decision-tree learning: ID 3 Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 37
38 Main Contributing Areas of KDD [data warehouses: integrated data] Statistics [OLAP: On-Line Analytical Processing] Databases Store, access, search, update data (deduction) Infer info from data (deduction & induction, mainly numeric data) KDD Machine Learning Computer algorithms that improve automatically through experience (mainly induction, symbolic data) Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 38
39 Data Mining: Classification Schemes General functionality n Descriptive data mining n Predictive data mining Different views lead to different classifications n Data view: Kinds of data to be mined n Knowledge view: Kinds of knowledge to be discovered n Method view: Kinds of techniques utilized n Application view: Kinds of applications adapted Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 39
40 Data Mining: Confluence of Multiple Disciplines Database Technology Machine Learning Pattern Recognition Statistics Data Mining Algorithm Visualization Other Disciplines Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 40
41 Why Not Traditional Data Analysis? Tremendous amount of data n Algorithms must be highly scalable to handle such as tera-bytes of data High-dimensionality of data n Micro-array may have tens of thousands of dimensions High complexity of data n Data streams and sensor data n Time-series data, temporal data, sequence data n Structure data, graphs, social networks and multi-linked data n Heterogeneous databases and legacy databases n Spatial, spatiotemporal, multimedia, text and Web data n Software programs, scientific simulations New and sophisticated applications Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 41
42 Agenda Motivation I: Application examples Motivation II: Types of reasoning A key concept of deductive DBs: Recursion The process of knowledge discovery (KDD) KDD: Origins and context A short overview of key KDD techniques An algorithm for decision-tree learning: ID 3 Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 42
43 Classification “What factors determine cancerous cells? ” Examples Data Cancerous Cell Data Mining Algorithm General patterns Classification Algorithm - Rule Induction - Decision tree - Neural Network Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 43
44 Classification: Rule Induction “What factors determine whether a cell is cancerous? ” If and Then Color = light Tails = 1 Nuclei = 2 Healthy Cell If and Then Color = dark Tails = 2 Nuclei = 2 Cancerous Cell (certainty = 92%) (certainty = 87%) Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 44
45 Classification: Decision Trees Color = dark #nuclei=1 #tails=1 healthy #tails=2 cancerous #nuclei=2 cancerous Color = light #nuclei=1 #nuclei=2 healthy #tails=1 #tails=2 healthy cancerous Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 45
46 Classification: Neural Networks “What factors determine whether a cell is cancerous? ” Color = dark # nuclei = 1 Healthy Cancerous … # tails = 2 Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 46
47 Clustering “Are there clusters of similar cells? ” Light color with 1 nucleus Dark color with 2 tails 2 nuclei 1 nucleus and 1 tail Dark color with 1 tail and 2 nuclei Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 47
48 Association Rule Discovery Task: Discovering association rules among items in a transaction database. An association among two items A and B means that the presence of A in a record implies the presence of B in the same record: A => B. In general: A 1, A 2, … => B Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 48
49 Association Rule Discovery “Are there any associations between the characteristics of the cells? ” If color = light and # nuclei = 1 then # tails = 1 (support = 12. 5%; confidence = 50%) If # nuclei = 2 and Cell = Cancerous then # tails = 2 (support = 25%; If # tails = 1 then Color = light confidence = 100%) (support = 37. 5%; confidence = 75%) Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 49
50 Many Other Data Mining Techniques Genetic Algorithms Rough Sets Bayesian Networks Text Mining Statistics Time Series Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 50
A goal: From databases to deductive databases to inductive databases n A deductive database system is a database system which can make deductions (ie: conclude additional facts) based on rules and facts stored in the (deductive) database. n inductive databases contain not only data, but also patterns. In an IDB, inductive queries can be used to generate (mine), manipulate, and apply patterns. The IDB framework supports the process of knowledge discovery in databases (KDD): 51 – the results of one (inductive) query can be used as input for another – nontrivial multi-step KDD scenarios can be supported, rather than just single data mining operations. Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 51
52 Agenda Motivation I: Application examples Motivation II: Types of reasoning A key concept of deductive DBs: Recursion The process of knowledge discovery (KDD) KDD: Origins and context A short overview of key KDD techniques An algorithm for decision-tree learning: ID 3 Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 52
53 Input data. . . Q: when does this person play tennis? Outlook Temp Humidity Windy Play Sunny Hot High False No Sunny Hot High True No Overcast Hot High False Yes Rainy Mild High False Yes Rainy Cool Normal True No Overcast Cool Normal True Yes Sunny Mild High False No Sunny Cool Normal False Yes Rainy Mild Normal False Yes Sunny Mild Normal True Yes Overcast Mild High True Yes Overcast Hot Normal False Yes Rainy Mild High True No Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 53
54 The goal: a decision tree for classification / prediction In which weather will someone play (tennis etc. )? Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 54
Constructing decision trees Strategy: top down Recursive divide-and-conquer fashion 55 First: select attribute for root node Create branch for each possible attribute value Then: split instances into subsets One for each branch extending from the node Finally: repeat recursively for each branch, using only instances that reach the branch Stop if all instances have the same class Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 55
Which attribute to select? 56 Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 56
Which attribute to select? 57 Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 57
58 Criterion for attribute selection Which is the best attribute? Popular impurity criterion: information gain Want to get the smallest tree Heuristic: choose the attribute that produces the “purest” nodes Information gain increases with the average purity of the subsets Strategy: choose attribute that gives greatest information gain Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 58
Computing information 59 Measure information in bits Given a probability distribution, the info required to predict an event is the distribution’s entropy Entropy gives the information required in bits (can involve fractions of bits!) Formula for computing the entropy: Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 59
Example: attribute Outlook 60 Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 60
Computing information gain 61 Information gain: information before splitting – information after splitting gain(Outlook ) = info([9, 5]) – info([2, 3], [4, 0], [3, 2]) = 0. 940 – 0. 693 = 0. 247 bits Information gain for attributes from weather data: gain(Outlook ) gain(Temperature ) gain(Humidity ) gain(Windy ) = 0. 247 bits = 0. 029 bits = 0. 152 bits = 0. 048 bits Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 61
Continuing to split 62 gain(Temperature ) = 0. 571 bits gain(Humidity ) = 0. 971 bits gain(Windy ) = 0. 020 bits Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 62
Final decision tree 63 Note: not all leaves need to be pure; sometimes identical instances have different classes Splitting stops when data can’t be split any further Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 63
Wishlist for a purity measure Properties we require from a purity measure: 64 When node is pure, measure should be zero When impurity is maximal (i. e. all classes equally likely), measure should be maximal Measure should obey multistage property (i. e. decisions can be made in several stages): Entropy is the only function that satisfies all three properties! Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 64
Properties of the entropy The multistage property: Simplification of computation: 65 Note: instead of maximizing info gain we could just minimize information Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 65
Discussion / outlook decision trees Top-down induction of decision trees: ID 3, algorithm developed by Ross Quinlan 66 Various improvements, e. g. C 4. 5: deals with numeric attributes, missing values, noisy data Gain ratio instead of information gain [see Witten & Frank slides, ch. 4, pp. 40 -45] Similar approach: CART … Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 66
67 Agenda Motivation I: Application examples Motivation II: Types of reasoning A key concept of deductive DBs: Recursion The process of knowledge discovery (KDD) KDD: Origins and context A short overview of key KDD techniques An algorithm for decision-tree learning: ID 3 Mining semistructured and unstructured data Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 67
68 References / background reading n n Knowledge discovery is now an established area with some excellent general textbooks. I recommend the following as examples of the 3 main perspectives: a databases / data warehouses perspective: Han, J. & Kamber, M. (2001). Data Mining: Concepts and Techniques. San Francisco, CA: Morgan Kaufmann. http: //www. cs. sfu. ca/%7 Ehan/dmbook a machine learning perspective: Witten, I. H. , & Frank, E. (2005). Data Mining. Practical Machine Learning Tools and Techniques with Java Implementations. 2 nd ed. Morgan Kaufmann. http: //www. cs. waikato. ac. nz/%7 Eml/weka/book. html a statistics perspective: Hand, D. J. , Mannila, H. , & Smyth, P. (2001). Principles of Data Mining. Cambridge, MA: MIT Press. http: //mitpress. mit. edu/catalog/item/default. asp? tid=3520&ttype =2 The CRISP-DM phase model can be found at http: //www. crisp -dm. org Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 68
69 Acknowledgements n p. 12, 17: http: //en. wikipedia. org/wiki/Deductive_database n pp. 14, 15: http: //en. wikipedia. org/wiki/Abductive_reasoning n p. 18: http: //en. wikipedia. org/wiki/Datalog n pp. 19 -26 taken from (with minor modifications): n pp. 33, 36, 38, 43 -50 were taken from (with minor modifications): n Tzacheva, A. A. (2006). Knowledge Discovery and Data Mining. http: //faculty. uscupstate. edu/atzacheva/SIMS 422/Overview. II. ppt pp. 30, 31, 39 -41 were taken from n Tzacheva, A. A. (2006). SIMS 422. Knowledge Inference Systems & Applications. http: //faculty. uscupstate. edu/atzacheva/SIMS 422/Overview. I. ppt pp. 47 -50 were taken from n Ramakrishnan, R. & Gehrke, J. (2002? ). Database Management Systems, 3 rd Edition 2002. Instructor Slides. Ch. 25 - Deductive Databases. http: //pages. cs. wisc. edu/~dbbook/open. Access/third. Edition/slides 3 edenglish/Ch 25_Ded. DB-95. pdf Han, J. & Kamber, M. (2006). Data Mining: Concepts and Techniques — Chapter 1 — Introduction. http: //www. cs. sfu. ca/%7 Ehan/bk/1 intro. ppt The ID 3 part is based on Witten, I. H. , & Frank, E. (2005). Data Mining. Practical Machine Learning Tools and Techniques with Java Implementations. 2 nd ed. Morgan Kaufmann. http: //www. cs. waikato. ac. nz/%7 Eml/weka/book. html In particular, the instructor slides for that book available at http: //books. elsevier. com/companions/9780120884070/ (chapters 1 -4): http: //books. elsevier. com/companions/9780120884070/revisionnotes/01~PDFs/chapter 1. pdf (and. . . chapter 2. pdf, chapter 3. pdf, chapter 4. pdf) or http: //books. elsevier. com/companions/9780120884070/revisionnotes/02~ODP%20 Files/chapter 1. odp (and. . . chapter 2. odp, chapter 3. odp, chapter 4. odp) Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 69
70 Picture credits n See “notes“ of the slides Berendt: Advanced databases, first semester 2008, http: //www. cs. kuleuven. be/~berendt/teaching/2008 w/adb/ 70
- Slides: 70