Classification n 202136 Bayesian Classification 1 Bayesian Classification

  • Slides: 47
Download presentation
Classification n 2021/3/6 Bayesian Classification 1

Classification n 2021/3/6 Bayesian Classification 1

Bayesian Classification: Why? n n Probabilistic learning: Calculate explicit probabilities for hypothesis, among the

Bayesian Classification: Why? n n Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data. Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured 2021/3/6 2

Bayesian Theorem n n n 2021/3/6 Given training data D, posteriori probability of a

Bayesian Theorem n n n 2021/3/6 Given training data D, posteriori probability of a hypothesis h, P(h|D) follows the Bayes theorem MAP (maximum posteriori) hypothesis Practical difficulty: require initial knowledge of many probabilities, significant computational cost 3

Naïve Bayes Classifier n n n 2021/3/6 P(Ci|X) = prob. that the sample X

Naïve Bayes Classifier n n n 2021/3/6 P(Ci|X) = prob. that the sample X is of class Ci. The naive Bayesian classifier assigns an unknown sample X to class Ci if and only if Idea: assign sample X to the class Ci if P(Ci|X) is maximal among P(C 1|X), P(C 2|X), …, P(Cm|X) 4

Estimating a-posteriori probabilities n Bayes theorem: P(C|X) = P(X|C)·P(C) / P(X) n P(X) is

Estimating a-posteriori probabilities n Bayes theorem: P(C|X) = P(X|C)·P(C) / P(X) n P(X) is constant for all classes n P(C) = relative freq of class C samples n Remaining problem: How to compute P(X|C) ? 2021/3/6 5

Naïve Bayesian Classifier n n 2021/3/6 Naïve assumption: attribute independence P(X|C) = P(x 1,

Naïve Bayesian Classifier n n 2021/3/6 Naïve assumption: attribute independence P(X|C) = P(x 1, …, xk|C) = P(x 1|C)·…·P(xk|C) If i-th attribute of X is categorical: P(xi|C) is estimated as the relative freq of samples having value xi as i-th attribute in class C If i-th attribute is continuous: P(xi|C) is estimated thru a Gaussian density function Computationally easy in both cases 6

Play-tennis example: estimating outlook P(xi|C) P(sunny|p) = 2/9 P(sunny|n) = 3/5 P(overcast|p) = 4/9

Play-tennis example: estimating outlook P(xi|C) P(sunny|p) = 2/9 P(sunny|n) = 3/5 P(overcast|p) = 4/9 P(overcast|n) = 0 P(rain|p) = 3/9 P(rain|n) = 2/5 temperature P(hot|p) = 2/9 P(hot|n) = 2/5 P(mild|p) = 4/9 P(mild|n) = 2/5 P(cool|p) = 3/9 P(cool|n) = 1/5 humidity P(p) = 9/14 P(n) = 5/14 2021/3/6 P(high|p) = 3/9 P(high|n) = 4/5 P(normal|p) = 6/9 P(normal|n) = 2/5 windy P(true|p) = 3/9 P(true|n) = 3/5 P(false|p) = 6/9 P(false|n) = 2/5 7

Play-tennis example: classifying X n n An unseen sample X = <rain, hot, high,

Play-tennis example: classifying X n n An unseen sample X = <rain, hot, high, false> P(X|p)·P(p) = P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) = 3/9· 2/9· 3/9· 6/9· 9/14 = 0. 010582 P(X|n)·P(n) = P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) = 2/5· 4/5· 2/5· 5/14 = 0. 018286 Sample X is classified in class n (don’t play) 2021/3/6 8

The independence hypothesis… n … makes computation possible n … yields optimal classifiers when

The independence hypothesis… n … makes computation possible n … yields optimal classifiers when satisfied n n … but is seldom satisfied in practice, as attributes (variables) are often correlated. Attempts to overcome this limitation: n n 2021/3/6 Bayesian networks, that combine Bayesian reasoning with causal relationships between attributes Decision trees, that reason on one attribute at the time, considering most important attributes first 9

Bayesian Belief Networks (I) Family History Smoker (FH, S) (FH, ~S)(~FH, S) (~FH, ~S)

Bayesian Belief Networks (I) Family History Smoker (FH, S) (FH, ~S)(~FH, S) (~FH, ~S) Lung. Cancer Emphysema LC 0. 8 0. 5 0. 7 0. 1 ~LC 0. 2 0. 5 0. 3 0. 9 The conditional probability table for the variable Lung. Cancer Positive. XRay Dyspnea Bayesian Belief Networks 2021/3/6 10

Bayesian Belief Networks (II) n Bayesian belief network allows a subset of the variables

Bayesian Belief Networks (II) n Bayesian belief network allows a subset of the variables conditionally independent n A graphical model of causal relationships n Several cases of learning Bayesian belief networks n Given both network structure and all the variables: easy n Given network structure but only some variables n When the network structure is not known in advance 2021/3/6 11

Classification n n 2021/3/6 Bayesian Classification by backpropagation 12

Classification n n 2021/3/6 Bayesian Classification by backpropagation 12

What Is Artificial Neural Network n 2021/3/6 ANN is an artificial intelligence which simulates

What Is Artificial Neural Network n 2021/3/6 ANN is an artificial intelligence which simulates the behaviors of the neurons of our brains. They are applied to many problems, such as recognition, decision, control, prediction… 13

Artificial Neuron(類神經元) I 1 W 1 I 2 W 2 x>T ? Y Wn

Artificial Neuron(類神經元) I 1 W 1 I 2 W 2 x>T ? Y Wn In 輸出(Output) 輸入(Inputs) 2021/3/6 15

Artificial Neural Networks(類神經網路 ) Input 1 Input 2 Input 3 Output Input N 2021/3/6

Artificial Neural Networks(類神經網路 ) Input 1 Input 2 Input 3 Output Input N 2021/3/6 16

Animal Recognition Shape eyes Size BRAIN color Speed 2021/3/6 17

Animal Recognition Shape eyes Size BRAIN color Speed 2021/3/6 17

Neural Networks n n Advantages n prediction accuracy is generally high n robust, works

Neural Networks n n Advantages n prediction accuracy is generally high n robust, works when training examples contain errors n output may be discrete, real-valued, or a vector of several discrete or real-valued attributes n fast evaluation of the learned target function Criticism n long training time n difficult to understand the learned function (weights) n not easy to incorporate domain knowledge 2021/3/6 18

A Neuron x 0 w 0 x 1 w 1 xn å f wn

A Neuron x 0 w 0 x 1 w 1 xn å f wn Input weight vector x vector w n - mk weighted sum output y Activation function The n-dimensional input vector x is mapped into variable y by means of the scalar product and a nonlinear function mapping 2021/3/6 19

Network Training n n The ultimate objective of training n obtain a set of

Network Training n n The ultimate objective of training n obtain a set of weights that makes almost all the tuples in the training data classified correctly Steps n Initialize weights with random values n Feed the input tuples into the network one by one n For each unit n n 2021/3/6 Compute the net input to the unit as a linear combination of all the inputs to the unit Compute the output value using the activation function Compute the error Update the weights and the bias 20

Multi-Layer Perceptron Output vector Output nodes Hidden nodes wij Input nodes Input vector: xi

Multi-Layer Perceptron Output vector Output nodes Hidden nodes wij Input nodes Input vector: xi 2021/3/6 21

Example 2021/3/6 22

Example 2021/3/6 22

2021/3/6 23

2021/3/6 23

2021/3/6 24

2021/3/6 24

2021/3/6 26

2021/3/6 26

Classification and Prediction n 2021/3/6 Bayesian Classification by backpropagation Other Classification Methods 27

Classification and Prediction n 2021/3/6 Bayesian Classification by backpropagation Other Classification Methods 27

Other Classification Methods n k-nearest neighbor classifier n case-based reasoning n Genetic algorithm n

Other Classification Methods n k-nearest neighbor classifier n case-based reasoning n Genetic algorithm n Rough set approach n Fuzzy set approaches 2021/3/6 28

Instance-Based Methods n n Instance-based learning: n Store training examples and delay the processing

Instance-Based Methods n n Instance-based learning: n Store training examples and delay the processing (“lazy evaluation”) until a new instance must be classified Typical approaches n k-nearest neighbor approach n Instances represented as points in a Euclidean space. n Locally weighted regression n Constructs local approximation n Case-based reasoning n Uses symbolic representations and knowledgebased inference 2021/3/6 29

The k-Nearest Neighbor Algorithm n n n All instances correspond to points in the

The k-Nearest Neighbor Algorithm n n n All instances correspond to points in the n-D space. The nearest neighbor are defined in terms of Euclidean distance. The target function could be discrete- or real- valued. For discrete-valued, the k-NN returns the most common value among the k training examples nearest to xq. Vonoroi diagram: the decision surface induced by 1 NN for a typical set of training examples. _ _ + _ _ 2021/3/6 _. + + xq . _ + . . 30

Discussion on the k-NN Algorithm n n The k-NN algorithm for continuous-valued target functions

Discussion on the k-NN Algorithm n n The k-NN algorithm for continuous-valued target functions n Calculate the mean values of the k nearest neighbors Distance-weighted nearest neighbor algorithm n Weight the contribution of each of the k neighbors according to their distance to the query point xq n giving greater weight to closer neighbors n Similarly, for real-valued target functions Robust to noisy data by averaging k-nearest neighbors Curse of dimensionality: distance between neighbors could be dominated by irrelevant attributes. n To overcome it, axes stretch or elimination of the least relevant attributes. Feature selection 2021/3/6 31

Case-Based Reasoning n n n Also uses: lazy evaluation + analyze similar instances Difference:

Case-Based Reasoning n n n Also uses: lazy evaluation + analyze similar instances Difference: Instances are not “points in a Euclidean space” Example: Water faucet problem in CADET (Sycara et al’ 92) Methodology n Instances represented by rich symbolic descriptions (e. g. , function graphs) n Multiple retrieved cases may be combined n Tight coupling between case retrieval, knowledge-based reasoning, and problem solving Research issues n Indexing based on syntactic similarity measure, and when failure, backtracking, and adapting to additional cases 2021/3/6 32

Remarks on Lazy vs. Eager Learning n n n Instance-based learning: lazy evaluation Decision-tree

Remarks on Lazy vs. Eager Learning n n n Instance-based learning: lazy evaluation Decision-tree and Bayesian classification: eager evaluation Key differences n Lazy method may consider query instance xq when deciding how to generalize beyond the training data D n Eager method cannot since they have already chosen global approximation when seeing the query Efficiency: Lazy - less time training but more time predicting Accuracy n Lazy method effectively uses a richer hypothesis space since it uses many local linear functions to form its implicit global approximation to the target function n Eager: must commit to a single hypothesis that covers the entire instance space 2021/3/6 33

Introduction to Genetic Algorithm n n 2021/3/6 Principle: survival-of-the-fitness Characteristics of GA n Robust

Introduction to Genetic Algorithm n n 2021/3/6 Principle: survival-of-the-fitness Characteristics of GA n Robust n Error-tolerant n Flexible n When you have no idea about solving problems… 34

Population Crossover Mutation Fitness Selection Mutation. . .

Population Crossover Mutation Fitness Selection Mutation. . .

Component of Genetic Algorithm n n Representation Genetic operations: n Crossover, mutation, inversion, as

Component of Genetic Algorithm n n Representation Genetic operations: n Crossover, mutation, inversion, as you wish Selection n Elitism, total, steady state, …as you wish Fitness n Problem dependent Everybody has different survival approaches. 2021/3/6 36

How to implement a GA ? Representation n Fitness n Operators design n Selection

How to implement a GA ? Representation n Fitness n Operators design n Selection strategy n 2021/3/6 37

Example(I) n 2021/3/6 Maximize 38

Example(I) n 2021/3/6 Maximize 38

2021/3/6 39

2021/3/6 39

Example(I): Representation n n 2021/3/6 Standard GA binary string n x = 5, x

Example(I): Representation n n 2021/3/6 Standard GA binary string n x = 5, x = 101 n x = 3. 25 x = 011. 1 n … Something noticeable n Length is predefined. n Not the only way. chromosome gene 40

Example(I): Fitness function n 2021/3/6 In this case, it is known already 41

Example(I): Fitness function n 2021/3/6 In this case, it is known already 41

Example(I): Genetic Operator n Standard crossover (one-point crossover) 2021/3/6 42

Example(I): Genetic Operator n Standard crossover (one-point crossover) 2021/3/6 42

Example(I): Genetic Operator n Standard mutation (point mutation) Randomly 2021/3/6 Randomly 43

Example(I): Genetic Operator n Standard mutation (point mutation) Randomly 2021/3/6 Randomly 43

Example(I): Selection n 2021/3/6 Standard selection (roulette wheel) 44

Example(I): Selection n 2021/3/6 Standard selection (roulette wheel) 44

Population Crossover Mutation Fitness Selection Mutation. . .

Population Crossover Mutation Fitness Selection Mutation. . .

2021/3/6 46

2021/3/6 46

Genetic Algorithms n n n GA: based on an analogy to biological evolution Each

Genetic Algorithms n n n GA: based on an analogy to biological evolution Each rule is represented by a string of bits An initial population is created consisting of randomly generated rules n e. g. , IF NOT A 1 and Not A 2 then C 2 can be encoded as 001 Based on the notion of survival of the fittest, a new population is formed to consists of the fittest rules and their offsprings The fitness of a rule is represented by its classification accuracy on a set of training examples Offsprings are generated by crossover and mutation 2021/3/6 47

Rough Set Approach n n n 2021/3/6 Rough sets are used to approximately or

Rough Set Approach n n n 2021/3/6 Rough sets are used to approximately or “roughly” define equivalent classes A rough set for a given class C is approximated by two sets: a lower approximation (certain to be in C) and an upper approximation (cannot be described as not belonging to C) Finding the minimal subsets (reducts) of attributes (for feature reduction) is NP-hard but a discernibility matrix is used to reduce the computation intensity 48