Chapter 8 Machine learning Xiujun GONG Ph D

  • Slides: 29
Download presentation
Chapter 8 Machine learning Xiu-jun GONG (Ph. D) School of Computer Science and Technology,

Chapter 8 Machine learning Xiu-jun GONG (Ph. D) School of Computer Science and Technology, Tianjin University gongxj@tju. edu. cn http: //cs. tju. edu. cn/faculties/gongxj/course/ai/

Outline p What is machine learning p Tasks of Machine Learning p The Types

Outline p What is machine learning p Tasks of Machine Learning p The Types of Machine Learning p Performance Assessment p Summary

What is the “machine learning” p machine learning is concerned with the design and

What is the “machine learning” p machine learning is concerned with the design and development of algorithms and techniques that allow computers to "learn“ n n Acquiring knowledge Mastering skill Improving system’s performance Theorizing, posting hypothesis, discovering the law The major focus of machine learning research is to extract information from data automatically, by computational and statistical methods.

A Generic System Input Variables: Hidden Variables: Output Variables: … … System

A Generic System Input Variables: Hidden Variables: Output Variables: … … System

Another View of Machine Learning p Machine Learning aims to discover the relationships between

Another View of Machine Learning p Machine Learning aims to discover the relationships between the variables of a system (input, output and hidden) from direct samples of the system p The study involves many fields: n Statistics, mathematics, theoretical computer science, physics, neuroscience, etc

Defining the Learning Task Improve on task, T, with respect to performance metric, P,

Defining the Learning Task Improve on task, T, with respect to performance metric, P, based on experience, E. T: Playing checkers P: Percentage of games won against an arbitrary opponent E: Playing practice games against itself T: Recognizing hand-written words P: Percentage of words correctly classified E: Database of human-labeled images of handwritten words T: Driving on four-lane highways using vision sensors P: Average distance traveled before a human-judged error E: A sequence of images and steering commands recorded while observing a human driver. T: Categorize email messages as spam or legitimate. P: Percentage of email messages correctly classified. E: Database of emails, some with human-given labels

Formulating the Learning Problem Data matrix: X n lines = patterns (data points, examples):

Formulating the Learning Problem Data matrix: X n lines = patterns (data points, examples): samples, patients, documents, images, … m columns = features: (attributes, input variables): genes, proteins, words, pixels, … n instance Colon cancer, Alon et al 1999 m attributes Output A 11, A 12, …, A 1 m A 21, A 22, …, A 2 m … … An 1, An 2, …, Anm ---C 1 ---C 2 ---… ---Cn

Supervised Learning p p Generates a function that maps inputs to desired outputs Classification

Supervised Learning p p Generates a function that maps inputs to desired outputs Classification & regression Training & test Algorithms n Global model: BN, NN, SVM, Decision Tree n Local model: KNN, CBR(Case-base reasoning) n instance Task m attributes Output A 11, A 12, …, A 1 m A 21, A 22, …, A 2 m … … An 1, An 2, …, Anm ---C 1 ---C 2 ---… ---Cn a 1, a 2, …, am ---? √ √ … … √ Training

Unsupervised learning p p Models a set of inputs: labeled examples are not available.

Unsupervised learning p p Models a set of inputs: labeled examples are not available. Clustering & data compression Cohension & divergence Algorithms n K-means, SOM, Bayesian, MST… m attributes n instance A 11, A 12, …, A 1 m A 21, A 22, …, A 2 m … … An 1, An 2, …, Anm Output ---C 1 ---C 2 ---… ---Cn Task X X … … X

Semi-Supervised Learning p p p Combines both labeled and unlabeled examples to generate an

Semi-Supervised Learning p p p Combines both labeled and unlabeled examples to generate an appropriate function or classifier. With large unlabeled sample, small labeled samples Algorithms n Co-training n EM n Latent variables m attributes n instance Task A 11, A 12, …, A 1 m A 21, A 22, …, A 2 m … … An 1, An 2, …, Anm a 1, a 2, …, am Output ---C 1 ---? ---… ---Cn ---? √ X … … √

Other types p Reinforcement learning n n p concerned with how an agent ought

Other types p Reinforcement learning n n p concerned with how an agent ought to take actions in an environment so as to maximize some notion of long-term reward find a policy that maps states of the world to the actions the agent ought to take in those states. Multi-task learning n Learns a problem together with other related problems at the same time, using a shared representation.

Learning Models(1) p A single Model: Motivation - build a single good model n

Learning Models(1) p A single Model: Motivation - build a single good model n n n Linear models Kernel methods Neural networks Probabilistic models Decision trees

Learning Models (2) p An Ensemble of Models n n Motivation – a good

Learning Models (2) p An Ensemble of Models n n Motivation – a good single model is difficult to compute (impossible? ), so build many and combine them. Combining many uncorrelated models produces better predictors. . . Boosting: Specific cost function Bagging: Bootstrap Sample: Uniform random sampling (with replacement) Active learning: Select samples for training actively

Linear models p f(x) = w x +b = S j=1: n wj xj

Linear models p f(x) = w x +b = S j=1: n wj xj +b Linearity in the parameters, NOT in the input components. Sw p f(x) = w F(x) +b = (Perceptron) p f(x) = i=1: m ai k(xi, x) +b method) S j j fj(x) +b (Kernel

Linear Decision Boundary hyperplane x 2 x 3 x 1 x 2 x 1

Linear Decision Boundary hyperplane x 2 x 3 x 1 x 2 x 1

Non-linear Decision Boundary x 2 x 3 x 2 x 1

Non-linear Decision Boundary x 2 x 3 x 2 x 1

Kernel Method x 1 x 2 xn Potential functions, Aizerman et al 1964 k(x

Kernel Method x 1 x 2 xn Potential functions, Aizerman et al 1964 k(x 1, x) k(x 2, x) a 1 a 2 S am k(xm, x) 1 b f(x) = Si ai k(xi, x) + b k(. , . ) is a similarity measure or “kernel”.

What is a Kernel? A kernel is: p a similarity measure p a dot

What is a Kernel? A kernel is: p a similarity measure p a dot product in some feature space: k(s, t) = F(s) F(t) But we do not need to know the F representation. Examples: p k(s, t) = exp(-||s-t||2/s 2) Gaussian kernel p k(s, t) = (s t)q Polynomial kernel

Probabilistic models p Bayesian network p Latent semantic model p Time series model-HMM

Probabilistic models p Bayesian network p Latent semantic model p Time series model-HMM

Decision Trees f 2 Choose f 1 All the data f 1 At each

Decision Trees f 2 Choose f 1 All the data f 1 At each step, choose the feature that “reduces entropy” most. Work towards “node purity”.

Decision Trees CART (Breiman, 1984) C 4. 5 (Quinlan, 1993) J 48

Decision Trees CART (Breiman, 1984) C 4. 5 (Quinlan, 1993) J 48

Boosting p Main assumption: n Combining many weak predictors to produce an ensemble predictor.

Boosting p Main assumption: n Combining many weak predictors to produce an ensemble predictor. p Each predictor is created by using a biased sample of the training data n p Instances (training examples) with high error are weighted higher than those with lower error Difficult instances get more attention

Bagging p Main assumption: n Combining many unstable predictors to produce a ensemble (stable)

Bagging p Main assumption: n Combining many unstable predictors to produce a ensemble (stable) predictor. n Unstable Predictor: small changes in training data produce large changes in the model. e. g. Neural Nets, trees p Stable: SVM, nearest Neighbor. p p Each predictor in ensemble is created by taking a bootstrap sample of the data. Bootstrap sample of N instances is obtained by drawing N example at random, with replacement. Encourages predictors to have uncorrelated errors.

Active learning Labeled Data NBClassifier Model Unlabeled data Data Pool Selector Learning incrementally Classifying

Active learning Labeled Data NBClassifier Model Unlabeled data Data Pool Selector Learning incrementally Classifying incrementally Computing the evaluation function incrementally

Performance Assessment Cost matrix Truth: y Predictions: F(x) Class -1 Class +1 Class -1

Performance Assessment Cost matrix Truth: y Predictions: F(x) Class -1 Class +1 Class -1 tn fp Class +1 fn tp Total rej=tn+fn sel=fp+tp Precision = tp/sel Class+1 /Total neg=tn+f p pos=fn+tp Class +1 / Total m=tn+fp +fn+tp Frac. selected = sel/m False alarm = fp/neg Hit rate = tp/pos Compare F(x) = sign(f(x)) to the target y, and report: • Error rate = (fn + fp)/m • {Hit rate , False alarm rate} or {Hit rate , Precision} or {Hit rate , Frac. selected} • Balanced error rate (BER) = (fn/pos + fp/neg)/2 = 1 – (sensitivity+specificity)/2 • F the measure = 2 precision. recall/(precision+recall) Vary decision threshold q in F(x) = sign(f(x)+q), and plot: • ROC curve: Hit rate vs. False alarm rate • Lift curve: Hit rate vs. Fraction selected • Precision/recall curve: Hit rate vs. Precision

Challenges training example 5 10 s Sylva 104 102 Ada Dexter, Nova Madelon 103

Challenges training example 5 10 s Sylva 104 102 Ada Dexter, Nova Madelon 103 NIPS 2003 & WCCI 2006 Gisette Gina Arcene, Dorothea, Hiva 10 inputs 10 102 103 104 105

BER/<BER> Challenge Winning Methods

BER/<BER> Challenge Winning Methods

Issues in Machine Learning What algorithms are available for learning a concept? How well

Issues in Machine Learning What algorithms are available for learning a concept? How well do they perform? p How much training data is sufficient to learn a concept with high confidence? p When is it useful to use prior knowledge? p Are some training examples more useful than others? p What are best tasks for a system to learn? p What is the best way for a system to represent its knowledge? p