Machine Learning Foundations Yishay Mansour TelAviv University Typical

Typical Applications • Classification/Clustering problems: – Data mining – Information Retrieval – Web search

Typical Applications • Control – Robots – Dialog system • Complicated software – driving

Why Now? • Technology ready: – Algorithms and theory. • Information abundant: – Flood

Example 1: Credit Risk Analysis • Typical customer: bank. • Database: – Current clients

Example 1: Credit Risk Analysis • Rules learned from data: IF Other-Delinquent-Accounts > 2

Example 2: Clustering news • Data: Reuters news / Web data • Goal: Basic

Example 3: Robot control • Goal: Control a robot in an unknown environment. •

A Glimpse in to the future • Today status: – First-generation algorithms: – Neural

Relevant Disciplines • • Artificial intelligence Statistics Computational learning theory Control theory Information Theory

Type of models • Supervised learning – Given access to classified data • Unsupervised

Learning: Complete Information • Probability D 1 over and probability D 2 for •

Predictions and Loss Model • Boolean Error – Predict a Boolean value. – each

Predictions and Loss Model • quadratic loss – Predict a “real number” q for

Predictions and Loss Model • Logarithmic loss – Predict a “real number” q for

The basic PAC Model Distribution D over domain X Unknown target function f(x) Goal:

Basic PAC Notions Example (x, f(x)) S - sample of m examples drawn i.

Bayesian Theory Prior distribution over H Given a sample S compute a posterior distribution:

Nearest Neighbor Methods Classify using near examples. Assume a “structured space” and a “metric”

Computational Methods How to find an h e H with low observed error. Most

Separating Hyperplane Perceptron: sign( xiwi ) Find w 1. . wn Limited representation w

Neural Networks Sigmoidal gates: a= xiwi and output = 1/(1+ e-a) x 1 Back

Decision Trees Limited Representation Efficient Algorithms. Aim: Find a small decision tree with low

Decision Trees PHASE I: Construct the tree greedy, using a local index function. Ginni

Complexity versus Generalization hypothesis complexity versus observed error. More complex hypothesis have lower observed

Basic criteria for Model Selection Minimum Description Length e’(h) + |code length of h|

Genetic Programming A search Method. Example: decision trees Local mutation operations Change a node

General PAC Methodology Minimize the observed error. Search for a small size classifier Hand-tailored

Weak Learning Small class of predicates H Weak Learning: Assume that for any distribution

Boosting Algorithms Functions: Weighted majority of the predicates. Methodology: Change the distribution to target

Support Vector Machine n dimensions m dimensions

Support Vector Machine Project data to a high dimensional space. Use a hyperplane in

Fourier Transform f(x) = az cz(x) = (-1)<x, z> Many Simple classes are well

Reinforcement Learning Control Problems. Changing the parameters changes the behavior. Search for optimal policies.

Basic Concepts in Probability • For a single hypothesis h: – Given an observed

Basic Concepts in Probability • Switching from h 1 to h 2: – Given

Slides: 41

Download presentation

Machine Learning: Foundations Yishay Mansour Tel-Aviv University

Typical Applications • Classification/Clustering problems: – Data mining – Information Retrieval – Web search • Self-customization – news – mail

Typical Applications • Control – Robots – Dialog system • Complicated software – driving a car – playing a game (backgammon)

Why Now? • Technology ready: – Algorithms and theory. • Information abundant: – Flood of data (online) • Computational power – Sophisticated techniques • Industry and consumer needs.

Example 1: Credit Risk Analysis • Typical customer: bank. • Database: – Current clients data, including: – basic profile (income, house ownership, delinquent account, etc. ) – Basic classification. • Goal: predict/decide whether to grant credit.

Example 1: Credit Risk Analysis • Rules learned from data: IF Other-Delinquent-Accounts > 2 and Number-Delinquent-Billing-Cycles >1 THEN DENAY CREDIT IF Other-Delinquent-Accounts = 0 and Income > $30 k THEN GRANT CREDIT

Example 2: Clustering news • Data: Reuters news / Web data • Goal: Basic category classification: – Business, sports, politics, etc. – classify to subcategories (unspecified) • Methodology: – consider “typical words” for each category. – Classify using a “distance “ measure.

Example 3: Robot control • Goal: Control a robot in an unknown environment. • Needs both – to explore (new places and action) – to use acquired knowledge to gain benefits. • Learning task “control” what is observes!

A Glimpse in to the future • Today status: – First-generation algorithms: – Neural nets, decision trees, etc. • Well-formed data-bases • Future: – many more problems: – networking, control, software. – Main advantage is flexibility!

Relevant Disciplines • • Artificial intelligence Statistics Computational learning theory Control theory Information Theory Philosophy Psychology and neurobiology.

Type of models • Supervised learning – Given access to classified data • Unsupervised learning – Given access to data, but no classification • Control learning – Selects actions and observes consequences. – Maximizes long-term cumulative return.

Learning: Complete Information • Probability D 1 over and probability D 2 for • Equally likely. • Computing the probability of “smiley” given a point (x, y). • Use Bayes formula. • Let p be the probability.

Predictions and Loss Model • Boolean Error – Predict a Boolean value. – each error we lose 1 (no error no loss. ) – Compare the probability p to 1/2. – Predict deterministically with the higher value. – Optimal prediction (for this loss) • Can not recover probabilities!

Predictions and Loss Model • quadratic loss – Predict a “real number” q for outcome 1. – Loss (q-p)2 for outcome 1 – Loss ([1 -q]-[1 -p])2 for outcome 0 – Expected loss: (p-q)2 – Minimized for p=q (Optimal prediction) • recovers the probabilities • Needs to know p to compute loss!

Predictions and Loss Model • Logarithmic loss – Predict a “real number” q for outcome 1. – Loss log 1/q for outcome 1 – Loss log 1/(1 -q) for outcome 0 – Expected loss: -p log q -(1 -p) log (1 -q) – Minimized for p=q (Optimal prediction) • recovers the probabilities • Loss does not depend on p!

The basic PAC Model Distribution D over domain X Unknown target function f(x) Goal: find h(x) such that h(x) approx. f(x) Given H find he. H that minimizes Pr. D[h(x) f(x)]

Basic PAC Notions Example (x, f(x)) S - sample of m examples drawn i. i. d using D True error e(h)= Pr. D[h(x)=f(x)] Observed error e’(h)= 1/m |{ x e. S | h(x) f(x) }| Basic question: How close is e(h) to e’(h)

Bayesian Theory Prior distribution over H Given a sample S compute a posterior distribution: Maximum Likelihood (ML) Maximum A Posteriori (MAP) Bayesian Predictor Pr[S|h] Pr[h|S] h(x) Pr[h|S].

Nearest Neighbor Methods Classify using near examples. Assume a “structured space” and a “metric” - + + + - - ? + -

Computational Methods How to find an h e H with low observed error. Most cases computational tasks are provably hard. Heuristic algorithm for specific classes.

Separating Hyperplane Perceptron: sign( xiwi ) Find w 1. . wn Limited representation w 1 x 1 sign S wn xn

Neural Networks Sigmoidal gates: a= xiwi and output = 1/(1+ e-a) x 1 Back Propagation xn

Decision Trees x 1 > 5 +1 x 6 > 2 +1 -1

Decision Trees Limited Representation Efficient Algorithms. Aim: Find a small decision tree with low observed error.

Decision Trees PHASE I: Construct the tree greedy, using a local index function. Ginni Index : G(x) = x(1 -x), Entropy H(x). . . PHASE II: Prune the decision Tree while maintaining low observed error. Good experimental results

Complexity versus Generalization hypothesis complexity versus observed error. More complex hypothesis have lower observed error, but might have higher true error.

Basic criteria for Model Selection Minimum Description Length e’(h) + |code length of h| Structural Risk Minimization: e’(h) + sqrt{ log |H| / m }

Genetic Programming A search Method. Example: decision trees Local mutation operations Change a node in a tree Cross-over operations Replace a subtree by another tree Keeps the “best” candidates. Keep trees with low observed erro

General PAC Methodology Minimize the observed error. Search for a small size classifier Hand-tailored search method for specific classes.

Weak Learning Small class of predicates H Weak Learning: Assume that for any distribution D, there is some predicate he. H that predicts better than 1/2+e. Weak Learning Strong Learning

Boosting Algorithms Functions: Weighted majority of the predicates. Methodology: Change the distribution to target “hard” examples. Weight of an example is exponential in the number of incorrect classifications. Extremely good experimental results and efficient algorithms.

Support Vector Machine n dimensions m dimensions

Support Vector Machine Project data to a high dimensional space. Use a hyperplane in the LARGE space. Choose a hyperplane with a large MARGIN. - + + -

Other Models Membership Queries x f(x)

Fourier Transform f(x) = az cz(x) = (-1)<x, z> Many Simple classes are well approximated using large coefficients. Efficient algorithms for finding large coefficients.

Reinforcement Learning Control Problems. Changing the parameters changes the behavior. Search for optimal policies.

Clustering: Unsupervised learning

Unsupervised learning: Clustering

Basic Concepts in Probability • For a single hypothesis h: – Given an observed error – Bound the true error • Markov Inequality • Chebyshev Inequality • Chernoff Inequality

Basic Concepts in Probability • Switching from h 1 to h 2: – Given the observed errors – Predict if h 2 is better. • Total error rate • Cases where h 1(x) h 2(x) – More refine