Probably Approximately Correct Learning PAC Leslie G Valiant

Probably Approximately Correct Learning (PAC) Leslie G. Valiant. A Theory of the Learnable. Comm. ACM (1984) 1134 -1142

Recall: Bayesian learning • Create a model based on some parameters • Assume some prior distribution on those parameters • Learning problem – Adjust the model parameters so as to maximize the likelihood of the model given the data – Utilize Bayesian formula for that.

PAC Learning • Given distribution D of observables X • Given a family of functions (concepts) F • For each x ε X and f ε F: f(x) provides the label for x • Given a family of hypotheses H, seek a hypothesis h such that Error(h) = Prx ε D[f(x) ≠ h(x)] is minimal

PAC New Concepts • • Large family of distributions D Large family of concepts F Family of hypothesis H Main questions: – Is there a hypothesis h ε H that can be learned – How fast can it be learned – What is the error that can be expected

Estimation vs. approximation • Note: – The distribution D is fixed – There is no noise in the system (currently) – F is a space of binary functions (concepts) • This is thus an approximation problem as the function is given exactly for each x X • Estimation problem: f is not given exactly but estimated from noisy data

Example (PAC) • Concept: Average body-size person • Inputs: for each person: – height – weight • Sample: labeled examples of persons – label + : average body-size – label - : not average body-size • Two dimensional inputs

Observable space X with concept f

Example (PAC) • Assumption: target concept is a rectangle. • Goal: – Find a rectangle h that “approximates” the target – Hypothesis family H of rectangles • Formally: – With high probability – output a rectangle such that – its error is low.

Example (Modeling) • Assume: – Fixed distribution over persons. • Goal: – Low error with respect to THIS distribution!!! • How does the distribution look like? – Highly complex. – Each parameter is not uniform. – Highly correlated.

Model Based approach • First try to model the distribution. • Given a model of the distribution: – find an optimal decision rule. • Bayesian Learning

PAC approach • Assume that the distribution is fixed. • Samples are drawn are i. i. d. – independent – Identically distributed • Concentrate on the decision rule rather than distribution.

PAC Learning • Task: learn a rectangle from examples. • Input: point (x, f(x)) and classification + or – classifies by a rectangle R • Goal: – Using the fewest examples – compute h – h is a good approximation for f

PAC Learning: Accuracy • Testing the accuracy of a hypothesis: – using the distribution D of examples. • Error = h D f (symmetric difference) • Pr[Error] = D(Error) = D(h D h) • We would like Pr[Error] to be controllable. • Given a parameter : – Find h such that Pr[Error] < .

PAC Learning: Hypothesis • Which Rectangle should we choose? – Similar to parametric modeling?

Setting up the Analysis: • Choose smallest rectangle. • Need to show: – For any distribution D and Rectangle h – input parameters: and – Select m( , ) examples – Let h be the smallest consistent rectangle. – Such that with probability 1 - (on X): D(f D h) <

More general case (no rectangle) • A distribution: D (unknown) • Target function: ct from C – ct : X {0, 1} • Hypothesis: h from H – h: X {0, 1} • Error probability: – error(h) = Prob. D[h(x) ct(x)] • Oracle: EX(ct, D)

PAC Learning: Definition • C and H are concept classes over X. • C is PAC learnable by H if • There Exist an Algorithm A such that: – For any distribution D over X and ct in C – for every input and : – outputs a hypothesis h in H, – while having access to EX(ct, D) – with probability 1 - we have error(h) < • Complexities.

Finite Concept class • Assume C=H and finite. • h is -bad if error(h)> . • Algorithm: – Sample a set S of m(e, d) examples. – Find h in H which is consistent. • Algorithm fails if h is -bad.

PAC learning: formalization (1) • X is the set of all possible examples • D is the distribution from which the examples are drawn • H is the set of all possible hypotheses, c H • m is the number of training examples • error(h) = Pr(h(x) c(x) | x is drawn from X with D) • h is approximately correct if error(h)

PAC learning: formalization (2) H Hbad c To show: after m examples, with high probability, all consistent hypotheses are approximately correct. All consistent hypotheses lie in an -ball around c.

Complexity analysis (1) • The probability that hypothesis hbad Hbad is consistent with the first m examples: error(hbad) > by definition. The probability that it agrees with an example is thus (1 - ) and with m examples (1 - )m

Complexity analysis (2) • For Hbad to contain a consistent example, at least one hypothesis in it must be consistent. Probability(Hbad has a consistent hypothesis) |Hbad|(1 - )m |H|(1 - )m

Complexity analysis (3) • To reduce the probability of error below |H|(1 - )m • This is possible when at least m examples m 1/ (ln 1/ + ln |H|) are seen. • This is the sample complexity

Complexity analysis (4) • “at least m examples are necessary to build a consistent hypothesis h that is wrong at most times with probability 1 - ” • Since |H| = 22^n, the complexity grows exponentially with the number of attributes n • Conclusion: learning any boolean function is no better in the worst case than table lookup!

PAC learning -- observations • “Hypothesis h(X) is consistent with m examples and has an error of at most with probability 1 - ” • This is a worst-case analysis. Note that the result is independent of the distribution D! • Growth rate analysis: – for 0, m proportionally – for 0, m logarithmically – for |H| , m logarithmically

PAC: comments • We only assumed that examples are i. i. d. • We have two independent parameters: – Accuracy e – Confidence d • No assumption about the likelihood of a hypothesis. • Hypothesis is tested on the same distribution as the sample.

PAC: non-feasible case • What happens if ct not in H • Needs to redefine the goal. • Let h* in H minimize the error b=error(h*) • Goal: find h in H such that error(h) error(h*) + = b+

Analysis* • For each h in H: – let obs-error(h) be the average error on the sample S. • Compute the probability that: Pr {|obs-error(h) - error(h) | < /2} Chernoff bound: Pr < exp(-( /2)2 m) • Consider entire H : Pr < |H| exp(-( /2)2 m) • Sample size m > (4/e 2) ln (|H|/d)

Correctness • Assume that for all h in H: – |obs-error(h) - error(h) | < /2 • In particular: – obs-error(h*) < error(h*) + /2 – error(h) - /2 < obs-error(h) • For the output h: – obs-error(h) < obs-error(h*) • Conclusion: error(h) < error(h*)+

Sample size issue • Due to the use of Chernoff boud: Pr {|obs-error(h) - error(h) | < /2} Chernoff bound: Pr < exp(-( /2)2 m) and on entire H : Pr < |H| exp(-( /2)2 m) • It follows that the sample size m > (4/e 2) ln (|H|/d) Not (1/e) ln (|H|/d) as before

Example: Learning OR of literals • • Inputs: x 1, … , xn Literals : x 1, OR functions: For each variable, target disjunction may contain xi or not, thus Number of disjunctions is 3 n

ELIM: Algorithm for learning OR • Keep a list of all literals • For every example whose classification is 0: – Erase all the literals that are 1. • Example c(00110)=0 results in deleting • Correctness: – Our hypothesis h: An OR of our set of literals. – Our set of literals includes the target OR literals. – Every time h predicts zero: we are correct. • Sample size: m > (1/e) ln (3 n/d)

Learning parity • Functions: x 1 x 7 x 9 • Number of functions: 2 n • Algorithm: – Sample set of examples – Solve linear equations (Matrix exists) • Sample size: m > (1/e) ln (2 n/d)

Infinite Concept class • X=[0, 1] and H={cq | q in [0, 1]} • cq(x) = 0 iff x < q • Assume C=H: max min • Which cq should we choose in [min, max]?

$Proof I • Define max = min{x|c(x)=1}, min = max{x|c(x)=0} • _Show that the$

Proof I • Define max = min{x|c(x)=1}, min = max{x|c(x)=0} • _Show that the probability that – Pr[ D([min, max]) > ] < • Proof: By Contradiction. – The probability that x in [min, max] at least – The probability we do not sample from [min, max] Is (1 - )m – Needs m > (1/e) ln (1/d) There is something wrong

Proof II (correct): • Let max’ be : D([q, max’])= /2 • Let min’ be : D([q, min’])= /2 • Goal: Show that with high probability – X+ in [max’, q] and – X- in [q, min’] • In such a case any value in [x-, x+] is good. • Compute sample size!

Proof II (correct): • Pr{x 1, x 2, . . , xm} is not in [q, min’])= (1 - /2)m < exp(-m /2) • Similarly with the other side • We require 2 exp(-m /2) < δ • Thus, m > 2/ ln(2/δ)

Comments • The hypothesis space was very simple H={cq | q in [0, 1]} • There was no noise in the data or labels • So learning was trivial in some sense (analytic solution)

Non-Feasible case: Label Noise • Suppose we sample: • Algorithm: – Find the function h with lowest error!

Exercise (Submission Mar 29, 04) 1. Assume there is Gaussian (0, σ) noise on xi. Apply the same analysis to compute the required sample size for PAC learning. Note: Class labels are determined by the non-noisy observations.

General -net approach • Given a class H define a class G – For every h in H – There exist a g in G such that – D(g D h) < /4 • Algorithm: Find the best h in H. • Computing the confidence and sample size.

Occam Razor W. Occam (1320) “Entities should not be multiplied unnecessarily” A. Einstein “Simplify a problem as much as possible, but no simpler” Information theoretic ground?

Occam Razor Finding the shortest consistent hypothesis. • Definition: (a, b)-Occam algorithm – – – • a >0 and b <1 Input: a sample S of size m Output: hypothesis h for every (x, b) in S: h(x)=b (consistency) size(h) < sizea(ct) mb Efficiency.

Occam algorithm and compression S (xi, bi) A B x 1, … , xm

Compression • Option 1: – A sends B the values b 1 , … , bm – m bits of information • Option 2: – A sends B the hypothesis h – Occam: large enough m has size(h) < m • Option 3 (MDL): – A sends B a hypothesis h and “corrections” – complexity: size(h) + size(errors)

Occam Razor Theorem • • A: (a, b)-Occam algorithm for C using H D distribution over inputs X ct in C the target function, n=size(ct) Sample size: • with probability 1 - A(S)=h has error(h) <

Occam Razor Theorem • • Use the bound for finite hypothesis class. Effective hypothesis class size 2 size(h) < na mb Sample size: The VC dimension will replace 2 size(h)

Exercise (Submission Mar 29, 04) 2. For an (a, b)-Occam algorithm, given noisy data with noise ~ (0, σ2) find the limitations on m. Hint (ε-net and Chernoff bound)

Learning OR with few attributes • Target function: OR of k literals • Goal: learn in time: – polynomial in k and log n – and constant • ELIM makes “slow” progress – disqualifies one literal per round – May remain with O(n) literals

Set Cover - Definition • • Input: S 1 , … , St and Si U Output: Si 1, … , Sik and j Sjk=U Question: Are there k sets that cover U? NP-complete

Set Cover Greedy algorithm • j=0 ; Uj=U; C= • While Uj – Let Si be arg max |Si Uj| – Add Si to C – Let Uj+1 = Uj – Si – j = j+1

Set Cover: Greedy Analysis • • At termination, C is a cover. Assume there is a cover C’ of size k. C’ is a cover for every Uj Some S in C’ covers Uj/k elements of Uj Analysis of Uj: |Uj+1| |Uj| - |Uj|/k Solving the recursion. Number of sets j < k ln |U| Ex 2 Solve

Learning decision lists • Lists of arbitrary size can represent any boolean function. Lists with tests of at most with at most k < n literals define the k-DL boolean language. For n attributes, the language is k-DL(n). • The language of tests Conj(n, k) has at most 3|Conj(n, k)| distinct component sets (Y, N, absent) • |k-DL(n)| 3|Conj(n, k)|! (any order) • |Conj(n, k)| = i=0 k ( ) = O(nk) 2 n i

Building an Occam algorithm • Given a sample S of size m – Run ELIM on S – Let LIT be the set of literals – There exists k literals in LIT that classify correctly all S • Negative examples: – any subset of LIT classifies theme correctly

Building an Occam algorithm • Positive examples: – Search for a small subset of LIT – Which classifies S+ correctly – For a literal z build Tz={x | z satisfies x} – There are k sets that cover S+ – Find k ln m sets that cover S+ • Output h = the OR of the k ln m literals • Size (h) < k ln m log 2 n • Sample size m =O( k log n log (k log n))

Criticism of PAC model • The worst-case emphasis makes it unusable – Useful for analysis of computational complexity – Methods to estimate the cardinality of the space of concepts (VC dimension). Unfortunately not sufficiently practical • The notions of target concepts and noise-free training are too restrictive – True. Switch to concept approximation weak. – Some extensions to label noise and fewer to variable noise

Summary • PAC model – Confidence and accuracy – Sample size • Finite (and infinite) concept class • Occam Razor

References • A theory of the learnable. Comm. ACM 27(11): 1134 -42, 1984. Original work • Probably approximately correct learning. D. Haussler. (Review) • Efficient noise-tolerant learning from statistical queries. M. Kearns. (Review on noise methods) • PAC learning with simple examples. F. Denis et al. (Simple)

Learning algorithms • • OR function Parity function OR of a few literals Open problems – OR in the non-feasible case – Parity of a few literals