Probably Approximately Correct Learning PAC Leslie G Valiant
Probably Approximately Correct Learning (PAC) Leslie G. Valiant. A Theory of the Learnable. Comm. ACM (1984) 1134 -1142
Recall: Bayesian learning • Create a model based on some parameters • Assume some prior distribution on those parameters • Learning problem – Adjust the model parameters so as to maximize the likelihood of the model given the data – Utilize Bayesian formula for that.
PAC Learning • Given distribution D of observables X • Given a family of functions (concepts) F • For each x ε X and f ε F: f(x) provides the label for x • Given a family of hypotheses H, seek a hypothesis h such that Error(h) = Prx ε D[f(x) ≠ h(x)] is minimal
PAC New Concepts • • Large family of distributions D Large family of concepts F Family of hypothesis H Main questions: – Is there a hypothesis h ε H that can be learned – How fast can it be learned – What is the error that can be expected
Estimation vs. approximation • Note: – The distribution D is fixed – There is no noise in the system (currently) – F is a space of binary functions (concepts) • This is thus an approximation problem as the function is given exactly for each x X • Estimation problem: f is not given exactly but estimated from noisy data
Example (PAC) • Concept: Average body-size person • Inputs: for each person: – height – weight • Sample: labeled examples of persons – label + : average body-size – label - : not average body-size • Two dimensional inputs
Observable space X with concept f
Example (PAC) • Assumption: target concept is a rectangle. • Goal: – Find a rectangle h that “approximates” the target – Hypothesis family H of rectangles • Formally: – With high probability – output a rectangle such that – its error is low.
Example (Modeling) • Assume: – Fixed distribution over persons. • Goal: – Low error with respect to THIS distribution!!! • How does the distribution look like? – Highly complex. – Each parameter is not uniform. – Highly correlated.
Model Based approach • First try to model the distribution. • Given a model of the distribution: – find an optimal decision rule. • Bayesian Learning
PAC approach • Assume that the distribution is fixed. • Samples are drawn are i. i. d. – independent – Identically distributed • Concentrate on the decision rule rather than distribution.
PAC Learning • Task: learn a rectangle from examples. • Input: point (x, f(x)) and classification + or – classifies by a rectangle R • Goal: – Using the fewest examples – compute h – h is a good approximation for f
PAC Learning: Accuracy • Testing the accuracy of a hypothesis: – using the distribution D of examples. • Error = h D f (symmetric difference) • Pr[Error] = D(Error) = D(h D h) • We would like Pr[Error] to be controllable. • Given a parameter : – Find h such that Pr[Error] < .
PAC Learning: Hypothesis • Which Rectangle should we choose? – Similar to parametric modeling?
Setting up the Analysis: • Choose smallest rectangle. • Need to show: – For any distribution D and Rectangle h – input parameters: and – Select m( , ) examples – Let h be the smallest consistent rectangle. – Such that with probability 1 - (on X): D(f D h) <
More general case (no rectangle) • A distribution: D (unknown) • Target function: ct from C – ct : X {0, 1} • Hypothesis: h from H – h: X {0, 1} • Error probability: – error(h) = Prob. D[h(x) ct(x)] • Oracle: EX(ct, D)
PAC Learning: Definition • C and H are concept classes over X. • C is PAC learnable by H if • There Exist an Algorithm A such that: – For any distribution D over X and ct in C – for every input and : – outputs a hypothesis h in H, – while having access to EX(ct, D) – with probability 1 - we have error(h) < • Complexities.
Finite Concept class • Assume C=H and finite. • h is -bad if error(h)> . • Algorithm: – Sample a set S of m(e, d) examples. – Find h in H which is consistent. • Algorithm fails if h is -bad.
PAC learning: formalization (1) • X is the set of all possible examples • D is the distribution from which the examples are drawn • H is the set of all possible hypotheses, c H • m is the number of training examples • error(h) = Pr(h(x) c(x) | x is drawn from X with D) • h is approximately correct if error(h)
PAC learning: formalization (2) H Hbad c To show: after m examples, with high probability, all consistent hypotheses are approximately correct. All consistent hypotheses lie in an -ball around c.
Complexity analysis (1) • The probability that hypothesis hbad Hbad is consistent with the first m examples: error(hbad) > by definition. The probability that it agrees with an example is thus (1 - ) and with m examples (1 - )m
Complexity analysis (2) • For Hbad to contain a consistent example, at least one hypothesis in it must be consistent. Probability(Hbad has a consistent hypothesis) |Hbad|(1 - )m |H|(1 - )m
Complexity analysis (3) • To reduce the probability of error below |H|(1 - )m • This is possible when at least m examples m 1/ (ln 1/ + ln |H|) are seen. • This is the sample complexity
Complexity analysis (4) • “at least m examples are necessary to build a consistent hypothesis h that is wrong at most times with probability 1 - ” • Since |H| = 22^n, the complexity grows exponentially with the number of attributes n • Conclusion: learning any boolean function is no better in the worst case than table lookup!
PAC learning -- observations • “Hypothesis h(X) is consistent with m examples and has an error of at most with probability 1 - ” • This is a worst-case analysis. Note that the result is independent of the distribution D! • Growth rate analysis: – for 0, m proportionally – for 0, m logarithmically – for |H| , m logarithmically
PAC: comments • We only assumed that examples are i. i. d. • We have two independent parameters: – Accuracy e – Confidence d • No assumption about the likelihood of a hypothesis. • Hypothesis is tested on the same distribution as the sample.
PAC: non-feasible case • What happens if ct not in H • Needs to redefine the goal. • Let h* in H minimize the error b=error(h*) • Goal: find h in H such that error(h) error(h*) + = b+
Analysis* • For each h in H: – let obs-error(h) be the average error on the sample S. • Compute the probability that: Pr {|obs-error(h) - error(h) | < /2} Chernoff bound: Pr < exp(-( /2)2 m) • Consider entire H : Pr < |H| exp(-( /2)2 m) • Sample size m > (4/e 2) ln (|H|/d)
Correctness • Assume that for all h in H: – |obs-error(h) - error(h) | < /2 • In particular: – obs-error(h*) < error(h*) + /2 – error(h) - /2 < obs-error(h) • For the output h: – obs-error(h) < obs-error(h*) • Conclusion: error(h) < error(h*)+
Sample size issue • Due to the use of Chernoff boud: Pr {|obs-error(h) - error(h) | < /2} Chernoff bound: Pr < exp(-( /2)2 m) and on entire H : Pr < |H| exp(-( /2)2 m) • It follows that the sample size m > (4/e 2) ln (|H|/d) Not (1/e) ln (|H|/d) as before
Example: Learning OR of literals • • Inputs: x 1, … , xn Literals : x 1, OR functions: For each variable, target disjunction may contain xi or not, thus Number of disjunctions is 3 n
ELIM: Algorithm for learning OR • Keep a list of all literals • For every example whose classification is 0: – Erase all the literals that are 1. • Example c(00110)=0 results in deleting • Correctness: – Our hypothesis h: An OR of our set of literals. – Our set of literals includes the target OR literals. – Every time h predicts zero: we are correct. • Sample size: m > (1/e) ln (3 n/d)
Learning parity • Functions: x 1 x 7 x 9 • Number of functions: 2 n • Algorithm: – Sample set of examples – Solve linear equations (Matrix exists) • Sample size: m > (1/e) ln (2 n/d)
Infinite Concept class • X=[0, 1] and H={cq | q in [0, 1]} • cq(x) = 0 iff x < q • Assume C=H: max min • Which cq should we choose in [min, max]?
Proof I • Define max = min{x|c(x)=1}, min = max{x|c(x)=0} • _Show that the probability that – Pr[ D([min, max]) > ] < • Proof: By Contradiction. – The probability that x in [min, max] at least – The probability we do not sample from [min, max] Is (1 - )m – Needs m > (1/e) ln (1/d) There is something wrong
Proof II (correct): • Let max’ be : D([q, max’])= /2 • Let min’ be : D([q, min’])= /2 • Goal: Show that with high probability – X+ in [max’, q] and – X- in [q, min’] • In such a case any value in [x-, x+] is good. • Compute sample size!
Proof II (correct): • Pr{x 1, x 2, . . , xm} is not in [q, min’])= (1 - /2)m < exp(-m /2) • Similarly with the other side • We require 2 exp(-m /2) < δ • Thus, m > 2/ ln(2/δ)
Comments • The hypothesis space was very simple H={cq | q in [0, 1]} • There was no noise in the data or labels • So learning was trivial in some sense (analytic solution)
Non-Feasible case: Label Noise • Suppose we sample: • Algorithm: – Find the function h with lowest error!
Analysis • Define: zi as an /4 - net (w. r. t. D) • For the optimal h* and our h there are – zj : |error(h[zj]) - error(h*)| < /4 – zk : |error(h[zk]) - error(h)| < /4 • Show that with high probability: – |obs-error(h[zi])| < /4
Exercise (Submission Mar 29, 04) 1. Assume there is Gaussian (0, σ) noise on xi. Apply the same analysis to compute the required sample size for PAC learning. Note: Class labels are determined by the non-noisy observations.
General -net approach • Given a class H define a class G – For every h in H – There exist a g in G such that – D(g D h) < /4 • Algorithm: Find the best h in H. • Computing the confidence and sample size.
Occam Razor W. Occam (1320) “Entities should not be multiplied unnecessarily” A. Einstein “Simplify a problem as much as possible, but no simpler” Information theoretic ground?
Occam Razor Finding the shortest consistent hypothesis. • Definition: (a, b)-Occam algorithm – – – • a >0 and b <1 Input: a sample S of size m Output: hypothesis h for every (x, b) in S: h(x)=b (consistency) size(h) < sizea(ct) mb Efficiency.
Occam algorithm and compression S (xi, bi) A B x 1, … , xm
Compression • Option 1: – A sends B the values b 1 , … , bm – m bits of information • Option 2: – A sends B the hypothesis h – Occam: large enough m has size(h) < m • Option 3 (MDL): – A sends B a hypothesis h and “corrections” – complexity: size(h) + size(errors)
Occam Razor Theorem • • A: (a, b)-Occam algorithm for C using H D distribution over inputs X ct in C the target function, n=size(ct) Sample size: • with probability 1 - A(S)=h has error(h) <
Occam Razor Theorem • • Use the bound for finite hypothesis class. Effective hypothesis class size 2 size(h) < na mb Sample size: The VC dimension will replace 2 size(h)
Exercise (Submission Mar 29, 04) 2. For an (a, b)-Occam algorithm, given noisy data with noise ~ (0, σ2) find the limitations on m. Hint (ε-net and Chernoff bound)
Learning OR with few attributes • Target function: OR of k literals • Goal: learn in time: – polynomial in k and log n – and constant • ELIM makes “slow” progress – disqualifies one literal per round – May remain with O(n) literals
Set Cover - Definition • • Input: S 1 , … , St and Si U Output: Si 1, … , Sik and j Sjk=U Question: Are there k sets that cover U? NP-complete
Set Cover Greedy algorithm • j=0 ; Uj=U; C= • While Uj – Let Si be arg max |Si Uj| – Add Si to C – Let Uj+1 = Uj – Si – j = j+1
Set Cover: Greedy Analysis • • At termination, C is a cover. Assume there is a cover C’ of size k. C’ is a cover for every Uj Some S in C’ covers Uj/k elements of Uj Analysis of Uj: |Uj+1| |Uj| - |Uj|/k Solving the recursion. Number of sets j < k ln |U| Ex 2 Solve
Learning decision lists • Lists of arbitrary size can represent any boolean function. Lists with tests of at most with at most k < n literals define the k-DL boolean language. For n attributes, the language is k-DL(n). • The language of tests Conj(n, k) has at most 3|Conj(n, k)| distinct component sets (Y, N, absent) • |k-DL(n)| 3|Conj(n, k)|! (any order) • |Conj(n, k)| = i=0 k ( ) = O(nk) 2 n i
Building an Occam algorithm • Given a sample S of size m – Run ELIM on S – Let LIT be the set of literals – There exists k literals in LIT that classify correctly all S • Negative examples: – any subset of LIT classifies theme correctly
Building an Occam algorithm • Positive examples: – Search for a small subset of LIT – Which classifies S+ correctly – For a literal z build Tz={x | z satisfies x} – There are k sets that cover S+ – Find k ln m sets that cover S+ • Output h = the OR of the k ln m literals • Size (h) < k ln m log 2 n • Sample size m =O( k log n log (k log n))
Criticism of PAC model • The worst-case emphasis makes it unusable – Useful for analysis of computational complexity – Methods to estimate the cardinality of the space of concepts (VC dimension). Unfortunately not sufficiently practical • The notions of target concepts and noise-free training are too restrictive – True. Switch to concept approximation weak. – Some extensions to label noise and fewer to variable noise
Summary • PAC model – Confidence and accuracy – Sample size • Finite (and infinite) concept class • Occam Razor
References • A theory of the learnable. Comm. ACM 27(11): 1134 -42, 1984. Original work • Probably approximately correct learning. D. Haussler. (Review) • Efficient noise-tolerant learning from statistical queries. M. Kearns. (Review on noise methods) • PAC learning with simple examples. F. Denis et al. (Simple)
Learning algorithms • • OR function Parity function OR of a few literals Open problems – OR in the non-feasible case – Parity of a few literals
- Slides: 60