Experts and Boosting Algorithms Experts Motivation Given a

Experts and Boosting Algorithms

Experts: Motivation • Given a set of experts – No prior information – No consistent behavior – Goal: Predict as the best expert • Model – online model – Input: historical results.

Experts: Model • N strategies (experts) • At time t: – Learner A chooses a distribution over N. – Let pt(i) probability of i-th expert. – Clearly Spt(i) = 1 – Receiving a loss vector lt – Loss at time t: Spt(i) lt(i) • Assume bounded loss, lt(i) in [0, 1]

Expert: Goal • Match the loss of best expert. • Loss: – LA – Li • Can we hope to do better?

Example: Guessing letters • Setting: – Alphabet S of k letters • Loss: – 1 incorrect guess – 0 correct guess • Experts: – Each expert guesses a certain letter always. • Game: guess the most popular letter online.

Example 2: Rock-Paper-Scissors • Two player game. • Each player chooses: Rock, Paper, or Scissors. • Loss Matrix: Rock Paper Scissors Rock 1/2 1 0 Paper 0 1/2 1 0 1/2 Scissors 1 • Goal: Play as best as we can given the opponent.

Example 3: Placing a point • • Action: choosing a point d. Loss (give the true location y): ||d-y||. Experts: One for each point. Important: Loss is Convex • Goal: Find a “center”

Experts Algorithm: Greedy • For each expert define its cumulative loss: • Greedy: At time t choose the expert with minimum loss, namely, arg min Lit

Greedy Analysis • Theorem: Let LGT be the loss of Greedy at time T, then • Proof!

Better Expert Algorithms • Would like to bound

Expert Algorithm: Hedge(b) • • Maintains weight vector wt Probabilities pt(k) = wt(k) / S wt(j) Initialization w 1(i) = 1/N Updates: – wt+1(k) = wt(k) Ub(lt(k)) – where b in [0, 1] and – br < Ub (r) < 1 -(1 -b)r

Hedge Analysis • Lemma: For any sequence of losses • Proof! • Corollary:

Hedge: Properties • Bounding the weights • Similarly for a subset of experts.

Hedge: Performance • Let k be with minimal loss • Therefore

Hedge: Optimizing b • For b=1/2 we have • Better selection of b:

Occam Razor

Occam Razor • Finding the shortest consistent hypothesis. • Definition: (a, b)-Occam algorithm – a >0 and b <1 – Input: a sample S of size m – Output: hypothesis h – for every (x, b) in S: h(x)=b – size(h) < sizea(ct) mb • Efficiency.

Occam algorithm and compression S (xi, bi) A B x 1, … , xm

compression • Option 1: – A sends B the values b 1 , … , bm – m bits of information • Option 2: – A sends B the hypothesis h – Occam: large enough m has size(h) < m • Option 3 (MDL): – A sends B a hypothesis h and “corrections” – complexity: size(h) + size(errors)

Occam Razor Theorem • • A: (a, b)-Occam algorithm for C using H D distribution over inputs X ct in C the target function Sample size: • with probability 1 -d A(S)=h has error(h) < e

Occam Razor Theorem • • Use the bound for finite hypothesis class. Effective hypothesis class size 2 size(h) < na mb Sample size:

Weak and Strong Learning

PAC Learning model • There exists a distribution D over domain X • Examples: <x, c(x)> – use c for target function (rather than ct) • Goal: – With high probability (1 -d) – find h in H such that – error(h, c ) < e – e arbitrarily small.

Weak Learning Model • Goal: error(h, c) < ½ - g • The parameter g is small – constant – 1/poly • Intuitively: A much easier task • Question: – Assume C is weak learnable, – C is PAC (strong) learnable

Majority Algorithm • Hypothesis: h. M(x)= MAJ[ h 1(x), . . . , h. T(x) ] • size(h. M) < T size(ht) • Using Occam Razor

Majority: outline • • • Sample m example Start with a distribution 1/m per example. Modify the distribution and get ht Hypothesis is the majority Terminate when perfect classification – of the sample

Majority: Algorithm • Use the Hedge algorithm. • The “experts” will be associate with points. • Loss would be a correct classification. – lt(i)= 1 - | ht(xi) – c(xi) | • Setting b= 1 - g • h. M(x) = MAJORITY( hi(x)) • Q: How do we set T?

Majority: Analysis • Consider the set of errors S – S={i | h. M(xi) c(xi) } • For ever i in S: – Li/T < ½ (Proof!) • From Hedge properties:

MAJORITY: Correctness • Error Probability: • Number of Rounds: • Terminate when error less than 1/m

Ada. Boost: Dynamic Boosting • Better bounds on the error • No need to “know” g • Each round a different b – as a function of the error

Ada. Boost: Input • Sample of size m: < xi, c(xi) > • A distribution D over examples – We will use D(xi)=1/m • Weak learning algorithm • A constant T (number of iterations)

Ada. Boost: Algorithm • Initialization: w 1(i) = D(xi) • For t = 1 to T DO – – – pt(i) = wt(i) / Swt(j) Call Weak Learner with pt Receive ht Compute the error et of ht on pt Set bt= et/(1 -et) wt+1(i) = wt(i) (bt)e, where e=1 -|ht(xi)-c(xi)| • Output

Ada. Boost: Analysis • Theorem: – Given e 1, . . . , e. T – the error e of h. A is bounded by

Ada. Boost: Proof • Let lt(i) = 1 -|ht(xi)-c(xi)| • By definition: pt lt = 1 –et • Upper bounding the sum of weights – From the Hedge Analysis. • Error occurs only if

Ada. Boost Analysis (cont. ) • • Bounding the weight of a point Bounding the sum of weights Final bound as function of bt Optimizing bt: – bt= et / (1 – et)

Ada. Boost: Fixed bias • Assume et= 1/2 - g • We bound:

Learning OR with few attributes • Target function: OR of k literals • Goal: learn in time: – polynomial in k and log n – e and d constant • ELIM makes “slow” progress – disqualifies one literal per round – May remain with O(n) literals

Set Cover - Definition • • Input: S 1 , … , St and Si U Output: Si 1, … , Sik and j Sjk=U Question: Are there k sets that cover U? NP-complete

Set Cover Greedy algorithm • j=0 ; Uj=U; C= • While Uj – Let Si be arg max |Si Uj| – Add Si to C – Let Uj+1 = Uj – Si – j = j+1

Set Cover: Greedy Analysis • • At termination, C is a cover. Assume there is a cover C’ of size k. C’ is a cover for every Uj Some S in C’ covers Uj/k elements of Uj Analysis of Uj: |Uj+1| |Uj| - |Uj|/k Solving the recursion. Number of sets j < k ln |U|

Building an Occam algorithm • Given a sample S of size m – Run ELIM on S – Let LIT be the set of literals – There exists k literals in LIT that classify correctly all S • Negative examples: – any subset of LIT classifies theme correctly

Building an Occam algorithm • Positive examples: – – – Search for a small subset of LIT Which classifies S+ correctly For a literal z build Tz={x | z satisfies x} There are k sets that cover S+ Find k ln m sets that cover S+ • Output h = the OR of the k ln m literals • Size (h) < k ln m log 2 n • Sample size m =O( k log n log (k log n))