Active Learning Lecture 26 th Maria Florina Balcan

Active Learning Lecture 26 th Maria Florina Balcan Maria-Florina Balcan

Active Learning Data Source Learning Algorithm Expert / Oracle Unlabeled examples Request for the Label of an Example A Label for that Example . . . Algorithm outputs a classifier • The learner can choose specific examples to be labeled. • He works harder, to use fewer labeled examples.

What Makes a Good Algorithm? • Guaranteed to output a relatively good classifier for most learning problems. • Doesn’t make too many label requests. • Choose the label requests carefully, to get informative labels. Maria-Florina Balcan

Can It Really Do Better Than Passive? • YES! (sometimes) • We often need far fewer labels for active learning than for passive. • This is predicted by theory and has been observed in practice. Maria-Florina Balcan

Can adaptive querying help? [CAL 92, Dasgupta 04] • Threshold fns on the real line: hw(x) = 1(x ¸ w), C = {hw: w 2 R} - + w Active Algorithm • Sample with 1/ unlabeled examples; do binary search. - - + • Binary search – need just O(log 1/ ) labels. Passive supervised: (1/ ) labels to find an -accurate threshold. Active: only O(log 1/ ) labels. Exponential improvement. Other interesting results as well.

Active Learning might not help [Dasgupta 04] In general, number of queries needed depends on C and also on D. h 3 R 1}: C = {linear separators in active learning reduces sample complexity substantially. h 2 C = {linear separators in R 2}: there are some target hyp. for which no improvement can be achieved! - no matter how benign the input distr. h 1 h 0 In this case: learning to accuracy requires 1/ labels… Maria-Florina Balcan

Examples where Active Learning helps In general, number of queries needed depends on C and also on D. • C = {linear separators in R 1}: active learning reduces sample complexity substantially no matter what is the input distribution. • C - homogeneous linear separators in Rd, D - uniform distribution over unit sphere: • need only d log 1/ labels to find a hypothesis with error rate < . • Dasgupta, Kalai, Monteleoni, COLT 2005 • Freund et al. , ’ 97. • Balcan-Broder-Zhang, COLT 07 Maria-Florina Balcan

Region of uncertainty [CAL 92] • Current version space: part of C consistent with labels so far. • “Region of uncertainty” = part of data space about which there is still some uncertainty (i. e. disagreement within version space) • Example: data lies on circle in R 2 and hypotheses are homogeneous linear separators. current version space + + region of uncertainty in data space Maria-Florina Balcan

Region of uncertainty [CAL 92] current version space region of uncertainy Algorithm: Pick a few points at random from the current region of uncertainty and query their labels. Maria-Florina Balcan

Region of uncertainty [CAL 92] • Current version space: part of C consistent with labels so far. • “Region of uncertainty” = part of data space about which there is still some uncertainty (i. e. disagreement within version space) current version space + + region of uncertainty in data space Maria-Florina Balcan

Region of uncertainty [CAL 92] • Current version space: part of C consistent with labels so far. • “Region of uncertainty” = part of data space about which there is still some uncertainty (i. e. disagreement within version space) new version space + + New region of uncertainty in data space Maria-Florina Balcan

Region of uncertainty [CAL 92], Guarantees Algorithm: Pick a few points at random from the current region of uncertainty and query their labels. [Balcan, Beygelzimer, Langford, ICML’ 06] Analyze a version of this alg. which is robust to noise. • C- linear separators on the line, low noise, exponential improvement. • C - homogeneous linear separators in Rd, D -uniform distribution over unit sphere. • low noise, need only d 2 log 1/ labels to find a hypothesis with error rate < . • realizable case, d 3/2 log 1/ labels. • supervised -- d/ labels. Maria-Florina Balcan

Margin Based Active-Learning Algorithm [Balcan-Broder-Zhang, COLT 07] wk+1 Use O(d) examples to find w 1 of error 1/8. wk iterate k=2, … , log(1/ ) • rejection sample mk samples x from D satisfying |wk-1 T ¢ x| · k ; γk • label them; • find wk 2 B(wk-1, 1/2 k ) consistent with all these examples. end iterate w* Maria-Florina Balcan

Margin Based Active-Learning, Realizable Case Theorem If PX is uniform over Sd. and then after iterations ws has error · . Fact 1 (u, v) u v Fact 2 v Maria-Florina Balcan

Margin Based Active-Learning, Realizable Case Theorem PX is uniform over Sd. If and then after iterations ws has error · . Fact 1 (u, v) u v Fact 3 If uv and v Maria-Florina Balcan

BBZ’ 07, Proof Idea iterate k=2, … , log(1/ ) Rejection sample mk samples x from D satisfying |wk-1 T ¢ x| · k ; ask for labels and find wk 2 B(wk-1, 1/2 k ) consistent with all these examples. end iterate Assume wk has error · . We are done if 9 k s. t. wk+1 has error · /2 and only need O(d log( 1/ )) labels in round k. wk+1 wk w* γk Maria-Florina Balcan

BBZ’ 07, Proof Idea iterate k=2, … , log(1/ ) Rejection sample mk samples x from D satisfying |wk-1 T ¢ x| · k ; ask for labels and find wk 2 B(wk-1, 1/2 k ) consistent with all these examples. end iterate Assume wk has error · . We are done if 9 k s. t. wk+1 has error · /2 and only need O(d log( 1/ )) labels in round k. Key Point Under the uniform distr. assumption for we have wk+1 · /4 wk w* γk Maria-Florina Balcan

BBZ’ 07, Proof Idea Key Point Under the uniform distr. assumption for wk+1 we have wk · /4 w* γk Key Point So, it’s enough to ensure that We can do so by only using O(d log( 1/ )) labels in round k. Maria-Florina Balcan