Active Learning Workshop Active Learning Challenge http clopinet

  • Slides: 23
Download presentation
Active Learning Workshop Active Learning Challenge http: //clopinet. com/al

Active Learning Workshop Active Learning Challenge http: //clopinet. com/al

Schedule see: http: //clopinet. com/al • 8: 00 -12: 00 talks • 12: 00

Schedule see: http: //clopinet. com/al • 8: 00 -12: 00 talks • 12: 00 -13: 30 posters and lunch • 13: 30 -16: 30 talks again • 16: 30 -20: 00 beach • 20: 00 -22: 00 dinner ---------------------------- Preprints: login: AISTATS; password: papers 2010 - Sign up for transportation - Sign up for excursion Active Learning Challenge http: //clopinet. com/al

Results of the Active Learning Challenge Isabelle Guyon (Clopinet, California) Gavin Cawley (University of

Results of the Active Learning Challenge Isabelle Guyon (Clopinet, California) Gavin Cawley (University of East Anglia, UK) Gideon Dror (Academic College of Tel-Aviv-Yaffo, Israel) Vincent Lemaire (Orange, France) Active Learning Challenge http: //clopinet. com/al

Data donors This project would not have been possible without generous donations of data:

Data donors This project would not have been possible without generous donations of data: • Chemoinformatics -- Charles Bergeron, Kristin Bennett and Curt Breneman (Rensselaer Polytechnic Institute, New York) contributed a dataset, which will be used for final testing. • Embryology -- Emmanuel Faure, Thierry Savy, Louise Duloquin, Miguel Luengo Oroz, Benoit Lombardot, Camilo Melani, Paul Bourgine, and Nadine Peyriéras (Institut des systèmes complexes, France) contributed the ZEBRA dataset. • Handwriting recognition -- Reza Farrahi Moghaddam, Mathias Adankon, Kostyantyn Filonenko, Robert Wisnovsky, and Mohamed Chériet (Ecole de technologie supérieure de Montréal, Quebec) contributed the IBN_SINA dataset. • Marketing -- Vincent Lemaire, Marc Boullé, Fabrice Clérot, Raphael Féraud, Aurélie Le Cam, and Pascal Gouzien (Orange, France) contributed the ORANGE dataset, previously used in the KDD cup 2009. We also reused data made publicly available on the Internet: • Chemoinformatics -- The National Cancer Institute (USA) for the HIVA dataset. • Ecology -- Jock A. Blackard, Denis J. Dean, and Charles W. Anderson (US Forest Service, USA) for the SYLVA dataset (Forest cover type). • Text processing -- Tom Mitchell (USA) and Ron Bekkerman (Israel) for the NOVA datset (derived from the Twenty Newsgroups). Active Learning Challenge http: //clopinet. com/al

Also, many thanks to: • • • Web platform: Olivier Guyon (Mister. P. net,

Also, many thanks to: • • • Web platform: Olivier Guyon (Mister. P. net, France) Computer hosting: Prof. Joachim Buhmann (ETH Zurich, Switzerland) System administration: Thomas Fuchs (ETH Zurich, Switzerland) Advisors and beta testers: Olivier Chapelle (Yahhoo!, California) Amir Reza Saffari Azar (Graz University of Technology) Causality workbench team: Constantin Aliferis (New York University, USA) Greg Cooper (University of Pittsburgh, USA) Andre Elisseeff (Nhumi, Switzerland) Jean-Philippe Pellet (IBM Zurich, Switzerland) Peter Spirtes (Carnegie Mellon University, USA) Alexander Statnikov (New York University, USA) Active Learning Challenge http: //clopinet. com/al

Labeling data is expensive $$ Active Learning Challenge $$$$$ http: //clopinet. com/al

Labeling data is expensive $$ Active Learning Challenge $$$$$ http: //clopinet. com/al

Examples of domains • • Chemo-informatics Handwriting and speech recognition Image processing Text processing

Examples of domains • • Chemo-informatics Handwriting and speech recognition Image processing Text processing Marketing Ecology Embryology Active Learning Challenge http: //clopinet. com/al

What is active learning? Active Learning Challenge http: //clopinet. com/al

What is active learning? Active Learning Challenge http: //clopinet. com/al

Focus on “pool-based” AL • Simplest scenario for a challenge. Training data: labels can

Focus on “pool-based” AL • Simplest scenario for a challenge. Training data: labels can be queried Test data: unknown labels • Methods developed for pool-based AL should also be useful for stream-based AL. Active Learning Challenge http: //clopinet. com/al

Datasets Active Learning Challenge http: //clopinet. com/al

Datasets Active Learning Challenge http: //clopinet. com/al

Difficulties • • • Spase data Missing values Unbalanced classes Categorical variables Noisy data

Difficulties • • • Spase data Missing values Unbalanced classes Categorical variables Noisy data Large datasets Active Learning Challenge http: //clopinet. com/al

Protocol (using the Virtual Lab) 0. Get 1 seed labeled ex. from the positive

Protocol (using the Virtual Lab) 0. Get 1 seed labeled ex. from the positive class 1. Predict 2. Sample 3. Submit a query 4. Retrieve the labels Active Learning Challenge http: //clopinet. com/al

Evaluation Area under the Learning Curve (ALC) Linear interpolation. Horizontal extrapolation. One query Five

Evaluation Area under the Learning Curve (ALC) Linear interpolation. Horizontal extrapolation. One query Five queries Thirteen queries Lazy: ask for all labels at once Active Learning Challenge http: //clopinet. com/al

Participation • Over 300 registered participants. • 30 manually verified teams of 1 to

Participation • Over 300 registered participants. • 30 manually verified teams of 1 to 20 members for the final test. • 50 countries represented. • ~700 development entries and ~250 final test entries. Active Learning Challenge http: //clopinet. com/al

Who won? Everybody! • Avicenna (HWR): Yanjie Fu and Chuan. Ren Liu, Chinese Academy

Who won? Everybody! • Avicenna (HWR): Yanjie Fu and Chuan. Ren Liu, Chinese Academy of Science. • Banana (Marketing): Ming-Hen Tsai and Chia-Hua Ho, National Taiwan University. • Chemo (Chemoinformatics): Cristian Grozea, Fraunhofer Institute FIRST, Germany, and Marius Popescu, University of Bucharest, Faculty of Mathematics and Computer Science, Romania. • Docs (Document classification): Zalan Bodo, Zsolt Minier, and Lehel Csato, Babes-Bolyai University, Romania. • Embryo (Embryology): Yukun Chen and Subramani Mani, Vanderbilt University, USA. • Forest (Ecology): Alexander Borisov and Eugene Tuv, Intel, USA. Active Learning Challenge http: //clopinet. com/al

Who won? Winners by average rank over all datasets: • 1 st place: Alexander

Who won? Winners by average rank over all datasets: • 1 st place: Alexander Borisov and Eugene Tuv, Intel, with Ensembles of Decision Trees. • 2 nd place: Ming-Hen Tsai and Chia-Hua Ho, National Taiwan University, with Support Vector Machines. • 3 rd place: Jianjun Xie, First American Core. Logic, USA and Tao Xiong, e. Bay , USA, with Stochastic Semi-supervised Learning. Active Learning Challenge http: //clopinet. com/al

Global Score (ALC) Active Learning Challenge http: //clopinet. com/al

Global Score (ALC) Active Learning Challenge http: //clopinet. com/al

AUC Active Learning Challenge http: //clopinet. com/al

AUC Active Learning Challenge http: //clopinet. com/al

Strategies Active Learning Challenge http: //clopinet. com/al

Strategies Active Learning Challenge http: //clopinet. com/al

Active Learning Challenge http: //clopinet. com/al

Active Learning Challenge http: //clopinet. com/al

Uncertainty Sampling Active Learning Challenge http: //clopinet. com/al

Uncertainty Sampling Active Learning Challenge http: //clopinet. com/al

Implementations Active Learning Challenge http: //clopinet. com/al

Implementations Active Learning Challenge http: //clopinet. com/al

Conclusion • High level of participation. • Wide variety of techniques. • Semi-supervised learning

Conclusion • High level of participation. • Wide variety of techniques. • Semi-supervised learning is needed to achieve good performance in the first part of the learning curve • Some degree of randomization in the query process is necessary to achieve good results. • Verification with systematic experiments (Cawley, 2010). • Viability of the Virtual Laboratory, which will be used in upcoming challenges on experimental http: //clopinet. com/al Active Learning Challenge