Machine Learning Research Presented by Isabelle Guyon BIOwulf

Machine Learning Research Presented by: Isabelle Guyon

BIOwulf 1 - People 2 - Technology 3 - Results

1 - People BIOwulf Technologies

Research people + Isabelle Guyon + Peter Bartlett + Asa Ben Hur + Nello Cristianini + René Doursat + David Lewis (c) + Ed Reiss + Shelia Guberman + Vladimir Vapnik (c) + Bernhard Schölkopf + André Elisseeff + Olivier Chapelle - Olivier Bousquet + Jason Weston - Alex Smola (c) + Hong Zhang

2 - Technology BIOwulf Technologies

Technology: SVM • Kernel Machines: F(x) = S ak K(xk, x) • Sparcity: the sum runs only over support vectors Boser-Guyon-Vapnik (1992) http: //www. clopinet. com/isabelle/Papers/colt 92. ps. Z

SVM: Universality & Generalization x 2 F(x)>0 x=(x 1, x 2) F(x)=0 F(x)<0 x 1

Neural Networks: Local Optima

SVM key properties

Core problems Feature/ Pattern Selection Classification Kernel Methods Regression Model selection Novelty Detection Control Problems Clustering Statistical Learning Theory Causality Inference SVMs

3 - Results BIOwulf Technologies

Scope Life Sciences Seismic Geological Imaging & Signal Processing Telecom Internet BIOWulf Technologies Financial Military Security Fraud & Abuse

Strategy Data Collection Data Analysis Result validation

Data Spectra Medical Images Microarray data Data Collection Medical & Biology Literature Genomic Sequences Medical & Demographic Records

Discovery Platform demo Internet Numerical Lab raw data DA Information Center IR prospects service customers tool scientists numerical results data analyst structured info researcher

Microarray Data Outlier 5 10 BPH 10 20 15 - Preprocessing 30 20 25 40 - Gene selection 30 50 G 4 - Data cleaning 35 60 40 1 2 1000 3 20004 5 3000 40007 6 4000 5000 8 5000 6000 9 6000 10 7000 Prostate cancer, Stamey-Guyon, Dec. 2000

Two best genes Golub SVM Prostate cancer, Stamey-Guyon, Dec. 2000

Tree Explorer R 88740 T 62947 M 59040 H 64807 R 88740 H 08393 T 94579 H 81558 T 62947 T 64012 R 55310 T 86444 U 09564 H 06524 Guyon-Doursat-Reiss, 2000 H 81558 H 06524 U 19969 H 06524 T 94579 T 58861 M 59040 L 08069 60 50 H 08393 40 M 82919 30 L 03840 U 19969 D 14812 M 82919 L 06895 20 10

Spectroscopy f(t) Class 1 g(t) Class 2 t t Simple kernel: K(f, g) = f(t) g(t) dt Alignment kernel: K(f, g) = f(t) g(t-x) exp(-gx 2) dtdx Infrared spectra, Elisseeff-Bartlett, Feb. 2001

Ciphergen Spectra • 299 features(peak values) • 385 examples (325 training, 60 test) • 4 classes (15 test example/class) A=BPH, B and C cancer (B<C), D=ref. • SVM multi-class error rate: 15%(9/60) • 59 peaks separate training set perfectly 1 1 2 D < A < B < C 3 2 Prostate cancer, Elisseeff-Guyon-Weston, May. 2001

Conclusions SVM advantages in pattern recognition: Superior prediction performance on test data. Unique, easy to interpret solution. Better feature selection (only 2 -7 genes in marray exp. ). Use all the data, automatic data cleaning. Incorporate knowledge about the task in Kernel. Can be combined with other methods.
- Slides: 21