Interactive Discovery in Large Data Sets Guest lecture
































- Slides: 32
Interactive Discovery in Large Data Sets Guest lecture for CS 886, Applied Machine Learning Kiri L. Wagstaff Jet Propulsion Laboratory kiri. wagstaff@jpl. nasa. gov October 11, 2012 Joint work with David R. Thompson (JPL), Nina Lanza (Los Alamos National Lab), Thomas G. Dietterich (Oregon State University), and Diana Blaney (JPL) This work was carried out in part at the Jet Propulsion Laboratory, California Institute of Technology, © 2012. Government sponsorship acknowledged. It was also supported by the Defense Advanced Research Projects Agency (DARPA) under Contract W 911 NF-11 -C-0088. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author’s and do not necessarily reflect the views of the DARPA, the Army
Interactive Discovery in Large Data Sets �Discovery ◦ What is interesting? Novel? ◦ Big NASA data sets Large Synoptic Survey Telescope (LSST) �LSST: 28 TB/day �SKA: 86 TB/day �Explanations Square Kilometre Array (SKA) ◦ AI: actions + reasons for them �Why “interactive”? ◦ No general definition of “interesting” Interactive Discovery in Large Data Sets - CS 886 2
Example: Mars Rover Panorama Spirit’s Mc. Murdo Panorama, 1000 sols, October 2006 (NASA/JPL/Cornell) 22, 348 x 5771 pixels = 386 MB What’s most interesting here? Interactive Discovery in Large Data Sets - CS 886 3
Zooming in Interactive Discovery in Large Data Sets - CS Portion of Spirit’s Mc. Murdo Panorama, 886 1000 sols, October 2006 (NASA/JPL/Cornell) 4
Zooming in Bright rocks Impact ejecta Dark rocks Possible meteorite Exposed layers Sand ripples Interactive Discovery in Large Data Sets - CS Portion of Spirit’s Mc. Murdo Panorama, 886 1000 sols, October 2006 (NASA/JPL/Cornell) 5
Unsupervised learnin Discovery �Exploration of large data sets System choose s item System learns model �Desiderata ◦ Diverse sampling of data set Interactive Discovery in Large Data Sets - CS 886 6
What to select? �Items that differ from those previously seen �Principal Components Model ◦ Approximate model of data set variation Known items ◦ Keep only the top K vectors from U [M. Scholz, 2006] Interactive Discovery in Large Data Sets - CS 886 7
What to select? �Items that differ from those previously seen �Principal Components Model ◦ Approximate model of data set variation Known items ◦ Keep only the top K vectors from U ◦ Select items in D that are difficult to represent with model U �Reconstruction For x in D error Mean of X Reconstruction of x Data Sets - CS Interactive Discovery in Large 886 8
Updating model U with new x �Redo PCA from scratch: expensive �Incrementally update U: fast! ◦ U depends only on previous U and new x ◦ [Ross et al. , 2008] Principal Components U 1 U 2 U 3 U 4 U 5 X 1 X 2 X 3 X 4 X 5 Data Iterations Interactive Discovery in Large Data Sets - CS 886 9
DEMUD: Discovery through Eigenbasis Modeling of Uninteresting Data Initial ranking of D by PCA-1 reconstruction error Select most interesting x in D Compute score for all x in D using U Treat x as “no longer interesting” Update model U to include x Interactive Discovery in Large Data Sets - CS 886 10
Mc. Murdo selections � 1200 features: 100 x 100 RGB, downsamp. 5 x �K=20 Selection #1 Interactive Discovery in Large Data Sets - CS 886 11
Mc. Murdo selections � 1200 features: 100 x 100 RGB, downsamp. 5 x �K=20 Selection #2 Interactive Discovery in Large Data Sets - CS 886 12
Mc. Murdo selections � 1200 features: 100 x 100 RGB, downsamp. 5 x �K=20 Selection #10 Interactive Discovery in Large Data Sets - CS 886 13
Explanations �Reconstruction Original Features residuals Residuals 1 Dark: lower intensity than expected Bright: higher intensity than expected 2 Dark area at bottom has small residual learning happened! 3 Interactive Discovery in Large Data Sets - CS 886 14
Semi-supervised learnin Interactive Discovery �Guided exploration of large data sets System choose s item User review s item �Desiderata ◦ Quickly find items of interest, even if rare ◦ Don’t miss anything! Interactive Discovery in Large Data Sets - CS 886 15
Interactive DEMUD Initial ranking of D by PCA-1 reconstruction error Select most interesting x in D Compute score for all x in D using U Query user on x • Interesting or uninteresting? If uninteresting, update model U Interactive Discovery in Large Data Sets - CS 886 16
Alternatives: 1 SVM-int �Train a one-class SVM to model the interesting class �Select most interesting item Varied Smooth Light Dark Interactive Discovery in Large Data Sets - CS 886 17
Alternatives: 1 SVM-unint �Train a one-class SVM to model the uninteresting class �Select least uninteresting item Varied Smooth Light Dark Interactive Discovery in Large Data Sets - CS 886 18
Alternatives: 2 SVM-both �Train a two-class SVM to model both classes �Select most interesting item Varied Smooth Light Dark Interactive Discovery in Large Data Sets - CS 886 19
Alternatives: 2 SVM-active �Train a two-class SVM to model both classes �Select most ambiguous item Varied Smooth Light Dark Interactive Discovery in Large Data Sets - CS 886 20
Premature Specialization (A possible danger of training on positives in a discovery setting) Interactive Discovery in Large Data Sets - CS 886 21
Alternatives: Static Baseline �Select by PCA-K ordering ◦ Same initial model as DEMUD �No feedback Interactive Discovery in Large Data Sets - CS 886 22
Faces Data Set � 40 people, 10 poses each �High dimensionality: 10, 304 �Goal: Discover 3 women ◦ Data set is mostly men ◦ Challenge: disjunction Data from AT&T Laboratories Cambridge Interactive Discovery in Large Data Sets - CS 886 23
Faces: Three Women Discovery Perfect Premature specialization Random Delayed start Interactive Discovery in Large Data Sets - CS 886 24
Exploration Rate Number of queries to find one image of each wom Algorithm Queries DEMUD 43 1 SVM-unint 50 1 SVM-int 164 2 SVM-both 124 2 SVM-active 124 Static baseline 223 Interactive Discovery in Large Data Sets - CS 886 25
Rare Class Discovery DEMUD 6 classes, d=7, n=336, K=4 6 classes, d=9, n=214, K=4 Figures from [He and Carbonell, 09] Interactive Discovery in Large Data Sets - CS 886 26
CRISM: Magnesite Discovery � Magnesite (Mg. CO 3): possible groundwater � CRISM data: 0. 364 to 3. 92 μm, 197 bands � Only 17 of 15, 400 items match deposit Random subset of CRISM data Mg. CO 3 Data from Mars Reconnaissance Orbiter/CRISM Interactive Discovery in Large Data Sets - CS 886 27
CRISM: Magnesite Discovery Perfect Ran out of time (3 days!) ~2 mins per query Random Delayed start Interactive Discovery in Large Data Sets - CS 886 28
Chem. Cam: Carbonates �Chem. Cam: LIBS instrument on MSL �Data set: 60 lab samples + 40 carbonates ◦ 6143 features (bands) ◦ K (8) chosen to capture 90% variance Regular PCA 1 SVM-int DEMUD Colored items are carbonates; white are non-carbonates Interactive Discovery in Large Data Sets - CS 886 29
DEMUD: Explanations �Top 10 items chosen by DEMUD Ranked by “interestingness” Explanations (residuals) Interactive Discovery in Large Data Sets - CS 886 30
Other Applications �Text ◦ Long Wavelength Array system log files ◦ Detect anomalous system behavior �Onboard prioritization ◦ Imaging spectrometers �Hyperion on EO-1: 256 x 6000 x 242 ◦ Assign priorities for input to onboard compression: ROI-ICER Interactive Discovery in Large Data Sets - CS 886 31
Summary �Discovery ◦ PCA-based model + reconstruction error �Explanations ◦ Why was it chosen? �Interactive discovery ◦ Model the uninteresting to avoid it �Next challenge: evolving class of interest Thank you! Contact: kiri. wagstaff@jpl. nasa. gov Interactive Discovery in Large Data Sets - CS 886 32