Interactive Discovery in Large Data Sets Guest lecture

  • Slides: 32
Download presentation
Interactive Discovery in Large Data Sets Guest lecture for CS 886, Applied Machine Learning

Interactive Discovery in Large Data Sets Guest lecture for CS 886, Applied Machine Learning Kiri L. Wagstaff Jet Propulsion Laboratory kiri. wagstaff@jpl. nasa. gov October 11, 2012 Joint work with David R. Thompson (JPL), Nina Lanza (Los Alamos National Lab), Thomas G. Dietterich (Oregon State University), and Diana Blaney (JPL) This work was carried out in part at the Jet Propulsion Laboratory, California Institute of Technology, © 2012. Government sponsorship acknowledged. It was also supported by the Defense Advanced Research Projects Agency (DARPA) under Contract W 911 NF-11 -C-0088. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author’s and do not necessarily reflect the views of the DARPA, the Army

Interactive Discovery in Large Data Sets �Discovery ◦ What is interesting? Novel? ◦ Big

Interactive Discovery in Large Data Sets �Discovery ◦ What is interesting? Novel? ◦ Big NASA data sets Large Synoptic Survey Telescope (LSST) �LSST: 28 TB/day �SKA: 86 TB/day �Explanations Square Kilometre Array (SKA) ◦ AI: actions + reasons for them �Why “interactive”? ◦ No general definition of “interesting” Interactive Discovery in Large Data Sets - CS 886 2

Example: Mars Rover Panorama Spirit’s Mc. Murdo Panorama, 1000 sols, October 2006 (NASA/JPL/Cornell) 22,

Example: Mars Rover Panorama Spirit’s Mc. Murdo Panorama, 1000 sols, October 2006 (NASA/JPL/Cornell) 22, 348 x 5771 pixels = 386 MB What’s most interesting here? Interactive Discovery in Large Data Sets - CS 886 3

Zooming in Interactive Discovery in Large Data Sets - CS Portion of Spirit’s Mc.

Zooming in Interactive Discovery in Large Data Sets - CS Portion of Spirit’s Mc. Murdo Panorama, 886 1000 sols, October 2006 (NASA/JPL/Cornell) 4

Zooming in Bright rocks Impact ejecta Dark rocks Possible meteorite Exposed layers Sand ripples

Zooming in Bright rocks Impact ejecta Dark rocks Possible meteorite Exposed layers Sand ripples Interactive Discovery in Large Data Sets - CS Portion of Spirit’s Mc. Murdo Panorama, 886 1000 sols, October 2006 (NASA/JPL/Cornell) 5

Unsupervised learnin Discovery �Exploration of large data sets System choose s item System learns

Unsupervised learnin Discovery �Exploration of large data sets System choose s item System learns model �Desiderata ◦ Diverse sampling of data set Interactive Discovery in Large Data Sets - CS 886 6

What to select? �Items that differ from those previously seen �Principal Components Model ◦

What to select? �Items that differ from those previously seen �Principal Components Model ◦ Approximate model of data set variation Known items ◦ Keep only the top K vectors from U [M. Scholz, 2006] Interactive Discovery in Large Data Sets - CS 886 7

What to select? �Items that differ from those previously seen �Principal Components Model ◦

What to select? �Items that differ from those previously seen �Principal Components Model ◦ Approximate model of data set variation Known items ◦ Keep only the top K vectors from U ◦ Select items in D that are difficult to represent with model U �Reconstruction For x in D error Mean of X Reconstruction of x Data Sets - CS Interactive Discovery in Large 886 8

Updating model U with new x �Redo PCA from scratch: expensive �Incrementally update U:

Updating model U with new x �Redo PCA from scratch: expensive �Incrementally update U: fast! ◦ U depends only on previous U and new x ◦ [Ross et al. , 2008] Principal Components U 1 U 2 U 3 U 4 U 5 X 1 X 2 X 3 X 4 X 5 Data Iterations Interactive Discovery in Large Data Sets - CS 886 9

DEMUD: Discovery through Eigenbasis Modeling of Uninteresting Data Initial ranking of D by PCA-1

DEMUD: Discovery through Eigenbasis Modeling of Uninteresting Data Initial ranking of D by PCA-1 reconstruction error Select most interesting x in D Compute score for all x in D using U Treat x as “no longer interesting” Update model U to include x Interactive Discovery in Large Data Sets - CS 886 10

Mc. Murdo selections � 1200 features: 100 x 100 RGB, downsamp. 5 x �K=20

Mc. Murdo selections � 1200 features: 100 x 100 RGB, downsamp. 5 x �K=20 Selection #1 Interactive Discovery in Large Data Sets - CS 886 11

Mc. Murdo selections � 1200 features: 100 x 100 RGB, downsamp. 5 x �K=20

Mc. Murdo selections � 1200 features: 100 x 100 RGB, downsamp. 5 x �K=20 Selection #2 Interactive Discovery in Large Data Sets - CS 886 12

Mc. Murdo selections � 1200 features: 100 x 100 RGB, downsamp. 5 x �K=20

Mc. Murdo selections � 1200 features: 100 x 100 RGB, downsamp. 5 x �K=20 Selection #10 Interactive Discovery in Large Data Sets - CS 886 13

Explanations �Reconstruction Original Features residuals Residuals 1 Dark: lower intensity than expected Bright: higher

Explanations �Reconstruction Original Features residuals Residuals 1 Dark: lower intensity than expected Bright: higher intensity than expected 2 Dark area at bottom has small residual learning happened! 3 Interactive Discovery in Large Data Sets - CS 886 14

Semi-supervised learnin Interactive Discovery �Guided exploration of large data sets System choose s item

Semi-supervised learnin Interactive Discovery �Guided exploration of large data sets System choose s item User review s item �Desiderata ◦ Quickly find items of interest, even if rare ◦ Don’t miss anything! Interactive Discovery in Large Data Sets - CS 886 15

Interactive DEMUD Initial ranking of D by PCA-1 reconstruction error Select most interesting x

Interactive DEMUD Initial ranking of D by PCA-1 reconstruction error Select most interesting x in D Compute score for all x in D using U Query user on x • Interesting or uninteresting? If uninteresting, update model U Interactive Discovery in Large Data Sets - CS 886 16

Alternatives: 1 SVM-int �Train a one-class SVM to model the interesting class �Select most

Alternatives: 1 SVM-int �Train a one-class SVM to model the interesting class �Select most interesting item Varied Smooth Light Dark Interactive Discovery in Large Data Sets - CS 886 17

Alternatives: 1 SVM-unint �Train a one-class SVM to model the uninteresting class �Select least

Alternatives: 1 SVM-unint �Train a one-class SVM to model the uninteresting class �Select least uninteresting item Varied Smooth Light Dark Interactive Discovery in Large Data Sets - CS 886 18

Alternatives: 2 SVM-both �Train a two-class SVM to model both classes �Select most interesting

Alternatives: 2 SVM-both �Train a two-class SVM to model both classes �Select most interesting item Varied Smooth Light Dark Interactive Discovery in Large Data Sets - CS 886 19

Alternatives: 2 SVM-active �Train a two-class SVM to model both classes �Select most ambiguous

Alternatives: 2 SVM-active �Train a two-class SVM to model both classes �Select most ambiguous item Varied Smooth Light Dark Interactive Discovery in Large Data Sets - CS 886 20

Premature Specialization (A possible danger of training on positives in a discovery setting) Interactive

Premature Specialization (A possible danger of training on positives in a discovery setting) Interactive Discovery in Large Data Sets - CS 886 21

Alternatives: Static Baseline �Select by PCA-K ordering ◦ Same initial model as DEMUD �No

Alternatives: Static Baseline �Select by PCA-K ordering ◦ Same initial model as DEMUD �No feedback Interactive Discovery in Large Data Sets - CS 886 22

Faces Data Set � 40 people, 10 poses each �High dimensionality: 10, 304 �Goal:

Faces Data Set � 40 people, 10 poses each �High dimensionality: 10, 304 �Goal: Discover 3 women ◦ Data set is mostly men ◦ Challenge: disjunction Data from AT&T Laboratories Cambridge Interactive Discovery in Large Data Sets - CS 886 23

Faces: Three Women Discovery Perfect Premature specialization Random Delayed start Interactive Discovery in Large

Faces: Three Women Discovery Perfect Premature specialization Random Delayed start Interactive Discovery in Large Data Sets - CS 886 24

Exploration Rate Number of queries to find one image of each wom Algorithm Queries

Exploration Rate Number of queries to find one image of each wom Algorithm Queries DEMUD 43 1 SVM-unint 50 1 SVM-int 164 2 SVM-both 124 2 SVM-active 124 Static baseline 223 Interactive Discovery in Large Data Sets - CS 886 25

Rare Class Discovery DEMUD 6 classes, d=7, n=336, K=4 6 classes, d=9, n=214, K=4

Rare Class Discovery DEMUD 6 classes, d=7, n=336, K=4 6 classes, d=9, n=214, K=4 Figures from [He and Carbonell, 09] Interactive Discovery in Large Data Sets - CS 886 26

CRISM: Magnesite Discovery � Magnesite (Mg. CO 3): possible groundwater � CRISM data: 0.

CRISM: Magnesite Discovery � Magnesite (Mg. CO 3): possible groundwater � CRISM data: 0. 364 to 3. 92 μm, 197 bands � Only 17 of 15, 400 items match deposit Random subset of CRISM data Mg. CO 3 Data from Mars Reconnaissance Orbiter/CRISM Interactive Discovery in Large Data Sets - CS 886 27

CRISM: Magnesite Discovery Perfect Ran out of time (3 days!) ~2 mins per query

CRISM: Magnesite Discovery Perfect Ran out of time (3 days!) ~2 mins per query Random Delayed start Interactive Discovery in Large Data Sets - CS 886 28

Chem. Cam: Carbonates �Chem. Cam: LIBS instrument on MSL �Data set: 60 lab samples

Chem. Cam: Carbonates �Chem. Cam: LIBS instrument on MSL �Data set: 60 lab samples + 40 carbonates ◦ 6143 features (bands) ◦ K (8) chosen to capture 90% variance Regular PCA 1 SVM-int DEMUD Colored items are carbonates; white are non-carbonates Interactive Discovery in Large Data Sets - CS 886 29

DEMUD: Explanations �Top 10 items chosen by DEMUD Ranked by “interestingness” Explanations (residuals) Interactive

DEMUD: Explanations �Top 10 items chosen by DEMUD Ranked by “interestingness” Explanations (residuals) Interactive Discovery in Large Data Sets - CS 886 30

Other Applications �Text ◦ Long Wavelength Array system log files ◦ Detect anomalous system

Other Applications �Text ◦ Long Wavelength Array system log files ◦ Detect anomalous system behavior �Onboard prioritization ◦ Imaging spectrometers �Hyperion on EO-1: 256 x 6000 x 242 ◦ Assign priorities for input to onboard compression: ROI-ICER Interactive Discovery in Large Data Sets - CS 886 31

Summary �Discovery ◦ PCA-based model + reconstruction error �Explanations ◦ Why was it chosen?

Summary �Discovery ◦ PCA-based model + reconstruction error �Explanations ◦ Why was it chosen? �Interactive discovery ◦ Model the uninteresting to avoid it �Next challenge: evolving class of interest Thank you! Contact: kiri. wagstaff@jpl. nasa. gov Interactive Discovery in Large Data Sets - CS 886 32