Novelty Detection OneClass SVM OCSVM 1 Outline Introduction

  • Slides: 36
Download presentation
Novelty Detection & One-Class SVM (OCSVM) 1

Novelty Detection & One-Class SVM (OCSVM) 1

Outline Introduction Quantile Estimation OCSVM – Theory OCSVM – Application to Jet Engines 2

Outline Introduction Quantile Estimation OCSVM – Theory OCSVM – Application to Jet Engines 2

Novelty Detection is An unsupervised learning problem (data unlabeled) About the identification of new

Novelty Detection is An unsupervised learning problem (data unlabeled) About the identification of new or unknown data or signal that a machine learning system is not aware of during training 3

Example 1 “Normal” “Novel” 4

Example 1 “Normal” “Novel” 4

So what’s seems to be the problem? It’s a 2 -Class problem. “Normal vs.

So what’s seems to be the problem? It’s a 2 -Class problem. “Normal vs. “Novel” 5

The Problem is That “All positive examples are alike but each negative example is

The Problem is That “All positive examples are alike but each negative example is negative in its own way”. 6

Example 2 Suppose we want to build a classifier that recognizes web pages about

Example 2 Suppose we want to build a classifier that recognizes web pages about “pickup sticks”. How can we collect a training data? n We can surf the web and pretty easily assemble a sample to be our collection of positive examples. What about negative examples ? n The negative examples are… the rest of the web. That is ~(“pickup sticks web page”) So the negative examples come from an unknown # of negative classes. 7

Applications Many exist n n n n Intrusion detection Fraud detection Fault detection Robotics

Applications Many exist n n n n Intrusion detection Fraud detection Fault detection Robotics Medical diagnosis E-Commerce And more… 8

Possible Approaches Density Estimation: n n Estimate a density based on training data Threshold

Possible Approaches Density Estimation: n n Estimate a density based on training data Threshold the estimated density for test points Quantile Estimation: n n Estimate a quantile of the distribution underlying the training data: for a fixed constant , attempt to find a small set such that Check whether test points are inside or outside 9

Quantile Estimation (QE) A quantile function with respect to is defined as : -

Quantile Estimation (QE) A quantile function with respect to is defined as : - a class of measurable subsets of - a real valued function. denotes the that attains the infimum 10

Quantile Estimation (QE) The empirical quantile function is defined as above where is the

Quantile Estimation (QE) The empirical quantile function is defined as above where is the empirical distribution: - denotes the infimum on the training set. Thus the goal is to estimate that attains the through 11

Quantile Estimation Choosing Intelligently and is important On one hand large class many small

Quantile Estimation Choosing Intelligently and is important On one hand large class many small sets that contain a fraction of the training examples. On the other hand, if we allowed just any set, the chosen set could consist of only the training points poor generalization 12

Complex vs. Simple 13

Complex vs. Simple 13

Support Vector Method for Novelty Detection Bernhard Schölkof, Robert Williams, Alex Smola, John Shawe-Taylor,

Support Vector Method for Novelty Detection Bernhard Schölkof, Robert Williams, Alex Smola, John Shawe-Taylor, John Platt 14

Problem Formulation Suppose we are given a training ample drawn from an underlying distribution

Problem Formulation Suppose we are given a training ample drawn from an underlying distribution We want to a estimate a “simple” subset such that for a test point drawn from the distribution , We approach the problem by trying to estimate a function which is positive on and negative on the complement 15

The SV Approach to QE The class is defined as the set of halfspaces

The SV Approach to QE The class is defined as the set of halfspaces in a feature space (via kernel ) Here we define , where are respectively a weight vector and an offset parameterizing a hyperplane in 16

“Hey, Just a second” If we use hyperplanes & offsets, doesn’t it mean we

“Hey, Just a second” If we use hyperplanes & offsets, doesn’t it mean we separate the “positive” sample? But, separate from what? From the Origin 17

OCSVM 18

OCSVM 18

OCSVM Serves as a penalizer like “C” in the 2 -class svm (recall thatthe

OCSVM Serves as a penalizer like “C” in the 2 -class svm (recall thatthe data set) from the origin To separate we solve the following quadric program: subject to Notice that no “y”s are incorporated in the constraint since there are no labels 19

OCSVM The decision is therefore: Since the slack variables are penalized in the objective

OCSVM The decision is therefore: Since the slack variables are penalized in the objective function, we can expect that if and solve the problem then will equal 1 for most example in the training set, while still stays small 20

OCSVM Using multipliers we get the Lagrangian, 21

OCSVM Using multipliers we get the Lagrangian, 21

OCSVM Setting the derivatives of yields: w. r. t to 0 1) 2) ,

OCSVM Setting the derivatives of yields: w. r. t to 0 1) 2) , 22

OCSVM Eq. 1 transforms into a kernel expansion: The offset eq can by exploiting

OCSVM Eq. 1 transforms into a kernel expansion: The offset eq can by exploiting Substituting. 1 be & 2 recovered into yields the dual that for any the corresponding problem: pattern satisfies: subject to , 23

- Property Assume the solution of the primal problem satisfies. The following statements hold:

- Property Assume the solution of the primal problem satisfies. The following statements hold: n is an upper bound on the fraction outliers. n is a lower bound on the fraction SVs n With probability 1, asymptotically, equals both the fraction of SVs and the fraction of outliers. (under certain conditions of P(x) and the kernel) 24

X axis – svm magnitude y axis – frequency Results – USPS (“ 0”)

X axis – svm magnitude y axis – frequency Results – USPS (“ 0”) For = 50%, we get: 50% SVs 49% Outliers For = 5%, we get: 6% SVs 4% Outliers 25

OCSM - Shortcomings Implicitly assumes that the “negative” data lies around the origin. Ignores

OCSM - Shortcomings Implicitly assumes that the “negative” data lies around the origin. Ignores completely “negative” data even if such data partially exist. 26

Support Vector Novelty Detection Applied to Jet Engine Vibration Spectra Paul Hyton, Bernhard Schölkof,

Support Vector Novelty Detection Applied to Jet Engine Vibration Spectra Paul Hyton, Bernhard Schölkof, Lionel Tarassenko, Paul Anuzis 27

Intro. Jet engines have pass-off tests before they can be delivered to the customer.

Intro. Jet engines have pass-off tests before they can be delivered to the customer. Through vibration tests an engine’s “vibration signature” can be extracted While normal vibration signatures are common, we may be short of abnormal signatures. Or even worse, the engine under test may show up a type of abnormality which has never been seen before. 28

Feature Selection A vibration gauges are attached to the engine’s case The engine under

Feature Selection A vibration gauges are attached to the engine’s case The engine under test is slowly accelerated from idle to full speed and decelerated back to idle The vibration signal is then recorded The final feature is calculated over a weighted average of the vibration for 10 different speed ranges Thus yielding a 10 -D vector 29

Algorithm Slightly more general than the regular OCSVM In addition to the “normal” data

Algorithm Slightly more general than the regular OCSVM In addition to the “normal” data points we take into account some abnormal points Rather than separating from the origin we separate from the mean of 30

Primal Form subject to and the decision function is 31

Primal Form subject to and the decision function is 31

Dual Form where and subject to , 32

Dual Form where and subject to , 32

2 D Toy Example 33

2 D Toy Example 33

Training Data 99 Normal Engines were used as training data 40 Normal Engines were

Training Data 99 Normal Engines were used as training data 40 Normal Engines were used as validation data 23 Abnormal Engines used as test data 34

Standard OCSM Results 35

Standard OCSM Results 35

Modified OCSM Results 36

Modified OCSM Results 36