Data Mining Anomaly Detection Lecture Notes for Chapter

Anomaly/Outlier Detection l What are anomalies/outliers? – The set of data points that are

Anomaly/Outlier Detection l Variants of Anomaly/Outlier Detection Problems – Given a database D, find

Anomaly/Outlier Detection l Applications: – Credit card fraud detection, – telecommunication fraud detection, –

Importance of Anomaly Detection Ozone Depletion History l In 1985 three researchers (Farman, Gardinar

© Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 6

Anomaly Detection l Challenges – How many outliers are there in the data? –

© Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 9

Graphical Approaches l Boxplot (1 -D), Scatter plot (2 -D), Spin plot (3 -D)

Statistical Approaches l Assume a parametric model describing the distribution of the data (e.

Grubbs’ Test Detect outliers in univariate data l Assume data comes from normal distribution

Limitations of Statistical Approaches l Most of the tests are for a single attribute

© Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 14

© Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 15

© Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 16

© Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 17

Distance-based Approaches l Data is represented as a vector of features l Three major

Nearest-Neighbor Based Approach l Approach: – Compute the distance between every pair of data

© Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 20

© Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 21

© Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 22

Density-based: LOF approach l l l For each point, compute the density of its

Clustering-Based l Basic idea: – Cluster the data into groups of different density –

Slides: 24

Download presentation

Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Anomaly/Outlier Detection l What are anomalies/outliers? – The set of data points that are considerably different than the remainder of the data Anomalies are usually rare l Called also outliers l © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 2

Anomaly/Outlier Detection l Variants of Anomaly/Outlier Detection Problems – Given a database D, find all the data points x D with anomaly scores greater than some threshold t – Given a database D, find all the data points x D having the topn largest anomaly scores f(x) © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 3

Anomaly/Outlier Detection l Applications: – Credit card fraud detection, – telecommunication fraud detection, – network intrusion detection, fault detection © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 4

Importance of Anomaly Detection Ozone Depletion History l In 1985 three researchers (Farman, Gardinar and Shanklin) were puzzled by data gathered by the British Antarctic Survey showing that ozone levels for Antarctica had dropped 10% below normal levels l Why did the Nimbus 7 satellite, which had instruments aboard for recording ozone levels, not record similarly low ozone concentrations? l The ozone concentrations recorded by the satellite were so low they were being treated as outliers by a computer program and discarded! © Tan, Steinbach, Kumar Introduction to Data Mining Sources: http: //exploringdata. cqu. edu. au/ozone. html http: //www. epa. gov/ozone/science/hole/size. html 4/18/2004 5

Anomaly Detection l Challenges – How many outliers are there in the data? – Method is unsupervised u Validation can be quite challenging (just like for clustering) – Finding needle in a haystack l Working assumption: – There are considerably more “normal” observations than “abnormal” observations (outliers/anomalies) in the data © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 7

Anomaly Detection Schemes l General Steps – Build a profile of the “normal” behavior u Profile can be patterns or summary statistics for the overall population – Use the “normal” profile to detect anomalies u l Anomalies are observations whose characteristics differ significantly from the normal profile Types of anomaly detection schemes – Graphical & Statistical-based – Distance-based – Model-based © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 8

Graphical Approaches l Boxplot (1 -D), Scatter plot (2 -D), Spin plot (3 -D) l Limitations – Time consuming – Subjective © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 10

Statistical Approaches l Assume a parametric model describing the distribution of the data (e. g. , normal distribution) l Apply a statistical test that depends on – Data distribution – Parameter of distribution (e. g. , mean, variance) – Number of expected outliers (confidence limit) © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 11

Grubbs’ Test Detect outliers in univariate data l Assume data comes from normal distribution l Detects one outlier at a time, remove the outlier, and repeat l – H 0: There is no outlier in data – HA: There is at least one outlier l Grubbs’ test statistic: l Reject H 0 if: © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 12

Limitations of Statistical Approaches l Most of the tests are for a single attribute l In many cases, data distribution may not be known l For high dimensional data, it may be difficult to estimate the true distribution © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 13

Distance-based Approaches l Data is represented as a vector of features l Three major approaches – Nearest-neighbor based – Density based – Clustering based © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 18

Nearest-Neighbor Based Approach l Approach: – Compute the distance between every pair of data points – There are various ways to define outliers: u Data points for which there are fewer than p neighboring points within a distance D u The top n data points whose distance to the kth nearest neighbor is greatest u The top n data points whose average distance to the k nearest neighbors is greatest © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 19

Density-based: LOF approach l l l For each point, compute the density of its local neighborhood Compute local outlier factor (LOF) of a sample p as the average of the ratios of the density of sample p and the density of its nearest neighbors Outliers are points with largest LOF value In the NN approach, p 2 is not considered as outlier, while LOF approach find both p 1 and p 2 as outliers p 2 © Tan, Steinbach, Kumar p 1 Introduction to Data Mining 4/18/2004 24

Clustering-Based l Basic idea: – Cluster the data into groups of different density – Choose points in small cluster as candidate outliers – Compute the distance between candidate points and non-candidate clusters. u If candidate points are far from all other non-candidate points, they are outliers © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 25