Data Mining Anomaly Detection Lecture Notes for Chapter
- Slides: 24
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 1
Anomaly/Outlier Detection l What are anomalies/outliers? – The set of data points that are considerably different than the remainder of the data Anomalies are usually rare l Called also outliers l © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 2
Anomaly/Outlier Detection l Variants of Anomaly/Outlier Detection Problems – Given a database D, find all the data points x D with anomaly scores greater than some threshold t – Given a database D, find all the data points x D having the topn largest anomaly scores f(x) © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 3
Anomaly/Outlier Detection l Applications: – Credit card fraud detection, – telecommunication fraud detection, – network intrusion detection, fault detection © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 4
Importance of Anomaly Detection Ozone Depletion History l In 1985 three researchers (Farman, Gardinar and Shanklin) were puzzled by data gathered by the British Antarctic Survey showing that ozone levels for Antarctica had dropped 10% below normal levels l Why did the Nimbus 7 satellite, which had instruments aboard for recording ozone levels, not record similarly low ozone concentrations? l The ozone concentrations recorded by the satellite were so low they were being treated as outliers by a computer program and discarded! © Tan, Steinbach, Kumar Introduction to Data Mining Sources: http: //exploringdata. cqu. edu. au/ozone. html http: //www. epa. gov/ozone/science/hole/size. html 4/18/2004 5
© Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 6
Anomaly Detection l Challenges – How many outliers are there in the data? – Method is unsupervised u Validation can be quite challenging (just like for clustering) – Finding needle in a haystack l Working assumption: – There are considerably more “normal” observations than “abnormal” observations (outliers/anomalies) in the data © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 7
Anomaly Detection Schemes l General Steps – Build a profile of the “normal” behavior u Profile can be patterns or summary statistics for the overall population – Use the “normal” profile to detect anomalies u l Anomalies are observations whose characteristics differ significantly from the normal profile Types of anomaly detection schemes – Graphical & Statistical-based – Distance-based – Model-based © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 8
© Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 9
Graphical Approaches l Boxplot (1 -D), Scatter plot (2 -D), Spin plot (3 -D) l Limitations – Time consuming – Subjective © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 10
Statistical Approaches l Assume a parametric model describing the distribution of the data (e. g. , normal distribution) l Apply a statistical test that depends on – Data distribution – Parameter of distribution (e. g. , mean, variance) – Number of expected outliers (confidence limit) © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 11
Grubbs’ Test Detect outliers in univariate data l Assume data comes from normal distribution l Detects one outlier at a time, remove the outlier, and repeat l – H 0: There is no outlier in data – HA: There is at least one outlier l Grubbs’ test statistic: l Reject H 0 if: © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 12
Limitations of Statistical Approaches l Most of the tests are for a single attribute l In many cases, data distribution may not be known l For high dimensional data, it may be difficult to estimate the true distribution © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 13
© Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 14
© Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 15
© Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 16
© Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 17
Distance-based Approaches l Data is represented as a vector of features l Three major approaches – Nearest-neighbor based – Density based – Clustering based © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 18
Nearest-Neighbor Based Approach l Approach: – Compute the distance between every pair of data points – There are various ways to define outliers: u Data points for which there are fewer than p neighboring points within a distance D u The top n data points whose distance to the kth nearest neighbor is greatest u The top n data points whose average distance to the k nearest neighbors is greatest © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 19
© Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 20
© Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 21
© Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 22
Density-based: LOF approach l l l For each point, compute the density of its local neighborhood Compute local outlier factor (LOF) of a sample p as the average of the ratios of the density of sample p and the density of its nearest neighbors Outliers are points with largest LOF value In the NN approach, p 2 is not considered as outlier, while LOF approach find both p 1 and p 2 as outliers p 2 © Tan, Steinbach, Kumar p 1 Introduction to Data Mining 4/18/2004 24
Clustering-Based l Basic idea: – Cluster the data into groups of different density – Choose points in small cluster as candidate outliers – Compute the distance between candidate points and non-candidate clusters. u If candidate points are far from all other non-candidate points, they are outliers © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 25
- Agrima seth
- Bayesian classification in data mining lecture notes
- Data mining lecture notes
- Data mining lecture notes
- Data mining lecture notes
- Mining complex types of data
- Anomaly detection in google analytics
- Anomaly detection spark
- Elasticsearch anomaly detection
- Flink anomaly detection
- System log analysis for anomaly detection
- 01:640:244 lecture notes - lecture 15: plat, idah, farad
- Mining multimedia databases in data mining
- Exploratory data analysis lecture notes
- Transaction flow testing
- Data flow testing
- Strip mining vs open pit mining
- Mineral resources and mining chapter 13
- Difference between strip mining and open pit mining
- Difference between text mining and web mining
- Data reduction in data mining
- What is data mining and data warehousing
- What is missing data in data mining
- Concept hierarchy generation for nominal data
- Data reduction in data mining