Outlier Detection Using kNearest Neighbour Graph Ville Hautamki

Outlier Detection Using k-Nearest Neighbour Graph Ville Hautamäki, Ismo Kärkkäinen and Pasi Fränti Department of Computer Science University of Joensuu, Finland

What is an outlier? Outlier is an observation that deviates too much from other observations so that it arouses suspicions that it was generated by a different mechanism. Outliers

Motivation Classical problem: Noise in measurements needs to be removed because of the detrimental effect for statistical inference. For example making clustering more robust. Data mining: Deviating or surprising measurement is interesting. In this case we want to detect outliers. Typical applications: ü Breast cancer detection ü Intrusion detection in network security systems ü Detecting credit card fraud

Taxonomy of outlier detection Distribution-based methods: A classical statistical approach. If distribution is known, then define an discordance test for that distribution. Clustering-based methods: Define observation as an outlier if it does not fit to the overall clustering pattern. Distance-based methods: Define an observation as an outlier if p% of samples are dmin distance away from it. Density-based methods: Observation is defined as an outlier if the local density is low.

K-Nearest Neighbour Graph Definition: Every vector in the data set forms one node and every node has pointers to its k nearest neighbours.

KDIST: Density-based outlier detection [S. Ramaswamy et al. , SIGMOD, Texas, 2000. ] Define k Nearest Neighbour distance (KDIST) as the distance to the kth nearest vector. Vectors are sorted by their KDIST distance. The last n vectors in the list are defined as outliers. Intuitive idea is that when KDIST is large, vector is in sparse area and is likely to be an outlier. Problem: user has to know in advance how many outliers he has in the data set.

Mean. DIST: proposed Improvement We define Mean. DIST as the mean of k nearest distances. User supplied parameters of the algorithms (cutting point k in KDIST, local threshold t in Mean. DIST).

Mk. NN: Mutual k-Nearest Neighbour [M. R. Brito et al. Statistics & Probability Letters, 35(1): 33 -42, August, 1997. ] Mutual k-Nearest Neighbour (Mk. NN) uses a special case of k. NN Graph: there exists an undirected edge between two vectors a and b if they belong to each others k-nearest neighborhood. Connected components are designated as clusters Isolated vectors as outliers.

ODIN: Outlier Detection using Indegree Number (proposed) Definition: Given k. NN Graph G for dataset S, outlier is a vertex, that has indegree less than or equal to T.

Test data sets We used Receiver Operating Characteristics (ROC): • False Rejection (FR) • False Acceptance (FA). Half Total Error Rate: (HTER) = (FR + FA)/ 2.

Summary of results HTER error rates (k, threshold): ü ODIN achieves zero error for HR, NHL 1 and NHL 2. ü Mean. DIST is better for larger synthetic set. ü For KDD, none of the methods work well.

KDIST vs. Mean. DIST Should we use the k. NN distance as was done in the original paper or the mean of distances as proposed? Mean. DIST is more robust as a function of k for synthetic data.

Parameter Stability Mean. DIST for KDD dataset: ODIN for S 1 dataset: Minimum error Value of k is not important as long as threshold is below 0. 1. A clear valley in error surface between threshold values 2050.

Conclusions ü Proposed method (ODIN) compares well with other similar methods. ü Proposed improvement to RSS method (Mean. DIST) outperforms KDIST in one case. In other cases, improvement is not significant. ü In the case of small number of observations, density estimation methods perform poorly.

Future Work ü Improve the performance of ODIN by integrating ideas from density estimation schemes. ü Test with more real world datasets. ü Consider validity of other proximity graphs for outlier detection task, especially need to study applicability of Adaptive Neighbourhood Graph. ü Work to improve parameter robustness in proposed algorithms.