Unsupervised Intrusion Detection Using Clustering Approach Muhammet Kabuku
Unsupervised Intrusion Detection Using Clustering Approach Muhammet Kabukçu Sefa Kılıç Ferhat Kutlu Teoman Toraman 1/29
Outline Ø Introduction Ø Using Clustering for Intrusion Detection Ø Methodology Ø Overall Summary Ø Conclusion Ø References 2/29
Introduction • Intrusion detection is the process of monitoring the events occurring in a computer system or network and analyzing them for signs of possible incidents. • Incidents are violations or imminent threats of violation of: * computer security policies, * acceptable use policies, * standard security practices. 3/29
Introduction • An intrusion detection system (IDS) is software that automates the intrusion detection process. • IDSs are primarily focuses on identifying possible incidents and detecting when an attacker has successfully compromised a system by exploiting vulnerability in the system. 4 /29
Introduction Methodologies of IDS Technologies Signature. Based Detection Anomaly. Based Detection Stateful Protocol Analysis 5 /29
Signature-Based Detection Ø A signature is a pattern that corresponds to a known threat (e. g. a telnet attempt with a username of "root", which is a violation of an organization's security policy). Ø Signature-based detection is the process of comparing signatures against observed events to identify possible incidents. Advantage: Very effective at detecting known threats. Disadvantage: Ineffective at detecting previously unknown threats. 6 /29
Anomaly-Based Detection Ø The process of comparing definitions of what activity is considered normal against observed events to identify significant deviations. Ø Capable of detecting previously unknown threats. Ø Uses host or network-specific profiles. 7 /29
Detection by Stateful Protocol Analysis Ø The process of comparing predetermined profiles of generally accepted definitions of benign protocol activity for each protocol state against observed events to identify deviations. Ø Relies on vendor-developed universal profiles that specify how particular protocols should and should not be used. 8 /29
Using Clustering for Intrusion Detection Ø Methods other than Signature-Based Detection use data mining and machine learning algorithms to train on labeled network data. Ø For training data, there are two major paradigms: Misuse Detection Anomaly Detection. Which one to use ? ? ? 9 /29
Using Clustering for Intrusion Detection - Misuse Detection Ø In misuse detection, machine learning algorithms are used with labeled data. Ø By using the extracted features from labeled network traffic, network data is classified. Ø By using new data which includes new type of attacks, detection models are retrained. 10 /29
Using Clustering for Intrusion Detection - Anomaly Detection Ø In anomaly detection, models are built by training on normal data, deviations are searched over the normal model. Ø Generating purely normal data is very difficult and costly in practice. Ø It is very hard to guarantee that there are no attacks during the time the traffic is collected from the network. 11 /29
Using Clustering for Intrusion Detection Misuse Detection Anomaly Detection. Ø Use a mechanism to detect intrusions by using unlabeled data as a train model. Ø Find intrusions buried within that data. 12/29
Using Clustering for Intrusion Detection A Set of Unlabeled Data Unsupervised Anomaly Detection Algorithm Assumptions for unsupervised anomaly detection algorithm: 1. The intrusions are rare with respect to normal network traffic. 2. The intrusions are different from normal network traffic. As a Result: The intrusions will appear as outliers in the data. Detected Intrusion Clusters Connection Comparison with Detected Clusters Detected malicious attacks 13 /29
Using Clustering for Intrusion Detection Ø The unsupervised anomaly detection algorithm clusters the unlabeled data instances together into clusters using a simple distance-based metric. 14 /29
Using Clustering for Intrusion Detection Once data is clustered, all of the instances that appear in small clusters are labeled as Intrusion cluster anomalies because; Ø The normal instances should form large clusters compared to the intrusions, Ø Malicious intrusions and normal instances are qualitatively different, so they do not fall into the same cluster. Normal cluster 15 /29
Methodology 1. Description of the dataset 2. Metric & Normalization 3. Clustering Algorithm a) Portnoy et. al. b) Y-means Algorithm 4. Labeling Clusters 5. Intrusion Detection 16 /29
Description of the dataset • KDD Cup 1999 Data • Main attack categories – DOS: Denial of Service, (e. g. synood) – R 2 L: Unauthorized access from a remote machine (e. g. guessing password) – U 2 R: Unauthorized access to local superuser (root) privileges (e. g. various buffer overflow attacks) – Probing: Surveillance and other probing (e. g. port scanning) • In total, 24 attack types in training data; 14 17/29 additional ones in test data. . .
Metric & Normalization • Euclidean Metric (for distance computation) • Feature Normalization (to eliminate the difference in the scale of features) 18/29
Clustering Algorithm (Portnoy et. al. ) . d 1 Xi Training set . . d 2 d 3 Empty set of clusters - d 1 is selected. - if d 1 < W ( predefined threshold value ), then Xi is assigned to that cluster. - else, a new cluster is created, then Xi is assigned to it. 19/29
Clustering Algorithm (Portnoy et. al. ) • Advantage: No need to know the initial no. of clusters. • Disadvantage: Need to know W, which may label instances wrong in some cases. • However… 20/29
Clustering Algorithm (Y-means Algorithm) • 3 main parts: 1. assigning instances to k clusters 2. splitting clusters 3. merging clusters 21/29
Clustering Algorithm (Y-means Algorithm) 1. assigning instances to k clusters . . . . redefine cluster centroid . . . k: no. of clusters n: no. of instances 1<k<n Dataset 22/29
Clustering Algorithm (Y-means Algorithm) 2. splitting clusters t ( normal threshold) = 2. 32 σ σ = standard deviation di. t Confident area Xi ( instance ). • if di > t , Xi is an outlier. • New clusters are created firstly with the farthest outliers. 23/29
Clustering Algorithm (Y-means Algorithm) 3. merging clusters . Xi If Xi is in the confident area of two clusters, merge these clusters back. 24/29
Labeling Clusters • Our first assumption: # of normal instances >> # of intrusions • Label instances in large clusters: normal • Label instances in small clusters: intrusion • Start labeling as normal, until 99% of data is labeled as normal, label rest of them as intrusion. Normal cluster Intrusion cluster 25/29
Intrusion Detection For test instance x, Ø Measure the distance to each cluster. Ø Select the nearest cluster C. Ø If C is normal cluster, label x as normal, Ø Otherwise label x as intrusion. 26/29
Overall Summary • IDS & IDS Technologies • Using Clustering for Intrusion Detection • Methodology 1. Description of the dataset 2. Metric & Normalization 3. Clustering Algorithm 4. Labeling Clusters 5. Intrusion Detection Conclusion • Unsupervised Clustering is choosen. • KDD Cup 1999 Data • Y-means Algorithm is used for creating ID System. 27/29
References [1] KDD Cup 1999 data. http: //kdd. ics. uci. edu/databases/kddcup 99. html. [2] Y. Guan and A. A. Ghorbani. Y-means: A clustering method for intrusion detection. In Proceedings of Canadian Conference on Electrical and Computer Engineering, pages 1083{1086, 2003. [3] L. Portnoy, E. Eskin, and S. Stolfo. Intrusion detection with unlabeled data using clustering. In Proceedings of ACM CSS Workshop on Data Mining Applied to Security (DMSA-2001), 2001. [4] K. Scarfone and P. Mell. Guide to intrusion detection and prevention systems (idps), 2007. 28/29
Questions? 29/29
- Slides: 29