Malicious Activity Detection And Identification Using LSH Clustering

Introduction Before we dive in, some basic terms… • Malware – A form of

Target Machine Learning of: • Detection and classification of malicious and benign programs. –

Dataset - API Trace * Trace - A recorded information about a program’s execution.

Min. Hash – Data Representation Min. Hash Convert large sets to short signatures, while

LSH – Traces Similarity Naïvely, we would like to compute pairwise Jaccard similarities for

Model Work Flow Trace The set of tokens of Candidate Pairs: length N that

Results Executions parameters: • Shingle size = 12. • Number of Min. Hash functions

Slides: 9

Download presentation

Malicious Activity Detection And Identification Using LSH Clustering Project Students: Amit Elran, Yoav Beeri, Shaked Sagi Advisors: Prof. Shlomi Dolev, Mr. Mohammad Ghanayim

Introduction Before we dive in, some basic terms… • Malware – A form of hostile or intrusive software. • Classification – The problem of identifying which of a set of categories a new observation belongs to. • Clustering – The task of grouping a set of objects in such a way that objects in the same group (cluster) are more similar to each other than to those in other groups (clusters). http: //www. wikipedia. org

Target Machine Learning of: • Detection and classification of malicious and benign programs. – Direct classification. – Indirect classification. • Detection and analysis of certain characteristics and behaviors.

Dataset - API Trace * Trace - A recorded information about a program’s execution. Our dataset is comprised of API traces, each of the form: Program Name Trace Data: API System Calls * N-Gram (or N-Shingle) for a trace is a sequence of N tokens (= characters or API calls) that appear in the trace. Trace data is parsed into a set of N-Grams http: //www. mmds. org

Min. Hash – Data Representation Min. Hash Convert large sets to short signatures, while preserving similarity. http: //www. mmds. org

LSH – Traces Similarity Naïvely, we would like to compute pairwise Jaccard similarities for every pair of traces. Problem! Number of pairs of traces may be too large! Solution – LSH (Locality-Sensitive Hashing): • Divide the Signature Matrix into b bands consisting of r rows each. • Each band has its own bucket array (clusters) and a hash function that takes vectors of r elements, and hashes them to these buckets. • We consider any pair that hashed to the same bucket, for any of the hashings, as a candidate pair. http: //www. mmds. org

Model Work Flow Trace The set of tokens of Candidate Pairs: length N that appear Those pairs of signatures that in the trace we need to test for similarity Shingling Min. Hash Locality. Sensitive Hashing Signature Matrix that represents the sets Trace classification by similarity to the clusters medioids. Medioids are representative objects of a cluster, whose average similarity to all the objects in the cluster is maximal. http: //www. mmds. org

Results Executions parameters: • Shingle size = 12. • Number of Min. Hash functions = 20. Classifier Error Rate - Malicious Training Set Malicious test set (48 trace logs) Benign test set (48 trace logs) Combined (96 trace logs) 45. 00% 40. 00% 35. 00% 30. 00% 25. 00% 20. 00% 15. 00% 10. 00% 5. 00% 0. 00% 100 200 600 1300 Training set size (malicious trace logs only) 1400

Questions ?