Automatic Analysis of Malware Behavior using Machine Learning

Automatic Analysis of Malware Behavior using Machine Learning Author’s: Konrad Rieck, Philipp Trinius, Carsten Willems, and Thosten Holz Presented by: Satyajeet Dept of Computer & Information Sciences University of Delaware CISC 879 - Machine Learning for Solving Systems Problems

Abstract & Introduction • • • Malware • Poses major threat to security of computer systems. • Very diverse – viruses, internet worms, trojan horses, • Amount of malware – millions of hosts infected Obfuscation and polymorphism impede detection at file level Dynamic analysis helps characterizing and defending. CISC 879 - Machine Learning for Solving Systems Problems

Abstract & Introduction Contd. . • Framework for automatic analysis of malware behavior using Machine learning • • • Framework allows automatic analysis of novel classes of malware with similar behavior – Clustering. Assigning unknown classes of malware to these discovered classes – Classification. An incremental approach based on both for behavior based analysis. CISC 879 - Machine Learning for Solving Systems Problems

Automatic analysis of Malware Behavior • Framework steps and procedure • • Executing and monitoring malware binaries in sandbox environment. Report generated on system calls and their arguments. Sequential reports are embedded in a vector space where each dimension is associated with a behavioral pattern. ML techniques then applied to the embedded reports to identify and classify malware. Incremental analysis progress by alternating between clustering and classification. CISC 879 - Machine Learning for Solving Systems Problems

Report representation • Can be textual or XML • • • Human readable and suitable for computation of general statistics But not efficient for automatic analysis Hence MIST (Malware Instr. Set) • Inspired from instr. set used in process design. CISC 879 - Machine Learning for Solving Systems Problems

MIST • Category of system calls • Operation - Reflects a particular system call • Arguments as argblocks. CISC 879 - Machine Learning for Solving Systems Problems

Sandbox and MIST representation CISC 879 - Machine Learning for Solving Systems Problems

Representation • • • These sequential reports identify typical behavior of malware – Changing registry keys, modifying system files. But still not suitable for efficient analysis techniques. Hence the need to embed behavior reports in vector space – Using instruction q-grams. This embedding enables expressing the similarity of behavior geometrically – Calculating distance. CISC 879 - Machine Learning for Solving Systems Problems

Clustering and Classification • • • Reports are embedded in vector space – Process ready for applying ML techniques Clustering of behavior – where classes of similar behavior malware identified. Classification of behavior – which allows to assign malware to known classes of behavior. What allows us to do this? Malware binaries are a family of similar variants with similar behavior patterns ! CISC 879 - Machine Learning for Solving Systems Problems

Contd. . CISC 879 - Machine Learning for Solving Systems Problems

Algorithms • Prototype extraction • • • Extracts small set of prototypes from set of reports. First one chosen at random. Clustering using Prototypes • • • Iterative algorithm Prototypes at beginning are individual clusters Algorithm determines and merges nearest pairs of clusters Classification using Prototypes • Allows to learn to discriminate between classes of malware. CISC 879 - Machine Learning for Solving Systems Problems

Algorithms Contd. . • • • For each report algorithm determines the nearest prototype of clusters in training data, if within radius then assigns to cluster Else rejects and holds back for later incremental analysis. Incremental analysis • Reports to be analyzed are received from source. • Initially classified using prototypes of known clusters • • Thereby variants of known malware identified for further analysis. Prototypes extracted from remaining reports and clustered again. CISC 879 - Machine Learning for Solving Systems Problems

Experiments and Results CISC 879 - Machine Learning for Solving Systems Problems

Evaluating components • Prototype extraction • • • Precision – 0. 99 when corpus compressed by 2. 9 % & 7% Clustering • • • Evaluated using Precision, Recall and Compression. Evaluated using F-measure for experiments – MIST 1 = 0. 93 and MIST 2 = 0. 95 better than previous related work 0. 881 Classification • F-measure for experiments – MIST 1= 0. 96 and MIST 2 = 0. 99 CISC 879 - Machine Learning for Solving Systems Problems

Experiments and Results Contd. . CISC 879 - Machine Learning for Solving Systems Problems

Experiments and Results Contd. . CISC 879 - Machine Learning for Solving Systems Problems

Conclusion • A new framework introduced which overcomes several previous deficiencies. • The framework is learning based • Framework can be implemented in practice • • Steps – Collection of malware, a study in sandbox environment, embed observed behavior in vector space, apply learning algorithms – clustering and classification. This process is efficient and learns automatically after initial setup and run. CISC 879 - Machine Learning for Solving Systems Problems

Thank you ! CISC 879 - Machine Learning for Solving Systems Problems
- Slides: 18