Malware Detection using Machine Learning Nawaf Abudawaood University

Malware Detection using Machine Learning Nawaf Abudawaood University of Colorado at Colorado Springs (UCCS) Project Defense 5/15/2019

Outline • • • Introduction Motivation Problem Statement Related Work Malware Binary to an Image Project Approach Malware Images Dataset Image Sizes CNN ILSVRC Models Project Results Conclusion References

Introduction • Malware has been improving in the way that it avoids malware detectors, thousands of malwares are being created everyday. • Malware can be hidden within a code using different techniques. • A new signature of a file can make it difficult for malware to be detected. • Abnormal behaviors can occur after a malicious code is executed. • Reverse engineers can find malware code after analyzing malicious files. • The intent of malware authors is to destroy and/or harm computer systems.

Rapid Increase of Malware Taken from Kalash, et al. , 2018

Motivation • • • Detecting malware images using the CNN ILSVRC machine learning models had a great impact, when classifying image. It provides the best results of running the models based on the complicated computations that occurred on the layers. Extracts features automatically to save a lot of time for reverse engineers to analyze a malicious code or image, which provides accurate predictions when classifying malicious code. A Convolutional Neural Network with an architecture that uses deep learning can allow for speech recognition, bioinformatics, and computer vision. Helpful especially in malware detection and cyber security, which proves that deep CNN models are so much better than classification models such as the Support Vector Machine. CNNs are the best at classifying images, which have changed the prospect of detecting malware using the classic approach and changing the method to classify malware based on the malicious images using the architecture of CNN models. Binary files are represented as a 2 -diminsional greyscale image, which will allow for the classification to occur through training the CNN model. Malware images are recognized through the most important patterns that allowed for the CNN model to classify them because many of these image binary files are generated with similar code that will visualize similar images Deep CNNs can also, reduce the error of image classification to the absolute minimum.

Problem Statement • The problem that we are trying to solve is classifying malware object code into various malware families. • We have 9342 malware samples given in the form of images obtained from the object code. • There are 25 malware families, with the biggest family containing 2, 950 samples and the smallest containing 81 samples. • Our approach involves classifying these images using deep learning models that have performed well in image classification.

Related Work • Can be used as an Artificial Neural Network (ANN) that combines Recurrent Neural Networks (RNNs) with Long-Short-Term-Memory (LSTM) for machine learning. Saxe and Berlin (2015) extract features that are represented based on the byte entropy. • Detected malware using the Dynamic analysis approach (wang et al. , 2018) • N-gram and behavior based approaches that focused on raw bytes (Kolonjaji, et al. , 2018)

Related Work • Saxe and Berlin approaches to avoid overfitting using cross-validation that creates more than one path for the classification to work properly in machine learning. (2015) • Alazab mentioned the danger of zero-day attacks. (2011) • Sheneamer, Swarup, and Kalita. “A detection framework for sematic code clones and obfuscated code”. (2018)

Malware Binary to an Image Taken from Kalash, et al. , 2018

Related Work • Szegedy, et. , al Google. Net Inception v 1 Going deeper with convolutions • Szegedy, et. , al Inception v 3 Rethinking the inception architecture for computer vision. (2015) • Aldujaili talked about malware binary image files based on the location of the pixel that encodes portable executable files into binaries. (2018)

Project Approach • We propose a method that uses the ILSVRC and CNN models to classify malware images. • CNNs will automatically extract features. • Using the vision lab dataset to achieve outstanding results. • Compare the results of the ILSVRC models using the same malimg dataset.

Image Sizes

CNN’s Importance for Image Classification • CNNs are the best at classifying images based on the extraction of features that are done for the model. • A convolution function occurs because the data gets compressed specially when the features are being extracted, which will achieve best results for vision recognition. • A fully connected convolution will allow for the whole network to be connected through each node (pixel) and that’s why CNNs are being used over other non-CNN models. • CNNs are similar to how the brain can visualize things based on objects such as the eyes and how far they are apart on the face of a human or an animal. • Can be recognized using CNNs for image classification.

Simple CNN Architecture Taken from Wikimedia Commons 2015

CNN ILSVRC Models • Because they are the winners of the Image. Net Large-Scale Visual Recognition Challenge (ILSVRC) which makes them the best at classifying images. • VGG-Net 16 (2014) • Inception V 3 (2015) • Res. Net (2015) • CNN-SVM (2018)

VGG-NET 16 Taken from Simonyan, et al. 2014

Inception V 3 • Reduced computational cost for the 5 x 5 used in Google. Net Inception v 1 to the 2 3 x 3 conv Inception v 3 which will be 2. 78 times cheaper and better results to prove that it performs a lot better than the previous models. Taken from Szegedy, et al. , 2015

Inception V 3 Taken from Szegedy, et al. , 2015

Grid reduction and Auxiliary classifier Taken from Szegedy, et al. , 2015

Inception V 3 Model Description • Inception v 3 has 42 layers, the winner of the ILSVRC in 2016. • Reduced the error rate by using less parameters. Taken from Szegedy, et al. , 2015

Res. Net 50 Taken from He, et al. , 2016

CNN-SVM Taken from Agarap, et al. , 2017

Project Results

Conclusion • Malware have been a major concern. • Malware bypasses malware detectors. • Machine learning can be very helpful in improving malware detection. • A number of ILSVRC models are used. • classify malware images using the vision lab dataset.

References • [1] Sethi, Kamalakanta, et al. "A novel malware analysis for malware detection and classification using machine learning algorithms. “ Proceedings of the 10 th International Conference on Security of Information and Networks. • ACM, 2017. [2] Sewak, Mohit, Sanjay K. Sahay, and Hemant Rathore. "Comparison of deep learning and the classical machine learning algorithm for the malware detection. " 2018 19 th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD). IEEE, 2018. • • • [3] Kolosnjaji, Bojan, et al. "Deep learning for classification of malware system call sequences. " Australasian Joint Conference on Artificial Intelligence. Springer, Cham, 2016. [4] Buczak, Anna L. , and Erhan Guven. "A survey of data mining and machine learning methods for cyber security intrusion detection. " IEEE Communications Surveys & Tutorials 18. 2 (2016): 1153 -1176. [5] Sethi, Kamalakanta, et al. "A Novel Malware Analysis Framework for Malware Detection and Classification using Machine Learning Approach. " Proceedings of the 19 th International Conference on Distributed Computing and Networking. ACM, 2018.

References • • • [6] Cui, Zhihua, et al. "Detection of malicious code variants based on deep learning. " IEEE Transactions on Industrial Informatics 14. 7 (2018): 3187 -3196. [7] Sheneamer, Abdullah, Swarup Roy, and Jugal Kalita. "A detection framework for semantic code clones and obfuscated code. " Expert Systems with Applications 97 (2018): 405 -420. [8] Kolosnjaji, Bojan, et al. "Adversarial Malware Binaries: Evading Deep Learning for Malware Detection in Executables. " ar. Xiv preprint ar. Xiv: 1803. 04173(2018). [9] Wang, Yao, Wan‐dong Cai, and Peng‐cheng Wei. "A deep learning approach for detecting malicious Java. Script code. " Security and Communication Networks 9. 11 (2016): 1520 -1534. [10] Yan, Jinpei, Yong Qi, and Qifan Rao. "Detecting malware with an ensemble method based on deep neural network. " Security and Communication Networks 2018 (2018).

References [11] Saxe, Joshua, and Konstantin Berlin. "Deep neural network based malware • • detection using two dimensional binary program features. " Malicious and Unwanted Software (MALWARE), 2015 10 th International Conference on. IEEE, 2015. [12] Al-Dujaili, Abdullah, et al. "Adversarial deep learning for robust detection of binary encoded malware. " 2018 IEEE Security and Privacy Workshops (SPW). IEEE, 2018. [13] Alazab, Mamoun, et al. "Zero-day malware detection based on supervised learning algorithms of API call signatures. " Proceedings of the Ninth Australasian Data Mining Conference-Volume 121. Australian Computer Society, Inc. , 2011. [14] He, Kaiming, et al. "Deep residual learning for image recognition. " Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. [15] Szegedy, Christian, et al. "Going deeper with convolutions. " Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.

References • [16] Peiravian, Naser, and Xingquan Zhu. "Machine learning for android malware detection using permission and api calls. " 2013 IEEE 25 th international conference on tools with artificial intelligence. IEEE, 2013. • [17] Rathore, H. , Agarwal, S. , Sahay, S. K. , & Sewak, M. (2019, April 4). Malware Detection Using Machine Learning and Deep Learning • [18] M. Christodorescu, S. Jha, S. A. Seshia, D. Song, R. E. Bryant, "Semantics-aware malware detection", Proc. 2005 IEEE Symp. Security Privacy, pp. 32 -46, 2005. • [19] Kalash, Mahmoud, et al. "Malware classification with deep convolutional neural networks. " 2018 9 th IFIP International Conference on New Technologies, Mobility and Security (NTMS). IEEE, 2018. • [20] Agarap, Abien Fred. "An architecture combining convolutional neural network (CNN) and support vector machine (SVM) for image classification. " ar. Xiv preprint ar. Xiv: 1712. 03541 (2017).

References • [21] He, Kaiming, et al. "Deep residual learning for image recognition. " Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. • [22] Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition. " ar. Xiv preprint ar. Xiv: 1409. 1556 (2014).