Anomaly Detection in Data Science Oneclass Classification with


































- Slides: 34

Anomaly Detection in Data Science One-class Classification with Privileged Information for Malware Detection Pavel Erofeev, IITP RAS, Airbus Group Russia

Find the Panda

Anomaly Detection: Hadlum vs Hadlum ◎ The birth of a child to Mrs. Hadlum happened 349 days after Mr. Haldum left for military service ◎ Average human pregnancy period is 280 days (40 weeks) ◎ Statistically, 39 days is an outlier

“ An outlier is an observation which deviates so much from other observations as to arouse suspicions that it was generated by different mechanism Howkins, 1980

Defining Anomaly Detection ◎ Digital representation vectors describing observations ◎ Mixture of “nominal” and “abnormal” points ◎ Anomaly points are generated by different generative process than the nominal points

Possible Settings in CS ◎ Supervised (Know attacks) ○ Training data labeled with “nominal” or “anomaly” ◎ Clean (Zero-day attacks) ○ Training data are all “nominal”, test data may be contaminated with “anomaly” ◎ Unsupervised (Unknown attacks) ○ Training data consists of mixture of “nominal” and “anomaly” points

Real World Data Problems ◎ Data is multivariate ◎ There is usually more than one generating mechanism underlying the “normal” data ◎ Anomalies may represent a different class of objects, so there sre many of them ◎ Domain specific definition of what to count as anomaly ◎ Normality evaolves in time 7

Anomaly Taxonomy Point Anomaly 8

Anomaly Taxonomy Contextual Anomaly 9

Anomaly Taxonomy Causal Anomaly 10

Taxonomy

Imbalanced classification ■ Normal data - a lot of samples ■ Abnormal - very few ■ Standard methods do not work as expected ■ Standard metrics do not apply 12

Imbalanced classification ◎Weights for classes ○ Proved not to be helpful in most cases ◎Resampling methods ○ Oversampling (Bootstrap, SMOTE, etc. ) ○ Undersampling ◎How to choose which method to use? ◎How to choose resampling parameter? ○ We compared several methods ○ We proposed a meta-model that on average gives best results [Papanov, Erofeev, Burnaev, 2015]

Statistics-based models ◎ Assumption on normal data generation procedure (e. g. Gaussian distribution, etc. ) ◎ PCA is a method commonly used to extract most variant combinations in data ◎ PCA based anomaly detection is good for highly correlated environments 14

Density-based models ◎SVM-based and nearest neighbours based ◎How to choose best kernel parameter? 15

One-class SVM with Privileged Information Evgeny Burnaev Dmitry Smolyakov Skoltech, IITP RAS

One-Class SVM

One-Class SVM

One-Class SVM

One-Class SVM Kernel Trick

Kernel Trick

Hyper-parameter Influence

Decision Functions

Learning with Privileged Info Example: Image classification with textual description

Learning with Privileged Info

Learning with Privileged Info

Learning with Privileged Info

Microsoft Malware Classification Challenge Kaggle. competition data (2015)

Problem Description ◎ 9 malware families ○ Rumnit, Lollipop, Kelihos ver 3, Vundo, Simda, Tracur, Kelihos ver 1, Obfuscator. ACY, Gatak ◎ Raw data ○ Hexadecimal representation of the raw binary content ○ Meta-data extracted from the binaries, including function calls, strings, etc.

Features ◎ Original features ○ Information from binary files such as ◉ Frequencies of bytes ◉ Number of different N-grams, etc. ◎ Privileged features ○ Information from code disassemble such as ◉ Frequencies of commands ◉ Number of calls to external dlls ○ Bytecode as an image ◉ Features based on image texture which is commonly used for image classification

Features

Experimental Setup

Results

Thanks! Any questions? pavel. erofeev@phystech. edu