Signposting Anomaly detection on useragent strings Eirini Spyropoulou

  • Slides: 27
Download presentation
Signposting Anomaly detection on useragent strings Eirini Spyropoulou 1, Jordan Noble 2, Chris Anagnostopoulos

Signposting Anomaly detection on useragent strings Eirini Spyropoulou 1, Jordan Noble 2, Chris Anagnostopoulos 2 1 Barclays 2 Mentat

Outline 1. Background and motivation for this work 2. Problem definition 3. Proposed methodology

Outline 1. Background and motivation for this work 2. Problem definition 3. Proposed methodology 4. Experimental results 5. Computational Scalability & Future Work

Signposting Background & Motivation

Signposting Background & Motivation

Malicious HTTP traffic • Malicious activity is analyzed by means of the cyber kill

Malicious HTTP traffic • Malicious activity is analyzed by means of the cyber kill chain • Already installed malware uses the HTTP protocol to communicate with the Command Control server to receive updates, send stolen information or download commands to execute (Command Control phase of the Cyber kill chain) • Hiding in terabytes of legitimate HTTP traffic of an organisation • One way that such HTTP requests can be detected is the user agent string

What is a user agent string? 2005 -04 -13 19: 12: 24 126 45.

What is a user agent string? 2005 -04 -13 19: 12: 24 126 45. 0. 0. 198 200 TCP_MISS 2021 591 GET http news. google. com /news ? imgefp=8 ku. NBkunk 18 J&imgurl=images. newsfactor. com/image s/id/4456/microsoft_patches_critical_flaws_win. jpg - DIRECT news. google. com image/jpeg "Mozilla/4. 0 (compatible; MSIE 6. 0; Windows NT 5. 1; SV 1; . NET CLR 1. 1. 4322)" PROXIED News/Media - 192. 16. 170. 42 SG-HTTP-Service - none -

User agents as indicators of malicious traffic • User agent strings are hard to

User agents as indicators of malicious traffic • User agent strings are hard to spoof as they are typically created at runtime using version information of various parts of the operating system • Malware sets HTTP User-Agent to constant string

The need for anomaly detection • Rule based techniques are highly empirical and can

The need for anomaly detection • Rule based techniques are highly empirical and can only capture malicious events that are identical to the ones that have happened in the past • Especially for the user agent strings it is impossible to enumerate everything that can go wrong with it • Malicious events are very rare compared to the volume of the traffic in an organisation • Gathering sufficient amount of labeled data is not realistic

Signposting Problem definition

Signposting Problem definition

Variability of user agent strings Browser user agents Internet Explorer Mozilla/4. 0 (compatible; MSIE

Variability of user agent strings Browser user agents Internet Explorer Mozilla/4. 0 (compatible; MSIE 6. 0; Windows NT 5. 1; SV 1; . NET CLR 1. 1. 4322) Chrome Mozilla/5. 0 (Windows NT 6. 1) Apple. Web. Kit/536. 11 (KHTML, like Gecko) Chrome/20. 0. 1132. 57 Safari/536. 11 Safari Mozilla/5. 0 (Macintosh; Intel Mac OS X 10_12_5) Apple. Web. Kit/603. 2. 4 (KHTML, like Gecko) Version/10. 1. 1 Safari/603. 2. 4

Variability of user agent strings Application user agents • Microsoft Office/11. 0 (Windows NT

Variability of user agent strings Application user agents • Microsoft Office/11. 0 (Windows NT 5. 2; Microsoft Office Outlook 11. 0. 8172; Pro) • SEP/12. 1. 5337. 5000, MID/{743 AA 76 E-7 F 26 -20 AA-73 F 6 F 91 D 50 A 6454 D}, SID/1 SEQ/141001002 • Datek Streamer v 6. 6. 4 -8508954355968836270 • JNLP/6. 0 javaws/1. 6. 0_33 (b 05) Java/1. 8. 0_102

Detecting anomalous user agents in enterprise traffic Flagging rare user agents as anomalous would

Detecting anomalous user agents in enterprise traffic Flagging rare user agents as anomalous would result in a lot of false positives There will be no way of ranking them

Signposting Proposed Methodology

Signposting Proposed Methodology

Proposed Methodology • Due to lack of strict syntax user agent strings are hard

Proposed Methodology • Due to lack of strict syntax user agent strings are hard to extract features from • We represent user agents directly into the space defined by their distances • Then plenty of anomaly detection techniques can be applied

Known String Distances Levenshtein Jaccard

Known String Distances Levenshtein Jaccard

1 st Shortcoming Mozilla/4. 0 (compatible; MSl. E 6. 0; Windows NT 5. 1)

1 st Shortcoming Mozilla/4. 0 (compatible; MSl. E 6. 0; Windows NT 5. 1) Mozilla/4. 0 (compatible; MSIE 6. 0; Windows NT 5. 1; {43 EE 6 E 71 -5 B 19424 C-9880 -28 BBBF 1360 B 9})) Jaccard on sets of 4 -grams: 0. 54 Scaled levenshtein: 0. 45

2 nd Shortcoming Mozilla/4. 0 (compatible; MSl. E 6. 0; Windows NT 5. 1)

2 nd Shortcoming Mozilla/4. 0 (compatible; MSl. E 6. 0; Windows NT 5. 1) Mozilla/4. 0 (compatible; MSIE 6 0; Windows NT 5. 2) Jaccard on sets of 4 -grams: 0. 08 Scaled levenshtein: 0. 02

A flexible parser for user agent strings Mozilla/4. 0 (compatible; MSIE 6. 0; Windows

A flexible parser for user agent strings Mozilla/4. 0 (compatible; MSIE 6. 0; Windows NT 5. 1) Java/1. 4. 2_06 {'additional': [('Java', '1. 4. 2_06')], 'device': [('compatible', ''), ('MSIE', '6. 0'), ('Windows NT', '5. 1')], 'prefix': ('Mozilla', '4. 0'), 'unparsable': ''}

Removal of random looking strings • We encode all tokens of the parsed string

Removal of random looking strings • We encode all tokens of the parsed string with 0 for number, 1 for letter and 2 for non-alphanumeric • We compute the entropy of the probability distribution of all transitions for every token • We threshold the entropy based on the cumulative distribution of values that we get for all the tokens in the corpus • Intuition: Random looking strings will have higher entropy

Our proposed distance Distance between token-value tuples Distance between user agent strings

Our proposed distance Distance between token-value tuples Distance between user agent strings

Signposting Experimental Evaluation

Signposting Experimental Evaluation

Anomaly detection Given the matrix of all pairwise distances 1 -KNN: Score is the

Anomaly detection Given the matrix of all pairwise distances 1 -KNN: Score is the distance to the first nearest neighbor One-class SVM: Score is the distance to the decision boundary

Experimental setup • How well can we predict malicious user agent strings using data

Experimental setup • How well can we predict malicious user agent strings using data only from normal traffic? • We trained anomaly detection models on normal traffic from bluecoat logs and tested on the union of a held out set and honeypot data • AUC scores using 10 -fold cross validation Size of raw data Distinct user agent strings Bluecoat proxy data 7, 943, 657 2, 760 Honeypot data 994, 693 4, 048

AUC curves

AUC curves

Signposting Computational Scalability and Future Work

Signposting Computational Scalability and Future Work

Scaling up the current method Computing all pairwise distances is a quadratic problem For

Scaling up the current method Computing all pairwise distances is a quadratic problem For the case of KNN: Directly mining outliers [Bay et al. SIGKDD 2013] – Randomises data into buckets – Uses the first bucket only get a good estimate of the pruning threshold which is then used to prune non-outliers – Works well for non-uniform data – Theoretical time still quadratic, practical time polynomial with a smaller exponent – Works for any k and number of outliers B-K trees [Burkhard et al. CACM 1973] – Efficient data structures that stores all strings based on their relative distance – Can be built in O(nlogn) time and retrieval of all nearest neighbors is also O(nlogn) – Has to fit in memory For the case of SVMs: – Standard string kernels could be used (such as the p-spectrum kernel) and investigate efficient solvers for this case

Future work • Get additional user feedback on our proposed user agent string distance

Future work • Get additional user feedback on our proposed user agent string distance using only enterprise data • Explore methodologies for finding suspicious look-alikes • Explore host based methods

Thank you!

Thank you!