Signposting Anomaly detection on useragent strings Eirini Spyropoulou
- Slides: 27
Signposting Anomaly detection on useragent strings Eirini Spyropoulou 1, Jordan Noble 2, Chris Anagnostopoulos 2 1 Barclays 2 Mentat
Outline 1. Background and motivation for this work 2. Problem definition 3. Proposed methodology 4. Experimental results 5. Computational Scalability & Future Work
Signposting Background & Motivation
Malicious HTTP traffic • Malicious activity is analyzed by means of the cyber kill chain • Already installed malware uses the HTTP protocol to communicate with the Command Control server to receive updates, send stolen information or download commands to execute (Command Control phase of the Cyber kill chain) • Hiding in terabytes of legitimate HTTP traffic of an organisation • One way that such HTTP requests can be detected is the user agent string
What is a user agent string? 2005 -04 -13 19: 12: 24 126 45. 0. 0. 198 200 TCP_MISS 2021 591 GET http news. google. com /news ? imgefp=8 ku. NBkunk 18 J&imgurl=images. newsfactor. com/image s/id/4456/microsoft_patches_critical_flaws_win. jpg - DIRECT news. google. com image/jpeg "Mozilla/4. 0 (compatible; MSIE 6. 0; Windows NT 5. 1; SV 1; . NET CLR 1. 1. 4322)" PROXIED News/Media - 192. 16. 170. 42 SG-HTTP-Service - none -
User agents as indicators of malicious traffic • User agent strings are hard to spoof as they are typically created at runtime using version information of various parts of the operating system • Malware sets HTTP User-Agent to constant string
The need for anomaly detection • Rule based techniques are highly empirical and can only capture malicious events that are identical to the ones that have happened in the past • Especially for the user agent strings it is impossible to enumerate everything that can go wrong with it • Malicious events are very rare compared to the volume of the traffic in an organisation • Gathering sufficient amount of labeled data is not realistic
Signposting Problem definition
Variability of user agent strings Browser user agents Internet Explorer Mozilla/4. 0 (compatible; MSIE 6. 0; Windows NT 5. 1; SV 1; . NET CLR 1. 1. 4322) Chrome Mozilla/5. 0 (Windows NT 6. 1) Apple. Web. Kit/536. 11 (KHTML, like Gecko) Chrome/20. 0. 1132. 57 Safari/536. 11 Safari Mozilla/5. 0 (Macintosh; Intel Mac OS X 10_12_5) Apple. Web. Kit/603. 2. 4 (KHTML, like Gecko) Version/10. 1. 1 Safari/603. 2. 4
Variability of user agent strings Application user agents • Microsoft Office/11. 0 (Windows NT 5. 2; Microsoft Office Outlook 11. 0. 8172; Pro) • SEP/12. 1. 5337. 5000, MID/{743 AA 76 E-7 F 26 -20 AA-73 F 6 F 91 D 50 A 6454 D}, SID/1 SEQ/141001002 • Datek Streamer v 6. 6. 4 -8508954355968836270 • JNLP/6. 0 javaws/1. 6. 0_33 (b 05) Java/1. 8. 0_102
Detecting anomalous user agents in enterprise traffic Flagging rare user agents as anomalous would result in a lot of false positives There will be no way of ranking them
Signposting Proposed Methodology
Proposed Methodology • Due to lack of strict syntax user agent strings are hard to extract features from • We represent user agents directly into the space defined by their distances • Then plenty of anomaly detection techniques can be applied
Known String Distances Levenshtein Jaccard
1 st Shortcoming Mozilla/4. 0 (compatible; MSl. E 6. 0; Windows NT 5. 1) Mozilla/4. 0 (compatible; MSIE 6. 0; Windows NT 5. 1; {43 EE 6 E 71 -5 B 19424 C-9880 -28 BBBF 1360 B 9})) Jaccard on sets of 4 -grams: 0. 54 Scaled levenshtein: 0. 45
2 nd Shortcoming Mozilla/4. 0 (compatible; MSl. E 6. 0; Windows NT 5. 1) Mozilla/4. 0 (compatible; MSIE 6 0; Windows NT 5. 2) Jaccard on sets of 4 -grams: 0. 08 Scaled levenshtein: 0. 02
A flexible parser for user agent strings Mozilla/4. 0 (compatible; MSIE 6. 0; Windows NT 5. 1) Java/1. 4. 2_06 {'additional': [('Java', '1. 4. 2_06')], 'device': [('compatible', ''), ('MSIE', '6. 0'), ('Windows NT', '5. 1')], 'prefix': ('Mozilla', '4. 0'), 'unparsable': ''}
Removal of random looking strings • We encode all tokens of the parsed string with 0 for number, 1 for letter and 2 for non-alphanumeric • We compute the entropy of the probability distribution of all transitions for every token • We threshold the entropy based on the cumulative distribution of values that we get for all the tokens in the corpus • Intuition: Random looking strings will have higher entropy
Our proposed distance Distance between token-value tuples Distance between user agent strings
Signposting Experimental Evaluation
Anomaly detection Given the matrix of all pairwise distances 1 -KNN: Score is the distance to the first nearest neighbor One-class SVM: Score is the distance to the decision boundary
Experimental setup • How well can we predict malicious user agent strings using data only from normal traffic? • We trained anomaly detection models on normal traffic from bluecoat logs and tested on the union of a held out set and honeypot data • AUC scores using 10 -fold cross validation Size of raw data Distinct user agent strings Bluecoat proxy data 7, 943, 657 2, 760 Honeypot data 994, 693 4, 048
AUC curves
Signposting Computational Scalability and Future Work
Scaling up the current method Computing all pairwise distances is a quadratic problem For the case of KNN: Directly mining outliers [Bay et al. SIGKDD 2013] – Randomises data into buckets – Uses the first bucket only get a good estimate of the pruning threshold which is then used to prune non-outliers – Works well for non-uniform data – Theoretical time still quadratic, practical time polynomial with a smaller exponent – Works for any k and number of outliers B-K trees [Burkhard et al. CACM 1973] – Efficient data structures that stores all strings based on their relative distance – Can be built in O(nlogn) time and retrieval of all nearest neighbors is also O(nlogn) – Has to fit in memory For the case of SVMs: – Standard string kernels could be used (such as the p-spectrum kernel) and investigate efficient solvers for this case
Future work • Get additional user feedback on our proposed user agent string distance using only enterprise data • Explore methodologies for finding suspicious look-alikes • Explore host based methods
Thank you!
- Mozilla5.0
- Signposting
- Agrima seth
- Elasticsearch anomaly detection
- Anomaly detection in google analytics
- Flink anomaly detection
- Anomaly detection spark
- System log analysis for anomaly detection
- Anomaly management systems
- Ararat anomaly
- Page replacement fifo
- Belady's anomaly example
- Standardized anomaly formula
- Choanal atresia,
- Arytenoid anomaly
- Anomaly score
- Vascular ring anomaly
- Birman cat neutrophil granulation anomaly
- Anomaly score
- Data flow anomaly state graph
- Semantic anomaly
- Choledocolithaisis
- Belady's anomaly example
- Belady's anomaly example
- Judi mountain
- Signature based vs anomaly based
- True anomaly calculator
- Data flow anomaly state graph