Detecting Algorithmically Generated Domains Using Data Visualizations and

Contents 1. 2. 3. 4. 5. 6. Introduction Data Sets Classification Features Model Improvement

Introduction • Example: 1002 n 0 q 11 m 17 h 017 r 1

Dataset Legit Domain DGA Domain Data Source Alexa Top 1 M Click. Security Project

Supervised Classification • Seidenberg 5

Data frame Sample domain class length entropy theukwebdesigncompany legit 21 4. 070656 texaswithlove 1982

Data Visualization - Plotting Seidenberg 7

Prepare for Classification • Re-sampling – Shuffle data randomly for training/testing (80/20 splits) •

Random Forest True dga legit All Predict dga legit 2991 6379 427 127532 3418

SVM Predicted True dga legit All dga 1160 8210 9370 legit 105 127854 127959

Naïve Bayes Predicted True dga legit All dga 3332 6038 9370 legit 5061 122898

Result Comparisons Accuracy Rate Error Rate TPR TNR FPR FAR FRR Random Forest 31.

Improvement Insight • Domains – Many DGAs are dictionary based algorithms – How to

Similarity Score using N-Gram domains D 1×m S 1×n abc google … ∑ Seidenberg

Alexa Gram • We calculate alexa_grams score for every single domains domain investmentsonthebeach infiniteskills

Dictionary Gram • Similarly, we calculate dict_grams score for every single domains. Dictionary: 479,

New Data Frame Sample • New Features • Calculate N-grams for Alexa domain, alexa_grams

New Features in Classification • Keep the same Classification Parameters • Train the model

Model Improvement Compare Old Perform Random ance Forest Rate Accuracy Rate Error Rate Seidenberg

Accuracy Rate Detail Compare – Random Forest 100% 90% 80% 70% 60% 50% 40%

Detail Compare – SVM 100% Accuracy Rate 99. 92% 92. 03% 99. 58% 80%

Detail Compare – Naïve Bayes 96. 04% Accuracy Rate 100% 99. 72% 76. 87%

Conclusion Implemented machine learning in cybersecurity area Introduced two new features on dga domain

Slides: 28

Download presentation

Detecting Algorithmically Generated Domains Using Data Visualizations and N-grams Methods Author: Tianyu Wang and Li-Chiou Chen Presenter: Tianyu Wang

Contents 1. 2. 3. 4. 5. 6. Introduction Data Sets Classification Features Model Improvement Result Analysis Conclusion & Future Research Seidenberg 2

Introduction • Example: 1002 n 0 q 11 m 17 h 017 r 1 shexghfqf. com • Dynamic Generation Algorithm (DGA) could: – Generate a long list of domain names – Keep sending name resolutions request – Evade blacklist-based detection • Identify DGA would: – Help to detect Distributed Command & Control Botnets – Help to monitor potential malicious activities Seidenberg 3

Dataset Legit Domain DGA Domain Data Source Alexa Top 1 M Click. Security Project Sample Size 1, 000 52, 665 • Data Processing – Remove Top-Level Domain (TLD) • For example, google. com -> google – Clean Data Seidenberg Legit domain DGA domain google 1002 n 0 q 11 m 17 h 017 r 1 shexghfqf facebook 1002 ra 86698 fjpgqke 1 cdvbk 5 youtube 1008 bnt 1 iekzdt 1 fqjb 76 pijxhr yahoo 100 f 3 a 11 ckgv 438 fpjz 91 idu 2 ag baidu 100 fjpj 1 yk 5 l 751 n 4 g 9 p 01 bgkmaf 4

Supervised Classification • Seidenberg 5

Data frame Sample domain class length entropy theukwebdesigncompany legit 21 4. 070656 texaswithlove 1982 -amomentlikethis legit 33 4. 051822 congresomundialjjrperu 2009 legit 26 4. 056021 a 17 btkyb 38 gxe 41 pwd 50 nxmzjxiwjwdwfrp 52 dga 37 4. 540402 a 17 c 49 l 68 ntkqnuhvkrmyb 28 fubvn 30 e 31 g 43 dq dga 39 4. 631305 a 17 d 60 gtnxk 47 gskti 15 izhvlviyksh 64 nqkz dga 37 4. 270132 a 17 erpzfzh 64 c 69 csi 35 bqgvp 52 drita 67 jzmy dga 38 4. 629249 a 17 fro 51 oyk 67 b 18 ksfzoti 55 j 36 p 32 o 11 fvc 29 cr dga 41 4. 305859 Seidenberg 6

Data Visualization - Plotting Seidenberg 7

Prepare for Classification • Re-sampling – Shuffle data randomly for training/testing (80/20 splits) • Choose Classification Algorithms – Random Forest – Support Vector Machines (SVM) – Naïve Bayes Seidenberg 8

Random Forest True dga legit All Predict dga legit 2991 6379 427 127532 3418 133911 All 9370 127959 137329 True Positive Rate (TPR) = 31. 92% True Negative Rate (TNR) = 99. 67% False Negative Rate (FNR) = 68. 08% False Positive Rate (FPR) = 0. 33% False Acceptance Rate (FAR) = 4. 76% False Rejection Rate (FRR) = 12. 49% Seidenberg 9

SVM Predicted True dga legit All dga 1160 8210 9370 legit 105 127854 127959 All 1265 136064 137329 TPR = 12. 38% TNR = 99. 92% FNR = 87. 62% FPR = 0. 08% FAR= 6. 03% FRR= 8. 30% Seidenberg 10

Naïve Bayes Predicted True dga legit All dga 3332 6038 9370 legit 5061 122898 127959 All 8393 128936 137329 TPR = 35. 56% TNR = 96. 04% FNR = 64. 44% FPR = 3. 96% FAR= 4. 68% FRR= 60. 30% Seidenberg 11

Result Comparisons Accuracy Rate Error Rate TPR TNR FPR FAR FRR Random Forest 31. 92% 99. 67% 68. 08% 0. 33% 4. 76% 12. 49% SVM Navie Bayes 12. 38% 99. 92% 87. 62% 0. 08% 6. 03% 8. 30% 35. 56% 96. 04% 64. 44% 3. 96% 4. 68% 60. 30% • We need to improve our model. Seidenberg 12

Improvement Insight • Domains – Many DGAs are dictionary based algorithms – How to measure the Similarity among domains • Introduce new features based on NGram – Build up Text Corpus Matrix • Legit Domain Matrix • Dictionary Words Matrix – Calculate Similarity Score based on matrix Seidenberg 13

Similarity Score using N-Gram domains D 1×m S 1×n abc google … ∑ Seidenberg abc 1 0 1 N-Grams of domains [N=3, 4, 5] Lm×n ego goo oog ogl gle goog 0 0 0 0 1 1 1 oogl 0 1 ogle 0 1 1 1 … 14

Alexa Gram • We calculate alexa_grams score for every single domains domain investmentsonthebeach infiniteskills dticash healthyliving asset-cache wdqdreklqnpp wdqjkpltirjhtho wdqxavemaedon wdraokbcnspexm wdsqfivqnqcbna Seidenberg class length legit 21 legit 14 legit 7 legit 13 legit 11 dga 12 dga 15 dga 13 dga 14 entropy 3. 368 2. 807 3. 239 2. 732 3. 085 3. 507 3. 239 3. 807 3. 325 alexa_grams 144. 722 81. 379 26. 558 76. 710 46. 268 11. 242 14. 304 28. 468 25. 935 4. 598 15

Dictionary Gram • Similarly, we calculate dict_grams score for every single domains. Dictionary: 479, 623 common used English word terms domain investmentsonthebeach infiniteskills dticash healthyliving asset-cache wdqdreklqnpp wdqjkpltirjhtho wdqxavemaedon wdraokbcnspexm wdsqfivqnqcbna Seidenberg class legit legit dga dga dga length 21 14 7 13 11 12 15 13 14 14 entropy 3. 368 2. 807 3. 239 2. 732 3. 085 3. 507 3. 239 3. 807 3. 325 dict_grams 109. 723 72. 786 23. 710 61. 722 31. 691 6. 367 16. 554 28. 700 19. 785 3. 629 16

New Data Frame Sample • New Features • Calculate N-grams for Alexa domain, alexa_grams • Calculate N-grams for Dictionary, dict_grams domain investmentsonthebeach infiniteskills dticash healthyliving asset-cache wdqdreklqnpp wdqjkpltirjhtho wdqxavemaedon wdraokbcnspexm wdsqfivqnqcbna Seidenberg class length entropy alexa_grams dict_grams legit 21 3. 368 144. 722 109. 723 legit 14 2. 807 81. 379 72. 786 legit 7 2. 807 26. 558 23. 710 legit 13 3. 239 76. 710 61. 722 legit 11 2. 732 46. 268 31. 691 dga 12 3. 085 11. 242 6. 367 dga 15 3. 507 14. 304 16. 554 dga 13 3. 239 28. 468 28. 700 dga 14 3. 807 25. 935 19. 785 dga 14 3. 325 4. 598 3. 629 17

More Plot Seidenberg 18

More Plot Seidenberg 19

More Plot Seidenberg 20

More Plot Seidenberg 21

New Features in Classification • Keep the same Classification Parameters • Train the model with Four Features – Length – Entropy – Alexa_grams – Dict_grams Seidenberg 22

Model Improvement Compare Old Perform Random ance Forest Rate Accuracy Rate Error Rate Seidenberg SVM New Naïve Bayes Random Forest SVM Naïve Bayes TPR 31. 92% 12. 38% 35. 56% 97. 53% 92. 03% 76. 87% TNR 99. 67% 99. 92% 96. 04% 99. 80% 99. 58% 99. 72% FNR 68. 08% 87. 62% 64. 44% 2. 47% 7. 97% 23. 13% FPR 0. 33% 0. 08% 3. 96% 0. 20% 0. 42% 0. 28% FAR 4. 76% 6. 03% 4. 68% 0. 18% 0. 58% 1. 67% FRR 12. 49% 8. 30% 60. 30% 2. 70% 5. 83% 4. 68% 23

Accuracy Rate Detail Compare – Random Forest 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 31. 92% TPR 70% 99. 80% 99. 67% 97. 53% TNR Old 68. 08% New Error Rate 60% 50% 40% 30% 12. 49% 20% 10% 2. 47% 4. 76% 0. 33% 0. 20% 0. 18% 2. 70% 0% FNR FPR Old FAR FRR New • New Model is Better. Seidenberg 24

Detail Compare – SVM 100% Accuracy Rate 99. 92% 92. 03% 99. 58% 80% 60% 40% 20% 12. 38% 0% TPR 100% TNR Old 87. 62% New Error Rate 80% 60% 40% 20% 7. 97% 0% FNR • New Model is Better. Seidenberg 6. 03% 0. 08% 0. 42% FPR Old 0. 58% FAR 8. 30% 5. 83% FRR New 25

Detail Compare – Naïve Bayes 96. 04% Accuracy Rate 100% 99. 72% 76. 87% 80% 60% 40% 35. 56% 20% 0% TPR 70% TNR Old 64. 44% New 60. 30% Error Rate 60% 50% 40% 30% 23. 13% 20% 3. 96% 10% 0% FNR • New Model is Better. Seidenberg 4. 68% 1. 67% 0. 28% FPR Old FAR 4. 68% FRR New 26

Conclusion Implemented machine learning in cybersecurity area Introduced two new features on dga domain classification Identified dga domains successfully Compared performance of three algorithms on new features • Further research would be focus on real-time monitoring • • Seidenberg 27

Q&A Thank You Seidenberg 28