Spam Campaign Cluster Detection Using Redirected URLs and

  • Slides: 21
Download presentation
Spam Campaign Cluster Detection Using Redirected URLs and Randomized Sub-Domains Authors Abu Awal Md

Spam Campaign Cluster Detection Using Redirected URLs and Randomized Sub-Domains Authors Abu Awal Md Shoeb, Dibya Mukhopadhyay, Shahid Al Noor, Alan Sprague, and Gary Warner Dec 14, 2014, Harvard University 1

Outline § § § Introduction Why spam detection is important Why it is difficult

Outline § § § Introduction Why spam detection is important Why it is difficult to detect Our approach Results Conclusion 2

Introduction Spam email § Unsolicited bulk email (also know as junk email) sent to

Introduction Spam email § Unsolicited bulk email (also know as junk email) sent to numerous recipients § Not only annoying but also dangerous Spam campaign § Spam emails constructed from same template and to market the same product Redirected URL § Address of the actual web sites obtained from given URL Randomized sub-domain § Different web address generated from a single domain or web address 3

Why Spam Detection is Important § More than 70% of email is Spam! §

Why Spam Detection is Important § More than 70% of email is Spam! § Malware infected user helps spammers to harm others! § Save extra time of large organizations to respond to infections § Find useful email easily without going through the large set of spam 4

Why Spam Detection is Difficult Attributes of spam change very quickly § Variety of

Why Spam Detection is Difficult Attributes of spam change very quickly § Variety of subjects § Variety of given URLs § Given URL shuts down after some time Criminals hide their identity through Botnets § Infected computers are used for campaign Lack of having access to instantaneous spam data § No benefit if a campaign is detected after it achieves its goal 5

Our Approach Why given URL § This is what comes with most spam Why

Our Approach Why given URL § This is what comes with most spam Why campaign, why not spam only § Campaign is the superset of spam email Why redirected URL § Given URL changes very often but Redirected URL doesn’t 6

Block Diagram/Overview Spam Dataset Each spam email may be considered as a cluster of

Block Diagram/Overview Spam Dataset Each spam email may be considered as a cluster of size 1 Level 1: Merge all spam emails that redirect to the same website Level 3: Merge each pair of clusters that contain a URL with common domain Level 2: Merge each pair of clusters that contain an exact subject in common Remove clusters whose size is smaller than a particular threshold T Large Spam Campaign each represented as a cluster 7

Our Algorithm Level 1: Redirected URL-Based Clustering § Put all spams into same cluster

Our Algorithm Level 1: Redirected URL-Based Clustering § Put all spams into same cluster that have same Redirected URL and assign Redirected URL as KEY § Remaining spams are treated as individual clusters Level 2: Exact Subject-Based Clustering § Merge existing clusters if subjects are matched § Assign Subjects as KEY if Redirected URL is not a key Level 3: Randomized Sub-Domain-Based Clustering § Extract given URL and merge existing clusters if they are the member of same domain Thresholding § Apply different threshold to discard tiny clusters (outliers) 8

Dataset and Tools Source § The Center for Information Assurance and Joint Forensics Research

Dataset and Tools Source § The Center for Information Assurance and Joint Forensics Research (CIA-JFR) Spam Data (Appx. Half a Million) § 15 April 2014 (6 Hours) to build prototype § 20, 21 August 2014 (Full Day) to test Attributes of Data § Subject, given URL, redirected URL (derived) Language § Python (Mechanize Library) 9

Example: Redirected URL Given URLs (20 Aug 2014) § http: //zadlaku. com/4 o. HEe.

Example: Redirected URL Given URLs (20 Aug 2014) § http: //zadlaku. com/4 o. HEe. C 55 i. O 13 V/ § http: //www. neutrohost. com/hj. VZDdc. E 1 vl 1 Q/ § http: //latinheatrl. com/g. IRkg 2 Z 4 X 9 B 3 i/ § http: //papiocreekpreschool. org/7 v 4 O 9 em. Eo. Dj 3 L/ § http: //hiltonheadhypnotherapy. com/29 IHXiw 0 KTb 2 k/ All of them have same Redirected URL § http: //dietscoop 24. com 10

Example: Randomized Sub-Domain All given URLs have same domain § § § http: //standpoint.

Example: Randomized Sub-Domain All given URLs have same domain § § § http: //standpoint. rhin. ru/ http: //chessboard. rhin. ru/ http: //controlled. rhin. ru/ http: //civilian. rhin. ru/ http: //alderman. rhin. ru/ 11

Results: Number of Clusters/ # of Redirected Same Randomized Threshold Dataset Spam URL Subject

Results: Number of Clusters/ # of Redirected Same Randomized Threshold Dataset Spam URL Subject Sub-domain 500 15 April 60995 17253 1086 289 4 20 Aug 249389 156077 5145 1044 18 21 Aug 247922 166645 7938 1037 15 20, 21 497311 Aug 322722 11878 1670 26 12

Results: April 15, 2014 Size (No. of Spam Emails) 25000 22195 Total Spam: 60995

Results: April 15, 2014 Size (No. of Spam Emails) 25000 22195 Total Spam: 60995 19870 20000 15000 11114 10000 SC 1: 36% SC 2: 32% SC 3: 18% SC 4: 6% 3582 5000 271 0 1 2 3 4 5 256 6 Spam Campaign (SC) 13

Results: August 20, 2014 Size (No. of Spam Emails) 100000 90144 Total Spam: 249389

Results: August 20, 2014 Size (No. of Spam Emails) 100000 90144 Total Spam: 249389 75992 80000 70000 60000 50000 40000 SC 1: 36% SC 2: 30% SC 3: 11. 5% SC 4: 6. 7% 28813 30000 16736 20000 10000 2996 2236 1813 5 6 7 0 1 2 3 4 Spam Campaign (SC) 14

Results: August 21, 2014 Size (No. of Spam Emails) 90000 84536 80000 79848 Total

Results: August 21, 2014 Size (No. of Spam Emails) 90000 84536 80000 79848 Total Spam: 247922 70000 60000 50000 42558 SC 1: 34% SC 2: 32% SC 3: 17% SC 4: 3. 5% 40000 30000 20000 8823 10000 0 1788 1 2 3 4 Spam Campaign (SC) 5 1466 6 1250 7 15

Products Advertized in Large Campaigns Viagra, 5% Others, 9% Anti-Aging Pills, 36% Weight-Loss Solution,

Products Advertized in Large Campaigns Viagra, 5% Others, 9% Anti-Aging Pills, 36% Weight-Loss Solution, 18% Weight-Loss Pills, 32% 15 April 2014 Campaigns: 60995 Spams 16

Behavior of Campaigns Redirected URL 15 April 20 August 21 August http: //wghtnews. com

Behavior of Campaigns Redirected URL 15 April 20 August 21 August http: //wghtnews. com Yes No No http: //fjxnewsdaily. com Yes No No http: //mscdailynews. com Yes (less frequent) http: //dietscoop 24. com No Yes http: //skinnewsdaily 7. com No Yes 17

Conclusion - 1 It is a real time spam campaign detection § No predefined

Conclusion - 1 It is a real time spam campaign detection § No predefined model/role is required § Can be applied once spam arrives Our approach is very effective § Almost 90% of half a million spam falls into 4 major campaigns Can detect campaign consistently § No matter if campaigns subject changes § No matter if given URL changes 18

Conclusion - 2 With large clusters identified, rather than blocking the spam, we need

Conclusion - 2 With large clusters identified, rather than blocking the spam, we need to identify a new approach towards spam campaign § Community awareness § Law enforcement 19

Thank You ☺ Question Time! Presented By – Abu Awal Md Shoeb The SECu.

Thank You ☺ Question Time! Presented By – Abu Awal Md Shoeb The SECu. RE and Trustworthy Computing Lab (SECRETLab) http: //secret. cis. uab. edu/ shoeb@uab. edu 20

Problem of Current Approaches Content-Based § Requires longer processing time Blacklist-Based/IP-Based § Attackers change

Problem of Current Approaches Content-Based § Requires longer processing time Blacklist-Based/IP-Based § Attackers change host IP or path Whitelist-Based § Detecting and maintaining the list is not easy Challenge Response-Based § Deadlock when both party implement this