Labeling the Virus Share Dataset Lessons Learned John
Labeling the Virus. Share Dataset: Lessons Learned John Seymour seymour 1@umbc. edu, @_delta_zero 2016 -04 -23
Outline • • Introduction to Malware Classification Labeling the Virus. Share Corpus Building a Malware Index using Py. Spark Pretty graphs, words of caution, and useful extensions
Where to find malware 600 samples • Lots of exploit kits • Includes analyses 10, 868 samples (about 500 GB) • 9 families of malware • Hexdumps/Assembly files (from IDA) • Neutered: PE headers removed 271, 092 samples • Labeled by KAV • Last update: 2007 • Most-used academic dataset • Split into chunks of 65, 536 samples 24, 783, 626 samples • Available by Torrent • Unlabeled (until now!) As many as you want • Virus. Total: Needs Private API • • Research Requests Licensing issues
Why Virus. Share? • Size (27 million samples: almost 2500 x Kaggle dataset size, not neutered) • Consistently updated • Make future research more reproducible • Scrape your own => you’ll probably overfit • Virus. Total: can't release any raw data from the platform • Short descriptions of dataset: “Chunks 25, 60, 90”
Overview • Virus. Total has an awesome API • Two versions: Private/Research and Public • Public is rate-limited • Private/Research has licensing agreements • Meaning we wouldn’t be able to distribute results • Took 30 people + 6 months to label • Due to rate-limiting • Mostly undergraduates wanting extra credit
Labels available here: https: //drive. google. com/file/d/0 B_IN 6 Rz. P 69 b 2 Tk. Nr. YVd. OMn. Q 4 LVE/view
What is an index and why do I need one? • An index will list each malware label with how many occur in each Virus. Share chunk • Why? We want to minimize download time + size on hard drive • Not all Virus. Share chunks have all types of malware • Very useful for when we want to download a large number of particular type of malware
It’s really easy to use!
Pretty Graphs
- Slides: 9