Botnet Detection Distinguishing Between Bots and Human Activities

Botnet Detection – Distinguishing Between Bots and Human Activities 2010/11/26 Speaker: Li-Ming Chen

What is Botnet? n n n Bots: compromised hosts, “Zombies” Botnets: networks of bots that are under the control of a human operator (botmaster) (generally looks like) Worm + C&C channel q Command Control Channel q Disseminate the botmasters’ commands to their bot armies Communication (IRC, HTTP, … (can be encrypted)) Worm 2010/11/26 Attack (Do. S, spamming, phishing site, …) Propagation (vulnerabilities, file sharing, P 2 P, …) Speaker: Li-Ming Chen 2

Lifecycle of a Typical Botnet Infection Uses of Botnets: • Phishing attacks • Spam • ID/information theft • DDo. S • Distributing other malwares 2010/11/26 Speaker: Li-Ming Chen 3

Why is Botnet so Daunting? Underground Economics! Multilayered/Multifunction C&C Architecture Botnet structures change (e. g. , P 2 P) Secure Comm. ! Always behind the mirror Fast-flux (hide C&C servers or other bots behind an ever-changing network) Multi-vector exploitation + Social Engineering Tech. 2010/11/26 Speaker: Li-Ming Chen 4

Botnet Detection n Using honeypots or infiltration techniques q n To understand the basic behavior of botnets Passive anomaly analysis q q Detect malicious activities Detect C&C traffic n n n Traffic signature or statistical features Response crowd phenomena Graph-based analysis q Detect botnet’s centralized/P 2 P structures 2010/11/26 Speaker: Li-Ming Chen 5

Fundamental Problems n How to detect new appeared botnets? q q n Botnet structures are moved from centralized to decentralized Botnet may use own developed C&C protocol How to identify applications for network traffic? q Investigating a huge number of unknown traffic is inevitable in botnet detection 2010/11/26 Speaker: Li-Ming Chen 6

Automatic Discovery of Botnet Communities of Large-Scale Communication Networks Wei Lu, Mahbod Tavallaee, and Ali A. Ghorbani (Univ. of New Brunswick, Canada) ASIACCS 2009 2010/11/26 Speaker: Li-Ming Chen 7

Two-leveled Botnet Detection n Propose a hierarchical framework for automatic botnets discovery q q Higher level: unknown network traffic different network application communities Lower level: for each application communities, differentiate malicious botnet behavior and normal application traffic 2010/11/26 Speaker: Li-Ming Chen 8

Traffic Classification n Current techniques: q n Use transport layer port number, payload signature, statistical signature, machine learning & clustering Proposed approach: q (Hybrid) combine (1) payload signatures with (2) a cross association clustering algorithm 2010/11/26 Speaker: Li-Ming Chen 9

Traffic Classification – (Step 1) Using Payload Signature n Setup 470 application signatures that are composed by 10 fields n Apply to one day trace of Fred-e. Zone Wi. Fi network Require further analysis 2010/11/26 Speaker: Li-Ming Chen 10

Traffic Classification – (Step 2) Identify Unknown Traffic App. Both unknown and known flows (cross-association) • After classification, we need to label each application community Assign label to unknown flows based on prob. of known flows in a community (? ) Src. IP clustering Dst. IP one cluster Dst. IP clustering Dst. Port (to obtain the exact applications underlying a general application category) 2010/11/26 Speaker: Li-Ming Chen 11

Botnet Detection n Measure 1 -gram byte distribution with time bins (N time bins) 1 256 : : 1 2010/11/26 F 1 F 2 : : 256 F 1 FN find closest pair F 1 F 5 F 5 : F 2 F 2 : F 8 FN : FN F 8 : FN Calculate σ1 and σ2 Botnet cluster has smaller σ Until there are 2 clusters Speaker: Li-Ming Chen 12

Performance Evaluation • Accuracy > 85% • Accuracy (for inserted C&C flows) ~ 100% TP 2010/11/26 Speaker: Li-Ming Chen FP 13

My Comments n Proposed approach supports automatic traffic classification n However, only analyze botnet in IRC and HTTPWeb communities n Classification rules are unclear… n Botnet classification only focuses on “byte distribution” of payloads 2010/11/26 Speaker: Li-Ming Chen 14

Are Your Hosts Trading or Plotting? Telling P 2 P File-Sharing and Bots Apart Ting-Fang Yen, Michael K. Reiter (CMU, UNC) ICDCS 2010 (International Conference on Distributed Computing Systems) 2010/11/26 Speaker: Li-Ming Chen 15

Problem and Motivation n P 2 P bots are more and more popular q n Problem: q n botnet C&C traffic will tend to blend into a background of P 2 P file sharing Differentiate bots (plotter) from other P 2 P hosts (trader) P 2 P bots characteristics: q q Volume (not for file sharing) Persistence (maintain connectivity) Peer churn (less churn in peer membership) Human-driven vs. Machine-driven (botnet traffic is more regular and periodic) 2010/11/26 Speaker: Li-Ming Chen 16

Dataset n CMU dataset (basis) n Trader dataset q n Known P 2 P traffic (Gnutella, e. Mule, Bit. Torrent) in CMU dataset Plotter dataset q q q Collected form honeypots (Storm & Nugache bots) Ignore spamming and scanning activities; preserve botnet control traffic Insert into CMU dataset for evaluation 2010/11/26 Speaker: Li-Ming Chen 17

Approach (Volume Test) New IP Connected (%) CDF Avg. # bytes sent per flow 2010/11/26 (Peer Churn Test) Speaker: Li-Ming Chen Hour Index 18

Approach (cont’d) (Human-driven) Second (Machine-driven) Flow Index Observe interstitial time distribution of flows to the same destination IP for each host 2010/11/26 Speaker: Li-Ming Chen 19

Performance Evaluation n Initial data reduction: q n filter out hosts (and its flows) that have relatively low failed connection rates!! (neither a Trader nor a Plotter) Identifying Plotters: TP degrade reduce FP rate 2010/11/26 Speaker: Li-Ming Chen 20

My Comments n Authors develop a series of tests for separating plotters and traders q q n Focus on flow characteristics (instead of packet-level information) Evaluate the effectiveness of the three tests and their combination Comparing to Bot. Grep, q q Bot. Grep only detect a P 2 P communication structure in the network This work can distinguish P 2 P bots and normal P 2 P users 2010/11/26 Speaker: Li-Ming Chen 21

Other Bots n Chat Bots q q n Input Data Modification Attacks q n e. g. , (good) help operate chat rooms, entertain chat users e. g. , (bad) distribute chat spam, “spim”, malware e. g. , online game cheating, click fraud, auctions, … Problem: q Is a human in control, or is it a bot (computer)? 2010/11/26 Speaker: Li-Ming Chen 22

Measurement and Classification of Human and Bots in Internet Chat Steven Gianvecchio, Mengjun Xie, Zhenyu Wu, and Haining Wang (The College of William and Mary) USENIX Security Symp. 2008 2010/11/26 Speaker: Li-Ming Chen 23

Detecting Chat Bots n Observation: q n Motivation: q n Perform a series of measurements on Yahoo! chat to study the behaviors of chat bots and human Human behavior is more complex than bot behavior Propose a classification system to accurately distinguish char bots from human users q (1) an entropy classifier n q Based on message time and size (2) a machine-learning classifier n 2010/11/26 Based on message content Speaker: Li-Ming Chen 24

Measurement & Pre-labeling n n n Input: public message posted to Yahoo! char rooms Time: Aug. ~Nov. , 2007. Pre-labeling: q q n The examiner observes a long conversation between a test subject and one or more third parties, and then decides if the subject is a human or a chat bot Criteria: lack of intelligent response, repetition of similar phases, presence of spam or malware URLs Labeling results: q Human, bots {periodic, random, responder, or replay bots}, ambiguous. 2010/11/26 Speaker: Li-Ming Chen 25

0. 06 Probability 10– 1 Probability (Human) Inter-message Delay (sec. ) (Periodic Bots) 104 2010/11/26 0. 08 104 250 (Replay Bots) 0. 08 104 Message Size (byte) 0 (Responder Bots) 0. 1 350 (Random Bots) 0. 08 0 104 0. 14 180 Speaker: Li-Ming Chen 0. 06 120 26300

Approach • Motivation: human behavior is more complex • Based on message time and size • Define cutoff score (entropy, entropy rate) • If the test score > cutoff score, classify as human • Based on content of chat message to identify chat bots • Using Bayesian classify to decide P(bot | M) • M is a feature vector <f 1, f 2, …, fn> 2010/11/26 Speaker: Li-Ming Chen 27

Performance Evaluation EN: entropy test CCN: correlated conditional entropy test more difficult to detect 2010/11/26 Speaker: Li-Ming Chen 28

My Comments n Present measurement results and a chat bot classification system q q n n Chat bots behavior very different from human users motivate the use of entropy-based classification Besides, also propose a machine learning-based classification scheme entropy-based classify can detect unknown bots, while machine learning classify is more efficient Complete work Can this approach extend to detect other bots? 2010/11/26 Speaker: Li-Ming Chen 29

Other Bots n Chat Bots q q n Input Data Modification Attacks q n e. g. , (good) help operate chat rooms, entertain chat users e. g. , (bad) distribute chat spam, “spim”, malware e. g. , online game cheating, click fraud Problem: q Is a human in control, or is it a bot (computer)? 2010/11/26 Speaker: Li-Ming Chen 30

Is a Bot at the Controls? Detecting Input Data Attacks Travis Schluessler, Stephen Goglin, Erik Johnson (Intel Corporation) Net. Games 2007 (Workshop on Network and Systems Support for Games) 2010/11/26 Speaker: Li-Ming Chen 31

Detecting Input Data Modification Attacks n Current methods: q n CAPTCHA, anti-cheat software Proposed Approach: q q Ensure input data enters a system through a physically present human input device (HID) Host-based approach n q OS/“software stack” independent (idea) “input data generated by HIDs” must be the same as “input data consumed by an application” n 2010/11/26 if the two data streams differ, some form of illicit modification occurred! Speaker: Li-Ming Chen 32

System Architecture Tamper-evident to modification illicit modification HIDs 2010/11/26 Speaker: Li-Ming Chen 33

Steady State Operation 2010/11/26 Speaker: Li-Ming Chen 34

Evaluation/ Discussion Quake 3 n Performance overhead n Detection Limitation: q q q n Platform hardware modification Programmable hardware Attacks that alter the timing of the arrival of input data Cost: q q Implementation/deployment Cost of circumvention 2010/11/26 Speaker: Li-Ming Chen 35

My Comments n n This paper describes a method to detect attacks that modify input data coming from HIDs The idea is simple, useful, but hard to implement q q n Need to isolate “input verification service (IVS)” Software application needs to register to and co-work with IVS Suitable for checking software/system inputs, not the outputs (e. g. , packets sent) 2010/11/26 Speaker: Li-Ming Chen 36