Netflow and Botnets Steven M Bellovin Columbia University
Netflow and Botnets Steven M. Bellovin Columbia University smb 1
Hypothesis • Most hosts are either clients or servers – P 2 P traffic is an exception • Bots talk to other bots and thus to command control node • By looking for unusual traffic flows – client-toclient traffic that isn’t P 2 P – we can find bots smb 2
Methodology • Use Netflow data to identify clients and servers • Classify nodes as clients or servers • Build a traffic matrix from the data to see which clients talk to which other clients • Exclude P 2 P traffic, which is generally identifiable based on flow size smb 3
Netflow • Originally from Cisco; now implemented by most router vendors – Also an IETF “Proposed Standard” • Records “flow information” – src/dst pairs (addresses and port numbers), length, timing, etc. – for “connections” through a given router • Intended for accounting and for traffic engineering smb 4
Problems with Netflow • Flows are unidirectional; need two records for complete picture • This is a consequence of Internet topology; most inter-ISP connections follow asymmetric paths • Routers often deliver sampled data; can miss flow start/end packets • Does not give unambiguous indication of client versus server smb 5
Strategy • Build tools at Columbia – Easy access to machines and data • Use existing archive of CU netflow data – Unclear if there are botnets present; get classification right first • Get other netflow archives (e. g. , from predict. org) • Bring nominally-working code to AT&T to experiment with large-scale datasets • Compare with previous results from AT&T as check on correctness smb 6
Node Classification • Must use heuristics – Flag field in netflow data doesn’t show client vs. server – Timestamp not useful because of sampling • Current strategy: look at port number distribution – Clients usually use ports 48 K-64 K • Considering using node degree – But – problems with low-activity hosts? smb 7
Classification is Hard • Simple heuristics have not been satisfactory • Building visualization tools to help us understand the data smb 8
Client: Port Number by Volume smb 9
Client: Port Number Scatter Plot smb 10
Server: Port Number by Volume smb 11
Server: Port Number Scatter Plot smb 12
Ambiguous Host smb 13
Ambiguous Host Scatter Plot Is this the sort of host we’re looking for? smb 14
Current Status • Have basic tools built • Working with visualization tools to understand the data • Next steps: – Refine classification algorithms – Confirm analysis of bots in sample data – Try tools on larger dataset smb 15
- Slides: 15