Learning Communication Rules Srikanth Kandula Ranveer Chandra and

  • Slides: 31
Download presentation
Learning Communication Rules Srikanth Kandula Ranveer Chandra and Dina Katabi

Learning Communication Rules Srikanth Kandula Ranveer Chandra and Dina Katabi

Network Admins. are Groping in the Dark Focus on Traffic Volume But, What’s Going

Network Admins. are Groping in the Dark Focus on Traffic Volume But, What’s Going On? • TCP=80%, HTTP=30% • Traffic follows plan? • Misconfigurations • Suspicious Traffic • Adapt report categories (e. g. , Auto. Focus) – Much traffic from ports 500 -600 Besides focusing on volume, learn rules underlying the traffic (Active) user browsing web, reading/sending mail (Automatic) SMS scan on a network, outlook refresh

X e Rul Y X X Y flow. Y flow. X (http DNS) X

X e Rul Y X X Y flow. Y flow. X (http DNS) X Y t Whenever flowy happens, flowx is likely to occur If you could learn such rules directly from a trace, • Infer the actual behavior of applications – AFS root servers direct traffic to volume servers evenly – mail to the incoming MX, is forwarded onto group MXes • Notice misconfigurations and badness – these clients shld not be talking on known command-control ports this server shld not be responding to DHCP requests – this mail server shld not attempt connections to non-existent MXes

Report all significant rules with no specific knowledge about a trace

Report all significant rules with no specific knowledge about a trace

Mining for Rules is Hard • How to define significance? – When is a

Mining for Rules is Hard • How to define significance? – When is a group of flows interesting enough to report? • Avoid observer bias but cannot evaluate everything – Focus on one server, miss what you are not looking for • Practical, deal with noise, search quickly e. Xpose 1. A scoring function for significance 2. Heuristics that bias search toward high hit-rate 3. Empirical validation on enterprise traces

Overview Packet Trace Activity Matrix flow 1 … flow. K time 1 … Rules

Overview Packet Trace Activity Matrix flow 1 … flow. K time 1 … Rules time. R • Packet trace to Activity Matrix o Rows are 1 s windows; Columns are flows o Is flow active in [timei-1, timei )? (at least one packet) • Association rule mining (X, Y are r. v. for columns) • Need not worry about interleaving • Dependencies are at these time-scales (an rtt, a server response) All windows in [. 25 s, 2 s] range yield similar rules

Which Rules are Significant? X Y • High Joint Probability? o X, Y may

Which Rules are Significant? X Y • High Joint Probability? o X, Y may occur very often individually (e. g. , breeze, sun shining) • High Conditional Probability? o Say Y occurs only when X does, but both are rare (lottery, buy a jet)

Which Rules are Significant? X Y • High Joint Probability? • High Conditional Probability?

Which Rules are Significant? X Y • High Joint Probability? • High Conditional Probability? • We use mutual information (combines the two) * Measures fraction of change in Y due to X Score=0, if Y is independent of X Score=Max, if Y is fully dependent on X * Trades off dependency & frequency * Encodes Directionality Kerberos Reservation

Modifying Scores for Networking • Negative Correlation … X – Flows with little overlap

Modifying Scores for Networking • Negative Correlation … X – Flows with little overlap P( Y|X) 1 leads to high score Y …

Modifying Scores for Networking • Negative Correlation … X … Y – Flows with

Modifying Scores for Networking • Negative Correlation … X … Y – Flows with little overlap • Long Running Flows – – Y Large downloads, ssh/remote desktop Trivial overlaps with long flow P(Y|X) 1 Distinguish new vs. present Present rules reported only if small mismatch in freq. • Too Many Possibilities – Bias, focus on pairs with at least one common IP – Miss rules, but hit-rate up 1000 x and costs down 10 x …

Generics - Miss, if no client accesses server often + Rules that abstract away

Generics - Miss, if no client accesses server often + Rules that abstract away parts of a flow Database Client : Server : Database Server Client : Server : Database * (any client) Client : Rsrv. Client : Kerberos * * (any client, but same on both sides) To do this automatically, • what to abstract? (IP addresses at non-server port) • which pairs to consider for rule? – flows match IP, generics match abstracted IP Reservation

Mining for Rules Techniques extend to arbitrary sized rules O(f 2) Instead, 1. Focus

Mining for Rules Techniques extend to arbitrary sized rules O(f 2) Instead, 1. Focus on pair-wise rules (simpler is likelier) 2. Group similar rules O(fn+1) – Eliminate weak rules between strongly connected groups – Transitive closure to read off clusters Rule Score Recursive Spectral Partitioning (VKV’ 00) Rule Mining Digests 105— 106 flows into 102— 103 rule clusters

Recap: e. Xpose Mines for Rules Activity Matrix flow 1 … flow. K Packet

Recap: e. Xpose Mines for Rules Activity Matrix flow 1 … flow. K Packet Trace time 1 … time. R present |new Rules … flowi. new flowj. present. . . Rule Clusters Contributions Learn all significant rules without prior knowledge o Scoring function for rule significance o Avoids observer bias, yet stays feasible by focusing on high hit-rate o Algorithms to mine and prune

Related Work Semi-Automated Discovery of App. Session Structure (KJPK’ 06) Sherlock (Diagnosing Performance Problems,

Related Work Semi-Automated Discovery of App. Session Structure (KJPK’ 06) Sherlock (Diagnosing Performance Problems, BCGKMZ’ 07) Autofocus (ESV’ 03) BLINC (KPF’ 05) Stepping Stones (ZP’ 00) Learn all significant rules without prior knowledge o Avoids observer bias, yet stays feasible by focusing on high hit-rate o Scoring function for rule significance o Algorithms to mine and prune

Results

Results

Evaluation Setup CSAIL’s Access Link of Conf. LANs Before CSAIL’s Servers Inside Microsoft •

Evaluation Setup CSAIL’s Access Link of Conf. LANs Before CSAIL’s Servers Inside Microsoft • Traces at access and internal server-facing links – Packet Headers, Connection Records (Bro), some anon. • Operational n/w with 103 clients, diverse traffic mix • Corroborated on test-bed traffic & vetted by admins. • Ran e. Xpose on a 2. 4 GHz x 86 with 8 GB RAM

Rules Discovered by e. Xpose • Dependencies for Major Applications email @ microsoft Client.

Rules Discovered by e. Xpose • Dependencies for Major Applications email @ microsoft Client. * – PFS 1. X Client. * – DC. 88 Client. * – PFS 2. X Client. * – Mail. 135 Client. * – Proxy. 80

Rules Discovered by e. Xpose • Dependencies for Major Applications afs @ csail C.

Rules Discovered by e. Xpose • Dependencies for Major Applications afs @ csail C. 7001 – *. * C. 7001 – AFS 2. 7000 C. 7001 – Root. 7003 C. 7001 – AFS 1. 7000 – Root. 7002

Rules Discovered by e. Xpose • Dependencies for Major Applications – web, e-mail, file-servers,

Rules Discovered by e. Xpose • Dependencies for Major Applications – web, e-mail, file-servers, IM, print, video broadcast web @ microsoft Proxy 2. 80 – *. * Proxy 3. 80 – *. * Proxy 1. 80 – *. * Proxy 4. 80 – *. *

Rules Discovered by e. Xpose • Dependencies for Major Applications – web, e-mail, file-servers,

Rules Discovered by e. Xpose • Dependencies for Major Applications – web, e-mail, file-servers, IM, print, video broadcast • Configuration Errors & Other Badness smtp + IDENT @ csail Client. 113 – Mail. Server. * Client. * – Mail. Server. 25

Rules Discovered by e. Xpose • Dependencies for Major Applications – web, e-mail, file-servers,

Rules Discovered by e. Xpose • Dependencies for Major Applications – web, e-mail, file-servers, IM, print, video broadcast • Configuration Errors & Other Badness – IDENT, Legacy emails, ssh scans, wingate Legacy email ids @ csail Univ. Mail. * – Old 1. 25 Univ. Mail. * – Old 3. 25 Univ. Mail. * – Old 2. 25

Rules Discovered by e. Xpose • Dependencies for Major Applications – web, e-mail, file-servers,

Rules Discovered by e. Xpose • Dependencies for Major Applications – web, e-mail, file-servers, IM, print, video broadcast • Configuration Errors & Other Badness – IDENT, Legacy emails, ssh scans, wingate • Rules for stuff we didn’t know before Nagios monitors @ csail Nagios. 7001 – AFS 2. 7000 Nagios. * – Mail 1. 25 Nagios. 7001 – AFS 1. 7000 Nagios. * – Mail 2. 25

Rules Discovered by e. Xpose • Dependencies for Major Applications – web, e-mail, file-servers,

Rules Discovered by e. Xpose • Dependencies for Major Applications – web, e-mail, file-servers, IM, print, video broadcast • Configuration Errors & Other Badness – IDENT, Legacy emails, ssh scans, wingate • Rules for stuff we didn’t know before – Nagios, LLMNR, i. Tunes Link level multicast name resolution @ hotspots H. 137 – Wins. 137 Black box: Little prior knowledge about servers, H. * – evolve Multicast. 5355 applications, or users Can H. * – DNS. 53

Correctness & Completeness • False Positives – 13% of rule-clusters in CSAIL trace, we

Correctness & Completeness • False Positives – 13% of rule-clusters in CSAIL trace, we couldn’t explain • False Negatives – Main CSAIL Web Server (too many different activities) – Dependencies on Personal Web Pages (too few traffic) – Planet. Lab Traffic (punted) • Other Limitations – IPSec, Anonymized, Cover Traffic • Extensions – Rules repeat over time, and across traces – Application whitelisting, Customize Generics

# Flows (x 106) Time to Mine for Rules . 6. 2. 6. 9

# Flows (x 106) Time to Mine for Rules . 6. 2. 6. 9 2. 8 At CSAIL’s access link, high fan-out with many distinct flows Stream Mining Appears Feasible!

Packet Trace e. Xpose Rules for frequently reoccurring flow sets Learn all significant rules

Packet Trace e. Xpose Rules for frequently reoccurring flow sets Learn all significant rules with no specific knowledge o Avoids observer bias, but feasible by focusing on high hit-rate o Scoring function for rule significance o Algorithms to mine and prune Empirical validation on enterprise traces • found configurations & protocols that we didn’t know existed • learnt rules for actual behavior of applications • found config. errors, bot scans, infected machines http: //research. microsoft. com/~srikanth

Backup

Backup

# of Discovered Rules Expanding Search Space (# of flows)… Rule Score (Modified JMeasure)

# of Discovered Rules Expanding Search Space (# of flows)… Rule Score (Modified JMeasure) … exposes few significant rules!

Memory Footprint (million rules) Time to Mine Rules (s) Expanding Search Space (# of

Memory Footprint (million rules) Time to Mine Rules (s) Expanding Search Space (# of flows)… # Top Active Flows … exposes few rules & costs a lot in time, memory

# of Discovered Rules Varying Size of Time Windows Rule Score (Modified JMeasure) All

# of Discovered Rules Varying Size of Time Windows Rule Score (Modified JMeasure) All window sizes in [. 25 s, 2 s] produce similar rules!

Joint Probability For all rules X Y Prob. (Y) Prob. (X)

Joint Probability For all rules X Y Prob. (Y) Prob. (X)