Learning to Detect Computer Intrusions with Extremely Few

  • Slides: 28
Download presentation
Learning to Detect Computer Intrusions with (Extremely) Few False Alarms Jude Shavlik Mark Shavlik

Learning to Detect Computer Intrusions with (Extremely) Few False Alarms Jude Shavlik Mark Shavlik

Two Basic Approaches for Intrusion Detection Systems (IDS) n n n Pattern Matching n

Two Basic Approaches for Intrusion Detection Systems (IDS) n n n Pattern Matching n If packet contains “site exec” and … then sound alarm n Famous example: SNORT. org n Weakness: Don’t (yet) have patterns for new attacks Anomaly Detection n Usually based on statistics measured during normal behavior n Weakness: does anomaly = intrusion ? Both approaches often suffer from too many false alarms n Admin’s ignore IDS when flooded by false alarms

How to Get Training Examples for Machine Learning? n Ideally, get measurements during 1.

How to Get Training Examples for Machine Learning? n Ideally, get measurements during 1. 2. Normal Operation vs. Intrusions n However, hard to define space of possible intrusions n Instead, learn from “positive examples only” n Learn what’s normal and define all else as anomalous

Behavior-Based Intrusion Detection n n Need to go beyond looking solely at external network

Behavior-Based Intrusion Detection n n Need to go beyond looking solely at external network traffic and log files n File-access patterns n Typing behavior n Choice of programs run n … Like human immune system, continually monitor and notice “foreign” behavior

Our General Approach n Identify ≈unique characteristics of each user/server’s behavior n Every second,

Our General Approach n Identify ≈unique characteristics of each user/server’s behavior n Every second, measure 100’s of Windows 2000 properties n in/out network traffic, programs running, keys pressed, kernel usage, etc n Predict Prob( normal | measurements ) n Raise alarm if recent measurements seem unlikely for this user/server

Goal: Choose “Feature Space” that Widely Separates User from General Population Choose separate set

Goal: Choose “Feature Space” that Widely Separates User from General Population Choose separate set of “features” for each user Specific User General Population Possible Measurements in Chosen Space

What We’re Measuring (in Windows 2000) n Performance Monitor (Perfmon) data n n File

What We’re Measuring (in Windows 2000) n Performance Monitor (Perfmon) data n n File bytes written per second TCP/IP/UDP/ICMP segments sent per second System calls per second # of processes, threads, events, … n Event-Log entries n Programs running, CPU usage, working-set size n n MS Office, Wordpad, Notepad Browsers: IE, Netscape Program development tools, … Keystroke and mouse events

Temporal Aggregates n Actual Value Measured n Average of the Previous 10 Values n

Temporal Aggregates n Actual Value Measured n Average of the Previous 10 Values n Average of the Previous 100 Values n Difference between Current Value and Previous Value n Difference between Current Value and Average of Last 10 n Difference between Current Value and Ave of Last 100 n Difference between Averages of Prev 10 and Prev 100

Using (Naïve) Bayesian Networks n Learning network structure too CPU-intensive n Plus, naïve Bayes

Using (Naïve) Bayesian Networks n Learning network structure too CPU-intensive n Plus, naïve Bayes frequently works best n Testset results n n 59. 2% of intrusions detected n About 2 false alarms per day per user This paper’s approach n 93. 6% detected n 0. 3 false alarms per day per user

Our Intrusion-Detection Template Last W (window width) measurements X X . . . X

Our Intrusion-Detection Template Last W (window width) measurements X X . . . X time (in sec) If score(current measurements) > T then raise “mini” alarm If # “mini” alarms in window > N then predict intrusion Use tuning set to choose per user good values for T and N

Methodology – for Training and Evaluating Learned Models Replay of User X’s Behavior .

Methodology – for Training and Evaluating Learned Models Replay of User X’s Behavior . . . Alarm from Model of User X ? yes Alarm from Model of User Y ? yes “Intrusion” Detected False Alarm

Learning to Score Windows 2000 Measurements (done for each user) 1. 2. Initialize weights

Learning to Score Windows 2000 Measurements (done for each user) 1. 2. Initialize weights on each feature to 1 For each training example do 1. Set weighted. Votes. FOR = 0 & weighted. Votes. AGAINST = 0 2. If measurement i is “unlikely” (ie, low prob) then add weighti to weighted. Votes. FOR else add weighti to weighted. Votes. AGAINST 3. If weighted. Votes. FOR > weighted. Votes. AGAINST then raise “mini alarm” 4. If decision about intrusion incorrect, multiply weights by ½ on all measurements that voted incorrectly (Winnow algorithm)

Choosing Good Parameter Values n For each user n Use training data to estimate

Choosing Good Parameter Values n For each user n Use training data to estimate probabilities and weight individual measurements n Try 20 values for T and 20 values for N n For each T x N pairing compute detection and false-alarm rates on tuning set n Select T x N pairing whose 1. false-alarm rate is less than spec (e. g. , 1 per day) 2. has highest detection rate

Experimental Data n n Subjects n Insiders: 10 employees at Shavlik Technologies n Outsiders:

Experimental Data n n Subjects n Insiders: 10 employees at Shavlik Technologies n Outsiders: 6 additional Shavlik employees n Unobtrusively collected data for 6 weeks n 7 GBytes archived Task: Are current measurements from user X?

Training, Tuning, and Testing Sets n Very important in machine learning to not use

Training, Tuning, and Testing Sets n Very important in machine learning to not use testing data to optimize parameters! n n Train Set: first two weeks of data n n Build a (statistical) model Tune Set: middle two weeks of data n n Can tune to zero false alarms and high detection rates! Choose good parameter settings Test Set: last two weeks of data n Evaluate “frozen” model

Experimental Results on the Testset

Experimental Results on the Testset

Highly Weighted Measurements (% of time in Top Ten across users & experiments) n

Highly Weighted Measurements (% of time in Top Ten across users & experiments) n n n Number of Semaphores (43%) Logon Total (43%) Print Jobs (41%) System Driver Total Bytes (39%) CMD: Handle Count (35%) Excel: Handle Count (26%) Number of Mutexes (25%) Errors Access Permissions (24%) Files Opened Total (23%) TCP Connections Passive (23%) Notepad: % Processor Time (21%) 73 measurements occur > 10%

Confusion Matrix – Detection Rates (3 of 10 Subjects Shown) INTRUDER A OWNER A

Confusion Matrix – Detection Rates (3 of 10 Subjects Shown) INTRUDER A OWNER A B 100% C 91% B C 100% 25% 94%

Some Questions n What if user behavior changes? (Called concept drift in machine learning)

Some Questions n What if user behavior changes? (Called concept drift in machine learning) n One solution Assign “half life” to counts used to compute prob’s Multiply counts by f < 1 each day (10/20 vs. 1000/2000) n CPU and memory demands too large? n n n Measuring features and updating counts < 1% CPU Tuning of parameters needs to be done off-line How often to check for intrusions? n n Only check when window full, then clear window Else too many false alarms

Future Directions n Measure system while applying various known intrusion techniques n n n

Future Directions n Measure system while applying various known intrusion techniques n n n Compare to measurements during normal operation Train on known methods 1, …, N -1 Test using data from known method N n Analyze simultaneous measurements from network of computers n Analyze impact of intruder’s behavior changing recorded statistics n Current results: prob of detecting intruder in first W sec

Some Related Work on Anomaly Detection n Machine learning for intrusion detection n n

Some Related Work on Anomaly Detection n Machine learning for intrusion detection n n n n Lane & Brodley (1998) Gosh et al. (1999) Lee et al. (1999) Warrender et al. (1999) Agarwal & Joshi (2001) Typically Unix-based Streams of programs invoked or network traffic analyzed Analysis of keystroke dynamics n n Monrose & Rubin (1997) For authenticating passwords

Conclusions n Can accurately characterize individual user behavior using simple models based on measuring

Conclusions n Can accurately characterize individual user behavior using simple models based on measuring many system properties n Such “profiles” can provide protection without too many false alarms n Separate data into train, tune, and test sets n “Let the data decide” good parameter settings, on per-user basis (including measurements to use)

Acknowledgements n DARPA’s Insider Threat Active Profiling (ITAP) program within ATIAS program n Mike

Acknowledgements n DARPA’s Insider Threat Active Profiling (ITAP) program within ATIAS program n Mike Fahland for help with data collection n Shavlik, Inc employees who allowed collection of their usage data

Using Relative Probabilities Alarm: Prob( keystrokes | machine owner ) Prob( keystrokes | population

Using Relative Probabilities Alarm: Prob( keystrokes | machine owner ) Prob( keystrokes | population )

Value of Relative Probabilities n Using relative probabilities n n Separates rare for this

Value of Relative Probabilities n Using relative probabilities n n Separates rare for this user from rare for everyone Example of variance reduction n Reduce variance in a measurement by comparing to another (eg, paired t-tests)

Tradeoff between False Alarms and Detected Intrusions (ROC Curve) spec Note: left-most value results

Tradeoff between False Alarms and Detected Intrusions (ROC Curve) spec Note: left-most value results from ZERO tune-set false alarms

Conclusions n Can accurately characterize individual user behavior using simple models based on measuring

Conclusions n Can accurately characterize individual user behavior using simple models based on measuring many system properties n Such “profiles” can provide protection without too many false alarms n Separate data into train, tune, and test sets n n “Let the data decide” good parameter settings, on per-user basis (including measurements to use) Normalize prob’s by general-population prob’s n Separate rare for this user (or server) from rare for everyone

Outline n Approaches for Building Intrusion-Detection Systems n A Bit More on What We

Outline n Approaches for Building Intrusion-Detection Systems n A Bit More on What We Measure n Experiments with Windows 2000 Data n Wrapup