Kargus A Highlyscalable Softwarebased Intrusion Detection System M

Kargus: A Highly-scalable Software-based Intrusion Detection System M. Asim Jamshed *, Jihyung Lee †, Sangwoo Moon †, Insu Yun *, Deokjin Kim ‡, Sungryoul Lee ‡, Yung Yi †, Kyoung. Soo Park * * Networked & Distributed Computing Systems Lab, KAIST † Laboratory of Network Architecture Design & Analysis, KAIST ‡ Cyber R&D Division, NSRI

Network Intrusion Detection Systems (NIDS) • Detect known malicious activities – Port scans, SQL injections, buffer overflows, etc. • Deep packet inspection – Detect malicious signatures (rules) in each packet • Desirable features – High performance (> 10 Gbps) with precision – Easy maintenance • Frequent ruleset updates NIDS Attack Internet 2

Hardware vs. Software • H/W-based NIDS – Specialized hardware • ASIC, TCAM, etc. – High performance – Expensive • Annual servicing costs – Low flexibility IDS/IPS Sensors (10 s of Gbps) ~ US$ 20, 000 - 60, 000 IDS/IPS M 8000 (10 s of Gbps) ~ US$ 10, 000 - 24, 000 • S/W-based NIDS – Commodity machines – High flexibility – Low performance • DDo. S/packet drops Open-source S/W ≤ ~2 Gbps 3

Goals – High performance • S/W-based NIDS – Commodity machines – High flexibility 4

Typical Signature-based NIDS Architecture alert tcp $EXTERNAL_NET any -> $HTTP_SERVERS 80 (msg: “possible attack attempt BACKDOOR optix runtime detection"; content: "/whitepages/page_me/100. html"; pcre: "/body=x 2521x 2521 Optixs+Pros+vd+x 252 Ed+S+s. Ervers+Onlinex 2521x 2521/") Packet Acquisition Preprocessing Decode Flow management Reassembly Match Multi-string Pattern Matching Success Match Failure (Innocent Flow) Rule Options Evaluation Failure (Innocent Flow) Evaluation Success Output Malicious Flow Bottlenecks * PCRE: Perl Compatible Regular Expression 5

Contributions Goal A highly-scalable software-based NIDS for high-speed network Slow software NIDS Fast software NIDS Bottlenecks Solutions Multi-core packet acquisition Inefficient packet acquisition Expensive string & PCRE pattern matching Parallel processing & GPU offloading Fastest S/W signature-based IDS: 33 Gbps Outcome 100% malicious traffic: 10 Gbps Real network traffic: ~24 Gbps 6

Challenge 1: Packet Acquisition • Default packet module: Packet CAPture (PCAP ) library – Unsuitable for multi-core environment – Low performing – More power consumption 0. 4 -6. 7 Gbps • Multi-core packet capture library is required CPU utilization Packet RX bandwidth * 100 % [Core 1] [Core 2] 10 Gbps NIC A [Core 3] [Core 4] [Core 5] 10 Gbps NIC B * Intel Xeon X 5680, 3. 33 GHz, 12 MB L 3 Cache [Core 7] [Core 8] 10 Gbps NIC C [Core 9] [Core 10] [Core 11] 10 Gbps NIC D 7

Solution: Packet. Shader I/O • Packet. Shader I/O – Uniformly distributes packets based on flow info by RSS hashing • Source/destination IP addresses, port numbers, protocol-id – 1 core can read packets from RSS queues of multiple NICs – Reads packets in batches (32 ~ 4096) • Symmetric Receive-Side Scaling (RSS) – Passes packets of 1 connection to the same queue [Core 1] Rx Q A 1 [Core 2] Rx Q B 1 Rx Q A 2 Rx Q B 2 10 Gbps NIC A [Core 3] Rx Q A 3 Rx Q B 3 [Core 4] Rx Q A 4 Rx Q B 4 [Core 5] Rx Q A 5 10 Gbps NIC B Rx Q B 5 Packet RX bandwidth 0. 4 - 6. 7 Gbps 40 Gbps CPU utilization 100 % 16 -29% * S. Han et al. , “Packet. Shader: a GPU-accelerated software router”, ACM SIGCOMM 2010 8

Challenge 2: Pattern Matching • CPU intensive tasks for serial packet scanning • Major bottlenecks – Multi-string matching (Aho-Corasick phase) – PCRE evaluation (if ‘ pcre’ rule option exists in rule) • On an Intel Xeon X 5680, 3. 33 GHz, 12 MB L 3 Cache – Aho-Corasick analyzing bandwidth per core: 2. 15 Gbps – PCRE analyzing bandwidth per core: 0. 52 Gbps 9

Solution: GPU for Pattern Matching • GPUs – Containing 100 s of SIMD processors • 512 cores for NVIDIA GTX 580 – Ideal for parallel data processing without branches • DFA-based pattern matching on GPUs – Multi-string matching using Aho-Corasick algorithm – PCRE matching • Pipelined execution in CPU/GPU Aho-Corasick bandwidth 2. 15 Gbps 39 Gbps PCRE bandwidth 0. 52 Gbps 8. 9 Gbps – Concurrent copy and execution GPU Engine Thread Packet Acquisition Preprocess Multi-string Matching Offloadin g Rule Option Evaluation Offloadin g GPU Dispatcher Thread Multi-string Matching Queue Multi-string Matching PCRE Matching Queue 10

Optimization 1: IDS Architecture • How to best utilize the multi-core architecture? • Pattern matching is the eventual bottleneck Function Time % Module acsm. Search. Sparse. DFA_Full 51. 56 multi-string matching List_Get. Next. State 13. 91 multi-string matching m. Search 9. 18 multi-string matching in_chksum_tcp 2. 63 preprocessing * GNU gprofiling results • Run entire engine on each core 11

Solution: Single-process Multi-thread • Runs multiple IDS engine threads & GPU dispatcher threads concurrently – Shared address space – Less GPU memory consumption – Higher GPU utilization & shorter service latency GPU memory usage 1/6 Core 6 Single thread pinned at core 1 GPU Dispatcher Thread Rule Option Evaluation Rule Option Evaluation Multi-string Matching Multi-string Matching Preprocess Preprocess Packet Acquisition Packet Acquisition Core 1 Core 2 Core 3 Core 4 Core 5 12

Architecture • Non Uniform Memory Access (NUMA)- aware • Core framework as deployed in dual hexa-core system • Can be configured to various NUMA set-ups accordingly ▲ Kargus configuration on a dual NUMA hexanode machine having 4 NICs, and 2 GPUs 13

Optimization 2: GPU Usage • Caveats – Long per-packet processing latency : • Buffering in GPU dispatcher – More power consumption • NVIDIA GTX 580: 512 cores • Use: – CPU when ingress rate is low (idle GPU) – GPU when ingress rate is high 14

Solution: Dynamic Load Balancing • Load balancing between CPU & GPU – – Reads packets from NIC queues per cycle Analyzes smaller # of packets at each cycle ( a < b < c) Increases analyzing rate if queue length increases Activates GPU if queue length increases Packet latency with GPU : 640 μsecs CPU: a b 13 μsecs c Internal packet queue (per engine) CPU GPU CPU α GPU β a Queue Length γ b c 15

Optimization 3: Batched Processing • Huge per-packet processing overhead – > 10 million packets per second for small-sized packets at 10 Gbps – reduces overall processing throughput • Function call batching – Reads group of packets from RX queues at once – Pass the batch of packets to each function Decode(p) Preprocess(p) Multistring_match(p ) 2 x faster processing rate Decode(list-p) Preprocess( list-p ) Multistring_match( list-p ) 16

Kargus Specifications 12 GB DRAM (3 GB x 4) $100 Intel X 5680 3. 33 GHz (hexacore) 12 MB L 3 NUMA-Shared Cache $1, 210 NUMA node 1 NUMA node 2 NVIDIA GTX 580 GPU $370 $512 Intel 82599 Gigabit Ethernet Adapter (dual port) Total Cost (incl. serverboard) = ~$7, 000 17

IDS Benchmarking Tool • Generates packets at line rate (40 Gbps) – Random TCP packets (innocent) – Attack packets are generated by attack rule-set • Support packet replay using PCAP files • Useful for performance evaluation 18

Kargus Performance Evaluation • Micro-benchmarks – Input traffic rate: 40 Gbps – Evaluate Kargus (~3, 000 HTTP rules) against: • Kargus-CPU-only (12 engines) • Snort with PF_RING • MIDe. A * • Refer to the paper for more results * G. Vasiliadis et al. , “MIDe. A: a multi-parallel intrusion detection architecture”, ACM CCS ‘ 11 19

Innocent Traffic Performance • 2. 7 -4. 5 x faster than Snort • 1. 9 -4. 3 x faster than MIDe. A 35 Throughput (Gbps) 30 25 MIDe. A Snort w/ PF_Ring Kargus CPU-only Kargus CPU/GPU 20 15 10 5 0 64 218 Actual payload analyzing bandwidth 256 818 Packet size (Bytes) 1024 1518 20

Malicious Traffic Performance • 5 x faster than Snort 35 Kargus, 25% Kargus, 50% Kargus, 100% Snort+PF_Ring, 25% Snort+PF_Ring, 50% Snort+PF_Ring, 100% Throughput (Gbps) 30 25 20 15 10 5 0 64 256 1024 Packet size (Bytes) 1518 21

Real Network Traffic • Three 10 Gbps LTE backbone traces of a major ISP in Korea: – Time duration of each trace: 30 mins ~ 1 hour – TCP/IPv 4 traffic: • 84 GB of PCAP traces • 109. 3 million packets • 845 K TCP sessions • Total analyzing rate: 25. 2 Gbps – Bottleneck: Flow Management (preprocessing) 22

Effects of Dynamic GPU Load Balancing • Varying incoming traffic rates Power Consumption (Watts) – Packet size = 1518 B 900 850 800 750 700 650 600 550 500 450 400 20% 8. 7 % Kargus w/o LB (polling) Kargus w/o LB Kargus w/ LB 0 5 10 20 33 Offered Incoming Traffic (Gbps) [Packet Size: 1518 B] 23

Conclusion • Software-based NIDS: – Based on commodity hardware • Competes with hardware-based counterparts – 5 x faster than previous S/W-based NIDS – Power efficient – Cost effective > 25 Gbps (real traffic) > 33 Gbps (synthetic traffic) US $~7, 000/- 24

Thank You fast-ids@list. ndsl. kaist. edu https: //shader. kaist. edu/kargu s/

Backup Slides

Kargus vs. MIDe. A UPDATE MIDEA KARGUS OUTCOME * G. Vasiliadis, M. Polychronakis, and S. Ioannidis, “MIDe. A: a multi-parallel intrusion detection architecture”, ACM CCS 2011 27

Kargus vs. MIDe. A UPDATE MIDEA KARGUS OUTCOME Packet acquisition PF_RING Packet. Shader I/O 70% lower CPU utilization * G. Vasiliadis, M. Polychronakis, and S. Ioannidis, “MIDe. A: a multi-parallel intrusion detection architecture”, ACM CCS 2011 28

Kargus vs. MIDe. A UPDATE MIDEA KARGUS OUTCOME Packet acquisition PF_RING Packet. Shader I/O 70% lower CPU utilization Detection e ngine GPU-support for Aho-Corasick & PCRE 65% faster detection rate * G. Vasiliadis, M. Polychronakis, and S. Ioannidis, “MIDe. A: a multi-parallel intrusion detection architecture”, ACM CCS 2011 29

Kargus vs. MIDe. A UPDATE MIDEA KARGUS OUTCOME Packet acquisition PF_RING Packet. Shader I/O 70% lower CPU utilization Detection e ngine GPU-support for Aho-Corasick & PCRE 65% faster detection rate Architecture Process-based Thread-based 1/6 GPU memory usage * G. Vasiliadis, M. Polychronakis, and S. Ioannidis, “MIDe. A: a multi-parallel intrusion detection architecture”, ACM CCS 2011 30

Kargus vs. MIDe. A UPDATE MIDEA KARGUS OUTCOME Packet acquisition PF_RING Packet. Shader I/O 70% lower CPU utilization Detection e ngine GPU-support for Aho-Corasick & PCRE 65% faster detection rate Architecture Process-based Thread-based 1/6 GPU memory usage Batch processing Batching only for detection engine (GPU) Batching from packet acquisition to output 1. 9 x higher throughput * G. Vasiliadis, M. Polychronakis, and S. Ioannidis, “MIDe. A: a multi-parallel intrusion detection architecture”, ACM CCS 2011 31

Kargus vs. MIDe. A UPDATE MIDEA KARGUS OUTCOME Packet acquisition PF_RING Packet. Shader I/O 70% lower CPU utilization Detection e ngine GPU-support for Aho-Corasick & PCRE 65% faster detection rate Architecture Process-based Thread-based 1/6 GPU memory usage Batch processing Batching only for detection engine (GPU) Batching from packet acquisition to output 1. 9 x higher throughput Power-efficient Always GPU (does not offload only when packet size is too small) Opportunistic offloading to GPUs (Ingress traffic rate) 15% power saving * G. Vasiliadis, M. Polychronakis, and S. Ioannidis, “MIDe. A: a multi-parallel intrusion detection architecture”, ACM CCS 2011 32

Receive-Side Scaling (RSS) • RSS uses Toeplitz hash function (with a random secret key) Algorithm: RSS Hash Computation function Compute. RSSHash(Input[], RSK) ret = 0; for each bit b in Input[] do if b == 1 then ret ^= (left-most 32 bits of RSK); endif shift RSK left 1 bit position; end for end function 33

Symmetric Receive-Side Scaling • Update RSK (Shinae et al. ) 0 x 6 d 5 a 0 x 56 da 0 x 255 b 0 x 0 ec 2 0 x 6 d 5 a 0 x 4167 0 x 253 d 0 x 43 a 3 0 x 8 fb 0 0 x 6 d 5 a 0 xd 0 ca 0 x 2 bcb 0 xae 7 b 0 x 30 b 4 0 x 6 d 5 a 0 x 77 cb 0 x 2 d 3 a 0 x 8030 0 xf 20 c 0 x 6 d 5 a 0 x 6 a 42 0 xb 73 b 0 xbeac 0 x 01 fa 0 x 6 d 5 a 34

Why use a GPU? Control ALU ALU ALU Cache Xeon X 5680: 6 cores VS ALU GTX 580: 512 cores *Slide adapted from NVIDIA CUDA C A Programming Guide Version 4. 2 (Figure 1 -2) 35

GPU Microbenchmarks – Aho-Corasick 39 Gbps 40 GPU throughput (2 B per entry) Throughput (Gbps) 35 CPU throughput 30 25 20 15 10 2. 15 Gbps 5 0 32 64 128 256 512 1, 024 2, 048 4, 096 The number of packets in a batch (pkts/batch) 8, 192 16, 384 36

GPU Microbenchmarks – PCRE 10 GPU throughput 9 CPU throughput 8 Throughput (Gbps) 8. 9 Gbps 7 6 5 4 3 2 0. 52 Gbps 1 0 32 64 128 256 512 1, 024 2, 048 4, 096 The number of packets in a batch (pkts/batch) 8, 192 16, 384 37

Effects of NUMA-aware Data Placement • Use of global variables minimal – Avoids compulsory cache misses – Eliminates cross-NUMA cache bouncing effects Performance Speedup 2. 8 Innocent Traffic 2. 6 Malicious Traffic 2. 4 2. 2 2 1. 8 1. 6 1. 4 1. 2 1 64 128 256 512 Packet Size (Bytes) 1024 1518 38

CPU-only analysis for small-sized packets • Offloading small-sized packets to the GPU is expensive – Contention across page-locked DMA accessible memory with GPU – GPU operational cost of packet metadata increases 12, 000 Latency (msec) 10, 000 GPU total latency GPU pattern matching latency CPU total latency CPU pattern matching latency 8, 000 6, 000 4, 000 2, 000 82 0 64 68 72 76 80 84 88 92 96 100 104 108 112 116 120 124 128 Packet Size (Bytes) 39

Challenge 1: Packet Acquisition • Default packet module: Packet CAPture (PCAP ) library – Unsuitable for multi-core environment – Low Performing Receiving Throughput (Gbps) 40 PCAP polling CPU % 35 100 30 80 25 60 20 15 40 10 5 0 0. 4 0. 8 64 128 1. 5 2. 9 5. 0 6. 7 20 CPU Utilization (%) PCAP polling 0 256 512 Packet Size (bytes) 1024 1518 40

Solution: Packet. Shader* I/O PSIO PCAP polling CPU % PSIO CPU % 35 100 30 80 25 60 20 15 40 10 5 0 0. 4 64 0. 8 1. 5 2. 9 5. 0 6. 7 20 CPU Utilization (%) Receiving Throughput (Gbps) 40 PCAP polling 0 128 256 512 Packet Size (bytes) 1024 1518 41