Node fz Fuzzing the ServerSide EventDriven Architecture James

  • Slides: 43
Download presentation
Node. fz Fuzzing the Server-Side Event-Driven Architecture James Davis, Arun Thekumparampil*, Dongyoon Lee Department

Node. fz Fuzzing the Server-Side Event-Driven Architecture James Davis, Arun Thekumparampil*, Dongyoon Lee Department of Computer Science (*Electrical and Computer Engineering) -1 -

This talk will answer three questions 1. Why should you should care about the

This talk will answer three questions 1. Why should you should care about the EDA? 2. What kinds of bugs happen in EDA programs? 3. How can we more effectively catch EDA bugs? -2 -

Two Main Contributions 1. Concurrency bug study 2. Node. fz -3 -

Two Main Contributions 1. Concurrency bug study 2. Node. fz -3 -

What is the Event-Driven Architecture? -4 -

What is the Event-Driven Architecture? -4 -

The One Thread Per Client Architecture (OTPC) Handle request Dispatcher Many threads -5 -

The One Thread Per Client Architecture (OTPC) Handle request Dispatcher Many threads -5 -

The Event-Driven Architecture (EDA) Event Loop Pending events Worker Pool offloads returns completed work

The Event-Driven Architecture (EDA) Event Loop Pending events Worker Pool offloads returns completed work -6 -

The key difference is multiplexing OTPC: Clients get dedicated resources Preemptive multi-tasking EDA: Clients

The key difference is multiplexing OTPC: Clients get dedicated resources Preemptive multi-tasking EDA: Clients share resources (multiplexing) Cooperative multi-tasking Tradeoff: efficiency vs. reliability -7 -

Why should you care about the Event-Driven Architecture? -8 -

Why should you care about the Event-Driven Architecture? -8 -

Node. js: A Server-Side Java. Script EDA Framework • “Full stack Java. Script” (Ryan

Node. js: A Server-Side Java. Script EDA Framework • “Full stack Java. Script” (Ryan Dahl, 2009) • 3. 5 M+ developers (April 2016) • 450 K+ modules (March 2017) • 2 B+ module downloads/week (March 2017) -9 -

What can go wrong in the EDA? - 10 -

What can go wrong in the EDA? - 10 -

Programming in the EDA is different Event queue Worker pool (k threads) Event loop

Programming in the EDA is different Event queue Worker pool (k threads) Event loop (single-threaded) Task queue Done queue - 11 -

Research Question 1 What are race conditions like in the server-side EDA? The bug

Research Question 1 What are race conditions like in the server-side EDA? The bug study - 12 -

Example: Atomicity Violation var NEXT_ID = 0; function on. Connect (client) { client. id

Example: Atomicity Violation var NEXT_ID = 0; function on. Connect (client) { client. id = NEXT_ID; set. Timeout(next. Step, 1); } function next. Step () { NEXT_ID++; } 0 0 1 - 13 -

Example: Ordering Violation var done = 0; fs. read. File(f 1, after. Read); fs.

Example: Ordering Violation var done = 0; fs. read. File(f 1, after. Read); fs. read. File(f 2, after. Read); start f 2 f 1 function after. Read (f) { if (f === f 2) Task queue done = 1; } Done ? = 1 Done queue - 14 -

Bug study overview • 12 bugs from Git. Hub • From Node. js applications

Bug study overview • 12 bugs from Git. Hub • From Node. js applications and npm modules • Patterns, manifestations, and fixes 1. Races on many kinds of events Tools must span entire Node. js framework, not just Java. Script 2. Races on shared memory and system resources Not like the client-side EDA 3. Races can have severe consequences Affect all clients, not just one Much more in the paper - 15 -

Research Question 2 How can we more easily identify race conditions in Node. js

Research Question 2 How can we more easily identify race conditions in Node. js applications? Node. fz - 16 -

Node. fz scales to the server Thousands or millions of events Random, not exhaustive,

Node. fz scales to the server Thousands or millions of events Random, not exhaustive, schedule exploration “Schedule fuzzing” - 17 -

Original Node. js (libuv) architecture Event queue Worker pool (k threads) Event loop (single-threaded)

Original Node. js (libuv) architecture Event queue Worker pool (k threads) Event loop (single-threaded) Task queue Done queue - 18 -

1: Add a scheduler Scheduler 1 Worker pool (k threads) Event loop (single-threaded) Event

1: Add a scheduler Scheduler 1 Worker pool (k threads) Event loop (single-threaded) Event queue Task queue Done queue - 19 -

2: Add scheduling hooks 2 Scheduler Event loop (single-threaded) 2 Worker pool (k threads)

2: Add scheduling hooks 2 Scheduler Event loop (single-threaded) 2 Worker pool (k threads) Event queue Task queue Done queue - 20 -

3: Serialize callbacks Scheduler Worker pool (1 thread) Event loop (single-threaded) Event queue 3

3: Serialize callbacks Scheduler Worker pool (1 thread) Event loop (single-threaded) Event queue 3 Task queue Done queue - 21 -

4: Remove done queue Scheduler Worker pool (1 thread) Event loop (single-threaded) Event queue

4: Remove done queue Scheduler Worker pool (1 thread) Event loop (single-threaded) Event queue Task queue 4 - 22 -

5: Fuzz! – node. fz 5 Scheduler Event loop (single-threaded) 5 Worker pool (1

5: Fuzz! – node. fz 5 Scheduler Event loop (single-threaded) 5 Worker pool (1 thread) Event queue Task queue - 23 -

1 node. js node. fz (no fuzzing) node. fz (fuzzing) 0. 8 0. 6

1 node. js node. fz (no fuzzing) node. fz (fuzzing) 0. 8 0. 6 0. 4 0. 2 SI O M KD KU E SI MG KU O ( S E new (k no ) KU w E n) (n ew ) FP S CL F NE S AK A ' 0 GH O Bug reproduction rate Node. fz improves bug reproduction rates - 24 -

Stuff I didn’t talk about • The details of the bug study • “Commutative”

Stuff I didn’t talk about • The details of the bug study • “Commutative” Ordering Violations • In-depth discussion of how Node. js is implemented • and all the sources of non-determinism • • • Node. fz is a legal, viable alternative to Node. js Node. fz’s parameters Node. fz measurably increases schedule exploration Node. fz exposed two new bugs Tuning Node. fz parameters can increase bug reproduction rate • “Guided fuzzing” • Evaluation of performance overhead - 25 -

Closing thoughts 1. You should care about the EDA. 2. EDA race conditions are

Closing thoughts 1. You should care about the EDA. 2. EDA race conditions are due to multiplexing. 3. Schedule “fuzzing” is simple and effective. - 26 -

Additional material - 27 -

Additional material - 27 -

Node. js (and the EDA) is booming Module counts for different languages 500 K

Node. js (and the EDA) is booming Module counts for different languages 500 K 400 K 300 K CPAN Gopm (go) Maven Central (Java) npm (node. js) Packagist (PHP) Py. PI 200 K 100 K 0 2011 2012 2013 2014 2015 2016 2017 www. modulecounts. com, 27 March 2017 - 28 -

Bug study in the server-side EDA • • npm modules, Node. js applications Patterns,

Bug study in the server-side EDA • • npm modules, Node. js applications Patterns, manifestations, and fixes Searched Git. Hub: race, Java. Script, closed bugs Studied 12 well-documented bugs (hard to come by) Name cinovo-logger-file mkdirp Abbr. Lo. C Dl/mo CLF 0. 9 K 111 MKD 0. 5 K 23. 3 M Descr. Logging mkdir -p agentkeepalive AKA 1. 9 K 194 K HTTP keepalive agent kue KUE 6. 6 K 69 K Priority job queue (Redis) restify RST 5. 5 K 232 K Help building RESTful APIs … … … - 29 -

Selected Findings from Bug Study Abbr. Type Racing Events Race on Impact Fix CLF

Selected Findings from Bug Study Abbr. Type Racing Events Race on Impact Fix CLF AV FS-Call Variable Duplicate file R/W in same callback AKA AV NW-Timer Variable Throws error R/W in same callback MKD AV FS-FS File system No mkdir Check error code KUE OV NW-NW Database Job repeats Order async calls RST (C)OV FS-X Array … … Incomplete Use an response async barrier … … - 30 -

“Commutative” Ordering Violations Goal Code Result var fs = require('fs'); var N = 4;

“Commutative” Ordering Violations Goal Code Result var fs = require('fs'); var N = 4; 1 2 32 4 var i; for (i = 1; i <= N; i++) start(i); 1 2 322 4 function start (i) { fs. read. File('/tmp/f', function () { if (i === N) { /* BUG -- not finished! */ next. Step(); } }); } - 31 -

Node. js Architecture Application Node. js Bindings (Node APIs) JS libs V 8 Java.

Node. js Architecture Application Node. js Bindings (Node APIs) JS libs V 8 Java. Script Engine libuv C++ libs . . . C++ addons Based on http: //stackoverflow. com/q/36766696 - 32 -

Libuv’s event loop Update loop time End Loop alive? Timers 1 Pending callbacks Idle

Libuv’s event loop Update loop time End Loop alive? Timers 1 Pending callbacks Idle callbacks Prepare handles Poll for I/O (epoll) Check handles Close callbacks Timers 2 http: //docs. libuv. org/en/v 1. x/design. html - 33 -

Sources of Non-Determinism in Node. js Programs • External – input • Network traffic

Sources of Non-Determinism in Node. js Programs • External – input • Network traffic • Timers (global clock) • UNIX signals • Internal – due to cooperative multitasking (partitioning algos. ) • Partitioning on I/O-bound activities − Network traffic (for remote services like database queries) − FS responsiveness − Worker pool thread schedule • Partitioning on CPU-bound activities − CPU speed (compression, crypto, etc. ) - 34 -

Node. fz is a legal, viable replacement for Node. js • Legal: Compliant with

Node. fz is a legal, viable replacement for Node. js • Legal: Compliant with Node. js documentation • Viable: Matches internal assumptions about libuv behavior Node. fz passes the Node. js test suite* - 35 -

Fuzzing parameters Parameter Description Default EL: epoll Do. F epoll queue shuffle distance -1

Fuzzing parameters Parameter Description Default EL: epoll Do. F epoll queue shuffle distance -1 EL: epoll deferral % Probability of deferring a ready epoll item until the next loop 10% EL: Timer deferral % Probability of deferring a ready timer until the next loop 20% EL: closing deferral % Probability of deferring a ready “close” until the next loop 5% WP: Do. F Work queue shuffle distance -1 WP: Max delay Time to wait for full task queue 0. 1 ms WP: epoll threshold Time to wait when EL also waits 0. 1 ms EL: Event Loop | WP: Worker Pool | Do. F: Degrees of Freedom - 36 -

Normalized Edit Distance Schedule variation induced by fuzzing - 37 -

Normalized Edit Distance Schedule variation induced by fuzzing - 37 -

We found two new bugs • Socket. io: PR 2721 • Test case fails

We found two new bugs • Socket. io: PR 2721 • Test case fails to clean up repeating temp socket • “Temp” socket connects to server during later test • Steals resource from subsequent tests timeout • Kue: Issue 967 • Fails on Node. NFZ and Node. FZ • Timeout while acquiring lock from Redis • Suggests atomicity violation - 38 -

“Guided fuzzing” increases repro. rate • Assert failed on 3/50 trials using “standard parameterization”

“Guided fuzzing” increases repro. rate • Assert failed on 3/50 trials using “standard parameterization” • Assert referred to a timer going off early • Tuned parameters to improve timer accuracy • E. g. defer worker pool tasks and event loop events • The event loop spends most of its time spinning and timers are identified quickly • New parameter values improved repro rate to 13/50. - 39 -

Normalized time to run test suite Node. fz Performance Overhead - 40 -

Normalized time to run test suite Node. fz Performance Overhead - 40 -

Experimental slides - 41 -

Experimental slides - 41 -

The One Thread Per Client Architecture (OTPC) Evaluate Reg. Ex from DB input Assigned.

The One Thread Per Client Architecture (OTPC) Evaluate Reg. Ex from DB input Assigned. Get to thread Prepare response against Reg. Ex pattern Request arrives Se nd re sp Handle request on se Handle request Request arrives on sp re nd Get Reg. Ex from DB Prepare Evaluate input response Assigned to thread against Reg. Ex pattern Se Many threads se Dispatcher - 42 -

Example: Ordering Violation (KUE 483) Job. prototype. mark. Failed () {. . . Asynchronous

Example: Ordering Violation (KUE 483) Job. prototype. mark. Failed () {. . . Asynchronous if (. . . ) { this. update(). delayed(); }. . . } - 43 -