REBOUND Defending Distributed Systems with BoundedTime Recovery Neeraj
REBOUND: Defending Distributed Systems with Bounded-Time Recovery Neeraj Gandhi, Edo Roth, Brian Sandler, Andreas Haeberlen, Linh Thi Xuan Phan Department of Computer and Information Science University of Pennsylvania 1 REBOUND – Euro. Sys’ 21 – April 2021
What are Cyber-Physical Systems (CPS)? S 1 S 2 Sensors 1 4 N 1 N 2 5 7 2 6 N 3 N 4 3 8 Control nodes A 1 A 2 A 3 A 4 Chemical Actuators Distributed systems using sensor data to change actuator behavior 2 REBOUND – Euro. Sys’ 21 – April 2021
CPS Can be Attacked! S 1 S 2 Sensors 1 4 N 1 N 2 5 7 2 6 N 3 N 4 3 8 Control nodes A 1 A 2 A 3 A 4 Chemical Actuators Need fault tolerance for more than crash faults 3 REBOUND – Euro. Sys’ 21 – April 2021
One Option Is to Use BFT • 1 1 4 4 5 7 S 1 S 2 Sensors 2 2 6 6 3 8 1 4 N 1 N 2 5 7 2 6 N 3 N 4 3 8 Control nodes A 1 A 2 A 3 A 4 Actuators 3 8 4 REBOUND – Euro. Sys’ 21 – April 2021
Insight: Many CPS Do Not Need Perfect Masking N 1 N 2 2 S 1 N 3 N 4 3 S 2 Sensors Control nodes A 1 A 2 A 3 A 4 Actuators Correct Operation We can leverage this “recovery” period! Recoverable Non-recoverable Time Fault Catastrophe REBOUND – Euro. Sys’ 21 – April 2021 5
How fast does recovery need to be? DC/DC converters (STM) 20μs Direct torque control (ABB) 25μs AC/DC converters 50μs Electronic throttle control (Ford) 5 ms Traction control (Ford) 20 ms Micro-scale race cars 40 ms Autonomous vehicle steering 50 ms Energy-efficient building control 500 ms Source: M. Morari. Fast model predictive control (MPC). • Timing varies by system • Many CPS have sufficiently large recovery bound times 6 REBOUND – Euro. Sys’ 21 – April 2021
Bounded-Time Recovery from Faults • Fault Recovered Correct Operation Bounded Time Correct Operation Recovery Period 7 REBOUND – Euro. Sys’ 21 – April 2021
How can we guarantee BTR? • 1 2 N 1 N 2 5 3 1 N 3 N 4 3 5 4 2 4 8 REBOUND – Euro. Sys’ 21 – April 2021
Challenges • Identify when a fault occurs • Identify which node is faulty • How to prevent • Incorrect attribution? • False negatives? • False positives? N 3 faulty! N 1 N 2 N 3 N 4 S 1 S 2 Sensors • How to bound detection + recovery time? Control nodes A 1 A 3 A 2 Actuators A 4 All is well! …even if the adversary tries to delay detection and/or recovery 9 REBOUND – Euro. Sys’ 21 – April 2021
Outline • Introduction • Goal: Protect against attacks • Bounded-Time Recovery (BTR) • Approach: Detect and Recovery • Technique • Model • REBOUND • Evaluation 10 REBOUND – Euro. Sys’ 21 – April 2021
Model Most Critical 1 • 2 4 Least Critical Round 0 Round 1 Round 2 6 3 5 7 8 Time 11 REBOUND – Euro. Sys’ 21 – April 2021
Modes and Transitions • Tree stores per-mode task allocation • Faults trigger mode transitions 7 N 1 N 2 5 3 7 8 2 1 6 4 N 3 N 4 3 5 8 7 No Faults 3 5 N 1 N 2 N 3 N 4 Node 2 Faulty 8 Link 1 -2 Faulty Nodes 1&2 Faulty 1 2 4 6 Drop No Faults … … 4 … 3 1 N 2 N 3 N 4 Nodes 1 & 2 Faulty 12 REBOUND – Euro. Sys’ 21 – April 2021
Failures trigger mode transitions • 1 2 4 6 N 1 N 2 5 3 7 8 Inputs {2, 3} Function + Output 5 17 Cross Check 2 1 6 4 N 3 N 4 3 5 8 7 Inputs {2, 3} Function + Output 5 13 REBOUND – Euro. Sys’ 21 – April 2021
Avoid Errors by Requiring Signatures Inputs: {2, 2} Signed: N 1 N 3 N 2 Output: 4 Signed: N 2 N 3 misbehaved! Evidence: Inputs: {2, 2} Signed: N 1 N 4 Output: 5 Signed: N 3 Signed: N 2 All messages are signed REBOUND – Euro. Sys’ 21 – April 2021 14
Omission Faults • Replicas cannot verify that a message was not received by another node • Allow unilateral link fault declaration (LFD) if node is a link endpoint • LFDs are all signed 2 1 15 REBOUND – Euro. Sys’ 21 – April 2021
REBOUND: Who is at fault? Adversaries have more complex strategies we must address! What happens when compromised nodes work together? N 1 N 2 N 3 There should be either data or an LFD received in this round Round 0 Round 1 Data LFD(N 1 N 2) 16 REBOUND – Euro. Sys’ 21 – April 2021
REBOUND: Finding Message Omitter Ask: What should have happened in previous round? What should have happened in earlier rounds? Round 0 Round 1 Round 2 Time 17 REBOUND – Euro. Sys’ 21 – April 2021
REBOUND: Require Per-Round Messages 1. All nodes need to know when any other node fails 2. Ensure attackers do not forge heartbeats 1 N 2 New mode: N 4 faulty 1 2 3 2 S 1 S 2 Sensors 3 N 4 A 1 Control nodes Each node must, per round, send its (signed) opinion of the system mode REBOUND – Euro. Sys’ 21 – April 2021 A 2 A 3 A 4 Actuators 18
Much More in Paper + TR • Forwarding + Auditing Protocol Layers • Scheduling + Optimizations • Bus optimizations • Multisignatures for scalability • Handling edge cases to guarantee BTR • And more! 19 REBOUND – Euro. Sys’ 21 – April 2021
Multisignatures • Allow sets of nodes to sign messages • Constant size & verification time • Approach requires many messages Message content often same No faults! • Only use for larger topologies No faults! N 3 N 1 No faults! N 2 N 4 No faults! No N 5 N 6 No faults! 20 REBOUND – Euro. Sys’ 21 – April 2021
Outline • Introduction • Goal: Protect against attacks • Bounded-Time Recovery (BTR) • Approach: Detect and Recovery • Technique • Model • REBOUND • Evaluation 21 REBOUND – Euro. Sys’ 21 – April 2021
Evaluation List of Results • Runtime • Bandwidth/computation • Mode change timing, bandwidth consumption • Overall runtime costs • Scheduling overheads • BFT scheduling comparison • Case Studies • Volvo XC 90 • Raspberry Pi hardware experiments 22 REBOUND – Euro. Sys’ 21 – April 2021
Unoptimized Full REBOUND 15 10 5 0 0 20 40 60 80 Number of nodes 100 200 Unoptimized Full REBOUND Unoptimized 150 Storage (KB/node) 20 Number of crypto ops per round per node Bandwidth (KB per link per round) What are the costs of REBOUND? Full REBOUND 100 50 0 0 20 40 60 80 Number of nodes 100 150 100 50 0 0 20 40 60 80 Number of nodes 100 • Synthetic workloads in randomized network topologies • Compare REBOUND with and without multisignatures Takeaways • Overhead is manageable • Scalability greatly improved by multisignatures! REBOUND – Euro. Sys’ 21 – April 2021 23
Embedded Platform: No Protection N 2 N 1 S 2 N 4 N 3 Pressure Valve A 1 A 2 A 3 A 4 Pressure Alarm No Recovery Occurs Monitor 24 REBOUND – Euro. Sys’ 21 – April 2021
Embedded Platform: With REBOUND Takeaway Our approach can detect + recover quickly in a real embedded platform! Most Critical Pressure Valve 1 2 Least Critical 3 4 5 6 7 Pressure Alarm 8 Monitor Recovered Application Dropped Application N 2 N 1 S 2 N 4 N 3 A 1 A 2 A 3 A 4 25 REBOUND – Euro. Sys’ 21 – April 2021
Conclusion • Thank you! REBOUND – Euro. Sys’ 21 – April 2021 26
- Slides: 26