An Integrated Framework for Dependable and Revivable Architecture
An Integrated Framework for Dependable and Revivable Architecture Using Multicore Processors Weidong Shi Hsien-Hsin “Sean” Lee Laura Falk Mrinmoy Ghosh Motorola Labs Georgia Tech University of Michigan Georgia Tech
Problem Statement • Highly Available, Reliable, and Revivable networked services. • Explore new programming and usage models for Multicore processors • Provide “architectural support” for network services to be – Autonomic – Remote-exploits revivable – Self-recoverable • Achieve high performance 2
Problem Statement • Highly Available, Reliable, and Revivable networked services. • Explore new programming and usage models for Multi-core processors • Provide “architectural support” for network services to be – Autonomic – Remote-exploits revivable – Self-recoverable • Achieve high performance 3
Toward Self-recovery Network Services Solutions Causes of Network Service Loss Accidental Intentional Aging Do. S Buffer Transient Heisenbugs Damage Overflow Replication Rejuvenation Checkpoint Remote Exploit Selfrecovery 4
Multicore: An ideal platform • Exploit insulation: Each core of a multicore can be programmed to run at different Shared L 2 privilege levels with different OS. Server Core Monitor Core Dual Core (Merome) • Tight coupling of cores comparing with SMP Fine-grained processor state monitoring • Concurrent monitoring, efficient state backup and recovery • Massive multi-core will have many idle cores 5
INDRA: A Dependable and Revivable Architecture Control signals Server Core CFG check Monitor Core (Network Apps) Trace Filter DL 1 Cache Trace FIFO IL 1 Cache DL 1 Cache Code origin check L 2 Cache Memory Interface Watch Dog Physical Memory Space (used by service OS and applications) IL 1 Cache Monitor Insulation Control Issue Recovery Protected Memory Space (monitor BIOS, and SW) 6
Monitor Core: Insulated Parallel Inspection [Kiriansky et al. , USENIX 2002] Data Page Code Origin Check Code Page Function. A() { Vuln_func(); A =3; } Vuln_func() { // Attack!! // Return address changed } Malicious_func() { } Control Flow Graph Check Exception Handling 7
Server Core: Request Based Recovery Issue state backup request Read network request (Request for page arch. ece. gatech. edu) Process network request No Monitor Signalled Error? Restore Checkpointed State Yes 8
Comparison of Backup and Recovery Backup Recovery Slow Fast, modify page translation Memory Update Log Fast Log based undo slow Virtual Checkpointing Copy dirty page on demand, slow Fast, modify TLB entry Fast, no page copy Approach Software checkpointing INDRA 9
INDRA Backup Page Record Global Timestamp Register (GT) GT=4 TLB Extension for Backup and Rollback Modified TLB Local Timestamp Active Page (Physical Address) Tag Backup Page (Physical Address) Dirty Block Bitvector Rollback Valid Processor Memory Active Page Backup Page (Physical Address) Local Timestamp Dirty Block Bitvector Rollback Valid 3 10
INDRA Backup Page Record Global Timestamp Register (GT) GT=4 Modified TLB Local Timestamp Active Page (Physical Address) TLB Extension for Backup and Rollback Tag Backup Page (Physical Address) Dirty Block Bitvector Rollback Valid Backp Record 3 Backup Page Record Active Page Backup Page (Physical Address) Local Timestamp Dirty Block Bitvector Processor Memory Rollback Bitvector Rollback Valid 3 11
INDRA Recovery Example Current Operation REQUEST n TLB Extension for Backup and Rollback Modified TLB Local Timestamp Active Page (Physical Address) Global Timestamp Register (GT) GT=5 Wr memory line 7 Tag Backup Page (Physical Address) Dirty Block Bitvector Rollback Valid Backup Record 53 Active Page Backup Page 12
INDRA Recovery Example Current Operation REQUEST n TLB Extension for Backup and Rollback Modified TLB Local Timestamp Active Page (Physical Address) Global Timestamp Register (GT) GT=5 Wr memory line 2 Tag Backup Page (Physical Address) Dirty Block Bitvector Rollback Valid Backup Record 53 Active Page Backup Page 13
INDRA Recovery Example REQUEST n TLB Extension for Backup and Rollback Modified TLB Local Timestamp Active Page (Physical Address) Global Timestamp Register (GT) GT=5 Failure Signal Tag 53 Backup Page (Physical Address) Dirty Block Bitvector Rollback Valid Backup Record 1 Active Page Backup Page Restore system resource allocation Restore process context 14
INDRA Recovery Example Current Operation REQUEST n+1 TLB Extension for Backup and Rollback Modified TLB Local Timestamp Active Page (Physical Address) Global Timestamp Register (GT) GT=5 Rd memory line 7 Tag 53 Backup Page (Physical Address) Dirty Block Bitvector Rollback Valid Backup Record 1 Active Page Backup Page 15
INDRA Recovery Example Current Operation REQUEST n+1 TLB Extension for Backup and Rollback Modified TLB Local Timestamp Active Page (Physical Address) Global Timestamp Register (GT) GT=5 Wr memory line 1 Tag 53 Backup Page (Physical Address) Dirty Block Bitvector Rollback Valid Backup Record 1 Active Page Backup Page 16
INDRA Recovery Example Current Operation REQUEST n+1 TLB Extension for Backup and Rollback Modified TLB Local Timestamp Active Page (Physical Address) Global Timestamp Register (GT) GT=6 GT=5 Handle Next Request Tag 53 Backup Page (Physical Address) Dirty Block Bitvector Rollback Valid Backup Record 1 Active Page Backup Page Record system resource allocation Record process context 17
INDRA Recovery Example Current Operation REQUEST n+2 TLB Extension for Backup and Rollback Modified TLB Local Timestamp Active Page (Physical Address) Global Timestamp Register (GT) GT=6 GT=5 Wr memory line 4 Tag 6 5 3 Backup Page (Physical Address) Dirty Block Bitvector Rollback Valid Backup Record 1 Active Page Backup Page 18
Test Bed (Bochs + TAXI [Vlaovic & Davidson, ICCD’ 02] ) Linux Network Server Monitor (Stripped Down OS, Security SW, 10 MB) Bochs + TAXI Network Requests Server Response Host OS • Run production OS with real service applications, httpd, ftpd, bind, sendmail, etc. • Recoverability evaluated by applying real x 86 remote exploits from security websites. • Experiment with documented exploits 19
Inter-Request Interval (# of Instructions) 20
I-Cache Miss Rate • Code Origin Check reads traces of code read from L 2 Cache • Number of Instructions in the Trace is Proportional to L 1 I Cache Miss Rate • Overhead of monitoring code origin depends on L 1 I Cache Miss Rate 21
Monitoring Overhead 22
Sensitivity of Monitoring Queue Size Slowdown Queue Size vs. Performance Queue Size 23
Backup Overhead of Modified Lines 24
Performance of Recovery + Monitoring 25
Conclusions • Real time exploit monitoring with autonomic recovery increases revivability and availability. • Multicore architectures are an ideal candidate for new type of revivable system. • INDRA-based Multicore system can provide improved reliability and availability. • More research is required to explore the trade-off between availability, performance, architecture design, and cost. 26
Questions and Answers Thank you ! http: //arch. ece. gatech. edu 27
- Slides: 27