Space JMP Programming with Multiple Virtual Address Spaces

  • Slides: 25
Download presentation
Space. JMP: Programming with Multiple Virtual Address Spaces Izzat El Hajj, Alexander Merritt, Gerd

Space. JMP: Programming with Multiple Virtual Address Spaces Izzat El Hajj, Alexander Merritt, Gerd Zellweger, Dejan Milojicic, Reto Achermann, Paolo Faraboschi, Wen-mei Hwu, Timothy Roscoe, Karsten Schwan 1

Space. JMP: Programming with Multiple Virtual Address Spaces Izzat El Hajj, Alexander Merritt, Gerd

Space. JMP: Programming with Multiple Virtual Address Spaces Izzat El Hajj, Alexander Merritt, Gerd Zellweger, Dejan Milojicic, Reto Achermann, Paolo Faraboschi, Wen-mei Hwu, Timothy Roscoe, Karsten Schwan Serialization is costly Overcome insufficent virtual address bits Let applications manage address spaces

Enormous Demand for Data 600 $5 Invested Captial ($Billion) 500 $4 No. of Deals

Enormous Demand for Data 600 $5 Invested Captial ($Billion) 500 $4 No. of Deals 300 $3 In-memory real-time analytics $2 200 100 DOMO, Data Never Sleeps 3. 0, 2015 14 20 13 20 12 20 11 20 10 20 09 20 08 20 07 20 06 20 05 Venture Investments in Big Data Analytics Companies 2 “How much data is generated every minute? ” 1 1 Source: 20 20 04 $1 2 Source: SVB, Big Data Next: Capturing the Promise of Big Data, 2015 3

Memory-Centric Computing Server CPU DRAM CPU Network Shared nothing • Private DRAM • Network-only

Memory-Centric Computing Server CPU DRAM CPU Network Shared nothing • Private DRAM • Network-only communication • Data marshaling 1 Faraboschi, et al. Beyond Processor-Centric Operating Systems. Hot. OS’ 15 DRAM So. C 3 D XPoint CPU DRAM So. C . . DRAM CPU DRAM load + store DRAM So. C NVM Memristor DRAM CPU Blade High Radix Switches Shared something 1 • Private DRAM • Global NVM pool Byte-addressable Near-uniform latency 4

Sharing Pointer-Based Data 0 x 8000 Symbol Table Pointer Data Structure list|0 x 8

Sharing Pointer-Based Data 0 x 8000 Symbol Table Pointer Data Structure list|0 x 8 D 40 tree|null Region-based Serialization viaprogramming file system • Fixed Marshaling base addresses costs Contiguous Virtual Region • Secondary region conflicts! representation 0 x 8000 No control over the address space! • Special pointers Virtual Address Space map+swizzling or use offsets! Memory Region L 0 x 4000 Virtual Address Space base 0 x 8 D 40 region pointer (absolute) 0 x 4000 offset + 0 x 0 D 40 = region pointer (relative) 0 x 4 D 40 5

What About Large Memories? 256 = 64 Pi. B (or more) Memory mapped region?

What About Large Memories? 256 = 64 Pi. B (or more) Memory mapped region? Physical Memory region 1 region 2 VAS region 1 No: not enough VA bits 2 = 256 Ti. B* Awkward and inefficient designs What to do? region 2 48 • Remapping • Many processes remap Single Process region 1 region 2 region 3 Multiple Processes *Intel x 86 -64 Processors. Challenges • Data partitioning • Coordination 6

Legacy Designs are Limiting fragmentation (holes) glob code Virtual Address Space libraries heap stack

Legacy Designs are Limiting fragmentation (holes) glob code Virtual Address Space libraries heap stack kernel Process Abstraction VAS* PC registers 256 Gi. B Range map 11 sec. unmap 2. 44 sec. region void* mmap(. . . ) 0 x 8000 int munmap(. . . ) • Limited control • • Randomization due to ASLR Aliasing not prevented 1 µ-sec. msec. Latency 100 10 1 2 -socket HSW Intel Xeon 512 GB DRAM, GNU/Linux le b a le t b n a o i age reg ge t pa c u y ro str t n s o c region de Why not let applications manage Limited granularity – files, ACL address spaces? Costly construction 32 Ki. B 1 Mi. B 32 Mi. B kernel. Free. BSD has MAP_EXCL to detect aliased regions. 1 Linux (not incl. page zeroing or hard faults) 1 Gi. B 32 Gi. B Memory Range Size (4 -Ki. B page) 7

Space. JMP: VAS as First-Class Citizen Process A PC [private] Virtual Address Space VAS*

Space. JMP: VAS as First-Class Citizen Process A PC [private] Virtual Address Space VAS* glob code heap lib stack registers switch VAS B’ (return) Q S B’ attach VAS B Q S B Explicit, arbitrary page table add segment to VAS B create VAS (global) “jumping”copy pertranslations thread create segments 8

Space. JMP: Shared Address Spaces Process A PC VAS* [private] Virtual Address Space glob

Space. JMP: Shared Address Spaces Process A PC VAS* [private] Virtual Address Space glob code heap lib stack registers Q S S B’ B Process B glob code Q glob code heap lib stack B’’ registers PC VAS* [private] Virtual Address Space 9

Space. JMP: Lockable Segments Process A PC VAS* [private] Virtual Address Space glob code

Space. JMP: Lockable Segments Process A PC VAS* [private] Virtual Address Space glob code heap lib stack registers switch VAS B’ acquire lock Kernel forces processes to abide by locking protocol Q S B’ Segment S is lockable Q S B switch VAS B’’ block! (inside kernel) Process B glob code Q glob code heap S lib stack B’’ registers PC VAS* [private] Virtual Address Space 10

Unobtrusive Implementation Dragon. Fly BSD v 4. 0. 6 • Small derivative of Free.

Unobtrusive Implementation Dragon. Fly BSD v 4. 0. 6 • Small derivative of Free. BSD struct vmspace Virtual Address Space vm_map_entry start; end; offset; protection; vm_object* vm_object OBJT_PHYS Resident Pages Space. JMP Segment memory system based on Mach µkernel • Supports only AMD 64 arch. Segment – wrapper around VM Object VAS – instance of vmspace Process modifications • Primary and attached VAS set VAS Switch (as system call) • Lookup vmspace, overwrite CR 3 11

Unobtrusive Implementation retype! Capability physical address RAM Barrelfish OS x 86 PML 4 x

Unobtrusive Implementation retype! Capability physical address RAM Barrelfish OS x 86 PML 4 x 86 PTDP Raw Memory x 86 PTD Segment x 86 PTE Space. JMPPage VASTable Frame App user space Application kernel OS node state replica x 86 . . . OS node state replica Xeon Phi ARM interconnect • Space. JMP user-level implementation • No dynamic memory allocation in kernel all memory is typed – frame, vnode, cnode safe via kernel-enforced capabilities • Flexible to experiment with optimizations Linux port at Hewlett Packard Labs. 12

Sharing Pointer-Rich Data SAMTools Genomics Utilities stage 1 stage 2 un-marshal stage 1 stage

Sharing Pointer-Rich Data SAMTools Genomics Utilities stage 1 stage 2 un-marshal stage 1 stage 3 un-marshal stage 2 stage 3 Normalized Runtime 1. 0 0. 8 0. 6 0. 4 0. 2 Flagstat switch VAS • No data marshaling • Use of absolute pointers no swizzling, or address conflicts 2 -socket 24 -core Westmere 92 Gi. B DRAM, DF BSD Qname Sort Coordinate Sort Index SAMTools Alignment Operations 1. 0 0. 8 0. 6 0. 4 0. 2 Flagstat Qname Sort Coordinate Sort Index 13

Single-System Client-Server GET per second user space S C marshal + unmarshal kernel Redis

Single-System Client-Server GET per second user space S C marshal + unmarshal kernel Redis – UNIX Sockets • Serialized data into sockets • Buffer copying • Scheduling coordination buffers 2 -socket 24 -core Westmere 92 Gi. B DRAM, DF BSD 14

Single-System Client-Server GET per second Client VAS C 0 S Server VAS Client VAS

Single-System Client-Server GET per second Client VAS C 0 S Server VAS Client VAS C 1 C 2 C 0 get! Redis with Space. JMP 2 -socket 24 -core Westmere 92 Gi. B DRAM, DF BSD 15

Single-System Client-Server Requests per second Client VAS C 0 Client VAS C 1 writer

Single-System Client-Server Requests per second Client VAS C 0 Client VAS C 1 writer C 2 Client VAS Server VAS C 0 C 1 C 1 set! get! block! C 0 user kernel C 2 Varying read-write loads • Scalability – lock granularity scalable locks, e. g. , MCS hardware transactional memory Write Ratio % 2 -socket 24 -core Westmere 92 Gi. B DRAM, DF BSD • Typical read/write ratio for KVS ca. 10% 16

Space. JMP – Summary P P Physical Memory attached P P VAS attached VAS

Space. JMP – Summary P P Physical Memory attached P P VAS attached VAS 1 VAS 2 P P Future Work Takeaway • • • Promote address spaces to first-class citizens • Processes explicitly create, attach, switch address spaces Persistence – fast reboots Security – sandboxing Semantics – transactions Versioning – fast checkpointing 17

Backup Slides

Backup Slides

Programs: How to Use Space. JMP vas_create(NAME, PERMS) VAS* seg_alloc(NAME, BASE, LEN, PERMS) seg_attach(VAS#,

Programs: How to Use Space. JMP vas_create(NAME, PERMS) VAS* seg_alloc(NAME, BASE, LEN, PERMS) seg_attach(VAS#, SEG#) S B’ S VAS B vas_attach(VAS#) S vas_switch(VAS#HANDLE) List *items = // lookup in symbol table append(items, malloc(new_item)) 19

Programming Large Memories with a GUPS-like workload Physical Memory P 2 VAS 80 P

Programming Large Memories with a GUPS-like workload Physical Memory P 2 VAS 80 P P 1 P 0 P 3 60 re-mapping P 4 Physical Memory VAS 2 VAS 1 P Updates per second (mil. ) Multi. Process Open. MPI busywaiting 40 20 2 -socket 36 -core HSW 512 Gi. B DRAM, DF BSD 0 Space. JMP 20

Study: Implications for RPC based communication Can Space. JMP support fast RPC? – Unix

Study: Implications for RPC based communication Can Space. JMP support fast RPC? – Unix domain sockets are ubiquitous – Faster published inter-machine RPC mechanisms? 21

Pointer Safety Issues Risk for Unsafe Behavior Pointer dereferences in the wrong address space

Pointer Safety Issues Risk for Unsafe Behavior Pointer dereferences in the wrong address space are undesirable Safe Programming Semantics switch v 1 a = malloc a is valid in v 1 only b = *a b is valid in v 1 only c = vcast v 2 b c is valid in v 2 only d = alloca d is valid in any VAS *d = c e = *d e is valid wherever c was valid 22

Compiler-Enforced Pointer Safety Analysis Identifies Potentially Unsafe Behavior – – – Analyze active VASes

Compiler-Enforced Pointer Safety Analysis Identifies Potentially Unsafe Behavior – – – Analyze active VASes at each program point Analyze which VAS each pointer may point to Identify dereferences with mismatch between current VAS and points-to VAS (safety-ambiguous) Transformation Guards Dereferences – – – Protect potentially unsafe dereferences with tag checks Tag pointers involved in potentially unsafe dereferences Tag pointers that escape visibility (e. g. external function invocation, stores, etc. ) a = malloc b = malloc switch v *a *b safe dereference safety-ambiguous dereference 23

How fast is address space switching? Switching costs – breakdown – CR 3 write

How fast is address space switching? Switching costs – breakdown – CR 3 write cost increases with tags – Switch latency lower with tags – Bold is with tagging Impact of TLB tagging – Translations remain in TLB – Diminishing returns with larger working sets 24

Concrete Systems Example: HP Superdome X 1 • 16 sockets, 288 cores (physical) •

Concrete Systems Example: HP Superdome X 1 • 16 sockets, 288 cores (physical) • 24 Ti. B DRAM • Byte-addressable • cache-coherent • $500 K–$1 M Improvements to make • No NVM • Non-uniform latencies • Cache coherence wall 1 Source: Hewlett Packard Enterprise 25