System Calls Linux ABI System Calls Everything distills

System Calls

Linux ABI • System Calls – Everything distills into a system call • /sys, /dev, /proc read() & write() syscalls • What is a system call? – Special purpose function call • Elevates privilege • Executes function in kernel – But what is a function call?

What is a function call? • Special form of jmp – Execute a block of code at a given address – Special instruction: call <fn-address> – Why not just use jmp? • What do function calls need? – int foo(int arg 1, char * arg 2); • Location: foo() • Arguments: arg 1, arg 2, … • Return code: int – Must be implemented at hardware level

System Calls • Function calls not that special – Just an abstraction built on top of hardware • System calls are basically function calls – With a few minor changes • Privilege elevation • Constrained entry points – Functions can call to any address – System calls must go through “gates”

Implementing system calls • System calls are implemented as a single function call: syscall() – read() and write() actually just invoke syscall() • What does syscall do? – Enters into the kernel at a known location – Elevates privilege – Instantiates kernel level environment • Once inside the kernel, an appropriate system call handler is invoked based on arguments to syscall()

x 86 and Linux • Number of different mechanisms for implementing syscall – Legacy: int 0 x 80 – Invokes a single interrupt handler – 32 bit: SYSENTER – Special instruction that sets up preset kernel environment – 64 bit: SYSCALL – 64 bit version of SYSENTER • All jump to a preconfigured execution environment inside kernel space – Either interrupt context or OS defined context • What about arguments? – syscall(int syscall_num, args…)

Specific system calls • Each system call has a number assigned to it – Index into a system call table • Function pointers referencing each syscall handler • Syscall(int syscall_num, args…) – Sets up kernel environment – Invokes syscall_table[syscall_num](args…); – Returns to user space: • Resets environment to state before call

man –s 2 write WRITE(2) Linux Programmer's Manual WRITE(2) NAME write - write to a file descriptor SYNOPSIS #include <unistd. h> ssize_t write(int fd, const void *buf, size_t count); DESCRIPTION write() writes up to count bytes from the buffer pointed buf to the file referred to by the file descriptor fd.

SYSCALL_DEFINE 3(write, unsigned int, fd, const char __user *, buf, size_t, count) { struct fd f = fdget_pos(fd); ssize_t ret = -EBADF; if (f. file) { loff_t pos = file_pos_read(f. file); ret = vfs_write(f. file, buf, count, &pos); if (ret >= 0) file_pos_write(f. file, pos); fdput_pos(f); } return ret; }

ssize_t __vfs_write(struct file *file, const char __user *p, size_t count, loff_t *pos) { if (file->f_op->write) return file->f_op->write(file, p, count, pos); else if (file->f_op->write_iter) return new_sync_write(file, p, count, pos); else return -EINVAL; } EXPORT_SYMBOL(__vfs_write);

static ssize_t console_write(struct file * filp, const char __user * buf, size_t size, loff_t * offset) { char * tmp_buf = NULL; … if (copy_from_user(tmp_buf, size)) { return -EFAULT; } return size; } static struct file_operations cons_fops = {. read = console_read, . write = console_write, };

int 0 x 80 • Old style system call invocation – Vectors into kernel through IDT – Special Interrupt (128) only used for system calls • IDT switches CPU to kernel mode – Changes CS segment to kernel CS segment • Hard coded as __KERNEL_CS – Switches to kernel stack • IRQ handler inspects register contents for syscall # and arguments – System call index goes in %eax • Syscall handler invoked from Syscall table – Like how IRQ handlers are invoked

Sysenter • More modern approach to syscall invocation – Allow OS to configure a syscall execution context – Configured via writes to Hardware MSRs – Achieves same effect as an IRQ handler, but faster • Configured at boot time on each CPU – SYSENTER_CS_MSR • Stores Kernel Code Segment – SYSENTER_EIP_MSR • Address of code to handle system calls – SYSENTER_ESP_MSR • Kernel mode stack pointer • Application issues sysenter instruction – Instantiates system call context – After system call, control returned to process with sysexit instruction

SYSENTER/SYSEXIT SYSENTER operation SYSEXIT operation

Syscall • Long mode version of sysenter – Separate set of MSRs for 64 bit mode – Assume flat memory model (no segments) • Configured at boot time on each CPU – SYSCALL_STAR_MSR • Stores Code Segment information – SYSCALL_LSTAR_MSR • Stores 64 bit instruction pointer – SYSCALL_FMASK_MSR • Masks for setting rflag values • Application issues syscall instruction – Instantiates system call context – After system call, control returned to process with sysret instruction

SYSCALL/SYSRET SYSCALL Operation SYSRET Operation

System call optimizations • System calls can be invoked in multiple ways – Which one should a program use? – Do you need to support all options at compile time? • System calls add overhead – Kernel <–> User mode switches are expensive – Some system calls are pretty simple and don’t modify state • E. g. getpid(), gettimeofday(), etc… – What if we can handle a syscall without invoking the kernel?

VDSO • Kernel provided dynamic library for making system calls – Mapped into address space of each process – Links with standard C library – Automatically uses optimal system call mechanism • Also provides optimized user space system calls – System calls executed without invoking kernel mode • • __vdso_clock_gettime; __vdso_getcpu; __vdso_gettimeofday; __vdso_time

Linux Kernel int 0 x 80 sysenter syscall VDSO read() Stack Libc. so fread() Heap Code Data /libc. so. 6 /bin/ls

Kernel Environment • The kernel is a C program – Compiled instructions collected in a single binary – Linked and loaded similar to a regular program • By boot loader not OS • Kernel executes in its own virtual address space – This virtual address space is independent from process address spaces – They do not intersect • Allows kernel and processes to coexist in same virtual address space

Memory layout • Traditional Unix (32 bit) kernel virtual memory invisible to user code stack • Program contents on the bottom • Kernel memory is on top • Dynamic memory is in the middle Memory mapped region for shared libraries – Heap grows up – Stack grows down the “brk” ptr run-time heap (via malloc) uninitialized data (. bss) initialized data (. data) program text (. text) 0

Memory layout VDSO • Modern Linux (64 bit) kernel virtual memory – Many more addresses kernel physical memory stack • Kernel is no longer top 1 GB – Sparsely mapped in at various addresses Memory mapped devices • Memory mapped devices • Balancing address use between stack and heap no longer an issue Memory mapped region for shared libraries – Heap allocated using mmap()’s – brk can still be used run-time heap (via malloc) • VDSO region run-time heap (via malloc) – User executable kernel code – User accessible kernel data uninitialized data (. bss) initialized data (. data) program text (. text) • current time • Plus much more… 0

Memory management • Address space of a process is virtual memory – What the process sees • Virtual memory may or may not be backed by physical memory – Actual byte addressable memory devices on motherboard (DRAM, NVM, etc) • OS managed mapping of virtual memory to physical memory – Memory grouped together as pages • typically 4 KB of physically contiguous memory – OS allocates pages for each processes – OS maps allocated pages into the virtual address space of each process – OS tracks current mapping of all processes • What memory is assigned to whom – OS can change mapping at anytime • Move memory around • Move memory to disk (swapping)

Kernel layout

Physical Address Layout Linux Kernel Boot loader copies kernel to 1 MB boundary from Root partition BIOS loads boot loader from startup disk Boot loader

Virtual Address Layouts (32 bit) 3 GB (0 xc 0000000) 16 MB (0 x 01000000)

Virtual Address Layout (64 bit) Process Ø cat /proc/self/maps 00400000 -0040 c 000 r-xp 0000 fd: 00 1189777 0060 b 000 -0060 c 000 r--p 0000 b 000 fd: 00 1189777 0060 c 000 -0060 d 000 rw-p 0000 c 000 fd: 00 1189777 01 a 26000 -01 a 47000 rw-p 0000 00: 00 0 3 dd 8600000 -3 dd 8620000 r-xp 0000 fd: 00 1179937 3 dd 881 f 000 -3 dd 8820000 r--p 0001 f 000 fd: 00 1179937 3 dd 8820000 -3 dd 8821000 rw-p 00020000 fd: 00 1179937 3 dd 8821000 -3 dd 8822000 rw-p 0000 00: 00 0 3 dd 8 e 00000 -3 dd 8 fb 4000 r-xp 0000 fd: 00 1179948 3 dd 8 fb 4000 -3 dd 91 b 3000 ---p 001 b 4000 fd: 00 1179948 3 dd 91 b 3000 -3 dd 91 b 7000 r--p 001 b 3000 fd: 00 1179948 3 dd 91 b 7000 -3 dd 91 b 9000 rw-p 001 b 7000 fd: 00 1179948 3 dd 91 b 9000 -3 dd 91 be 000 rw-p 0000 00: 00 0 7 f 3 b 66 ba 0000 -7 f 3 b 6 d 0 c 9000 r--p 0000 fd: 00 1183411 7 f 3 b 6 d 0 c 9000 -7 f 3 b 6 d 0 cc 000 rw-p 0000 00: 00 0 7 f 3 b 6 d 0 e 6000 -7 f 3 b 6 d 0 e 7000 rw-p 0000 00: 00 0 7 ffffed 24000 -7 ffffed 45000 rw-p 0000 00: 00 0 7 ffffedb 3000 -7 ffffedb 5000 r--p 0000 00: 00 0 7 ffffedb 5000 -7 ffffedb 7000 r-xp 0000 00: 00 0 fffff 600000 -fffff 601000 r-xp 0000 00: 00 0 /usr/bin/cat [heap] /usr/lib 64/ld-2. 18. so /usr/lib 64/libc-2. 18. so /usr/lib/locale-archive [stack] [vvar] [vdso] [vsyscall]

Virtual Address Layout (64 bit) ============================================================ Start addr | Offset | End addr | Size | VM area description ============================================================ | | 00000000 | 00007 ffffff | 128 TB | user-space virtual memory, different per mm _________|__________________|_______________________________ | | 00008000000 | +128 TB | ffff 7 ffffff | ~16 M TB |. . . huge, almost 64 bits wide hole of non-canonical | | virtual memory addresses up to the -128 TB | | starting offset of kernel mappings. _________|__________________|_______________________________ | Kernel-space virtual memory, shared between all processes: ______________________________|______________________________ | | ffff 8000000 | -128 TB | ffff 87 fffff | 8 TB |. . . guard hole, also reserved for hypervisor ffff 8800000 | -120 TB | ffff 887 fffff | 0. 5 TB | LDT remap for PTI ffff 88800000 | -119. 5 TB | ffffc 87 fffff | 64 TB | direct mapping of all physical memory (page_offset_base) ffffc 8800000 | -55. 5 TB | ffffc 8 fffff | 0. 5 TB |. . . unused hole ffffc 900000 | -55 TB | ffffe 8 fffff | 32 TB | vmalloc/ioremap space (vmalloc_base) ffffe 900000 | -23 TB | ffffe 9 fffff | 1 TB |. . . unused hole ffffea 00000 | -22 TB | ffffeafffff | 1 TB | virtual memory map (vmemmap_base) ffffeb 00000 | -21 TB | ffffebfffff | 1 TB |. . . unused hole ffffec 00000 | -20 TB | fffffbfffff | 16 TB | KASAN shadow memory _________|__________________|________________________________ | | Identical layout to the 56 -bit one from here on: ______________________________|______________________________ | | fffffc 00000 | -4 TB | fffffdfffff | 2 TB |. . . unused hole | | vaddr_end for KASLR fffffe 00000 | -2 TB | fffffe 7 fffff | 0. 5 TB | cpu_entry_area mapping fffffe 800000 | -1. 5 TB | fffffefffff | 0. 5 TB |. . . unused hole ffffff 00000 | -1 TB | ffffff 7 fffff | 0. 5 TB | %esp fixup stacks ffffff 800000 | -512 GB | ffffffeeffff | 444 GB |. . . unused hole ffffffef 0000 | -68 GB | fffffffeffff | 64 GB | EFI region mapping space ffff 0000 | -4 GB | ffff 7 fffffff | 2 GB |. . . unused hole ffff 80000000 | -2 GB | ffff 9 fffffff | 512 MB | kernel text mapping, mapped to physical address 0 ffff 80000000 |-2048 MB | | | ffffa 0000000 |-1536 MB | fffffeffffff | 1520 MB | module mapping space fffff 000000 | -16 MB | | | FIXADDR_START | ~-11 MB | fffff 5 fffff | ~0. 5 MB | kernel-internal fixmap range, variable size and offset fffff 600000 | -10 MB | fffff 600 fff | 4 k. B | legacy vsyscall ABI fffffe 00000 | -2 MB | ffffffff | 2 MB |. . . unused hole _________|__________________|_______________________________ Both are contiguous ranges starting at physical address 0

Kernel System. map Boot loader jumps here • • • 000001000000 ffffffff 81000000 ffff 81000110 ffff 810001 b 0 A T T phys_startup_64 _text startup_64 secondary_startup_64 start_cpu 0 • • • ffff 810 b 57 f 0 ffff 8118 e 650 ffff 8118 f 780 ffff 8130 a 2 a 0 ffff 81309 ff 0 T T T vprintk kfree __kmalloc memset memcpy Kernel initialization

Spectre/Meltdown • Kernel used to share virtual address space with process – Present in each process address space – Only accessible if hardware was in kernel mode • Protected by page table HW – Allowed system calls to be made without switching page tables • Performance optimization (just increase priviledge level) • Spectre/Meltdown changed that – Allowed hardware to speculatively access kernel memory – Result of access could be read via cache side channel – Location of access could be controlled by attacker • Mitigations: – Kernel and processes are no longer mapped into the same page tables – Effect: Lots of stuff you read is no longer accurate

Linked Lists

structs and memory layout fox fox list. next list. prev

Linked lists in Linux Node; fox list {. next. prev }

What about types? • Calculates a pointer to the containing struct list_head fox_list; struct fox * fox_ptr = list_entry(fox_list->next, struct fox, node);

List access methods struct list_head some_list; list_add(struct list_head * new_entry, struct list_head * list); list_del(struct list_head * entry_to_remove); struct type * ptr; list_for_each_entry(ptr, &some_list, node){ … } struct type * ptr, * tmp_ptr; list_for_each_entry_safe(ptr, tmp_ptr, &some_list, node) { list_del(ptr); kfree(ptr); }
- Slides: 35