Concurrency and Race Conditions Linux Kernel Programming CIS

Pitfalls in scull n Race condition: result of uncontrolled access to shared data if

Concurrency and Its Management n Sources of concurrency n Multiple user-space processes n Multiple

Some guiding principles n Try to avoid concurrent access entirely n Global variables n

Managing Concurrency n Atomic operation: all or nothing from the perspective of other threads

Lock Design Considerations n Context n Can another thread be scheduled on the current

Kernel Locking Implementations n mutex n Sleep if lock cannot be acquired immediately n

Mutex Implementation n Architecture-dependent code n Optimizations n Initialization n n DEFINE_MUTEX(name) void mutex_init(struct

$Using mutexes in scull_dev structure revisited struct scull_dev { struct scull_qset *data; /* Pointer$

$Using mutexes in scull_dev initialization for (i = 0; i < scull_nr_devs; i++) {$

Using mutexes in scull_write() if (mutex_lock_interruptible(&dev->mutex)) return -ERESTARTSYS; n scull_write ends with out: mutex_unlock(&dev->mutex);

mutex_lock_interruptible() returns nonzero n If can be resubmitted n Undo visible changes if any

Restartable system call n Automatic restarting of certain interrupted system calls n Retry with

Restartable system call n Arguments may need to be modified n return ‑ERESTARTSYS_RESTARTBLOCK n

Userspace write() and kernelspace *_interruptible() n From POSIX man page n If write() is

mutex_lock_killable() n mutex_lock() n Process assumes that it cannot be interrupted by a signal

MUTEX USAGE AS COMPLETION (ERROR) HTTPS: //LKML. ORG/LKML/2013/12/2/997

General Pattern n refcount variable for deciding which thread to perform cleanup n Usage

$fs/pipe. c __pipe_lock(pipe); … spin_lock(&inode->i_lock); if (!--pipe->files) { inode->i_pipe = NULL; kill = 1;$

CPU 1 mutex_lock(obj->lock); dead = !--obj->refcount; // refcount was 2, is now 1, dead

Conclusion n Mutex serializes what is inside the mutex, but not necessarily the lock

Completions n Start and wait for operation to complete (outside current thread) n n

Completions n To create a completion DECLARE_COMPLETION(my_completion); n Or struct completion my_completion; init_completion(&my_completion); n

Completions n To signal a completion event, call one of the following /* wake

Completions n Example: misc-modules/complete. c DECLARE_COMPLETION(comp); ssize_t complete_read(struct file *filp, char __user *buf, size_t

Completions n Example ssize_t complete_write(struct file *filp, const char __user *buf, size_t count, loff_t

Spinlocks n Generally used in code that should not sleep n (e. g. ,

Spinlocks n Actual implementation varies for different architectures n Protect a process from other

Introduction to Spinlock API n #include <linux/spinlock. h> n To initialize, declare spinlock_t my_lock

Spinlocks and Atomic Context n While holding a spinlock, be atomic n Do not

The Spinlock Functions n Four functions to acquire a spinlock void spin_lock(spinlock_t *lock); /*

The Spinlock Functions n Four functions to release a spinlock void spin_unlock(spinlock_t *lock); /*

Locking Traps n It is very hard to manage concurrency n What can possibly

Ambiguous Rules n Shared data structure D, protected by lock L function A() {

Ambiguous Rules n Solution n Have clear entry points to access data structures n

Lock Ordering Rules function A() { function B() { lock(&L 1); lock(&L 2); /*

Lock Ordering Rules function A() { function B() { lock(&L 1); X(); unlock(&L 1)

Lock Ordering Rules of Thumb n Choose a lock ordering that is local to

Fine- Versus Coarse-Grained Locking n Coarse-grained locking n Poor concurrency n Fine-grained locking n

BKL n Kernel used to have “big kernel lock” n Giant spinlock introduced in

Alternatives to Locking n Lock-free algorithms n Atomic variables n Bit operations n seqlocks

Lock-Free Algorithms n Circular buffer n Producer places data into one end of an

Lock-Free Algorithms n Producer and consumer can access buffer concurrently without race conditions Always

Atomic Variables n If the shared resource is an integer value n Locking is

Atomic Variables n Atomic operations n atomic_sub(amount, &account 1); n atomic_add(amount, &account 2); n

Bit Operations n Atomic bit operations n See <asm/bitops. h> n SMP safe

Read-Copy-Update (RCU) n Assumptions n Reads are common n Writes are rare n Resources

Read-Copy-Update n Basic idea n The writing thread makes a copy n Make changes

seqlocks n sequential lock n Designed to protect small, simple, and frequently accessed resource

$seqlocks Expected non-blocking reader usage: do { seq = read_seqbegin(&foo); . . . }$

lglock (local/global locks) n Fast per-cpu access n Allows access to other cpu data

brlocks n Sat Oct 5 14: 19: 39 2013 -0400 no need to keep

Reader/Writer Semaphores n Allow multiple concurrent readers n Single writer (for infrequent writes) n

Reader/Writer Spinlocks n Analogous to the reader/writer semaphores n Allow multiple readers to enter

Reader/Writer Spinlocks n To declare and initialize, there are two ways /* static way

Reader/Writer Spinlocks n Similar functions are available void read_lock(rwlock_t *lock); read_lock_irqsave(rwlock_t *lock, unsigned long

Reader/Writer Spinlocks n Similar functions are available void write_lock(rwlock_t *lock); write_lock_irqsave(rwlock_t *lock, unsigned long

Slides: 69

Download presentation

Concurrency and Race Conditions Linux Kernel Programming CIS 4930/COP 5641

MOTIVATION: EXAMPLE PITFALL IN SCULL

Pitfalls in scull n Race condition: result of uncontrolled access to shared data if (!dptr->data[s_pos]) { dptr->data[s_pos] = kzalloc(quantum, GFP_KERNEL); if (!dptr->data[s_pos]) { goto out; } }

Pitfalls in scull n Race condition: result of uncontrolled access to shared data if (!dptr->data[s_pos]) { dptr->data[s_pos] = kmalloc(quantum, GFP_KERNEL); if (!dptr->data[s_pos]) { goto out; } }

Pitfalls in scull n Race condition: result of uncontrolled access to shared data if (!dptr->data[s_pos]) { dptr->data[s_pos] = kmalloc(quantum, GFP_KERNEL); if (!dptr->data[s_pos]) { goto out; Memory leak } }

MANAGING CONCURRENCY

Concurrency and Its Management n Sources of concurrency n Multiple user-space processes n Multiple CPUs n Device interrupts n Timers

Some guiding principles n Try to avoid concurrent access entirely n Global variables n Apply locking and mutual exclusion principles n Implications to device drivers n n n Use sufficient concurrency mechanisms (depending on context) No object can be made available to the kernel until it can function properly References to such objects must be tracked for proper removal n Avoid “roll your own” solutions

Managing Concurrency n Atomic operation: all or nothing from the perspective of other threads n Critical section: code executed by only one thread at a time n Not all critical sections are the same n n Access from interrupt handlers Latency constraints

Lock Design Considerations n Context n Can another thread be scheduled on the current processor? n Assumptions of kernel operation n Breaking assumptions will break code that relies on them n Time expected to wait for lock n Considerations n n n Long n n Amount of time lock is expected to be held Amount of expected contention Other threads can make better use of the processor Short n Time to switch to another thread will be longer than just waiting a short amount of time

Kernel Locking Implementations n mutex n Sleep if lock cannot be acquired immediately n Allow other threads to use the processor n spinlock n Continuously try to grab the lock n Generally do not allow sleeping n Why?

MUTEX

Mutex Implementation n Architecture-dependent code n Optimizations n Initialization n n DEFINE_MUTEX(name) void mutex_init(struct mutex *lock); n Various routines n n void mutex_lock(struct mutex *lock); int mutex_lock_interruptible(struct mutex *lock); int mutex_lock_killable(struct mutex *lock); void mutex_unlock(struct mutex *lock);

$Using mutexes in scull_dev structure revisited struct scull_dev { struct scull_qset *data; /* Pointer$

Using mutexes in scull_dev structure revisited struct scull_dev { struct scull_qset *data; /* Pointer to first quantum set int quantum; /* the current quantum size int qset; /* the current array size unsigned long size; /* amount of data stored here unsigned int access_key; /* used by sculluid & scullpriv struct mutex; /* mutual exclusion */ struct cdev; /* Char device structure }; */ */ */

$Using mutexes in scull_dev initialization for (i = 0; i < scull_nr_devs; i++) {$

Using mutexes in scull_dev initialization for (i = 0; i < scull_nr_devs; i++) { scull_devices[i]. quantum = scull_quantum; scull_devices[i]. qset = scull_qset; mutex_init(&scull_devices[i]. mutex); /* before cdev_add */ scull_setup_cdev(&scull_devices[i], i); }

Using mutexes in scull_write() if (mutex_lock_interruptible(&dev->mutex)) return -ERESTARTSYS; n scull_write ends with out: mutex_unlock(&dev->mutex); return retval;

mutex_lock_interruptible() returns nonzero n If can be resubmitted n Undo visible changes if any and restart n Otherwise return ‑EINTR n E. g. , could not undo changes

mutex_lock_interruptible() (returns non-zero) n If can be resubmitted n Undo visible changes if any and restart n Otherwise return ‑EINTR n E. g. , could not undo changes

Restartable system call n Automatic restarting of certain interrupted system calls n Retry with same arguments (values) n Simplifies user-space programming for dealing with "interrupted system call“ POSIX permits an implementation to restart system calls, but it is not required. SUS defines the SA_RESTART flag to provide a means by which an application can request that an interrupted system calls be restarted. http: //pubs. opengroup. org/onlinepubs/009604499/function s/sigaction. html return ‑ERESTARTSYS n n

Restartable system call n Arguments may need to be modified n return ‑ERESTARTSYS_RESTARTBLOCK n Specify callback function to modify arguments n http: //lwn. net/Articles/17744/

Userspace write() and kernelspace *_interruptible() n From POSIX man page n If write() is interrupted by a signal before it writes any data, it shall return -1 with errno set to [EINTR]. n If write() is interrupted by a signal after it successfully writes some data, it shall return the number of bytes written. n http: //pubs. opengroup. org/onlinepubs/0096044 99/functions/sigaction. html

mutex_lock_killable() n mutex_lock() n Process assumes that it cannot be interrupted by a signal n Breaking assumption breaks user-kernel space interface n If process receives fatal signal and mutex_lock() never returns n Results in an immortal process n Assumptions/expectations do not apply if process receives fatal signal n n Process that called system call will never return Does not break assumption since process does not continue n http: //lwn. net/Articles/288056/

MUTEX USAGE AS COMPLETION (ERROR) HTTPS: //LKML. ORG/LKML/2013/12/2/997

General Pattern n refcount variable for deciding which thread to perform cleanup n Usage n n Initialize shared object Set refcount to number of concurrent threads Start multiple threads Last thread cleans up <do stuff> mutex_lock(obj->lock); dead = !--obj->refcount; mutex_unlock(obj->lock); if (dead) free(obj);

$fs/pipe. c __pipe_lock(pipe); … spin_lock(&inode->i_lock); if (!--pipe->files) { inode->i_pipe = NULL; kill = 1;$

fs/pipe. c __pipe_lock(pipe); … spin_lock(&inode->i_lock); if (!--pipe->files) { inode->i_pipe = NULL; kill = 1; } spin_unlock(&inode->i_lock); __pipe_unlock(pipe); if (kill) free_pipe_info(pipe);

CPU 1 mutex_lock(obj->lock); dead = !--obj->refcount; // refcount was 2, is now 1, dead = 0. mutex_unlock(obj->lock); __mutex_fastpath_unlock() fastpath fails (because mutex is nonpositive __mutex_unlock_slowpath: if (__mutex_slowpath_needs_to_unlock()) atomic_set(&lock->count, 1); CPU 2 mutex_lock(obj->lock); // blocks on obj->lock, goes to slowpath // mutex is negative, CPU 2 is in optimistic // spinning mode in __mutex_lock_common if ((atomic_read(&lock->count) == 1) && (atomic_cmpxchg(&lock->count, 1, 0) == 1)) {. . and now CPU 2 owns the mutex, and goes on dead = !--obj->refcount; // refcount was 1, is now 0, dead = 1. mutex_unlock(obj->lock); if (dead) free(obj); but in the meantime, CPU 1 is busy still unlocking: if (!list_empty(&lock->wait_list)) {

Conclusion n Mutex serializes what is inside the mutex, but not necessarily the lock ITSELF n Use spinlocks and/or atomic ref counts n "don't use mutexes to implement completions"

COMPLETIONS

Completions n Start and wait for operation to complete (outside current thread) n n Common pattern in kernel programming E. g. , wait for initialization to complete n Reasons to use instead of mutexes n Wake up multiple threads n More efficient n More meaningful syntax n Subtle races with mutex implementation code n n n Cleanup of mutex itself http: //lkml. iu. edu//hypermail/linux/kernel/0107. 3/0674. html https: //lkml. org/lkml/2008/4/11/323 n completions n #include <linux/completion. h>

Completions n To create a completion DECLARE_COMPLETION(my_completion); n Or struct completion my_completion; init_completion(&my_completion); n To wait for the completion, call void wait_for_completion(struct completion *c); void wait_for_completion_interruptible(struct completion *c); void wait_for_completion_timeout(struct completion *c, unsigned long timeout);

Completions n To signal a completion event, call one of the following /* wake up one waiting thread */ void complete(struct completion *c); /* wake up multiple waiting threads */ /* need to call INIT_COMPLETION(struct completion c) to reuse the completion structure */ void complete_all(struct completion *c);

Completions n Example: misc-modules/complete. c DECLARE_COMPLETION(comp); ssize_t complete_read(struct file *filp, char __user *buf, size_t count, loff_t *pos) { printk(KERN_DEBUG "process %i (%s) going to sleepn", current->pid, current->comm); wait_for_completion(&comp); printk(KERN_DEBUG "awoken %i (%s)n", current->pid, current->comm); return 0; /* EOF */ }

Completions n Example ssize_t complete_write(struct file *filp, const char __user *buf, size_t count, loff_t *pos) { printk(KERN_DEBUG "process %i (%s) awakening the readers. . . n", current->pid, current->comm); complete(&comp); return count; /* succeed, to avoid retrial */ }

SPINLOCKS

Spinlocks n Generally used in code that should not sleep n (e. g. , interrupt handlers) n Usually implemented as a single bit n If the lock is available, the bit is set and the code continues n If the lock is taken, the code enters a tight loop n Repeatedly checks the lock until it become available

Spinlocks n Actual implementation varies for different architectures n Protect a process from other CPUs and interrupts n Usually does nothing on uniprocessor machines n Exception: changing the IRQ masking status

Introduction to Spinlock API n #include <linux/spinlock. h> n To initialize, declare spinlock_t my_lock = SPIN_LOCK_UNLOCKED; n Or call void spin_lock_init(spinlock_t *lock); n To acquire a lock, call void spin_lock(spinlock_t *lock); n Spinlock waits are uninterruptible n To release a lock, call void spin_unlock(spinlock_t *lock);

Spinlocks and Atomic Context n While holding a spinlock, be atomic n Do not sleep or relinquish the processor n Examples of calls that can sleep § Copying data to or from user space § User-space page may need to be on disk… § Memory allocation § Memory might not be available n Disable interrupts (on the local CPU) as needed n Hold spinlocks for the minimum time possible

The Spinlock Functions n Four functions to acquire a spinlock void spin_lock(spinlock_t *lock); /* disables interrupts on the local CPU */ void spin_lock_irqsave(spinlock_t *lock, unsigned long flags); /* only if no other code disabled interrupts */ void spin_lock_irq(spinlock_t *lock); /* disables software interrupts; leaves hardware interrupts enabled (e. g. tasklets)*/ void spin_lock_bh(spinlock_t *lock);

The Spinlock Functions n Four functions to release a spinlock void spin_unlock(spinlock_t *lock); /* need to use the same flags variable for locking */ /* need to call spin_lock_irqsave and spin_unlock_irqrestore in the same function, or your code may break on some architectures */ void spin_unlock_irqrestore(spinlock_t *lock, unsigned long flags); void spin_unlock_irq(spinlock_t *lock); void spin_unlock_bh(spinlock_t *lock);

Locking Traps n It is very hard to manage concurrency n What can possibly go wrong?

Ambiguous Rules n Shared data structure D, protected by lock L function A() { lock(&L); /* call function B() that accesses D */ unlock(&L); } n If function B() calls lock(&L), we have a deadlock

Ambiguous Rules n Solution n Have clear entry points to access data structures n Document assumptions about locking

Lock Ordering Rules function A() { function B() { lock(&L 1); lock(&L 2); /* access D */ unlock(&L 2); unlock(&L 1) } lock(&L 2); lock(&L 1); /* access D */ unlock(&L 1); unlock(&L 2) } - Multiple locks should always be acquired in the same order - Easier said than done

Lock Ordering Rules function A() { function B() { lock(&L 1); X(); unlock(&L 1) lock(&L 2); Y(); unlock(&L 2) } } function X() { function Y() { lock(&L 2); /* access D */ unlock(&L 2); } lock(&L 1); /* access D */ unlock(&L 1); }

Lock Ordering Rules of Thumb n Choose a lock ordering that is local to your code before taking a lock belonging to a more central part of the kernel n Lock of central kernel code likely has more users (more contention) n Obtain the mutex first before taking the spinlock n Grabbing a mutex (which can sleep) inside a spinlock can lead to deadlocks

Fine- Versus Coarse-Grained Locking n Coarse-grained locking n Poor concurrency n Fine-grained locking n Need to know which one to acquire n And which order to acquire n At the device driver level n Start with coarse-grained locking n Refine the granularity as contention arises n Can enable lockstat to check lock holding time

BKL n Kernel used to have “big kernel lock” n Giant spinlock introduced in Linux 2. 0 n Only one CPU could be executing locked kernel code at any time n BKL has been removed n https: //lwn. net/Articles/384855/ n https: //www. linux. com/learn/tutorials/447301: w hats-new-in-linux-2639 -ding-dong-the-bigkernel-lock-is-dead

Alternatives to Locking n Lock-free algorithms n Atomic variables n Bit operations n seqlocks n Read-copy-update (RCU)

Lock-Free Algorithms n Circular buffer n Producer places data into one end of an array n n When the end of the array is reached, the producer wraps back Consumer removes data from the other end

Lock-Free Algorithms n Producer and consumer can access buffer concurrently without race conditions Always store the value before updating the index into the array n Need to make sure that producer/consumer indices do not overrun each other n n A generic circular buffer is available n See <linux/kfifo. h>

ATOMIC VARIABLES

Atomic Variables n If the shared resource is an integer value n Locking is overkill (if supported by processor) n The kernel provides atomic types n atomic_t - integer n atomic 64_t – long integer n Both types must be accessed through special functions (See <asm/atomic. h>) n SMP safe

Atomic Variables n Atomic operations n atomic_sub(amount, &account 1); n atomic_add(amount, &account 2); n A higher level locking must be used

Bit Operations n Atomic bit operations n See <asm/bitops. h> n SMP safe

OTHER SYNCHRONIZATION MECHANISMS

Read-Copy-Update (RCU) n Assumptions n Reads are common n Writes are rare n Resources accessed via pointers n All references to those resources held by atomic code

Read-Copy-Update n Basic idea n The writing thread makes a copy n Make changes to the copy n Switch a few pointers to commit changes n Deallocate the old version when all references to the old version are gone

EVEN MORE…

seqlocks n sequential lock n Designed to protect small, simple, and frequently accessed resource n Write access is rare n Must obtain an exclusive lock (spinlock) n Allow readers free access to the resource n Lockless n Operation n Check for collisions with writers Retry as needed Not for protecting pointers

$seqlocks Expected non-blocking reader usage: do { seq = read_seqbegin(&foo); . . . }$

seqlocks Expected non-blocking reader usage: do { seq = read_seqbegin(&foo); . . . } while (read_seqretry(&foo, seq));

lglock (local/global locks) n Fast per-cpu access n Allows access to other cpu data (slow) n Implementation n per-CPU array of spinlocks n Can only be declared as global variables to avoid overhead and keep things simple n http: //lwn. net/Articles/401738/

brlocks n Sat Oct 5 14: 19: 39 2013 -0400 no need to keep brlock macros anymore. . . n 0 f 6 ed 63 b 170778 b 9 c 93 fb 0 ae 4017 f 110 c 9 ee 64 16 n

Reader/Writer Semaphores n Allow multiple concurrent readers n Single writer (for infrequent writes) n Too many writers can lead to reader starvation (unbounded waiting) n #include <linux/rwsem. h> n Do not follow the return value convention n E. g. , returns 1 if successful n Not interruptible

Reader/Writer Spinlocks n Analogous to the reader/writer semaphores n Allow multiple readers to enter a critical section n Provide exclusive access for writers n #include <linux/spinlock. h>

Reader/Writer Spinlocks n To declare and initialize, there are two ways /* static way */ rwlock_t my_rwlock = RW_LOCK_UNLOCKED; /* dynamic way */ rwlock_t my_rwlock; rwlock_init(&my_rwlock);

Reader/Writer Spinlocks n Similar functions are available void read_lock(rwlock_t *lock); read_lock_irqsave(rwlock_t *lock, unsigned long flags); read_lock_irq(rwlock_t *lock); read_lock_bh(rwlock_t *lock); void read_unlock(rwlock_t *lock); void read_unlock_irqrestore(rwlock_t *lock, unsigned long flags); void read_unlock_irq(rwlock_t *lock); void read_unlock_bh(rwlock_t *lock);

Reader/Writer Spinlocks n Similar functions are available void write_lock(rwlock_t *lock); write_lock_irqsave(rwlock_t *lock, unsigned long flags); write_lock_irq(rwlock_t *lock); write_lock_bh(rwlock_t *lock); void write_unlock(rwlock_t *lock); void write_unlock_irqrestore(rwlock_t *lock, unsigned long flags); void write_unlock_irq(rwlock_t *lock); void write_unlock_bh(rwlock_t *lock);