The ABCs of Atomics An Introduction to std

The ABC’s of Atomics An Introduction to std: : atomic<T> and the C++11 Memory Model

The ABC’s of Atomics • C++11 atomic operations library • Header <atomic> • Atomic types • std: : atomic<T> • std: : atomic_flag • Functions • Standalone fences • Dozens of functions for C compatibility

A TOMICITY

Data Races A data race is a race condition that occurs if multiple threads concurrently access the same memory location, without synchronisation, and at least one of those accesses is a write.

Atomicity Atomic (a-tomos, undividable) == indivisible, data-race-free

Atomic Load & Store std: : atomic<int> x; void f(int); x. store(1); x = 2; // C interoperability: std: : atomic_store(&x, 3); f(x. load()); f(x); int i = x; int i = std: : atomic_load(&x);

Atomic Load & Store std: : atomic<std: : uint 64_t> x; x = 1; x = ~0 ul; assert(x == 1 || x == ~0 ul);

Atomic Load & Store std: : atomic<std: : uint 64_t> x; x = 1; assert(x == 0 || x == 1 || x == ~0 ul); std: : atomic<int> y(0); x = ~0 ul;

Atomic Load & Store std: : atomic<std: : uint 64_t> x; x = 1; x = ~0 ul; auto r = x. load(); assert(r == 0 || r == 1 || r == ~0 ul);

Atomic Load & Store std: : atomic<std: : uint 64_t> x; x = 1; //. . . x = ~0 ul; assert(x == 0 || x == 1 || x == ~0 ul);

Atomic Load & Store For any (trivially copyable) type T std: : atomic<int> x; std: : atomic<long int> y; struct S { /*. . . */ }; std: : atomic<S> z; assert(x. is_lock_free()); // Most platforms assert(!z. is_lock_free());

Atomic Exchange For any (trivially copyable) type T const auto before = x. exchange(123); T exchange(T new. Value) { T old. Value = load(); store(new. Value); return old. Value; }

Atomic Compare Exchange For any (trivially copyable) type T bool compare_exchange(T& expected, T desired) { const T current. Value = load(); if (current. Value == expected) { store(desired); return true; } else { expected = current. Value; return false; } }

Atomic Compare Exchange • For any (trivially copyable) type T • Powerful tool for lock-free data structures • Two flavors • compare_exchange_strong(T& des, T exp) • For use as standalone operation • compare_exchange_weak(T& des, T exp) • May “fail” from time to time • For use in a loop

Atomic Operations For any (trivially copyable) type T • Atomic load • Atomic store • Atomic exchange • Atomic compare_exchange_strong • Atomic compare_exchange_weak

Specialisations & Typedefs std: : atomic_bool std: : atomic<bool> std: : atomic_char std: : atomic<char> std: : atomic_schar std: : atomic<signed char> std: : atomic_uchar std: : atomic<unsigned char> std: : atomic_short std: : atomic<short> std: : atomic_ushort std: : atomic<unsigned short> std: : atomic_int std: : atomic<int> std: : atomic_uint std: : atomic<unsigned int> std: : atomic_long std: : atomic<long> std: : atomic_ulong std: : atomic<unsigned long> std: : atomic_llong std: : atomic<long> std: : atomic_ullong std: : atomic<unsigned long> std: : atomic_char 16_t std: : atomic<char 16_t> std: : atomic_char 32_t std: : atomic<char 32_t> std: : atomic_wchar_t std: : atomic<wchar_t>

Specialisations & Typedefs std: : atomic_int_least 8_t std: : atomic<std: : int_least 8_t> std: : atomic_uint_least 8_t std: : atomic<std: : uint_least 8_t> std: : atomic_int_least 16_t std: : atomic<std: : int_least 16_t> std: : atomic_uint_least 16_t std: : atomic<std: : uint_least 16_t> std: : atomic_int_least 32_t std: : atomic<std: : int_least 32_t> std: : atomic_uint_least 32_t std: : atomic<std: : uint_least 32_t> std: : atomic_int_least 64_t std: : atomic<std: : int_least 64_t> std: : atomic_uint_least 64_t std: : atomic<std: : uint_least 64_t> std: : atomic_int_fast 8_t std: : atomic<std: : int_fast 8_t> std: : atomic_uint_fast 8_t std: : atomic<std: : uint_fast 8_t> std: : atomic_int_fast 16_t std: : atomic<std: : int_fast 16_t> std: : atomic_uint_fast 16_t std: : atomic<std: : uint_fast 16_t> std: : atomic_int_fast 32_t std: : atomic<std: : int_fast 32_t> std: : atomic_uint_fast 32_t std: : atomic<std: : uint_fast 32_t> std: : atomic_int_fast 64_t std: : atomic<std: : int_fast 64_t>

Specialised Atomic Operations • For integral types T • operator++, operator++(int) • operator+=, operator-=, operator|=, operator&=, operator^= • fetch_add, fetch_sub, fetch_or, fetch_and, fetch_xor • For pointer types T* • operator++, operator++(int) • operator+=, operator-= • fetch_add, fetch_sub

std: : atomic_flag • Guaranteed lock-free • No load, store, exchange, compare_exchange • Instead: • Asignment operator= • bool test_and_set() // Set to true • void clear() // Set to false

Takeaways Atomics offer data-race-free operations • • Any type, integral and pointer types in particular Load, store, compare exchange, increment, … Portable Efficient (lock-free)

A -I S F

The As-If Rule A conforming implementation is free to choose how it executes a well-formed program, as long as the program’s observable behaviour is as if it were executed as written.

Single-threaded Optimisations for (int r = 0; r < rows; ++r) for (int c = 0; c < cols; ++c) sum += array[c*rows+r]; for acc int (int=csum, sum; = 0; ic=<0; cols; ++c) for (int r =c 0; r < rows; for (int c = 0; < cols; ++c) ++r) sum r+== array[c*rows+r]; for (int 0; r < rows; ++r) acc += array[c*rows+r]; array[++i]; sum = acc;

Single-threaded Optimisations for (int r = 0; r < rows; ++r) for (int c = 0; c < cols; ++c) sum += array[c*rows+r]; int acc = sum, i = 0; for (int c = 0; c < cols; ++c) for (int r = 0; r < rows; ++r) acc += array[++i]; sum(i) if = acc; sum = acc; sum=42;

Code Transformations bool x, y; x = true; if (y) cout << "y"; y = true; if (x) cout << "x";

Compiler Transformations bool x, y; if (y) cout << "y"; x = true; if (x) cout << "x"; y = true;

Real-life Examples

Code Transformations bool x, y; x = true; if (y) cout << "y"; y = true; if (x) cout << "x";

Compiler Transformations bool x, y; bool r = y; x = true; if (r) cout << "y"; bool r = x; y = true; if (r) cout << "x";

Code Transformations bool x, y; x = true; if (y) cout << "y"; y = true; if (x) cout << "x";

Store Buffer Processor 1 Processor 2 x = true; 1 if (y) cout << "y"; y = true; 3 if (x) cout << "x"; 2 Store Buffer 4 Store Buffer 5 Coherent Cache / Main Memory 6

Code Transformations Source Code reordering subexpr. el. Compiler inlining unrolling Processor Source Code Transformations out-of-order ex. … speculation store buffer Cache Executed Code … …

Takeaways 1. Implementation rarely executes what you wrote • Code is reordered, omitted, invented • Compiler, processor, cache: all equivalent • Critical for performance 2. Atomicity + As-if rule: not enough! • Need to restrict code transformations

B ARRIERS

Critical Regions using Mutexes std: : string x; std: : mutex x_mutex; x_mutex. lock(); x = "Hello, world"; x_mutex. unlock(); x = "Hello, world"; x_mutex. lock(); x_mutex. unlock();

Critical Regions using Mutexes std: : string x; std: : mutex x_mutex; x_mutex. lock(); x = "Hello, world"; x_mutex. unlock(); x = "Hello, world"; x_mutex. lock(); x_mutex. unlock(); x = "Hello, world";

Critical Regions using Mutexes std: : string x, y, z; std: : mutex y_mutex; x = "Hell"; y_mutex. lock(); y = "o, w"; y_mutex. unlock(); z = "orld"; y_mutex. lock(); x = "Hell"; y = "o, w"; y_mutex. unlock(); z = "orld";

Critical Regions using Mutexes std: : string x, y, z; std: : mutex y_mutex; x = "Hell"; y_mutex. lock(); y = "o, w"; y_mutex. unlock(); z = "orld"; y_mutex. lock(); x = "Hell"; y = "o, w"; z = "orld"; y_mutex. unlock();

Critical Regions using Mutexes std: : string x, y, z; std: : mutex y_mutex; x = "Hell"; y_mutex. lock(); y = "o, w"; y_mutex. unlock(); z = "orld"; y_mutex. lock(); z = "orld"; y = "o, w"; x = "Hell"; y_mutex. unlock();

Critical Regions using Mutexes std: : string x, y, z; std: : mutex y_mutex; x = "Hell"; y_mutex. lock(); y = "o, w"; y_mutex. unlock(); z = "orld"; x = "Hell"; z = "orld"; y_mutex. lock(); y = "o, w"; y_mutex. unlock();

Critical Regions using Mutexes std: : string x, y, z; std: : mutex y_mutex; x = "Hell"; y_mutex. lock(); y = "o, w"; y_mutex. unlock(); z = "orld"; x = "Hell"; y_mutex. lock(); y = "o, w"; z = "orld"; y_mutex. unlock();

Critical Regions using Mutexes std: : string x, y, z; std: : mutex y_mutex; x = "Hell"; y_mutex. lock(); y = "o, w"; y_mutex. unlock(); z = "orld"; x = "Hell"; y_mutex. lock(); z = "orld"; y = "o, w"; y_mutex. unlock();

Critical Regions using Mutexes std: : string x, y, z; std: : mutex y_mutex; x = "Hell"; y_mutex. lock(); y = "o, w"; y_mutex. unlock(); z = "orld"; x = "Hell"; z = "orld"; y_mutex. lock(); y = "o, w"; y_mutex. unlock();

Critical Regions using Mutexes std: : string x, y, z; std: : mutex y_mutex; x = "Hell"; y_mutex. lock(); y = "o, w"; y_mutex. unlock(); z = "orld"; y_mutex. lock(); y = "o, w"; y_mutex. unlock(); x = "Hell"; z = "orld";

Memory Barriers std: : string x, y, z; std: : mutex y_mutex; x = "Hell"; y_mutex. lock(); y = "o, w"; y_mutex. unlock(); z = "orld"; Acquire barrier Release barrier

Atomic Barriers bool x, y; x = true; y = true; if (y) cout << "y"; if (x) cout << "x";

Atomic Barriers “synchronised” bool x, y; x = true; y = true; if (y) cout << "y"; if (x) cout << "x";

Atomic Barriers std: : atomic<bool> x, y; Atomic Store == Release Barrier x = true; y = true; if (y) cout << "y"; if (x) cout << "x"; Atomic Load == Acquire Barrier

Takeaways • Acquire barriers • mutex: : lock, atomic: : load, … • Code may flow down, but not up • “Wait until acquired” • Release barriers • mutex: : unlock, atomic: : release, … • Code may flow up, but not down • “Finish before releasing” • Atomicity + acquire/release barriers: not enough!

C ONSISTENCY

Acquire & Release Barriers std: : atomic<bool> x, y; x = true; y = true; if (y) cout << "y"; if (x) cout << "x";

Sequential Consistency (SC) “Plain” Acquire & Release SC Acquire & Release Acquire Release Acquire

Sequentially Consistent Barriers std: : atomic<bool> x, y; Atomic Store == SC Release Barrier x = true; y = true; if (y) cout << "y"; if (x) cout << "x"; Atomic Load == SC Acquire Barrier

Sequential Consistency std: : string s; std: : atomic<bool> ready; s = "Hello, world!"; ready = true; while (!ready) {} cout << s;

Sequentially Consistent Pointers std: : atomic<std: : string*> s; s = new string("Hello"); while (!s) {} cout << *s;

Sequential Consistency std: : atomic<std: : string*> s; auto temp = new string("Hello"); s = temp; while (!s) {} cout << *s;

Double-Checked Locking is Unbroken class Singleton { /*. . . */ }; std: : atomic<Singleton*> instance; std: : mutex m; Singleton* Get. Instance() { if (instance == nullptr) { std: : lock_guard<std: : mutex> lock(m); if (instance == nullptr) instance = new Singleton(); } return instance; }

Sequential Consistency: Transitivity bool g; std: : atomic<bool> x, y; g = true; x = true; if (x) y = true; if (y) assert(g)

Sequential Consistency: Total Order std: : atomic<bool> x, y; x = true; if (x && !y) cout << "x first"; y = true; if (y && !x) cout << "y first";

Key Takeaway Don’t write race conditions, and use sequentially consistent atomics, and your code will do what you think.

D E ON’T DO IT XPERTS ONLY

Don’t Do It, Experts Only The First Rule of Program Optimization: Don't do it The Second Rule of Program Optimization (for experts only!): Don't do it yet.

Memory Order std: : atomic_bool x, y; x = true; if (y) cout << "y"; y = true; if (x) cout << "x";

Memory Order std: : atomic_bool x, y; x. store(true); if (y. load()) cout << "y"; y. store(true); if (x. load()) cout << "x";

Memory Order std: : atomic_bool x, y; x. store(true, std: : memory_order_seq_cst); if (y. load(std: : memory_order_seq_cst)) cout << "y"; y. store(true, std: : memory_order_seq_cst); if (x. load(std: : memory_order_seq_cst)) cout << "x";

$Memory Order enum memory_order { memory_order_relaxed, memory_order_consume, memory_order_acquire, memory_order_release, memory_order_acq_rel, memory_order_seq_cst // default };$

Memory Order enum memory_order { memory_order_relaxed, memory_order_consume, memory_order_acquire, memory_order_release, memory_order_acq_rel, memory_order_seq_cst // default };

Relaxed Memory Order std: : atomic_bool x, y; x. store(true, std: : memory_order_relaxed); if (y. load(std: : memory_order_relaxed)) cout << "y"; y. store(true, std: : memory_order_relaxed); if (x. load(std: : memory_order_relaxed)) cout << "x";

Acquire/Release Memory Order std: : atomic_bool x, y; x. store(true, std: : memory_order_release); if (y. load(std: : memory_order_acquire)) cout << "y"; y. store(true, std: : memory_order_release); if (x. load(std: : memory_order_acquire)) cout << "x";

Acquire/Release Memory Order std: : atomic_int x, y; #define acquire std: : memory_order_acquire #define release std: : memory_order_release x. store(1, release); if (x. load(acquire) && !y. load(acquire)) cout << "x first"; y. store(1, release); if (y. load(acquire) && !x. load(acquire)) cout << "y first";

Don’t Do It, Experts Only The difference between acq_rel and seq_cst is generally whether the operation is required to participate in the single global order of sequentially consistent operations. This has subtle and unintuitive effects. The fences in the current standard may be the most experts-only construct [in C++].

$Peterson’s Mutex (Bartosz Milewski) class Peterson. Mutex. BM { std: : atomic<bool> m_interested[2]; std:$

Peterson’s Mutex (Bartosz Milewski) class Peterson. Mutex. BM { std: : atomic<bool> m_interested[2]; std: : atomic<unsigned> m_victim; public: void lock() { const auto me = binary_thread_id(); // 0 or 1 const unsigned he = 1 – me; // 1 or 0 m_interested[me]. exchange(true, acq_rel); m_victim. store(me, release); while (m_interested[he]. load(acquire) && m_victim. load(acquire) == me); }

$Peterson’s Mutex (Dmitriy V'jukov) class Peterson. Mutex. DV { std: : atomic<bool> m_interested[2]; std:$

Peterson’s Mutex (Dmitriy V'jukov) class Peterson. Mutex. DV { std: : atomic<bool> m_interested[2]; std: : atomic<unsigned> m_victim; public: void lock() { const auto me = binary_thread_id(); // 0 or 1 const unsigned he = 1 – me; // 1 or 0 m_interested[me]. store(true, relaxed); m_victim. exchange(me, acq_rel); while (m_interested[he]. load(acquire) && m_victim. load(relaxed) == me); }

Relaxed Double Checked Locking (Herb Sutter) • Questions 1. Is it correct? 2. Is it worth it?

Takeaway: Relaxed, Don’t Do it

F ENCES

Standalone Fences • Fence == barrier • std: : atomic_thread_fence(std: : memory_order) & std: : atomic_signal_fence(std: : memory_order) • • • memory_order_relaxed memory_order_consume memory_order_acquire memory_order_release memory_order_acq_rel memory_order_seq_cst // does nothing // (no default)

Fences bool x, y; using namespace std; x = true; atomic_thread_fence( memory_order_seq_cst ); if (y) cout << "y"; y = true; atomic_thread_fence( memory_order_seq_cst ); if (x) cout << "x";

Takeaway • Standalone fences are suboptimal • Error-prone • Suboptimal performance • Cf. “Atomic<> Weapons” by Herb Sutter

L OCK-FREE PROGRAMMING

Lock-Free Programming Don’t do it, experts only! • New lock-free data structure == research article • Cf. “Lock-Free Programming (Or, Juggling Razor Blades)” by Herb Sutter

M AGIC

Double-Checked Locking class Singleton { /*. . . */ }; std: : atomic<Singleton*> instance; std: : mutex m; Singleton* Get. Instance() { if (instance == nullptr) { std: : lock_guard<std: : mutex> lock(m); if (instance == nullptr) instance = new Singleton(); } return instance; }

Magic Statics Work… class Singleton { /*. . . */ }; // Magic statics are thread-safe in C++11. . . Singleton& Get. Instance() { static Singleton instance; return instance; } //. . . , but check your compiler documentation!

T HREADS

std: : thread Constructor Release barrier Start thread function Acquire barrier End thread function Release barrier Join Acquire barrier • Essentially: • Everything written prior the thread’s launch can safely be read from the function it executes • Everything written during the thread’s execution can safely be read after std: : thread: : join() • std: : async & std: : future are similar

V OLATILE

Volatile in C++ • Unoptimisable variables for talking (I/O) to something outside the program • E. g. hardware registers etc. • Deliberately underspecified • Not necessarily atomic • Similar but different reordering constraints • No optimisation • Not even e. g. “v=1; v=2; ” or “v=1; r=v; ”

Volatile in Ms VC++ (msdn) “Visual Studio interprets the volatile keyword differently depending on the target architecture. For ARM, [the default is] /volatile: iso, [otherwise it is] /volatile: ms; [but] we strongly recommend that you specify /volatile: iso, and use explicit synchronization primitives […] when you are dealing with memory that is shared across threads. […] Microsoft Specific [/volatile: ms] • A write to a volatile object […] has Release semantics; […] • A read of a volatile object […] has Acquire semantics; […] Note [Code that relies on] the enhanced guarantee that's provided when the /volatile: ms […] is used, […] is non-portable. ”

Volatile in Java /. Net volatile in Java /. Net ≈ std: : atomic in C++ • Java • Main inspiration for C++11 memory model • Atomic load & store, sequential consistentent ordering • java. util. concurrent. atomic • Atomic arrays • Atomic increment, exchange, cas, etc. • . Net • Plain acquire & release, no sequential consistency

Takeaway • volatile in C++ ≠ std: : atomic in C++ • Don’t use MS-specific volatile • volatile in Java /. Net ≈ std: : atomic in C++

N UTSHELL

Key Takeaway Don’t write race conditions, and don’t use relaxed atomics, and your code will do what you think.

Q UESTIONS?

More Information • Herb Sutter • Atomic<> Weapons, C++ and Beyond 2012 talk (part 1, part 2) • Lock-Free Programming (Or, Juggling Razor Blades), Cpp. Con 2014 talk (part 1, part 2) • … • Hans Boehm • Threads Basics (article) • A Less Formal Explanation of the Proposed C++ Concurrency Memory Model (C++11 standard proposal article) • … • Anthony Williams • C++ Concurrency in Action (book) • Just Software Solutions blog • …