Parallelism in the Standard C What to Expect

  • Slides: 40
Download presentation
Parallelism in the Standard C++: What to Expect in C++ 17 Artur Laksberg arturl@microsoft.

Parallelism in the Standard C++: What to Expect in C++ 17 Artur Laksberg [email protected] com Visual C++ Team, Microsoft September 17, 2014

Agenda Parallel Task Fundamentals regions Parallel Algorithms Parallelization Vectorization

Agenda Parallel Task Fundamentals regions Parallel Algorithms Parallelization Vectorization

Part 1: The Fundamentals

Part 1: The Fundamentals

Renderscript Open. MP CUDA C++ AMP PPL TBB MPI Open. ACC Open. CL Cilk

Renderscript Open. MP CUDA C++ AMP PPL TBB MPI Open. ACC Open. CL Cilk Plus GCD

Parallelism in C++11/14 Fundamentals: Memory Atomics model Basics: thread mutex condition_variable async future

Parallelism in C++11/14 Fundamentals: Memory Atomics model Basics: thread mutex condition_variable async future

Quicksort: Serial void quicksort(int *v, int start, int end) { if (start < end)

Quicksort: Serial void quicksort(int *v, int start, int end) { if (start < end) { int pivot = partition(v, start, end); quicksort(v, start, pivot - 1); quicksort(v, pivot + 1, end); } }

Quicksort: Use Threads void quicksort(int *v, int start, int end) { if (start <

Quicksort: Use Threads void quicksort(int *v, int start, int end) { if (start < end) { Problem 1: expensive int pivot = partition(v, start, end); std: : thread t 1([&] { quicksort(v, start, pivot - 1); }); std: : thread t 2([&] { quicksort(v, pivot + 1, end); }); t 1. join(); t 2. join(); } } Problem 3: Exceptions? ? Problem 2: Fork-join not enforced

Andrzej Krzemieński: “Do not use naked threads in the program: use RAII-like wrappers instead”

Andrzej Krzemieński: “Do not use naked threads in the program: use RAII-like wrappers instead”

Quicksort: Fork-Join Parallelism void quicksort(int *v, int start, int end) { if (start <

Quicksort: Fork-Join Parallelism void quicksort(int *v, int start, int end) { if (start < end) { int pivot = partition(v, start, end); parallel region task quicksort(v, start, pivot - 1); quicksort(v, pivot + 1, end); } } task

Quicksort: Using Task Regions (N 3832) void quicksort(int *v, int start, int end) {

Quicksort: Using Task Regions (N 3832) void quicksort(int *v, int start, int end) { if (start < end) { parallel region task_region([&] (auto& r) { int pivot = partition(v, start, end); r. run([&] { quicksort(v, start, pivot - 1); }); r. run([&] { quicksort(v, pivot + 1, end); }); } } task

Under The Hood…

Under The Hood…

Work Stealing Scheduling proc 1 proc 2 proc 3 proc 4

Work Stealing Scheduling proc 1 proc 2 proc 3 proc 4

Work Stealing Scheduling proc 1 proc 2 proc 3 proc 4 Old items New

Work Stealing Scheduling proc 1 proc 2 proc 3 proc 4 Old items New items

Work Stealing Scheduling proc 1 proc 2 proc 3 proc 4 Old items New

Work Stealing Scheduling proc 1 proc 2 proc 3 proc 4 Old items New items

Work Stealing Scheduling proc 1 proc 2 proc 3 proc 4 Old items New

Work Stealing Scheduling proc 1 proc 2 proc 3 proc 4 Old items New items

Work Stealing Scheduling proc 1 proc 2 proc 3 proc 4 Old items New

Work Stealing Scheduling proc 1 proc 2 proc 3 proc 4 Old items New items “Thief”

Fork-Join Parallelism and Work Stealing Q 1: What thread runs f? e(); e() Q

Fork-Join Parallelism and Work Stealing Q 1: What thread runs f? e(); e() Q 2: What thread runs g? task_region([] (auto& r) { r. run(f); g(); }); h(); g() f() Q 3: What thread runs h? h()

Work Stealing Design Choices What Thread Executes After a Spawn? What Thread Executes After

Work Stealing Design Choices What Thread Executes After a Spawn? What Thread Executes After a Join? Child Stealing Stalling: initiating thread waits Continuation (parent) Stealing Greedy: the last thread to reach join continues task_region([] (auto& r) { for(int i=0; i<n; ++i) r. run(f); });

Part 2: The Algorithms

Part 2: The Algorithms

Alex Stepanov: Start With The Algorithms

Alex Stepanov: Start With The Algorithms

Inspiration Intel Threading Building Blocks Microsoft Parallel Nvidia Thrust Patterns Library, C++ AMP Performing

Inspiration Intel Threading Building Blocks Microsoft Parallel Nvidia Thrust Patterns Library, C++ AMP Performing Parallel Operations On Containers

Parallel STL Just like STL, only parallel… Can be faster If you know what

Parallel STL Just like STL, only parallel… Can be faster If you know what you’re doing Two Execution Policies: std: par std: : par_vec

Parallelization: What’s a Big Deal? Why not already parallel? std: : sort(begin, end, [](int

Parallelization: What’s a Big Deal? Why not already parallel? std: : sort(begin, end, [](int a, int b) { return a < b; }); User-provided closures must be thread safe: int comparisons = 0; std: : sort(begin, end, [&](int a, int b) { comparisons++; return a < b; }); But also special-member functions, std: : swap etc.

It’s a Contract What the user can do What the implementer can do Asymptotic

It’s a Contract What the user can do What the implementer can do Asymptotic Guarantees: std: : sort: O(n*log(n)), std: : stable_sort: O(n*log 2(n)), what about parallel sort? What is a valid implementation? (see next slide)

Chaos Sort template<typename Iterator, typename Compare> void chaos_sort( Iterator first, Iterator last, Compare comp

Chaos Sort template<typename Iterator, typename Compare> void chaos_sort( Iterator first, Iterator last, Compare comp ) { auto n = last-first; std: : vector<char> c(n); for(; ; ) { bool flag = false; for( size_t i=1; i<n; ++i ) { c[i] = comp(first[i], first[i-1]); flag |= c[i]; } if( !flag ) break; for( size_t i=1; i<n; ++i ) if( c[i] ) std: : swap( first[i-1], first[i] ); } }

Execution Policies Built-in Execution Policies: extern const sequential_execution_policy seq; extern const parallel_execution_policy par; extern

Execution Policies Built-in Execution Policies: extern const sequential_execution_policy seq; extern const parallel_execution_policy par; extern const parallel_vector_execution_policy par_vec; Dynamic Execution Policy: class execution_policy { public: //. . . const type_info& target_type() const; template<class T> T *target(); template<class T> const T *target() const; };

Using Execution Policy To Write Paralel Code std: : vector<int> vec =. . .

Using Execution Policy To Write Paralel Code std: : vector<int> vec =. . . // standard sequential sort std: : sort(vec. begin(), vec. end()); using namespace std: : experimental: : parallel; // explicitly sequential sort(seq, vec. begin(), vec. end()); // permitting parallel execution sort(par, vec. begin(), vec. end()); // permitting vectorization as well sort(par_vec, vec. begin(), vec. end());

Picking Execution Policy Dynamically size_t threshold =. . . execution_policy exec = seq; if(vec.

Picking Execution Policy Dynamically size_t threshold =. . . execution_policy exec = seq; if(vec. size() > threshold) { exec = par; } sort(exec, vec. begin(), vec. end());

Exception Handling In C++ philosophy, no exception is silently ignored Exception list: container of

Exception Handling In C++ philosophy, no exception is silently ignored Exception list: container of exception_ptr objects try { r = std: : inner_product(std: : par, a. begin(), a. end(), b. begin(), func 1, func 2, 0); } catch(const exception_list& list) { for(auto& exptr : list) { // process exception pointer exptr } }

Vectorization: What’s a Big Deal? Move Unaligned Double Quadword int a[n] =. . .

Vectorization: What’s a Big Deal? Move Unaligned Double Quadword int a[n] =. . . ; int b[n] =. . . ; for(int i=0; i<n; ++i) { a[i] = b[i] + c; } movdqu paddd movdqu xmm 1, XMMWORD PTR _b$[esp+eax+132] xmm 0, XMMWORD PTR _a$[esp+eax+132] xmm 1, xmm 2 xmm 1, xmm 0 XMMWORD PTR _a$[esp+eax+132], xmm 1 a[i: i+3] = b[i: i+3] + c;

Vector Lane is not a Thread! Taking locks Thread Then with thread_id x takes

Vector Lane is not a Thread! Taking locks Thread Then with thread_id x takes a lock… another “thread” with the same thread_id enters the lock… Deadlock!!! Exceptions Can we unwind 1/4 th of the stack?

Vectorization: Not So Easy Any More… void f(int* a, int*b) { for(int i=0; i<n;

Vectorization: Not So Easy Any More… void f(int* a, int*b) { for(int i=0; i<n; ++i) { a[i] = b[i] + c; func(); Aliasing? mov add call } } Side effects? Dependence? Exceptions? ecx, DWORD PTR _b$[esp+esi+140] ecx, edi DWORD PTR _a$[esp+esi+140], ecx func

How Do We Get This? void f(float* a, float*b) { for(int i=0; i<n; ++i)

How Do We Get This? void f(float* a, float*b) { for(int i=0; i<n; ++i) { a[i] = b[i] + c; func(); } for(int i=0; i<n; i+=4) { a[i: i+3] = b[i: i+3] + c; for(int j=0; j<4; ++j) func(); } } Need a helping hand from the programmer!

Vectorization Hazard: Locks Consider: f takes a lock, g releases the lock: for(int i=0;

Vectorization Hazard: Locks Consider: f takes a lock, g releases the lock: for(int i=0; i<n; ++i) { lock. enter(); a[i] = b[i] + c; lock. release(); } ? for(int i=0; i<n; i+=4) { for(int j=0; j<4; ++j) lock. enter(); a[i: i+3] = b[i: i+3] + c; for(int j=0; j<4; ++j) lock. release(); } This transformation is not safe!

But Wait, There Is One Little Problem… Index-based algorithm: Element-based algorithm: void f(float* a,

But Wait, There Is One Little Problem… Index-based algorithm: Element-based algorithm: void f(float* a, float*b) { for(int i=0; i<n; ++i) { // OK: a[i] = b[i] + c; void f(float* a, float*b) { std: : for_each(a, b, [&](float f) { // Oops, no ‘i’: a[i] = b[i] + c; func(); } } }); }

Vector Loop with Parallel STL void f(float* a, float*b) { integer_iterator begin {0}; integer_iterator

Vector Loop with Parallel STL void f(float* a, float*b) { integer_iterator begin {0}; integer_iterator end {b-a}; // almost, see N 3976 std: : for_each( std: : par_vec, begin, end, [&](int i) { a[i] = b[i] + c; func(); }

Parallelization vs. Vectorization Parallelization Threads Vectorization Vector Lanes Stack No stack Good for divergent

Parallelization vs. Vectorization Parallelization Threads Vectorization Vector Lanes Stack No stack Good for divergent code Lock-step execution Relatively heavy-weight Very light-weight

When To Vectorize std: : par No race conditions std: : par_vec Same as

When To Vectorize std: : par No race conditions std: : par_vec Same as std: : par, plus: No aliasing No Exceptions No Locks No/Little Divergence

References N 3991: Task Region N 3872: A Primer on Scheduling Fork-Join Parallelism with

References N 3991: Task Region N 3872: A Primer on Scheduling Fork-Join Parallelism with Work Stealing N 3724: A Parallel Algorithms Library N 3989: Working Draft, Technical Specification for C++ Extensions for Parallelism N 3976 : Multidimensional bounds, index and array_view parallelstl. codeplex. com