Assignment 3 Stencil NPRG 051 20192020 Homework assignment

  • Slides: 25
Download presentation
Assignment 3 - Stencil NPRG 051 2019/2020 Homework assignment #3 - Stencil 1

Assignment 3 - Stencil NPRG 051 2019/2020 Homework assignment #3 - Stencil 1

Motivation: LIFE by John Conway � The game of LIFE 1970 � A cellular

Motivation: LIFE by John Conway � The game of LIFE 1970 � A cellular automaton � § § � Each cell reproduces depending on its state and the state of its 8 neighbor cells § § � 2 D grid of cells Two states: Dead, Alive Born if exactly 3 neighbors alive Survives if 2 or 3 neighbors alive Turing complete! § Rendell, 2002 � Author: John Conway Born Dec 26, 1937 � Died Apr 11, 2020, COVID-19 � NPRG 051 2019/2020 Homework assignment #3 - Stencil 2

Motivation: Rule 110 by Stephen Wolfram � Rule 110 1985 � 1 D grid

Motivation: Rule 110 by Stephen Wolfram � Rule 110 1985 � 1 D grid of cells, two states � One of 256 rules possible for 1 D-neighborhood two-state cells � a[i] = (110>>(4*a[i-1]+2*a[i]+a[i+1]))&1; time � Turing complete! § Cook, 2005 � Author: Stephen Wolfram Born Aug 29, 1959 � www. wolframalpha. com � NPRG 051 2019/2020 Homework assignment #3 - Stencil 3

Motivation � Many mathematical models work as follows: � A (multi-dimensional) array of cells

Motivation � Many mathematical models work as follows: � A (multi-dimensional) array of cells § � Each cell has a state A global clock § § In each tick, each cell is updated based on the (previous) states of its neighbors The update formula is usually independent of position and time � Examples: � Cellular automata § § § � Ulam & von Neumann, 1948 Finite number of states of each cell 1 D: Rule 110; 2 D: LIFE Stencils § § Emmons, 1944 Numeric solution of partial differential equations (finite difference method) § § Heat transfer, hydrodynamics, . . . State of each cell consists of several real or complex numbers, depending on the problem

time Serial evaluation of a stencil � Two alternating buffers required � � Stencils

time Serial evaluation of a stencil � Two alternating buffers required � � Stencils propagate state in both directions The new states must always be computed from the old states � Boundary cells may require special handling � Wrapping around is the simplest (but usually not physically correct) NPRG 051 2019/2020 Homework assignment #3 - Stencil 5

time Evaluation of a part of the space � If the state is not

time Evaluation of a part of the space � If the state is not known outside a part of the space, the result is valid only for a subset of the space � Shrinking with time NPRG 051 2019/2020 Homework assignment #3 - Stencil 6

Parallel evaluation of a stencil time G G+W+G � A running thread cannot continuously

Parallel evaluation of a stencil time G G+W+G � A running thread cannot continuously observe the results produced by adjacent threads � � W Synchronization in every generation would be too expensive An individual task is to compute G generations, producing W results � � The input to such a task is G+W+G wide A part of the work is duplicated in adjacent tasks (threads) NPRG 051 2019/2020 Homework assignment #3 - Stencil 7

Parallel evaluation of a stencil G G time G G+W+G W � An individual

Parallel evaluation of a stencil G G time G G+W+G W � An individual task is to compute G generations, producing W results � The input to such a task is G+W+G wide � The same thread may continue with subsequent generations � Synchronization with adjacent threads is required each G generations NPRG 051 2019/2020 Homework assignment #3 - Stencil 8

time Data exchanged between threads � The same thread may continue with subsequent generations

time Data exchanged between threads � The same thread may continue with subsequent generations Synchronization with adjacent threads is required each G generations � Two buffers of size G are sent from each thread to its neighbors � Two buffers of size G are received by each thread from its neighbors � NPRG 051 2019/2020 Homework assignment #3 - Stencil 9

Data exchanged between threads G G W-2 G G G � Every G generations,

Data exchanged between threads G G W-2 G G G � Every G generations, synchronization is required � Before synchronization, each thread holds W valid elements § � Plus two buffers of size G, containing out-of-date elements Two buffers of size G must be copied from each thread to its neighbors � Two possible approaches � The receiving thread copies the data from adjacent threads to its own empty buffers § § � The adjacent threads may copy in parallel but they must not start computing further generations Synchronization (rendezvous) required both before and after copying! The sending thread copies the data and then sends the data including ownership of the buffer § § The out-of-date buffers may be used, provided they are separable Only one synchronization required (waiting for the messages) NPRG 051 2019/2020 Homework assignment #3 - Stencil 10

Data exchanged between threads � Synchronization primitives in C++17 � Starting and joining threads

Data exchanged between threads � Synchronization primitives in C++17 � Starting and joining threads § § � Mutex § � Not suitable for waiting for an event Promise-future § § � Requires a master thread Thread start is significantly slower than synchronization between running threads Not reusable! Additional thread-safe mechanism for dispatching promises (or futures) required Condition variable § § § Difficult to understand, coupled with a mutex Spurious triggers Reusable, does not involve allocation on each use NPRG 051 2019/2020 Homework assignment #3 - Stencil 11

Data exchanged between threads � Emulating rendezvous (between two threads) Corresponds to std: :

Data exchanged between threads � Emulating rendezvous (between two threads) Corresponds to std: : barrier in C++20 � Could be emulated by a pair of std: : binary_semaphore in C++20 � § � But you only have C++17 With std: : condition_variable: § § Lock a mutex Increment a shared counter Notify a condition variable While the counter is less than 2, wait for the condition variable § § Waiting internally unlocks the mutex, allowing the other thread to increment Spurious wakeups may occur – repeated testing of the counter is required Unlock the mutex BEWARE: A thread-safe way of resetting the counter is required § § Resetting must appear after all the threads safely exited the rendezvous but before any of them enters the rendezvous again In our case, we need a rendezvous both before and after the copying – resetting one of them may be done inside the other (if controlled by the same mutex) NPRG 051 2019/2020 Homework assignment #3 - Stencil 12

Data exchanged between threads � Emulating message passing � Easily emulated by a std:

Data exchanged between threads � Emulating message passing � Easily emulated by a std: : counting_semaphore in C++20 § � But you only have C++17 With std: : condition_variable: § Send § § § Receive § § � With a mutex locked, push the message to a queue Notify a condition variable Lock the mutex While the queue is empty, wait for the condition variable Pop the message from the queue Unlock the mutex In our case § § Each pair of adjacent threads needs two message channels Each message channel may contain up to two messages § A general (dynamically allocating) queue is not needed NPRG 051 2019/2020 Homework assignment #3 - Stencil 13

The assignment � Your job is to implement: � A generic structure used to

The assignment � Your job is to implement: � A generic structure used to hold the states of cells § � A generic method which executes a given stencil function § § � 1 -dimensional, wrapped-around = circle For a given number of generations In parallel, using only the C++17 standard library Submit into recodex as "stencil 1 d. hpp" � The testing framework will apply two stencil functions: � Rule 110 § � State of a cell = bool Lemmings § State of a cell is a structure containing some small numbers � The framework will test correctness by printing some states � Using recodex to compare against the desired output � Beware: Recodex will also apply some not-so-infinite time limits � The framework and the desired outputs are available outside Recodex � //www. ksi. mff. cuni. cz/teaching/nprg 051 -web/DU/du-1920 -3. framework. zip NPRG 051 2019/2020 Homework assignment #3 - Stencil 14

The required interface template< typename ET> class circle { public: circle(std: : size_t s);

The required interface template< typename ET> class circle { public: circle(std: : size_t s); std: : size_t size() const; � Class circle<ET> � � void set(std: : ptrdiff_t x, const ET& v); § REF get(std: : ptrdiff_t x) const; template< typename SF> void run( SF&& sf, std: : size_t g, std: : size_t thrs = std: : thread: : hardware_concurrency()); }; ET is the type representing the status of one cell The constructor allocates the space for the desired number of cells § � The function set/get serve for writing/reading individual cells § § NPRG 051 2019/2020 Homework assignment #3 - Stencil also returned by size() each cell initialized as ET() The cells are circularly arranged – index overflows are handled by wrapping The set/get function must support indexes (x) in the range <-size(), 2*size()-1> BEWARE: the built-in operator% does not perform mathematical modulo for negative numbers get may return by const reference but beware of std: : vector<bool> 15

The required interface template< typename ET> class circle { public: circle(std: : size_t s);

The required interface template< typename ET> class circle { public: circle(std: : size_t s); � The function run does the parallel evaluation of the stencil � std: : size_t size() const; void set(std: : ptrdiff_t x, const ET& v); ET sf(ET left, ET mine, ET right); § REF get(std: : ptrdiff_t x) const; template< typename SF> void run( SF&& sf, std: : size_t g, std: : size_t thrs = std: : thread: : hardware_concurrency()); The stencil is specified as a function/functor/lambda with the following interface: � The run function performs g generations, modifying the contents of the underlying class § }; � Any auxiliary resources like threads or work buffers shall be freed upon return The thrs argument specifies the desired number of threads, the default is as shown § § NPRG 051 2019/2020 Homework assignment #3 - Stencil The arguments may also be passed by const reference; therefore, the run function shall avoid copying the cell states when calling sf Other parameters, like suitable values of W and G, must be determined internally Beware of non-divisibility 16

Frequent mistakes NPRG 051 2019/2020 Homework assignment #3 - Stencil 17

Frequent mistakes NPRG 051 2019/2020 Homework assignment #3 - Stencil 17

Frequent mistakes � Plagiarism NPRG 051 2019/2020 Homework assignment #3 - Stencil 18

Frequent mistakes � Plagiarism NPRG 051 2019/2020 Homework assignment #3 - Stencil 18

Frequent mistakes � Threads simultaneously writing into vector<T> - prohibited for T=bool � Wrong

Frequent mistakes � Threads simultaneously writing into vector<T> - prohibited for T=bool � Wrong usage of condition_variable § § � Non-locked or wrongly locked manipulation on the notify side Spurious wakeup problem ignored Too complex synchronization mechanism § § § § too many mutexes two mutexes locked at once (how you can prove that it cannot deadlock? ) too many condition-variables useless atomic<. . . > (often a result of an attempt to patch another mistake) Parallel programs cannot be verified by testing or debugging The only viable approach is simplicity and clarity Mishandled synchronization is usually penalized by recodex results § Deadlocks will manifest as timeouts: CPU Time = 30 ms, Wall Time = 60 s NPRG 051 2019/2020 Homework assignment #3 - Stencil 19

Frequent mistakes – condition_variable § Wrong at notify side § Correct at wait side

Frequent mistakes – condition_variable § Wrong at notify side § Correct at wait side { var = something; unique_lock<mutex> g(mtx); cv. notify_all(); while (var != something) { cv. wait(g); } } NPRG 051 2019/2020 Homework assignment #3 - Stencil 20

Frequent mistakes – condition_variable § Wrong at notify side § Correct at wait side

Frequent mistakes – condition_variable § Wrong at notify side § Correct at wait side { unique_lock<mutex> g(mtx); while (var != something) { var = something; cv. notify_all(); § Notify wakes up only threads that are already waiting cv. wait(g); § It may wait forever } } NPRG 051 2019/2020 Homework assignment #3 - Stencil 21

Frequent mistakes – condition_variable § Notify side must be locked (by the same mutex

Frequent mistakes – condition_variable § Notify side must be locked (by the same mutex as the wait side) § Correct at wait side { { unique_lock<mutex> g(mtx); lock_guard<mutex> g(mtx); while (var != something) var = something; { } cv. wait(g); cv. notify_all(); } } NPRG 051 2019/2020 Homework assignment #3 - Stencil 22

Correct usage of condition_variable § Notify side must always be locked § § By

Correct usage of condition_variable § Notify side must always be locked § § By the same mutex! As a consequence, atomic variables are usually useless in the presence of condition_variables § Correct at wait side { unique_lock<mutex> g(mtx); while (var != something) { cv. wait(g); { lock_guard<mutex> g(mtx); § § this critical section may be entered either before the wait-side thread enters its critical section. . . or when the other thread is inside its wait(), which temporarily unlocks the mutex var = something; } } cv. notify_all(); § notify itself does not have to be locked NPRG 051 2019/2020 Homework assignment #3 - Stencil } 23

Frequent mistakes � Violations of C++ best practices Missing const etc. � Missing move/forward

Frequent mistakes � Violations of C++ best practices Missing const etc. � Missing move/forward etc. � 1 -bit bitfields are not guaranteed to work modulo 2 � � Violations of software-engineering best practices copy-paste code � using lock. unlock() just before destroying the lock is disturbing for the readers � § RAII shall be kept as simple as possible { lock_guard<mutex> lock(mtx); /*. . . */ lock. unlock(); } � void * and static_cast forms an unjustified C-ism NPRG 051 2019/2020 Homework assignment #3 - Stencil 24

Hints � Improving speed � remove serial bottlenecks, if possible § � � �

Hints � Improving speed � remove serial bottlenecks, if possible § � � � Do not copy data when locked (transfer ownership instead) partial specialization for T=bool the use of shared_ptr is often unnecessary do not repeat allocation/deallocation pairs – recycle the objects and containers instead � Improving readability � � final copying back may sometimes be moved to the master thread (instead of complex protection by a mutex) when implementing buffers of size 2, generalization to a circular buffer within array<. . . , 2> is usually more readable and faster than handling the three possible states by a system of conditions � Minor details � � � emplace_back(std: : thread(. . . )) may be replaced by emplace_back(. . . ) push_back(std: : thread(. . . )) may be replaced by emplace_back(. . . ) x = 1 - x; is usually faster than x = (x ? 0 : 1); NPRG 051 2019/2020 Homework assignment #3 - Stencil 25