Performance Tuning From Chapter 3 of Culler Singh

  • Slides: 30
Download presentation
Performance Tuning [From Chapter 3 of Culler, Singh, Gupta] MAINAK CS 622 1

Performance Tuning [From Chapter 3 of Culler, Singh, Gupta] MAINAK CS 622 1

Agenda • Partitioning for performance • Data access and communication • Summary • Goal

Agenda • Partitioning for performance • Data access and communication • Summary • Goal is to understand simple trade-offs involved in writing a parallel program keeping an eye on parallel performance – Getting good performance out of a multiprocessor is difficult – Programmers need to be careful – A little carelessness may lead to extremely poor performance MAINAK CS 622 2

Partitioning for perf. • Partitioning plays an important role in the parallel performance –

Partitioning for perf. • Partitioning plays an important role in the parallel performance – This is where you essentially determine the tasks • A good partitioning should practise – Load balance – Minimal communication – Low overhead to determine and manage task assignment (sometimes called extra work) • A well-balanced parallel program automatically has low barrier or point-to-point synchronization time – Ideally I want all the threads to arrive at a barrier at the same time MAINAK CS 622 3

Load balancing • Achievable speedup is bounded above by – Sequential exec. time /

Load balancing • Achievable speedup is bounded above by – Sequential exec. time / Max. time for any processor – Thus speedup is maximized when the maximum time and minimum time across all processors are close (want to minimize the variance of parallel execution time) – This directly gets translated to load balancing • What leads to a high variance? – Ultimately all processors finish at the same time – But some do useful work all over this period while others may spend a significant time at synchronization points – This may arise from a bad partitioning – There may be other architectural reasons for load imbalance beyond the scope of a programmer e. g. , network congestion, unforeseen cache conflicts etc. (slows down a few threads) 4 MAINAK CS 622

Load balancing • Effect of decomposition/assignment on load balancing – Static partitioning is good

Load balancing • Effect of decomposition/assignment on load balancing – Static partitioning is good when the nature of computation is predictable and regular – Dynamic partitioning normally provides better load balance, but has more runtime overhead for task management; also it may increase communication – Fine grain partitioning (extreme is one instruction per thread) leads to more overhead, but better load balance – Coarse grain partitioning (e. g. , large tasks) may lead to load imbalance if the tasks are not well-balanced MAINAK CS 622 5

Dynamic task queues • Introduced in the last lecture • Normally implemented as part

Dynamic task queues • Introduced in the last lecture • Normally implemented as part of the parallel program • Two possible designs – Centralized task queue: a single queue of tasks; may lead to heavy contention because insertion and deletion to/from the queue must be critical sections – Distributed task queues: one queue per processor • Issue with distributed task queues – When a queue of a particular processor is empty what does it do? Task stealing MAINAK CS 622 6

Task stealing • A processor may choose to steal tasks from another processor’s queue

Task stealing • A processor may choose to steal tasks from another processor’s queue if the former’s queue is empty – How many tasks to steal? Whom to steal from? – The biggest question: how to detect termination? Really a distributed consensus! – Task stealing, in general, may increase overhead and communication, but a smart design may lead to excellent load balance (normally hard to design efficiently) – This is a form of a more general technique called Receiver Initiated Diffusion (RID) where the receiver of the task initiates the task transfer – In Sender Initiated Diffusion (SID) a processor may choose to insert into another processor’s queue if the former’s task queue is full above a threshold MAINAK CS 622 7

Architect’s job • Normally load balancing is a responsibility of the programmer – However,

Architect’s job • Normally load balancing is a responsibility of the programmer – However, an architecture may provide efficient primitives to implement task queues and task stealing – For example, the task queue may be allocated in a special shared memory segment, accesses to which may be optimized by special hardware in the memory controller – But this may expose some of the architectural features to the programmer – There are multiprocessors that provide efficient implementations for certain synchronization primitives; this may improve load balance – Sophisticated hardware tricks are possible: dynamic load monitoring and favoring slow threads dynamically MAINAK CS 622 8

Partitioning and communication • Need to reduce inherent communication – This is the part

Partitioning and communication • Need to reduce inherent communication – This is the part of communication determined by assignment of tasks – There may be other communication traffic also (more later) • Goal is to assign tasks such that accessed data are mostly local to a process – Ideally I do not want any communication – But in life sometimes you need to talk to people to get some work done! MAINAK CS 622 9

Domain decomposition • Normally applications show a local bias on data usage – Communication

Domain decomposition • Normally applications show a local bias on data usage – Communication is short-range e. g. nearest neighbor – Even if it is long-range it falls off with distance – View the dataset of an application as the domain of the problem e. g. , the 2 -D grid in equation solver – If you consider a point in this domain, in most of the applications it turns out that this point depends on points that are close by – Partitioning can exploit this property by assigning contiguous pieces of data to each process – Exact shape of decomposed domain depends on the application and load balancing requirements MAINAK CS 622 10

Comm-to-comp ratio • Surely, there could be many different domain decompositions for a particular

Comm-to-comp ratio • Surely, there could be many different domain decompositions for a particular problem – For grid solver we may have a square block decomposition, block row decomposition or cyclic row decomposition – How to determine which one is good? Communication-tocomputation ratio Assume P processors and Nx. N grid for grid solver P 0 P 1 P 2 P 3 P 4 P 5 P 6 P 7 P 15 Size of each block: N/√P by N/√P Communication (perimeter): 4 N/√P Computation (area): N 2/P Comm-to-comp ratio = 4√P/N Sq. block decomp. for P=16 MAINAK CS 622 11

Comm-to-comp ratio • For block row decomposition – – Each strip has N/P rows

Comm-to-comp ratio • For block row decomposition – – Each strip has N/P rows Communication (boundary rows): 2 N Computation (area): N 2/P (same as square block) Comm-to-comp ratio: 2 P/N • For cyclic row decomposition – – Each processor gets N/P isolated rows Communication: 2 N 2/P Computation: N 2/P Comm-to-comp ratio: 2 • Normally N is much larger than P – Asymptotically, square block yields lowest comm-to-comp 12 MAINAK CS 622 ratio

Comm-to-comp ratio • Idea is to measure the volume of inherent communication per computation

Comm-to-comp ratio • Idea is to measure the volume of inherent communication per computation – In most cases it is beneficial to pick the decomposition with the lowest comm-to-comp ratio – But depends on the application structure i. e. picking the lowest comm-to-comp may have other problems – Normally this ratio gives you a rough estimate about average communication bandwidth requirement of the application i. e. how frequent is communication – But it does not tell you the nature of communication i. e. bursty or uniform – For grid solver comm. happens only at the start of each iteration; it is not uniformly distributed over computation – Thus the worst case BW requirement may exceed the average comm-to-comp ratio MAINAK CS 622 13

Extra work • Extra work in a parallel version of a sequential program may

Extra work • Extra work in a parallel version of a sequential program may result from – Decomposition – Assignment techniques – Management of the task pool etc. • Speedup is bounded above by Sequential work / Max (Useful work + Synchronization + Comm. cost + Extra work) where the Max is taken over all processors • But this is still incomplete – We have only considered communication cost from the viewpoint of the algorithm and ignored the architecture 14 MAINAK CS 622 completely

Data access and communication • The memory hierarchy (caches and main memory) plays a

Data access and communication • The memory hierarchy (caches and main memory) plays a significant role in determining communication cost – May easily dominate the inherent communication of the algorithm • For uniprocessor, the execution time of a program is given by useful work time + data access time – Useful work time is normally called the busy time or busy cycles – Data access time can be reduced either by architectural techniques (e. g. , large caches) or by cache-aware algorithm design that exploits spatial and temporal locality 15 MAINAK CS 622

Data access • In multiprocessors – Every processor wants to see the memory interface

Data access • In multiprocessors – Every processor wants to see the memory interface as its own local cache and the main memory – In reality it is much more complicated – If the system has a centralized memory (e. g. , SMPs), there are still caches of other processors; if the memory is distributed then some part of it is local and some is remote – For shared memory, data movement from local or remote memory to cache is transparent while for message passing it is explicit – View a multiprocessor as an extended memory hierarchy where the extension includes caches of other processors, remote memory modules and the network topology MAINAK CS 622 16

Artifactual comm. • Communication caused by artifacts of extended memory hierarchy – Data accesses

Artifactual comm. • Communication caused by artifacts of extended memory hierarchy – Data accesses not satisfied in the cache or local memory cause communication – Inherent communication is caused by data transfers determined by the program – Artifactual communication is caused by poor allocation of data across distributed memories, unnecessary data in a transfer, unnecessary transfers due to system-dependent transfer granularity, redundant communication of data, finite replication capacity (in cache or memory) • Inherent communication assumes infinite capacity and perfect knowledge of what should be 17 MAINAK CS 622 transferred

Capacity problem • Most probable reason for artifactual communication – Due to finite capacity

Capacity problem • Most probable reason for artifactual communication – Due to finite capacity of cache, local memory or remote memory – May view a multiprocessor as a three-level memory hierarchy for this purpose: local cache, local memory, remote memory – Communication due to cold or compulsory misses and inherent communication are independent of capacity – Capacity and conflict misses generate communication resulting from finite capacity – Generated traffic may be local or remote depending on the allocation of pages – General technique: exploit spatial and temporal locality to use the cache properly MAINAK CS 622 18

Temporal locality • Maximize reuse of data – Schedule tasks that access same data

Temporal locality • Maximize reuse of data – Schedule tasks that access same data in close succession – Many linear algebra kernels use blocking of matrices to improve temporal (and spatial) locality – Example: Transpose phase in Fast Fourier Transform (FFT); to improve locality, the algorithm carries out blocked transpose i. e. transposes a block of data at a time Block transpose MAINAK CS 622 19

Spatial locality • Consider a square block decomposition of grid solver and a C-like

Spatial locality • Consider a square block decomposition of grid solver and a C-like row major layout i. e. A[i][j] and A[i][j+1] have contiguous memory locations Memory allocation Page Cache line Page straddles partition boundary Cache line across partition MAINAK CS 622 The same page is local to a processor while remote to others; same applies to straddling cache lines. Ideally, I want to have all pages within a partition local to a single processor. Standard trick is to covert the 2 D array to 4 D. 20

2 D to 4 D conversion • Essentially you need to change the way

2 D to 4 D conversion • Essentially you need to change the way memory is allocated – The matrix A needs to be allocated in such a way that the elements falling within a partition are contiguous – The first two dimensions of the new 4 D matrix are block row and column indices i. e. for the partition assigned to processor P 6 these are 1 and 2 respectively (assuming 16 processors) – The next two dimensions hold the data elements within that partition – Thus the 4 D array may be declared as float B[√P][N/√P] – The element B[3][2][5][10] corresponds to the element in 10 th column, 5 th row of the partition of P 14 – Now all elements within a partition have contiguous 21 MAINAK CS 622 addresses

Transfer granularity • How much data do you transfer in one communication? – For

Transfer granularity • How much data do you transfer in one communication? – For message passing it is explicit in the program – For shared memory this is really under the control of the cache coherence protocol: there is a fixed size for which transactions are defined (normally the block size of the outermost level of cache hierarchy) • In shared memory you have to be careful – Since the minimum transfer size is a cache line you may end up transferring extra data e. g. , in grid solver the elements of the left and right neighbors for a square block decomposition (you need only one element, but must transfer the whole cache line): no good solution MAINAK CS 622 22

Worse: false sharing • If the algorithm is designed so poorly that – Two

Worse: false sharing • If the algorithm is designed so poorly that – Two processors write to two different words within a cache line at the same time – The cache line keeps on moving between two processors – The processors are not really accessing or updating the same element, but whatever they are updating happen to fall within a cache line: not a true sharing, but false sharing – For shared memory programs false sharing can easily degrade performance by a lot – Easy to avoid: just pad up to the end of the cache line before starting the allocation of the data for the next processor (wastes memory, but improves performance) MAINAK CS 622 23

Communication cost • Given the total volume of communication (in bytes, say) the goal

Communication cost • Given the total volume of communication (in bytes, say) the goal is to reduce the end-to-end latency • Simple model: T = f*(o + L + (n / m) / B + tc – overlap) where f = frequency of messages o = overhead per message (at receiver and sender) L = network delay per message (really the router delay) n = total volume of communication in bytes m = total number of messages B = node-to-network bandwidth tc = contention-induced average latency per message overlap = how much communication time is overlapped with useful computation MAINAK CS 622 24

Communication cost • The goal is to reduce T – Reduce o by communicating

Communication cost • The goal is to reduce T – Reduce o by communicating less: restructure algorithm to reduce m i. e. communicate larger messages (easy for message passing, but need extra support in memory controller for shared memory e. g. , block transfer) – Reduce L = number of average hops*time per hop – Number of hops can be reduced by mapping the algorithm on the topology properly e. g. , nearest neighbor communication is well-suited for a ring (just left/right) or a mesh (grid solver example); however, L is not very important because today routers are really fast (routing delay is ~10 ns); o and tc are the dominant parts in T – Reduce tc by not creating hot-spots in the system: restructure algorithm to make sure a particular node does not get flooded with messages; distribute uniformly MAINAK CS 622 25

Contention • It is very easy to ignore contention effects when designing algorithms –

Contention • It is very easy to ignore contention effects when designing algorithms – Can severely degrade performance by creating hot-spots • Location hot-spot: – Consider accumulating a global variable; the accumulation takes place on a single node i. e. all nodes access the variable allocated on that particular node whenever it tries to increment it CA on this node becomes a bottleneck 26 MAINAK CS 622 Scalable tree accumulation

Hot-spots • Avoid location hot-spot by either staggering accesses to the same location or

Hot-spots • Avoid location hot-spot by either staggering accesses to the same location or by designing the algorithm to exploit a tree structured communication • Module hot-spot – Normally happens when a particular node saturates handling too many messages (need not be to same memory location) within a short amount of time – Normal solution again is to design the algorithm in such a way that these messages are staggered over time • Rule of thumb: design communication pattern such that it is not bursty; want to distribute it uniformly over time MAINAK CS 622 27

Overlap • Increase overlap between communication and computation – Not much to do at

Overlap • Increase overlap between communication and computation – Not much to do at algorithm level unless the programming model and/or OS provide some primitives to carry out prefetching, block data transfer, non-blocking receive etc. – Normally, these techniques increase bandwidth demand because you end up communicating the same amount of data, but in a shorter amount of time (execution time hopefully goes down if you can exploit overlap) MAINAK CS 622 28

Summary • Comparison of sequential and parallel execution – Sequential execution time = busy

Summary • Comparison of sequential and parallel execution – Sequential execution time = busy useful time + local data access time – Parallel execution time = busy useful time + busy overhead (extra work) + local data access time + remote data access time + synchronization time – Busy useful time in parallel execution is ideally equal to sequential busy useful time / number of processors – Local data access time in parallel execution is also less compared to that in sequential execution because ideally each processor accesses less than 1/P th of the local data (some data now become remote) MAINAK CS 622 29

Summary • Parallel programs introduce three overhead terms: busy overhead (extra work), remote data

Summary • Parallel programs introduce three overhead terms: busy overhead (extra work), remote data access time, and synchronization time – Goal of a good parallel program is to minimize these three terms – Goal of a good parallel computer architecture is to provide sufficient support to let programmers optimize these three terms MAINAK CS 622 30