Scalability Analysis Prof Ioana Banicescu Performance And Scalability
Scalability Analysis Prof. Ioana Banicescu
Performance And Scalability Of Parallel Systems • Evaluation – Sequential: • run time (evaluations time) Ts = T (Input size) – Parallel: • – run time. (start-> last PEs ends) Ts = T (Input size, P, architecture) cannot be evaluated in isolation for the parallel systems. • Parallel systems: – Parallel algorithm + parallel architecture combinations. • Metrics – Evaluate performance of parallel systems. • Scalability: – ability of parallel algorithms to achieve performance gains proportional w/ no. of PEs.
Performance Metrics. 1. Run–Time: Ts, Tp 2. Speedup: How much performance is gained by running the application on “p” identical processors. S= Ts/Tp. Ts – the fastest sequential algorithms for solving the same problem. If: not known yet (only lower bound known) or, known, with large constants at run-time that make it impossible to implement. Then: Take the fastest known sequential algorithm that can be practically implemented 3. Speedup: relative metric
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 (a) Initial data distribution and the first communication step 0 1 2 3 4 5 6 7 8 9 10 11 12 13 (b) Second communication step Figure 4. 1: Computing the sum of 16 numbers on a 16 processor hypercube.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 (c) Third communication step 0 1 2 3 4 5 6 7 8 9 10 (d) Fourth communication step Figure 4. 1: Computing the sum of 16 numbers on a 16 processor hypercube.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 (e) Accumulation of the sum at processor 0 after the final communication Figure 4. 1: Computing the sum of 16 numbers on a 16 processor hypercube. 15
Performance Metrics S P S > P Super linear. • Algorithm of adding “n” numbers on “n” processors (hypercube, n = p = 2 k). – Ts = (n) – Tp = (logn) S= (n/logn) • Efficiency: measure of how effectively the problem is solved on P processors. measures the fraction of time for which a processor is effectively employed. – EE [O, 1] If p = n, E = S/P. (1/logn) • Cost: – Cseq-fast = Ts – – Cpar = p. Tp Cpar Cseq-fast cost–optimal.
Performance Metrics p = n, fine granularity Cseq-fast = (n), Cpar = (n log n), E = ( l/ logn), Not cost–optimal p < n, coarse granularity. Scaling down
Effects of granularity on cost–optimality n/p n PEs p PEs Assume: n virtual PEs; • If : P physical PEs, then each PE simulates n/p virtual PEs. • The computation at each PE increases by a factor n/p Note: Even if [P < n], this does not necessarily provide cost– optimal algorithm.
12 8 4 0 0 13 9 5 1 14 10 6 2 15 11 7 3 1 2 3 12 8 4 13 9 5 14 10 6 15 11 7 0 1 2 3 Substep 1 Substep 2 12 12 8 0 13 9 14 10 15 11 1 2 3 Substep 3 0 13 1 14 15 2 3 Substep 4 Figure 4. 2 a: Four processors simulating the first communication step of 16 processors
0 1 2 3 0 Substep 1 0 1 2 Substep 3 1 2 3 Substep 2 3 0 1 2 3 Substep 4 Figure 4. 2 b: Four processors simulating the second communication step of 16 processors
0 1 2 3 Substep 2 Substep 1 Figure 4. 2 c: Simulation of the third step in two substeps 0 1 2 3 Figure 4. 2 d: Simulation of the fourth step 0 1 2 3 Figure 4. 2 e: Final result
Adding “n” numbers an “p” processors Hypercube (p < n) • N = 2 k, P = 2 m, Example: N = 16, P = 4. – Computation and Communication – Computation (n/P) - last 4 steps – Parallel Execution Time, Tpar = – Cpar = P 0(n/P log. P) = – Cseq-fast = (n/P log. P) – first 8 steps (n/P log. P) (n) – p increased asymptotically, not cost- optimal • A Cost–optimal algorithm (on following slide)
3 2 1 0 7 6 5 4 11 10 9 8 15 14 13 12 0 1 2 3 0 1 (a) 0 1 3 2 3 (b) 2 (c) 2 3 0 1 (d) Figure 4. 3: A cost-optimal way of computing the sum of 16 numbers on a four processor hypercube
A Cost Optimal Algorithm • • camputation (n/p). camputation + communication (logp). Then total cost is: (n/p +logp). Therefore, we have n > plogp. • Tpar = (n/p). • Cpar = (n) = p. Tpar. • Cseq-fast = Tseq-fast = (n), cost–optimal.
A Cost Optimal Algorithm • • – – – • • If algorithm is COST-OPTIMAL: P – physical PEs Each PE simulates n/p virtual PEs Then, If the overall communication does not grow for more than n/p (proper mapping) Total parallel run–time grows by at most (n/p)Tc+(n/p)Tcomm= (n/p)Ttotal=(n/p)Tp =Tn/p growth at, computation (n/p. Tc), communication (n/p. Tcomm), and total (n/p. Ttotal). For p > n, Cpar =p. Tp (n=p) For p < n, Cpar = p. Tn/p = (p)(n/p)Tp = n. Tp = Cpar(n = p) New algorithm using n/p processors is cost–optimal for (p<n). If algorithm is not COST-OPTIMAL for p = n, if we increase granularity, the new algorithm using n/p (for p<n) may still not be cost optimal.
A Cost Optimal Algorithm • Example: Adding “n” numbers an “P” processors HYPERCUBE. – n = 2 k, (n = 16). – p = 2 m, (p = 4), therefore we have, (p < n). – Each virtual PE (# i) is simulated by physical PE (# i mod p). – First log. P (2) steps of the logn (4) steps in the original algorithm are simulated in (n/p)logp (i. e. , 16/4 X 2 = 8) steps on p = 4 processors. – The remaining steps do not require communication (the PE that continue to communicate in the original are simulated by the same PE here).
The Role of Mapping Computations onto Processors in Parallel Algorithm Design • For a cost-optimal parallel algorithm E = (1). • If a parallel algorithm on p = n processors is not cost–optimal or non cost–optimal, then if p < n you can find a cost–optimal algorithm. • Even if you find a cost-optimal algorithm for p < n, you found an algorithm with best parallel nun-time. • Performance (parallel run-time) depends on: – Number of processors. – Data-mapping and assignment.
The Role of Mapping… • Parallel run–time of the same problem (problem size) depends upon mapping of virtual PEs onto physical PEs. • Performance critically depends on the data mapping onto a coarse grain parallel computer. – Example: multiplication of an nxn matrix by a vector on p processor hypercube – – • P number of square blocks versus p slices of n/p rows. Parallel FFT an a hypercube with cut-through. W - computation steps. Pmax= W. For Pmax– each PE executes one step of the algorithm. For p < W, each PE executes a larger number of steps. • The choice of the best algorithm to perform local computations depends upon # PEs (how much fragmentation is available).
The Role Of Mapping (continue) • Optimal algorithm for solving a problem on an arbitrary # of PEs cannot be obtained for the most fine-grained parallel algorithms. • The analysis on fine–grained parallel algorithm may not reveal important facts such as: • An analysis of coarse-grain parallel algorithm: 1. If message is short (one word only), then transfer time between 2 PEs is the same for the store-and-forward and cut–through routing. 2. If message is long, cut-through is faster than store-and-forward. 3. Performance on hypercube and mesh is identical with cut-through routing. 4. Performance on a mesh with store–and–forward is worse. • Design: 1. 2. 3. 4. Devising the parallel algorithm for the finest-grain. Mapping data onto PEs. Describing the algorithm implementation on an arbitrary number of PEs. Variables: – problem size, number of PEs.
Scalability • S <= p, S(p), E(p) • Example: Adding n numbers on a p–processor hypercube. – Assume: 1 unit time is spent in adding 2 numbers. – 1 unit time is spent to communicate with connected PE. • Then adding locally n/p numbers takes : n/p – 1. • p partial sums added in logp steps (for each sum: 1 addition + 1 communication), which is, 2 logp. • Then we have, – Tp = n/p – 1 + 2 log p – Tp = n/p + 2 log p (ignoring -1) – Ts = n – 1 = n (when n increases to infinity) • S = n/[n/p + 2 logp] = np/[n + 2 p logp], i. e. , S(n, p). • E = S/p = n/[n + 2 plogp], i. e. , E(n, p). • Can be computed far any pair of n and p.
Scalability S Theoretical linear speedup N = 1024 N = 512 N = 64 p • • • 4 8 12 As p increases to increase S we also need to increase n. Otherwise saturation and E decreases. Larger problem sizes, S increases, E increases but they drop if p increases. E = constant. • Scalability of a parallel system is a measure of its capacity to increase speedup in proportion to the number of processors.
Efficiency of Adding n Numbers on a p-Processor Hypercube • • • – – For the cost–optimal algorithm n= S = np/[n + 2 plogp] E = n/[p + 2 plogp] E = E(n, p) N P=1 P=4 P=8 64 1. 0 0. 8 0. 57 0. 33 0. 17 192 1. 0 0. 92 0. 8 E = 0. 80 constant. 320 1. 0 0. 95 0. 87 0. 71 0. 50 1. 0 0. 97 0. 91 0. 8 (plogp) For n = 64 and p = 4, we 512 have n = 8 plog p. For n = 192 and p = 8, we have n = 8 plog p. For n = 512 and p = 16, we have n = 8 plog p. P=16 P=32 0. 60 0. 38 0. 62
Conclusions • For adding n numbers a p–processor hypercube with a costoptional algorithm: – The algorithm is cost-optimal if n = (plogp). – The algorithm is scalable. iff n increases proportionally to is increased. • Problem size: – For matrix multiplication: • input n, then, 0(n 3). • n’ = 2 n, then 0(n’ 3), which is 0( 8 n 3). – For matrix addition • input n, then 0(n 2). • n’= 2 n, then 0(n’ 2), which is 0( 4 n 2). (plogp) as p
Conclusion (contd. ) • Doubling the size of the problem means performing twice the amount of computation. • Computation step: assume it takes 1 time unit. – – message start-up time. per-word transfer time. per- hop time. can be normalized w. r. t. unit computation time. • [W = Ts], for the fastest sequential algorithm on a sequential computer.
Overhead Function • Ideally E = 1, and S = p. • In reality, E < 1, and S < p. Due to: – overhead (inter-processor communication…etc. ). – overhead function.
Overhead Function • The time collectively spent by all processors in addition to that required by the fastest known sequential algorithm to solve the same problem an a single PE. – To = To (W, P) – To = p. Tp – W • For cost–optimal algorithm of adding n numbers on a p processor hypercube. – – Ts = W = n Tp = n/p + 2 logp To = p [n/p + 2 logp] – n = 2 p log p [To = 2 plogp]
Isoefficiency Function • Tp = T(W, To, p) and To = p. Tp – W – Then we have Tp = [W + To(W, p)]/p. • S = Ts/Tp = W/ Tp = Wp/[W +To(W, p)]. • E = S/p = W/[W + To(W, p)] = 1/[1 + To(W, p)/ W]. – E = 1/[1 +To(W, p)/W]. • If, – W is constant, and – p is increased, then E decreases. • If, – p is constant, and – w is increases, then E increases for parallel scalable systems. • We need [ E to be constant ] – for scalable effective systems.
Isoefficiency Function • Example 1 – p increases and requires that, – W exponentially with p, then poorly scalable. – Problem is poorly scalable. Since we need to increase the problem size very much to obtain good speed ups. • Example 2 – P increases and requires that, – w linearly with p, then highly scalable. – Since the speed up is now proportional to the number of processors.
Isoefficiency Function • E = 1/[1 + To(W, p)/W]. • W = [E/1 – E] * To(W, p). • E is constant since E/1 – E is a constant. • Given than E/1 – E = K, we have, • This function dictates growth rate of W required to keep the E constant as p increases. • Isoefficiency does not exist for unscalable parallel systems. • Because in this systems, E cannot be kept constant as p increases, no matter how much or how fast W increases. – W = KTo(W , p)
Isoefficiency Function • – – – • – – Overhead function (Adding n numbers on p–processor hypercube. ) Ts = n cost–optimal. Tp = n/p + 2 logp. To = p. Tp – Ts = p (n/p + 2 log p) – n = 2 plogp. Isoeficiency Function W = KTo(W, p). To = 2 plog p. W = 2 Kplog p. Note: To = To (p). Asymptotic isoeffciency function is (plogp).
Isoefficiency Function • Meaning – If number of PE increases p’ > p, then the problem size has to be increased by p’logp’/plogp to have the same efficiency as with p processors. – If number of PE increases p’ > p by a factor of p’/p, then the problem size has to grow by a factor of p’logp’/plogp to increase the speedup by p’/p. – Here communication overhead is an exclusive function of p: T o = To(p). – In general, To = To(W, p) and W = KTo(W, p) (may involve many terms and sometimes hard to solve in terms of p. – E = constant needs ratio To/w to be fixed. As p increases, W must also increase to obtain non–decreasing efficiency (E’ >= E). – To should not grow faster than W. – None of To terms should grow faster than W.
Isoefficiency Function • If To has multiple terms, we balance W against each term of To and compute the( respective) isoefficiency functions for corresponding individual terms. • The component of To that requires the problem size to grow at the highest rate with respect to p, determines the overall asymptotic isoefficiency of the parallel system. • Example: – – – To = p 3/2 + p 3/4 W 3/4. Two terms, W = Kp 3/2 and W = Kp 3/4 W 3/4. From the first term we get W = (p 3/2). From the second term we get • • • W 1/4 =Kp 3/4 W = K 4 P 3 Then, W = (p 3) Take the highest of the 2 rates. To ensure that E does not decrease, the problem size needs to grow as (overall asymptotic efficiency) (p 3)
Isoefficiency functions • Captures characteristics of the parallel system (parallel algorithm + parallel architecture). • Predicts the impact on performance as the number of PE increases. • Characterizes the amount of parallelism in a parallel algorithm. • Study of algorithm (of the parallel system) behavior due to hardware changes: – PE speed. – Communication channels.
Cost–optimality and Isoefficiency • Ts/p. Tp = constant. • p. Tp = (W). • W + To(W, p) = (W) – Since, To = p. Tp – W – And Ts = W • Then, To(W, p) = (W), and W = (To(W, p)). • A parallel system is cost–optimal iff its overhead function does not grow asymptotically more than the problem size.
Cost–optimality and Isoefficiency Function Relationship • Examples: Add n numbers on p-processor hypercube – Non–optimal cost: • • • W = O(n) Tp = O((n/p)log P) To = p. Tp – W = (nlogp) W = K (nlogp) not true for all K (and E). Algorithm – – Not cost–optimal Does not have an isoefficiency function. not scalable.
Cost–optimality and Isoefficiency Function Relationship • Examples: Add n numbers on p-processor hypercube – Cost–optimal: • • • W = (plog p), we have n >> p for cost–optimality. W = O(n), W n Tp = O (n/p + logp) To = O (n + plogp) - O(n) W = KO(plogp), that is, the problem size should grow at least as plogp such that the parallel systems is scalable.
Isoefficiency Function • Determines the ease with which a parallel system can maintain a constant efficiency and thus, achieve speedups increasing in proportion to the number of processors. • A small isoefficiency function means that small increments in the problem size are sufficient for the efficient utilization of an increasing number of processors–indicates that the parallel system is highly scalable. • A large isoefficiency function indicates a poorly scalable parallel system. • The isoefficiency function does not exist for unscalable parallel systems, because in such systems the efficiency cannot be kept at any constant value as p increases, no matter how fast the problem size is increased.
Lower Bound on the Isoefficiency • Small isoefficiency function, higher scalability – For a problem with W, Pmax W for a cost–optimal system (if we have p > W, some PEs are idle) – If W < (p) (i. e. , the problem size grows slower than p), then, as P keeps on increasing, at one point the number of PEs gets larger than W, the efficiency E decreases. – Asymptotically W = (p), i. e. , the problem size must increase proportionally as (p) to maintain fixed efficiency. – W = (p) (W should grow at least as fast as p). – (p) is the asymptotic lower bound on the isoefficiency function. – Let pmax = (W) (p should grow at most as fast as W) – The isoefficiency function for an ideal parallel system is: • W= (p)
Degree of Concurrency and Isoefficiency Function • Degree of Concurrency – Maximum number of tasks that can be executed simultaneously at any time. – Independent on the parallel architecture. – Denoted by C(W), means that, no more than C(w) processors can be employed effectively. • Effect of Concurrency on Isoefficiency function – Gaussian elimination: • W = (n 3) • p = ( n 2) • C(W) = (W 2/3), i. e. , at most (W 2/3) processors can be used efficiently • Given p, W = (p 3/2), i. e. , problem size should be at least (p 3/2) to use them all. • The isoefficiency due to concurrency is (p 3/2).
Degree of Concurrency and Isoefficiency Function • The isoefficiency function due to concurrency is optimal, that is, (P), i. e. , only if the degree of concurrency of the parallel algorithm is (W). • If the degree of concurrency of an algorithm is less then (W), then the isoefficiency function due to concurrency is worse, i. e. , greater then (p) • Overall isoefficiency function of a parallel system is: – Isoeffsystem = max(Isoeffconcurr, Isoeffcommun, Isoeffoverhead)
Sources of Parallel Overhead • The overhead function characterizes a parallel system. • Given the overhead function To = To(W, p), we can express Tp, S, E , p. Tp (cost) as f(W, p). • The overhead function encapsulates all causes of inefficiencies of a parallel system, due to: – Algorithm – Architecture – Algorithm–architecture interaction
Sources of Overhead • Interprocessor Communication – Each PE spends tcomm – overall interprocessor communication-ptcomm (architecture impact) • Load imbalance – Idle versus busy PEs (contributes to overhead). • Example: In sequential part – 1 PE: Ws is useful. – (p– 1) PEs: (p– 1)Ws contributes to overhead function.
Sources of Overhead • Extra–Computation – Redundant Computation (Eg: Fast Fourier Transformation) – W–for best sequential algorithm. • W’–for a sequential algorithm easily parallelizable. • (W’–W) Contributes to overhead. • W = Ws + Wp, where Ws is executed by PE only and the rest of (p– 1)Ws contributes to overhead. – Overhead of scheduling.
Minimum Execution Time (assume p is not a constraint) • Adding n numbers on a hypercube – In general: • Tp = Tp(W, p) • for a given W, Tpmin ? • d. Tp/dp = 0, the solution gives the p 0 for which Tp = Tpmin – Example: • Tp = n/p + 2 log p • • d. Tp/dp = 0, the solution gives, p 0 = n/2 Tpmin = 2 logn. cost–sequential: (n) cost–parallel: (nlogn), since p 0 Tpmin = (n/2) * (2 logn). – Running this algorithm for Tpmin is NOT COST–OPTIMAL, but this algorithm IS cost–optimal
Minimum Execution Time • Derive Lower bound for Tp such that parallel cost is optimal. – Tpcost–optimal – parallel runtime such that the cost is optimal with fixed W. – If isoefficiency function is (f(p)), then problems of size W can be executed cost–optimally only iff: W = (f(p)). – p = O(f-1(W)) required for a cost–optimal solution. – Tp for cost- optimal algorithm is: (W/p). – Since, • • p. Tp = (W) Tp = (W/p) P = O(f-1(W) Then Tpcost-opt= (W/f-1(W))
Minimum Cost–Optimal Time for Adding n Numbers on a Hypercube • • • – – – – Isoefficiency Function : To = p. Tp –W Tp = n/p + 2 logp To = p (n/p + 2 logp) – n = 2 plog p W = KTo = K 2 plog p W = O (p log p), isoefficiency function If W = n = f(p) = plog p, then logn = logp + loglog p logn = log p (ignoring loglogp term) If n = f(p) = plogp p = f-1(x) n = plogp p = n/logn f-1(n) = n/logn f-1(W) = n/logn f-1 (W) = O (n/logn)
Minimum Cost–Optimal Time for Adding n Numbers on a Hypercube • The cost–optimal solution P = O(f– 1(W)) – for a cost optimal solution • p = (nlogn), the maximum for cost–optimal solution. • For p = n/logn, Tp = Tpcost–opt, Tp = n/p + 2 log p • Tpcost-opt = logn + 2 log(n/logn) = 3 logn – 2 loglogn – Tpcost-opt = – Note: (logn) • Tpmin = (logn) • Tpcost-opt = (logn) • Cost optimal solution is the best asymptotic solution in terms of execution time. • Tpmin for po = n/2 > po = n/logn (Tpcost-min) • Tpcost-opt = (Tpmin)
Minimum Cost–Optimal Time for Adding n Numbers on a Hypercube • Parallel systems when Tpcost–optimal > Tpmin • • • To = p 3/2 + p 3/4 W Tp = W + To/p Tp = W/p + p 1/2+ W 3/4/p 1/4 d. Tp/dp = 0, the solution is p 3/4 = 1/4 W 3/4 implies, p 3/4 = (W 3/4), i. e. , [Po = (W)] Tpmin = (W 1/2) • Isoefficiency Function: (1/16 W 3/2 + 2 W 1/2) which • W = KTo = K 4 p 3 = (p 3), i. e. , pmax = (W 1/3), the maximum number of PE for which algorithm is cost-optimal. • Tp = W/p + p 1/2+ W 3/4/p 1/4, and p = (W) • Then Tpcost-opt = (W 2/3) • Tpcost-opt > Tpmin asymptotically.
- Slides: 49