Scalability of Threaded Applications Intel Software College Objectives

Scalability of Threaded Applications Intel Software College

Objectives After completion of this module you will understand • The need for designing multithreaded applications for scalability to take advantage of an increasing number of available cores • What tools are available to measure and predict scalability • How several different factors can inhibit scaling of applications on increased number of cores Scalability of Multithreaded Applications 2 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Agenda Why focus on scalability? • Measuring and estimating scalability • Where would you start? Tools for scalability analysis Factors inhibiting scalability • Serially Dominant Workloads • Granularity and Parallel Overhead • Load Imbalance • Synchronization Issue • Memory Related Issues • I/O Scalability of Multithreaded Applications 3 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

What is scalability? “What is it that we really mean by scalability? A service is said to be scalable if when we increase the resources in a system, it results in increased performance in a manner proportional to resources added. ” -- Werner Vogels CTO - Amazon. com Handle growing amounts of work in a graceful manner What resources might be increased? • Cores and threads • Memory capacity • Data, problem size • Not a resource, but likely to see increases as computation power increases Scalability of Multithreaded Applications 4 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Evolutionary configurable architecture Large, Scalar cores for high single-thread performance Scalar plus many core for highly threaded workloads Many-core array • CMP with 10 s-100 s low power cores • Scalar cores • Capable of TFLOPS+ • Full System-on-Chip • Servers, workstations, embedded… Multi-core array • CMP with ~10 cores Dual core • Symmetric multithreading n tio Evolu Scalability of Multithreaded Applications 5 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Amdahl’s Law Maximum Theoretical Speedup from Amdahl's Law Ψ(p) ≤ 1 8 s + (1 - s) / p 7 Speedup where 0 ≤ s ≤ 1, the fraction of serial operations Speedup is limited by the amount of serial code 6 %serial= 0 5 %serial=10 %serial=20 4 %serial=30 3 %serial=40 2 %serial=50 1 0 0 1 2 3 4 5 6 7 8 Number of cores Scalability of Multithreaded Applications 6 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Question 1 If application is only 25% serial, what’s the maximum speedup you can ever achieve, assuming infinite number of processors ? (ignore parallel overhead) A: 1. 25 B: 2. 0 C: 4. 0 D: No speedup Scalability of Multithreaded Applications 7 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Scaled Speedup (Gustafson-Barsis’s Law) Amdahl’s Law does not take into account • overhead costs • increases in problem size able to be computed with more cores Increasing the number of cores enables… • Increasing the problem size ―> Decreasing the sequential fraction of computation ―> Increasing Speedup Scaled Speedup estimates how much faster parallel execution is over same computation on Given p cores and a parallel code core solving a problem of size n, single let s be the fraction of serial execution in the code. Assumes problem size increases linearly Ψ ≤ p + (1 – p) / s with number of cores Scalability of Multithreaded Applications 8 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Using Scaled Speedup If application runs on 64 cores in 220 seconds with 11 seconds devoted to serial execution, what is the scaled speedup? Ψ = 64 + (1 – 64) (11/220) = 64 – 63 * 0. 05 Amdahl’s Law 5% serial on 64 cores => 15. 42 = 60. 85 Assuming fixed serial time, what is single core execution time? (220 -11)*64 + 11 = 13387 seconds • Amdahl’s Law then yields speedup of 60. 84 on 64 cores with 0. 08271% serial time Would serial time be fixed? Would problem fit on one core? Scalability of Multithreaded Applications 9 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Question 2 What is the maximum amount of serial execution time for a parallel application to achieve a scaled speedup of 7. 5 on an eight-core system? 7. 5 = 8 + (1 – 8) s s = 0. 5 / 7 = 0. 071 => 7. 1% Using Amdahl’s Law, serial percentage must be ≤ 0. 952% Scalability of Multithreaded Applications 10 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Estimating potential scalability of serial applications • Need to estimate serial vs. parallelizable execution times • Speedup estimate based on Amdahl’s law • VTune sampling • • • Identify potential areas for parallelization • Example: loops • Use clock ticks to estimate parallel time Serial time = Total run time – parallelizable run time Compute scalability estimate • VTune call graph • See potential call trees for parallelization • Use “Total time” (self + descendents) for parallelizable run time Scalability of Multithreaded Applications 11 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Estimating scalability upper bound for parallel applications Need to estimate serial vs. parallel execution times • • Speedup estimate based on Amdahl’s law Serial percentage for Gustafson-Barsis’s Law Thread Profiler • • Use critical path information in Profile View Use information in Concurrency Level view • Experimental technique based on CPU utilization of all processors/cores Scalability of Multithreaded Applications 12 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Finding Serial and Parallel Time Thread Profiler – for parallel applications • Use Concurrency Level View • Total Serial (CL: 0 and CL: 1) and Parallel (CL: 2 and up) times • Under Utilized times counted as parallel time Scalability of Multithreaded Applications 13 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Finding Serial and Parallel Time CPU utilization Experimental approaches – for parallel applications • Monitor utilization of all CPUs over time • Parallel region is where all CPUs are active • Perfmon* (Windows) or mpstat (Linux) • Example: 76% serial, 24% parallel on DP Perfmon* or mpstat does not capture sub-second behavior Scalability of Multithreaded Applications 14 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

How can speedup estimate help identify scalability issues? • Different workloads can exercise different parts of application 4. 0 3. 0 Measured 4 P Speedup Workload 13 Workload 12 Workload 11 Workload 10 Workload 09 Workload 08 Workload 07 Workload 06 Workload 05 0. 0 Workload 04 1. 0 Workload 03 • Choose largest delta workloads for analysis 2. 0 Workload 02 • Compare measured vs. estimate Measured 2 P Speedup Workload 01 Speedup • Estimates can point to workloads that need scalability analysis and improvement 2 P and 4 P Speedup (IBM* X 440. NET RC 1) Workload 11 is 12 predicted to have low scaling Workloads 5& show significant difference between estimate and actual; focus tuning here Scalability of Multithreaded Applications 15 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Quick Review: Measuring and Estimating Speedup Estimate serial vs. parallel times in workloads • Allows prediction of speedup upper bounds Serial applications • Estimate based on VTune Sampling or Callgraph runs Parallel applications • Use Thread Profiler • Experimental techniques • Measuring CPU utilization over time for all processors Scalability of Multithreaded Applications 16 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Agenda Why focus on scalability? • Measuring and estimating scalability • Where would you start? Tools for scalability analysis Factors inhibiting scalability • Serially Dominant Workloads • Granularity and Parallel Overhead • Load Imbalance • Synchronization Issue • Memory Related Issues • I/O Scalability of Multithreaded Applications 17 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Approaching a serial application 1. Pick a workload 2. Establish a scalability target • Example: Must have at least 2. 5 x improvement 1 core 4 core 3. Estimate amount of parallelization required • • • Dictated by Amdahl’s law • Example: 2. 5 X improvement 1 c 4 c would require 80% of run time to be parallelized Identify areas to parallelize Cannot find areas to meet required amount of parallelization? • Reset scalability target and continue parallelization 4. Parallelize and measure speedup 5. Did you meet the scalability target? • If not, root cause and improve Repeat for other workloads Scalability of Multithreaded Applications 18 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Approaching a parallel application 1. Pick a workload 2. Estimate expected speedup • Amdahl, Gustafson-Barsis 3. Measure speedup 4. Did you meet the expected scaling? • If not, root cause and improve Repeat for other workloads Scalability of Multithreaded Applications 19 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Question 3 a: What is the best design for scalability? Audio processing application • Left channel computation • Right channel computation Scalability of Multithreaded Applications 20 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Question 3 b: What is the best design for scalability? Video stream encoding • Thread intra-frame? • Thread groups of pictures? Scalability of Multithreaded Applications 21 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Question 3 c: What is the best design for scalability? Room Assignment Problem (Simulated Annealing) Goal: Find most compatible roommate assignments Method: • Roomers take interest survey • Roommates initially chosen at random • Two people are swapped at random • Does new assignment increase common interests in roommates (reduce conflict)? • • If yes, keep new assignment If no, undo swap; shrinking random chance to keep bad match • Continue until solution stabilizes Scalability of Multithreaded Applications 22 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Agenda Why focus on scalability? • Measuring and estimating scalability • Where would you start? Tools for scalability analysis Factors inhibiting scalability • Serially Dominant Workloads • Granularity and Parallel Overhead • Load Imbalance • Synchronization Issue • Memory Related Issues • I/O Scalability of Multithreaded Applications 23 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Windows*: Perfmon* Recommended first set of counters • “Processor” performance object: %processor time, %privileged time (for each CPU) • “System” performance object: Context Switches/sec, System Calls/sec • “Physical. Disk” performance object: Disk Read bytes/sec, Disk Write bytes/sec (for each disk) • “Memory” performance object: Pages/sec • “Network Interface” performance object: Bytes Total/sec (for each network card) Windows command line tools available • Logman • Relog • Typeperf Scalability of Multithreaded Applications 24 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Windows*: Fixing Process to Core Eliminate “noise” from context switches that abandon cache Windows Task Manager • “Process” Tab right click on process to set affinity Windows APIs • Set. Process. Affinity. Mask • Set. Thread. Affinity. Mask Scalability of Multithreaded Applications 25 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

VTune* call graph Helps isolate call trees for potential threading Scalability of Multithreaded Applications 26 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

VTune Counter Monitor Tracks operating system counters over time Some relevant counters: • Processor time • Available memory • Context switches Scalability of Multithreaded Applications 27 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel Thread Profiler Identifies • Serial vs. parallel run times • Lock contention areas • Parallel overhead • Load imbalance Scalability of Multithreaded Applications 28 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Loop graph viewer Being considered for VTune 9. 0 View program as loop hierarchies • Loop self times and total times (in terms of instructions retired) • Similar to call graph self and total times • Loop counts Helps identify loops for coarse grain threading • Loop hierarchies can span functions and files PIN tool based prototype • Currently Linux-only Scalability of Multithreaded Applications 29 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Quick Review: Tools No single tool may give you all the answers for scalability issues • Thread Profiler comes close Simple tools can provide insight into scalability issues • Perfmon • Monitoring of CPU utilization of processors and application threads Scalability of Multithreaded Applications 30 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Agenda Why focus on scalability? • Measuring and estimating scalability • Where would you start? Tools for scalability analysis Factors inhibiting scalability • Serially Dominant Workloads • Granularity and Parallel Overhead • Load Imbalance • Synchronization Issue • Memory Related Issues • I/O Scalability of Multithreaded Applications 31 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Effects of serial domination Serially dominated workloads do not scale well • Amdahl’s Law How to estimate serial time? • VTune sampling, VTune Call graph • Serial applications • Thread Profiler, experimental approaches • Parallel applications Scalability of Multithreaded Applications 32 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Case Study 1 2 P and 4 P Speedup (IBM* X 440. NET RC 1) 76% serial time on DP 4. 0 3. 0 Measured 2 P Speedup 2. 0 Measured 4 P Speedup 1. 0 Workload 13 Workload 12 Workload 11 Workload 10 Workload 09 Workload 08 Workload 07 Workload 06 Workload 05 Workload 04 Workload 03 Workload 02 0. 0 Workload 01 Speedup • 53% non-concurrent time on UP • 4 P theoretical scaling 1. 6 X • 4 P measured scaling 1. 1 X Parallelize serial sections for better scalability Scalability of Multithreaded Applications 33 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Question 4 Profile shows 80% of runtime spent calculating multidimensional FFT Assume calling sequence of fft 2 d fft 1 d fftcc gprofile: Each sample counts as 0. 01 seconds. % cumulative self time seconds calls ms/call 64. 71 93. 43 23952910 0. 00 11. 47 110. 00 16. 57 23952910 0. 00 11. 41 126. 47 100 164. 70 4. 94 133. 59 7. 13 151600 0. 05 2. 94 137. 84 4. 25 37900 0. 11 2. 24 141. 07 3. 23 100 32. 30 total ms/call 0. 00 1402. 89 0. 77 A: fftcc 0. 11 32. 30 Where would you thread for better scaling? C: fft 2 d B: fft 1 d name fftcc_ fft 1 d_ ssf_3 dcs_ fft 2 d_ phaseshift 3 d_ imaging_ D: All of the above Scalability of Multithreaded Applications 34 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Top-Down Design The iterative “hotspot” tuning process: The top-down parallelization process: Find and fix hot spot… Find the highest level of natural parallelism… Find and fix hot spot… Top-down approach considers the parallelism of the whole application rather than individual hotspots. The result is usually a more scalable, parallel application. Scalability of Multithreaded Applications 35 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Agenda Why focus on scalability? • Measuring and estimating scalability • Where would you start? Tools for scalability analysis Factors inhibiting scalability • Serially Dominant Workloads • Granularity and Parallel Overhead • Load Imbalance • Synchronization Issue • Memory Related Issues • I/O Scalability of Multithreaded Applications 36 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Granularity Loosely defined as the ratio of computation to synchronization Be sure there is enough work to merit parallel computation Example: Working on the railroad. How many more workers can be added? Scalability of Multithreaded Applications 37 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Activity 1: Workload-dependent Scaling Lab shows a chosen number of spheres bouncing within an enclosed box • Obey laws of physics for bouncing off walls and colliding with other spheres User is able to control • Number of spheres • Amount of physics computation before rendering • Whether to run with single thread or multithreaded • Load balance between threads will be explored in later lab GUI frames per second displayed is performance metric Scalability of Multithreaded Applications 38 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Parallel Overhead Parallel overhead impacts scalability Thread creation/destruction • Amount of work vs. overhead • Thread Pool (Windows*) may be a good solution Synchronization • Call overhead • Transition in and out of kernel space Possible indicators • High kernel time • Thread Profiler • Critical Path view showing large overhead times Scalability of Multithreaded Applications 39 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Example: Threading Quicksort Algorithm: • Pick pivot value from elements Pivot • Partition data around pivot • • Less-than or equal to pivot Greater than pivot Less-than or equal p Greater-than q r • Quicksort the two partitions Quick. Sort(int p, int r) // Assume global array of data { if (p < r) { int q = Partition(p, r); Quick. Sort(p, q-1); // sort less-than Quick. Sort(q+1, r); // sort greater-than } } Scalability of Multithreaded Applications 40 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Example: Threading Quicksort How about creating threads at each recursive call? DWORD WINAPI Quick. Sort(LPVOID pr) { int p = ((q. Params *)pr)->s; int r = ((q. Params *)pr)->t; q. Params lo, hi; HANDLE h. LOHI[2]; typedef struct { int s, t; } q. Params; For single if (p < r) parameter { int q = Partition(p, r); lo. s = p; lo. t = q; hi. s = q+1; hi. t = r; h. LOHI[0] = Create. Thread(NULL, 0, Quick. Sort, (LPVOID) &lo, 0, NULL); h. LOHI[1] = Create. Thread(NULL, 0, Quick. Sort, (LPVOID) &hi, 0, NULL); Wait. For. Multiple. Objects(2, h. LOHI, TRUE, INFINITE); } return 0; } Scalability of Multithreaded Applications 41 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Quicksort Performance Results Scalability of Multithreaded Applications 42 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Is There a More Scalable Quicksort Implementation? Thread pool to control number of threads Producer/Consumer relationship with index pair queue • • Dequeue pair struct from queue and partition (Consumer) Recursive calls become enqueue of index struct (Producer) Encapsulation of DWORD WINAPI Quick. Sort(LPVOID p. Arg) index pairs done in { queue routines int p, r, q; while (1) { Wait. For. Single. Object(h. Sem, INFINITE); dequeue(&p, &r); if (p < r) { q = Partition(p, r); enqueue(p, q); enqueue(q+1, r); q = Release. Semaphore(h. Sem, 2, NULL); } } return 0; } Scalability of Multithreaded Applications Semaphore counts number of pairs in queue 43 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Quickstart Thread Pool Performance Scalability of Multithreaded Applications 44 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Agenda Why focus on scalability? • Measuring and estimating scalability • Where would you start? Tools for scalability analysis Factors inhibiting scalability • Serially Dominant Workloads • Granularity and Parallel Overhead • Load Imbalance • Synchronization Issue • Memory Related Issues • I/O Scalability of Multithreaded Applications 45 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Looking at Load Imbalances Load imbalance reduces scalability Why? • Idle CPU • Easier to spot on 4 P or above • On 2 P, idle times might be mistaken as “serial” sections How do you detect this? • Windows* Perfmon • Linux* mpstat • Thread Profiler Scalability of Multithreaded Applications 46 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Case Study 2 Linux* mpstat CPU data 2 P data not suggestive of load imbalance 2 cores 1 core 4 cores 1 core 2 cores 4 P data shows CPUs drop off Scalability of Multithreaded Applications 47 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Case Study 2: Improved Linux* mpstat CPU data Second figure shows improvement with load balancing 4 cores • 1 P-4 P scaling improves from 2. 1 x to 2. 7 x Scalability of Multithreaded Applications 48 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Spotting Load Imbalance in Thread Profiler Differences in Active Thread state First problem noticed is create/destroy threads for each iteration… …but there is a difference in Active Thread state within pairs. Differences in Active Thread state Scalability of Multithreaded Applications 49 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Activity 2: Effects of Load Balance in Multi-Threaded Implementation Use Load Balance control within Basic Physics GUI to control number of spheres assigned to threads How does this affect FPS measure? Scalability of Multithreaded Applications 50 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Agenda Why focus on scalability? • Measuring and estimating scalability • Where would you start? Tools for scalability analysis Factors inhibiting scalability • Serially Dominant Workloads • Granularity and Parallel Overhead • Load Imbalance • Synchronization Issue • Memory Related Issues • I/O Scalability of Multithreaded Applications 51 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Synchronization ad re Th ad re Th 3 2 1 0 Lost time waiting for locks Most likely scenario for high contention • Work inside AND outside protected region is very small Busy • “Threads pile up” on the lock In Critical • Symptoms: High context switches/sec, high kernel times Idle Time Spotting highly contended synchronization objects • Thread Profiler Scalability of Multithreaded Applications 52 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Lock Contention Indicators in Thread Profiler Large amount of Impact time associated with object Large percentage of Locks time Scalability of Multithreaded Applications 53 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Synchronization Primitives Windows* Choice of synchronization primitives • Atomic increments/decrements • Interlocked. Increment • Critical Section, Critical Section with spin count • Enter. Critical. Section, Leave. Critical. Section, Set. Critical. Section. Spin. Count • Works within a single process • Events • Signal condition has been changed/satisfied • Mutex • Works across processes as well • Semaphore • Works across processes as well Scalability of Multithreaded Applications 54 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Activity 3: Measuring Synchronization Object Overhead Determine overhead for using different synchronization objects • • • Interlocked. Increment CRITICAL_SECTION with spin count Mutex Semaphore CRITICAL_SECTION is used as baseline • Interlocked. Increment is specialized functionality; others more general Scalability of Multithreaded Applications 55 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Synchronization Primitives Costs: Un-contended >50 x Use the least expensive synchronization method possible Scalability of Multithreaded Applications 56 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Lock Contention Lock contention reduces scalability Following factors combine to produce contention and reduce scalability • Amount of work inside vs. outside protected region • Synchronization primitive costs • OS context switches during lock contention Possible indicators (without Thread Profiler) • High context switches/sec • >10, 000/s should be investigated • And high kernel time • >20% should be investigated Watch for high context switches/sec and kernel time Scalability of Multithreaded Applications 57 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Reducing Lock Contention Lock contention reduces scalability – fix? • Ideally, work inside << work outside • Redesign • Explore use of “spin count” (Windows*) • Initialize. Critical. Section. And. Spin. Count, Set. Critical. Section. Spin. Count • #define _WIN 32_WINNT 0 x 0403 // or higher • • Spin count = 4000 recommended by Microsoft* Not very portable Scalability of Multithreaded Applications 58 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Case Study 3 17% serial time on 2 P 2 P and 4 P Speedup (IBM* X 440. NET RC 1) 4. 0 Measured 2 P Speedup 2. 0 Measured 4 P Speedup 1. 0 Workload 13 Workload 12 Workload 11 Workload 10 Workload 09 Workload 08 Workload 07 Workload 06 Workload 05 Workload 04 Workload 03 Workload 02 0. 0 Workload 01 Speedup 3. 0 • 9% serial time on UP • 4 P theoretical scaling 3. 1 X • Measured 4 P scaling is 0. 4 X Why should we think this is a Synchronization issue? Pretty good load balance Clues: 50% kernel time >40 K context switches/sec on 4 P Scalability of Multithreaded Applications 59 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Case Study 3: Speedup We have a negative scaling problem… If adding more threads results in worse performance, there must be some increased contention on a shared resource Scalability of Multithreaded Applications 60 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Case Study 3: First Approach Root cause: A class defined a critical section as a static member variable Before Solution: Have each instance of class use separate lock by removing static declaration Before: 4 threads randomly accessing 8 lights with 1 global lock After: 4 threads randomly accessing 8 lights with 8 private locks Scalability of Multithreaded Applications 61 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Case Study 3: Performance Observations • 4 P scaling has improved from 0. 7 x to 1. 3 x • There still is much work to do: • Now 80, 000 context switches/second • Utilization of each CPU near 75% Scalability of Multithreaded Applications 62 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Case Study 3: Second Approach Perfmon* observations (4 P) • Almost no serial execution, utilization of each CPU near 50% • Almost 200, 000 Context Switches/sec! Scalability of Multithreaded Applications 63 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Case Study 3: Diagnosis Root cause: poor choice of synchronization primitive • Computation is incrementing a single variable • Threads contending on single Critical Section object Solution: Use of “Interlocked. Increment” • Critical section with spin count is another possibility Use the least expensive synchronization method possible Scalability of Multithreaded Applications 64 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

How Much of Data Structure to be Locked? Example: Array of counts/buckets/pointers (random access) • Enumeration sort, radix sort, bucket sort • Hash table Lock whole structure? • • Easy to implement Severely restricts access Lock individual elements? • • Individual access by different threads Extra space in structure Scalability of Multithreaded Applications 65 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Modulo Locks Assuming little contention for individual elements Create array of locks to protect every Kth element • Fixed number of locks, say 2 • Lock index used to determine which elements are protected • To access element Data[Q], thread must hold LOCK[Q % 2] • Works for 2 -D and 3 -D arrays • For example, with eight locks, accessing A[i, j] would use LOCK[(i+j) % 8] Set number of locks equal to number of threads Scalability of Multithreaded Applications 66 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Agenda Why focus on scalability? • Measuring and estimating scalability • Where would you start? Tools for scalability analysis Factors inhibiting scalability • Serially Dominant Workloads • Granularity and Parallel Overhead • Load Imbalance • Synchronization Issue • Memory Related Issues • I/O Scalability of Multithreaded Applications 67 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Frontside Bus (FSB) Bandwidth Cores share bus in current Intel® multi-core architectures • Saturating the bus limits scalability • Newer independent bus designs improve scalability • Applies to current SMP platforms too Good metric to monitor, if • • • CPU utilization is close to 100% Poor scaling to 4 P or 8 P Low context switches/sec How do you measure this? • VTune™ Performance Analyzer • Compare 1 -thread vs. multi-thread VTune runs • • Look for areas where clock ticks show significant jumps Code inspection Scalability of Multithreaded Applications 68 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Case Study 4 IPF Madison 1. 5 Ghz/9 M/400 Mhz • 1 P to 2 P scaling: 1. 28 • 1 P to 4 P scaling: 1. 27 2 P close to FSB saturation • ~5 GB/s • Madison 400 Mhz bus peak bandwidth is 6. 4 GB/s Solution? • Change algorithm / data structures to keep data in cache more often • Easier said than done Scalability of Multithreaded Applications 69 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Frontside Bus Lab Intent of this lab • Observe impact of FSB saturation on scalability using Stream benchmark • Learn use of appropriate VTune performance event to monitor bus utilization Woodcrest Socket 0 Woodcrest Socket 1 Core 0 2 M L 2 Core 1 2 M L 2 Core 4 2 M L 2 Core 5 2 M L 2 Core 2 2 M L 2 Core 3 2 M L 2 Core 6 2 M L 2 Core 7 2 M L 2 Bus 0 Bus 1 Chipset MCH Scalability of Multithreaded Applications 70 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Computing FSB Data Bandwidth Core 2™ Processor Bus bandwidth (MB/s) per core = (BDC. ta / CCU. b) * TB • BDC. ta is the BUS_DRDY_CLOCKS. THIS_AGENT event count • Counts the number of bus cycles when data is sent on the bus (the DRDY [Data Ready] signal is asserted on the bus) • CCU. b is the CPU_CLK_UNHALTED. BUS event count • Counts the number of bus cycles occurred during measurement (bus cycles when core is not halted) • TB is Theoretical bandwidth of the bus in MB/s = 8 * bus frequency • Example: 2. 6 GHz processor with 1333 MHz FSB • Theoretical bandwidth = 8 bytes/clock * 1333 = 10664 MB/s Total bus bandwidth = ∑ “per core” bandwidths Scalability of Multithreaded Applications 71 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Frontside Bus Lab: VTune Notes • Where is “BUS_DRDY_CLOCKS. THIS_AGENT” event? • Configure Sampling Events Tab -> Event Groups: “External Bus Events” • View results as Table in VTune • Easier to compute bandwidth • View results per CPU (Show/Hide CPU Info. ) Scalability of Multithreaded Applications 72 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Frontside Bus Lab: Example Bus bandwidth (MB/s) per core = (BDC. ta / CCU. b) * TB • TB = 8 bytes/clock * 1333 MHz = 10664 MB/s Processor 0 BW = (159, 204, 930 / 1, 318, 288, 023) * 10664 MB/s = (0. 159/1. 318)*10. 7 GB/s ~ 1. 29 GB/s Scalability of Multithreaded Applications 73 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Activity 4: Measuring Frontside Bus Saturation Intent of this activity • Observe impact of FSB saturation on scalability using Stream benchmark • Learn use of appropriate VTune performance event to monitor bus utilization What you may see Scalabilitybus 0. bat: ~ 5 sec / ~1. 2 GB/s Scalabilitybus 0123. bat: ~18 sec / ~4 GB/s Scalability of Multithreaded Applications 74 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Frontside Bus Lab: 4 Core Examples Processor 0 BW = (0. 308/3. 216)*10. 7 GB/s ~ 1. 02 GB/s Processor 1 BW = (0. 312/3. 311)*10. 7 GB/s ~ 1. 00 GB/s Processor 2 BW = (0. 308/3. 224)*10. 7 GB/s ~ 1. 02 GB/s Processor 3 BW = (0. 311/3. 305)*10. 7 GB/s ~ 1. 00 GB/s Total BW ~ 1. 0*4 = 4 GB/s Processor 0 BW = (0. 309/4. 922)*10. 7 GB/s ~ 0. 67 GB/s Processor 2 BW = (0. 313/4. 989)*10. 7 GB/s ~ 0. 67 GB/s Processor 4 BW = (0. 315/5. 027)*10. 7 GB/s ~ 0. 67 GB/s Processor 6 BW = (0. 315/5. 023)*10. 7 GB/s ~ 0. 67 GB/s Total BW ~ 0. 67*4 = 2. 68 GB/s Scalability of Multithreaded Applications 75 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Frontside Bus Lab: 8 Core Example Total BUS_DRDY_CLOCKS. THIS_AGENT = 2, 524, 773, 727 Average Total CPU_CLK_UNHALTED. BUS = 52, 382, 932, 480 / 8 = 6, 547, 866, 560 Total Avg. BW = 2, 524, 773, 727 / 6, 547, 866, 560 * 10. 7 GB/s ~ 4. 11 GB/s Scalability of Multithreaded Applications 76 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Frontside Bus Lab: Discussion Almost 3 x slower run time from 1 to 4 cores • Same amount of data transferred by each thread • Contention for shared bus makes everything run slower How do clockticks compare in first 2 runs? • Notice the clock ticks go up significantly in the 4 stream case in the source view as well Why is there a difference in MB/s reported by Stream vs. what you calculated using VTune? Scalability of Multithreaded Applications 77 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Frontside Bus Lab: Caution Measuring BW on Underutilized System Process or thread migration can break the formula Example: Single thread Stream allowed to migrate in the lab Is bandwidth used equal to (3. 9 * 4 =) 15. 6 GB/s? Best to tie threads to cores for bandwidth analysis Scalability of Multithreaded Applications 78 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

What is False Sharing? Multiple threads repeatedly write to the same cache line shared by cores • Usually different data • Cache lines get invalidated • Forces additional reads from memory • Severe performance impact in tight loops, in general • Threads read/write to the same cache line very rapidly • Good metric to monitor if • • • CPU utilization of all processors very high Poor scaling to 4 P or 8 P Low context switches/sec Scalability of Multithreaded Applications 79 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Detecting False Sharing with VTune Analyzer Core 2® processor-based events: • MACHINE_NUKES. MEM_ORDER event • Significant last level cache read misses • • 2 nd Level or 3 rd Level Cache Read Misses MEM_LOAD_RETIRED. L 2_MISS • Significant FSB activity • BUS_DRDY_CLOCKS. THIS_AGENT Compare 1 -thread vs. multi-thread VTune runs • Look for areas where clock ticks show significant jumps • Code inspection Scalability of Multithreaded Applications 80 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

False Sharing Example 1 #define N_THREADS 16 double sum=0. 0, sum_local[N_THREADS]; No overlap of memory access; no sync needed #pragma omp parallel Each thread can invalidate { cache line for others int me = omp_get_thread_num(); sum_local[me] = 0. 0; #pragma omp for (i=0; i<N; i++) To fix, declare sum_local[me] += x[i] * y[i]; and use true local #pragma omp atomic sum variable for sum += sum_local[me]; each thread } Scalability of Multithreaded Applications 81 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

False Sharing Example 2 Normalization of an array of spatial vectors (double precision) • 10, 000 vectors (<256 K size; fits in L 2) • 5�⁄� vectors per cache line False sharing case V 0 • Round-Robin distribution V 1 V 2 Thread 0 V 3 V 4 V 5 Thread 1 V 6 V 7 V 8 Thread 2 V 9 . . Thread 3 • Each thread works on “start index + i*Num_Threads” V 0 No false sharing case V 1 V 2 Thread 0 … V 2499 Thread 1 V 2500 … V 4999 Thread 2 V 5000 . . Thread 3 • Each thread works on a block of data • Block per thread = Array size / Num_Threads Scalability of Multithreaded Applications 82 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Speedup False Sharing Example 2 – Effects on Speedup (2 S/2 C Dempsey; HT off) Scalability of Multithreaded Applications 83 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Activity 5: Identifying False Sharing Intent of this activity • Observe impact of false sharing on scalability • Learn use of appropriate VTune performance events • Compare and contrast false sharing vs. no false sharing Scalability of Multithreaded Applications 84 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Activity 5: Discussion Typical Results Sum of events on all cores Event 1 T 4 T-FS 4 T-NOFS CPU_CLK_UNHALTED. CORE 38. 6 E+09 173. 2 E+09 37. 1 E+09 INST_RETIRED. ANY 30. 5 E+09 28. 5 E+09 MACHINE_NUKES. MEM_ORDER 1. 92 E+06 105. 48 E+06 0. 04 E+06 0. 020 E+06 75. 557 E+06 0. 185 E+06 8. 4 E+06 2454. 5 E+06 7. 3 E+06 MEM_LOAD_RETIRED. L 2_MISS BUS_DRDY_CLOCKS. THIS_AGENT No false sharing in single thread execution MACHINE_NUKES. MEM_ORDER counts events most likely due to false sharing • Cache misses can be indication of problems Scalability of Multithreaded Applications 85 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Effects of On-die Shared Cache Last level cache (LLC) size, shared vs. not shared • Dempsey: 2 MB L 2 not shared • Merom/Woodcrest: 4 MB L 2 shared • Clovertown: 8 MB L 2 (4 MB per die shared) Cache sensitive application will run better with threads on cores not sharing cache Dempsey L 2 L 2 Chipset Woodcrest L 2 Chipset Clovertown L 2 Chipset Scalability of Multithreaded Applications 86 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Detecting Effects of On-die Shared Cache VTune sampling • LLC cache misses increase significantly when run on same socket vs. different sockets Experiments with single socket vs. multi-socket show differences in scaling May require thread affinity to correct performance problems Scalability of Multithreaded Applications 87 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Paying Attention to NUMA Issues NUMA may affect scalability • Non-Uniform Memory Access • Adds extra memory layer to locate data • • Registers Cache Memory “Far” Memory Chipset FSB Chipset MEM MEM MEM Dual Independent Bus Cache-coherent Interconnect MEM Scalability of Multithreaded Applications 88 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Paying attention to NUMA issues NUMA related scalability issues depend on • Platform design • NUMA aware OS used or not • NUMA aware OSs’: Windows* Server 2003 and Linux 2. 6 kernel • Application being NUMA aware or not Check for NUMA issues if • Scaling falls off when going from SMP to NUMA • Low context switches/sec • Application is memory latency sensitive How do you detect this? • • • Knowledge of platform architecture Through experimentation Tie threads to different cores to measure performance • Measure memory latency ratio between “near” and “far” memory Scalability of Multithreaded Applications 89 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

OS and Application NUMA Support Definition of node • Own processors and memory • Connected to the larger system through a cache-coherent interconnect Role of NUMA-aware OS • Schedule threads on processors in the same node as memory being used • Satisfy memory-allocation requests from within the node • But will allocate memory from other nodes if necessary Role of NUMA-aware applications • Use of NUMA APIs • • Topology of nodes Memory per node • Use of Affinity Mask APIs • • Set. Thread. Affinity. Mask, Set. Process. Affinity. Mask Keep threads sharing memory on the same node Scalability of Multithreaded Applications 90 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Agenda Why focus on scalability? • Measuring and estimating scalability • Where would you start? Tools for scalability analysis Factors inhibiting scalability • Serially Dominant Workloads • Granularity and Parallel Overhead • Load Imbalance • Synchronization Issue • Memory Related Issues • I/O Scalability of Multithreaded Applications 91 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Watching I/O impacts scalability • CPU likely to be idle • Check for I/O to disk and network How do you detect this? • Windows* Perfmon • Linux* vmstat, sar, iostat Scalability of Multithreaded Applications 92 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Case Study 5 Linux* mpstat CPU data, sar I/O data Correlation between disk write peaks and CPU utilization troughs When I/O is reduced using application configuration options • 1 P-4 P scaling improves from 1. 9 x to 2. 9 x Striped or RAID disk configurations could have helped Overlapped I/O implementation in application Scalability of Multithreaded Applications 93 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Factors Inhibiting Scalability Summary Serially dominated workload Application Domain Choice of synchronization primitives and lock contention Granularity Parallel overhead I/O (Disk and Network) Load Imbalance Platform/CPU High front side bus utilization Memory related • NUMA • False sharing • Shared cache effects Scalability of Multithreaded Applications 94 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Scalability of Multithreaded Applications 95 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Backup Scalability of Multithreaded Applications 96 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

MESI protocol Every cache line is marked with one of the four following states (coded in two additional bits): • M - Modified: The cache line is present only in the current cache, and is dirty; it has been modified from the value in main memory. The cache is required to write the data back to main memory at some time in the future, before permitting any other read of the (not longer valid) main memory state. • E - Exclusive: The cache line is present only in the current cache, but is clean; it matches main memory. • S - Shared: Indicates that this cache line may be stored in other caches of the machine. • I - Invalid: Indicates that this cache line is invalid. A cache may satisfy a read from any state except Invalid. An Invalid line must be fetched (to the Shared or Exclusive states) to satisfy a read. A write may only be performed if the cache line is in the Modified or Exclusive state. If it is in the Shared state, all other cached copies must be invalidated first. This is typically done by a broadcast operation. A cache may discard a non-Modified line at any time, changing to the Invalid state. A Modified line must be written back first. Scalability of Multithreaded Applications 97 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

MESI protocol (contd. ) A cache that holds a line in the Modified state must snoop (intercept) all attempted reads (from all of the other CPUs in the system) of the corresponding main memory location and insert the data that it holds. This is typically done by forcing the read to back off (i. e. to abort the memory bus transaction), then writing the data to main memory and changing the cache line to the Shared state. A cache that holds a line in the Shared state must also snoop all invalidate broadcasts from other CPUs, and discard the line (by moving it into Invalid state) on a match. A cache that holds a line in the Exclusive state must also snoop all read transactions from all other CPUs, and move the line to Shared state on a match. The Modified and Exclusive states are always precise: i. e. they match the true cacheline ownership situation in the system. The Shared state may be imprecise: if another CPU discards a Shared line, and this CPU becomes the sole owner of that cacheline, the line will not be promoted to Exclusive state. (because broadcasting all cacheline replacements from all CPUs is not practical over a broadcast snoop bus) Scalability of Multithreaded Applications 98 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Linux*: vmstat Quick way to watch for • Disk i/o • Overall cpu utilization • Swap • Context switches vmstat –n 1 • print header only once • output data every 1 sec Scalability of Multithreaded Applications 99 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Linux*: sar May not be installed by default • sysstat package (CD 3 RPMS directory – RHEL 3 U 2) Monitors • Disk i/o, all CPUs, swap, network traffic, interrupts sar –U ALL –b. Ww –o <binfile> 1 0 Report statistics, 1 sec interval, forever -U ALL Report on all CPUs -b aggregated disk I/O (for more details use iostat) -W swap statistics -w context switches/s -o <binfile> write to binary file –f <binfile> read from binary file Scalability of Multithreaded Applications 100 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Linux*: sar Using sar in application launch scripts export PATH=$PATH: /sbin sar –U ALL –b. Ww –o app. sar 1 0 & <launch your app> kill -9 `pidof sar` (could use: killall -9 sar) kill -9 `pidof sadc` (could use: killall -9 sadc) sar seems more expensive • Time gaps in reports even if 1 sec output is requested • Needs more investigation Scalability of Multithreaded Applications 101 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Linux*: mpstat –P ALL 1 -P ALL all processors Using mpstat in application launch scripts export PATH=$PATH: /sbin mpstat –P ALL 1 >mpstat. out & <launch your app> kill -9 `pidof mpstat` (could use: killall -9 mpstat) Scalability of Multithreaded Applications 102 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Linux*: top Useful interactive mode options • Press the appropriate keys s changes delay between updates u selects only specified user’s process H show threads & utilization (toggle) (shows CPU on which thread is scheduled) i idle processes or threads (toggle) b batch mode • Does not report threads Scalability of Multithreaded Applications 103 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Linux*: using processor affinity schedutils package • “taskset” command Affinity system call APIs • sched_setaffinity • In 2. 6 kernels Scalability of Multithreaded Applications 104 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.