Threaded Programming Methodology Intel Software College Objectives After
Threaded Programming Methodology Intel Software College
Objectives After completion of this module you will • Be able to rapidly prototype and estimate the effort required to thread time consuming regions Threaded Programming Methodology 2 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Agenda A Generic Development Cycle Case Study: Prime Number Generation Common Performance Issues Threaded Programming Methodology 3 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
What is Parallelism? Two or more processes or threads execute at the same time Parallelism for threading architectures • Multiple processes • Communication through Inter-Process Communication (IPC) • Single process, multiple threads • Communication through shared memory Threaded Programming Methodology 4 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Amdahl’s Law Describes the upper bound of parallel execution speedup 0. 5 P/2 (1 -P) + 0. 25 0. 0 Tparallel = {(1 -P) + P/n} Tserial n = number of processors … (1 -P) Tserial P n=∞ 2 P/∞ 1. 0/0. 75==2. 0 1. 33 Speedup = Tserial / Tparallel 1. 0/0. 5 Serial code limits speedup Threaded Programming Methodology 5 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Processes and Threads Stack thread main() Stack thread Stack … thread Code segment Data segment Modern operating systems load programs as processes • Resource holder • Execution A process starts executing at its entry point as a thread Threads can create other threads within the process • Each thread gets its own stack All threads within a process share code & data segments Threaded Programming Methodology 6 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Threads – Benefits & Risks Benefits • Increased performance and better resource utilization • Even on single processor systems - for hiding latency and increasing throughput • IPC through shared memory is more efficient Risks • Increases complexity of the application • Difficult to debug (data races, deadlocks, etc. ) Threaded Programming Methodology 7 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Commonly Encountered Questions with Threading Applications Where to thread? How long would it take to thread? How much re-design/effort is required? Is it worth threading a selected region? What should the expected speedup be? Will the performance meet expectations? Will it scale as more threads/data are added? Which threading model to use? Threaded Programming Methodology 8 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Prime Number Generation i factor 61 63 65 67 69 71 73 75 77 79 357 3579 35 3579 bool Test. For. Prime(int val) { // let’s start checking from 3 int limit, factor = 3; limit = (long)(sqrtf((float)val)+0. 5 f); while( (factor <= limit) && (val % factor) ) factor ++; return (factor > limit); } void Find. Primes(int start, int end) { int range = end - start + 1; for( int i = start; i <= end; i += 2 ) { if( Test. For. Prime(i) ) global. Primes[g. Primes. Found++] = i; Show. Progress(i, range); } } Threaded Programming Methodology 9 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Activity 1 Run Serial version of Prime code • Locate Prime. Single directory • Compile with Intel compiler in Visual Studio • Run a few times with different ranges Threaded Programming Methodology 10 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Development Methodology Analysis • Find computationally intense code Design (Introduce Threads) • Determine how to implement threading solution Debug for correctness • Detect any problems resulting from using threads Tune for performance • Achieve best parallel performance Threaded Programming Methodology 11 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Development Cycle Analysis –VTune™ Performance Analyzer Design (Introduce Threads) –Intel® Performance libraries: IPP and MKL –Open. MP* (Intel® Compiler) –Explicit threading (Win 32*, Pthreads*) Debug for correctness –Intel® Thread Checker –Intel Debugger Tune for performance –Intel® Thread Profiler –VTune™ Performance Analyzer Threaded Programming Methodology 12 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Analysis - Sampling Use VTune bool Test. For. Prime(int val) { // let’s start checking from 3 Sampling to int findlimit, hotspots in application factor = 3; limit = (long)(sqrtf((float)val)+0. 5 f); while( (factor <= limit) && (val % factor)) factor ++; (factor > for limit); Let’s use the project return Prime. Single analysis } • Prime. Single <start> <end> Usage: void Find. Primes(int start, int end) {. /Prime. Single 1 1000000 // start is always odd int range = end - start + 1; for( int i = start; i <= end; i+= 2 ){ if( Test. For. Prime(i) ) global. Primes[g. Primes. Found++] = i; Show. Progress(i, range); } } Identifies the time consuming regions Threaded Programming Methodology 13 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Analysis - Call Graph This is the level in the call tree where we need to thread Used to find proper level in the call-tree to thread Threaded Programming Methodology 14 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Analysis Where to thread? • Find. Primes() Is it worth threading a selected region? • Appears to have minimal dependencies • Appears to be data-parallel • Consumes over 95% of the run time Baseline measurement Threaded Programming Methodology 15 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Activity 2 Run code with ‘ 1 5000000’ range to get baseline measurement • Make note for future reference Run VTune analysis on serial code • What function takes the most time? Threaded Programming Methodology 16 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Foster’s Design Methodology From Designing and Building Parallel Programs by Ian Foster Four Steps: • Partitioning • Dividing computation and data • Communication • Sharing data between computations • Agglomeration • Grouping tasks to improve performance • Mapping • Assigning tasks to processors/threads Threaded Programming Methodology 17 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Designing Threaded Programs Partition • Divide problem into tasks The Problem Communicate • Determine amount and pattern of communication Agglomerate Initial tasks • Combine tasks Map Communication • Assign agglomerated tasks to created threads Combined Tasks Final Program Threaded Programming Methodology 18 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Parallel Programming Models Functional Decomposition • Task parallelism • Divide the computation, then associate the data • Independent tasks of the same problem Data Decomposition • Same operation performed on different data • Divide data into pieces, then associate computation Threaded Programming Methodology 19 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Decomposition Methods Functional Decomposition Atmosphere Model • Focusing on computations can reveal structure in a problem Grid reprinted with permission of Dr. Phu V. Luong, Coastal and Hydraulics Laboratory, ERDC Hydrology Model Ocean Model Land Surface Model Domain Decomposition • Focus on largest or most frequently accessed data structure • Data Parallelism • Same operation applied to all data Threaded Programming Methodology 20 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Pipelined Decomposition Computation done in independent stages Functional decomposition • Threads are assigned stage to compute • Automobile assembly line Data decomposition • Thread processes all stages of single instance • One worker builds an entire car Threaded Programming Methodology 21 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
LAME Encoder Example LAME MP 3 encoder • Open source project • Educational tool used for learning The goal of project is • To improve the psychoacoustics quality • To improve the speed of MP 3 encoding Threaded Programming Methodology 22 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
LAME Pipeline Strategy Prelude Fetch next frame Frame characterization Set encode parameters Time Acoustics Psycho Analysis FFT long/short Filter assemblage Encoding Other Add frame header Check correctness Write to disk Apply filtering Frame Noise Shaping Quantize & Count bits Hierarchical Barrier T 1 Prelude N+1 Acoustics N T 2 T 3 Prelude N+2 Prelude N+3 Acoustics N+1 Acoustics N+2 Encoding N+1 Other N T 4 Frame N Other N+1 Frame N + 1 Threaded Programming Methodology 23 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Design What is the expected benefit? Speedup(2 P) = 100/(96/2+4) = ~1. 92 X How do you achieve this with the least effort? Rapid prototyping with Open. MP How long would it take to thread? How much re-design/effort is required? Threaded Programming Methodology 24 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Open. MP Fork-join parallelism: • Master thread spawns a team of threads as needed • Parallelism is added incrementally • Sequential program evolves into a parallel program Master Thread Parallel Regions Threaded Programming Methodology 25 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Design #pragma omp parallel for( int i = start; i <= end; i+= 2 ){ ) iterations Divide Open. MPif( Test. For. Prime(i) of the for loop global. Primes[g. Primes. Found++] = i; Show. Progress(i, Create threadsrange); here for } this parallel region Threaded Programming Methodology 26 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Activity 3 Run Open. MP version of code • Locate Prime. Open. MP directory and solution • Compile code • Run with ‘ 1 5000000’ for comparison • What is the speedup? Threaded Programming Methodology 27 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Design What is the expected benefit? How do you achieve this with the least effort? Speedup of 1. 40 X (less than 1. 92 X) How long would it take to thread? How much re-design/effort is required? Is this the best speedup possible? Threaded Programming Methodology 28 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Debugging for Correctness Is this threaded implementation right? No! The answers are different each time … Threaded Programming Methodology 29 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Debugging for Correctness Intel® Thread Checker pinpoints notorious threading bugs like data races, stalls and deadlocks VTune™ Performance Analyzer Intel® Thread Checker Primes. exe Binary Instrumentation Runtime Data Collector Primes. exe (Instrumented) +DLLs (Instrumented) threadchecker. thr (result file) Threaded Programming Methodology 30 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Thread Checker Threaded Programming Methodology 31 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Activity 4 Use Thread Checker to analyze threaded application • Create Thread Checker activity • Run application • Are any errors reported? Threaded Programming Methodology 32 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Debugging for Correctness How much re-design/effort is required? Thread Checker reported only 2 dependencies, so effort required should be low How long would it take to thread? Threaded Programming Methodology 33 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Debugging for Correctness #pragma omp parallel for( int i = start; i <= end; i+= 2 ){ if( Test. For. Prime(i) ) #pragma omp critical global. Primes[g. Primes. Found++] = i; Will create a critical section for this reference Show. Progress(i, range); } Will create a critical section for both these references #pragma omp critical { g. Progress++; percent. Done = (int)(g. Progress/range *200. 0 f+0. 5 f) } Threaded Programming Methodology 34 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Activity 5 Modify and run Open. MP version of code • Add critical region pragmas to code • Compile code • Run from within Thread Checker • If errors still present, make appropriate fixes to code and run again in Thread Checker • Run with ‘ 1 5000000’ for comparison • Compile and run outside Thread Checker • What is the speedup? Threaded Programming Methodology 35 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Correctness Correct answer, but performance has slipped to ~1. 33 X 1. 33 Is this the best we can expect from this algorithm? No! From Amdahl’s Law, we expect speedup close to 1. 9 X Threaded Programming Methodology 36 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Common Performance Issues Parallel Overhead • Due to thread creation, scheduling … Synchronization • Excessive use of global data, contention for the same synchronization object Load Imbalance • Improper distribution of parallel work Granularity • No sufficient parallel work Threaded Programming Methodology 37 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Tuning for Performance Thread Profiler pinpoints performance bottlenecks in threaded applications VTune™ Performance Analyzer Primes. c Compiler /Qopenmp_profile Source Instrumentation Thread Profiler Binary Instrumentation Runtime Data Collector Primes. exe (Instrumented) +DLL’s (Instrumented) Primes. exe Bistro. tp/guide. gvs (result file) Threaded Programming Methodology 38 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Thread Profiler for Open. MP Threaded Programming Methodology 39 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Thread Profiler for Open. MP Speedup Graph Estimates threading speedup and potential speedup – Based on Amdahl’s Law computation Gives upper and lower bound estimates Threaded Programming Methodology 40 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Thread Profiler for Open. MP serial parallel serial Threaded Programming Methodology 41 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Thread Profiler for Open. MP Thread 0 Thread 1 Thread 2 Thread 3 Threaded Programming Methodology 42 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Thread Profiler (for Explicit Threads) Threaded Programming Methodology 43 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Thread Profiler (for Explicit Threads) Why so many transitions? Threaded Programming Methodology 44 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Performance This implementation has implicit synchronization calls This limits scaling performance due to the resulting context switches Back to the design stage Threaded Programming Methodology 45 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Activity 6 Use Thread Profiler to analyze threaded application • Use /Qopenmp_profile to compile and link • Create Thread Profiler Activity (for explicit threads) • Run application in Thread Profiler • Find the source line that is causing the threads to be inactive Threaded Programming Methodology 46 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Performance Is that much contention expected? void Show. Progress( int val, int range ) { int percent. Done; static int last. Percent. Done = 0; int percent. Done; #pragma omp critical g. Progress++; {percent. Done = (int)((float)g. Progress/(float)range*200. 0 f+0. 5 f); g. Progress++; percent. Done = (int)((float)g. Progress/(float)range*200. 0 f+0. 5 f); if( percent. Done % 10 == 0 ) } printf("bb%3 d%%", percent. Done); if( percent. Done % 10 == 0 && last. Percent. Done < percent. Done / 10){ } printf("bb%3 d%%", percent. Done); last. Percent. Done++; } } The. This algorithm has many more fix updates than the 10 needed for change should the contention issue showing progress Threaded Programming Methodology 47 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Design Goals • Eliminate the contention due to implicit synchronization Speedup is 2. 32 X ! Is that right? Threaded Programming Methodology 48 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Performance Our original baseline measurement had the “flawed” progress update algorithm Is this the best we can expect from this algorithm? Speedup is actually 1. 40 X (<<1. 9 X)! Threaded Programming Methodology 49 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Activity 7 Modify Show. Progress function (both serial and Open. MP) to print only the needed output if( percent. Done % 10 == 0 && last. Percent. Done < percent. Done / 10){ printf("bb%3 d%%", percent. Done); last. Percent. Done++; } • Recompile and run the code • Be sure no instrumentation flags are used • What is speedup from serial version now? Threaded Programming Methodology 50 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Performance Re-visited Still have 62% of execution time in locks and synchronization Threaded Programming Methodology 51 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Performance Re-visited Let’s look at the Open. MP locks… void Find. Primes(int start, int end) { { // start is always odd int range = end - start + 1; #pragma omp parallel for for( int i = start; i <= end; i += 2 ) { Lock is in a loop { if( Test. For. Prime(i) ) #pragma omp critical global. Primes[Interlocked. Increment(&g. Primes. Found)] = i; global. Primes[g. Primes. Found++] = i; } } Show. Progress(i, range); Threaded Programming Methodology 52 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Performance Re-visited Let’s look at the second lock void Show. Progress( int val, int range ) { void Show. Progress( int val, int range ) int percent. Done; { static int last. Percent. Done = 0; long percent. Done, local. Progress; This lock is also being static int last. Percent. Done = 0; called within a loop #pragma omp critical { local. Progress = Interlocked. Increment(&g. Progress); g. Progress++; percent. Done = (int)((float)local. Progress/(float)range*200. 0 f+0. 5 f); percent. Done = (int)((float)g. Progress/(float)range*200. 0 f+0. 5 f); } if( percent. Done % 10 == 0 && last. Percent. Done < percent. Done / 10){ if( printf("bb%3 d%%", percent. Done % 10 == 0 &&percent. Done); last. Percent. Done < percent. Done / 10){ printf("bb%3 d%%", percent. Done); last. Percent. Done++; } } Threaded Programming Methodology 53 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Activity 8 Modify Open. MP critical regions to use Interlocked. Increment instead • Re-compile and run code • What is speedup from serial version now? Threaded Programming Methodology 54 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Thread Profiler for Open. MP Thread 0 342 factors to test 116747 250000 Thread 1 612 factors to test 373553 500000 Thread 2 789 factors to test 623759 750000 Thread 3 934 factors to test 873913 1000000 Threaded Programming Methodology 55 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Fixing the Load Imbalance Distribute the work more evenly void Find. Primes(int start, int end) { // start is always odd int range = end - start + 1; #pragma omp parallel for schedule(static, 8) for( int i = start; i <= end; i += 2 ) { if( Test. For. Prime(i) ) global. Primes[Interlocked. Increment(&g. Primes. Found)] = i; Show. Progress(i, range); } } Speedup achieved is 1. 68 X Threaded Programming Methodology 56 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Activity 9 Modify code for better load balance • Add schedule (static, 8) clause to Open. MP parallel for pragma • Re-compile and run code • What is speedup from serial version now? Threaded Programming Methodology 57 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Final Thread Profiler Run Speedup achieved is 1. 80 X Threaded Programming Methodology 58 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Comparative Analysis Threading applications require multiple iterations of going through the software development cycle Threaded Programming Methodology 59 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Threading Methodology What’s Been Covered Four step development cycle for writing threaded code from serial and the Intel® tools that support each step • Analysis • Design (Introduce Threads) • Debug for correctness • Tune for performance Threading applications require multiple iterations of designing, debugging and performance tuning steps Use tools to improve productivity Threaded Programming Methodology 60 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Threaded Programming Methodology 61
Backup Slides Threaded Programming Methodology 62 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Parallel Overhead Thread Creation overhead • Overhead increases rapidly as the number of active threads increases Solution • Use of re-usable threads and thread pools • Amortizes the cost of thread creation • Keeps number of active threads relatively constant Threaded Programming Methodology 63 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Synchronization Heap contention • Allocation from heap causes implicit synchronization • Allocate on stack or use thread local storage Atomic updates versus critical sections • Some global data updates can use atomic operations (Interlocked family) • Use atomic updates whenever possible Critical Sections versus mutual exclusion • Critical Section objects reside in user space • Use CRITICAL SECTION objects when visibility across process boundaries is not required • Introduces lesser overhead • Has a spin-wait variant that is useful for some applications Threaded Programming Methodology 64 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Load Imbalance Unequal work loads lead to idle threads and wasted time Time Busy Idle Threaded Programming Methodology 65 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Granularity Coarse grain Scaling: ~2. 5 X ~3 X Serial Parallelizable portion Fine grain Serial Parallelizable portion Scaling: ~1. 10 X ~1. 05 X Threaded Programming Methodology 66 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- Slides: 66