Introduction Companion slides for The Art of Multiprocessor

  • Slides: 59
Download presentation
Introduction Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir

Introduction Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Moore’s Law Transistor count still rising Clock speed flattening sharply Art of Multiprocessor Programming

Moore’s Law Transistor count still rising Clock speed flattening sharply Art of Multiprocessor Programming 2

Why do we care? • Time no longer cures software bloat – The “free

Why do we care? • Time no longer cures software bloat – The “free ride” is over • When you double your program’s path length – You can’t just wait 6 months – Your software must somehow exploit twice as much concurrency Art of Multiprocessor Programming 3

Nearly Extinct: the Uniprocesor cpu memory Art of Multiprocessor Programming 4

Nearly Extinct: the Uniprocesor cpu memory Art of Multiprocessor Programming 4

The New Boss: The Multicore Processor (CMP) All on the same chip cache Bus

The New Boss: The Multicore Processor (CMP) All on the same chip cache Bus shared memory Art of Multiprocessor Programming Sun T 2000 Niagara 5

Traditional Scaling Process 7 x Speedup 1. 8 x 3. 6 x User code

Traditional Scaling Process 7 x Speedup 1. 8 x 3. 6 x User code Traditional Uniprocessor Time: Moore’s law Art of Multiprocessor Programming 6

Ideal Scaling Process 7 x Speedup 1. 8 x 3. 6 x User code

Ideal Scaling Process 7 x Speedup 1. 8 x 3. 6 x User code Multicore Unfortunately, not so simple… Art of Multiprocessor Programming 7

Actual Scaling Process Speedup 1. 8 x 2 x 2. 9 x User code

Actual Scaling Process Speedup 1. 8 x 2 x 2. 9 x User code Multicore Parallelization and Synchronization require great care… Art of Multiprocessor Programming 8

Multicore Programming: Course Overview • Fundamentals – Models, algorithms, impossibility • Real-World programming –

Multicore Programming: Course Overview • Fundamentals – Models, algorithms, impossibility • Real-World programming – Architectures – Techniques Art of Multiprocessor Programming 9

Sequential Computation thread memory object Art of Multiprocessor Programming 10

Sequential Computation thread memory object Art of Multiprocessor Programming 10

thr ead s Concurrent Computation memory object Art of Multiprocessor Programming 11

thr ead s Concurrent Computation memory object Art of Multiprocessor Programming 11

Asynchrony • Sudden unpredictable delays – Cache misses (short) – Page faults (long) –

Asynchrony • Sudden unpredictable delays – Cache misses (short) – Page faults (long) – Scheduling quantum used up (really long) Art of Multiprocessor Programming 12

Model Summary • Multiple threads – Sometimes called processes • Single shared memory •

Model Summary • Multiple threads – Sometimes called processes • Single shared memory • Objects live in memory • Unpredictable asynchronous delays Art of Multiprocessor Programming 13

Road Map • We are going to focus on principles first, then practice –

Road Map • We are going to focus on principles first, then practice – Start with idealized models – Look at simplistic problems – Emphasize correctness over pragmatism – “Correctness may be theoretical, but incorrectness has practical impact” Art of Multiprocessor Programming 14

Concurrency Jargon • Hardware – Processors • Software – Threads, processes • Sometimes OK

Concurrency Jargon • Hardware – Processors • Software – Threads, processes • Sometimes OK to confuse them, sometimes not. Art of Multiprocessor Programming 15

Parallel Primality Testing • Challenge – Print primes from 1 to 1010 • Given

Parallel Primality Testing • Challenge – Print primes from 1 to 1010 • Given – Ten-processor multiprocessor – One thread per processor • Goal – Get ten-fold speedup (or close) Art of Multiprocessor Programming 16

Load Balancing 1 109 2· 109 P 0 P 1 … … 1010 P

Load Balancing 1 109 2· 109 P 0 P 1 … … 1010 P 9 • Split the work evenly • Each thread tests range of 109 Art of Multiprocessor Programming 17

Procedure for Thread i void prime. Print { int i = Thread. ID. get();

Procedure for Thread i void prime. Print { int i = Thread. ID. get(); // IDs in {0. . 9} for (j = i*109+1, j<(i+1)*109; j++) { if (is. Prime(j)) print(j); } } Art of Multiprocessor Programming 18

Issues • Higher ranges have fewer primes • Yet larger numbers harder to test

Issues • Higher ranges have fewer primes • Yet larger numbers harder to test • Thread workloads – Uneven – Hard to predict Art of Multiprocessor Programming 19

Issues • Higher ranges have fewer primes • Yet larger numbers harder to test

Issues • Higher ranges have fewer primes • Yet larger numbers harder to test • Thread workloads – Uneven – Hard to predict d e ct e j re • Need dynamic load balancing Art of Multiprocessor Programming 20

Shared Counter 19 18 each thread takes a number 17 Art of Multiprocessor Programming

Shared Counter 19 18 each thread takes a number 17 Art of Multiprocessor Programming 21

Procedure for Thread i int counter = new Counter(1); void prime. Print { long

Procedure for Thread i int counter = new Counter(1); void prime. Print { long j = 0; while (j < 1010) { j = counter. get. And. Increment(); if (is. Prime(j)) print(j); } } Art of Multiprocessor Programming 22

Procedure for Thread i Counter counter = new Counter(1); void prime. Print { long

Procedure for Thread i Counter counter = new Counter(1); void prime. Print { long j = 0; while (j < 1010) { j = counter. get. And. Increment(); if (is. Prime(j)) Shared counter print(j); object } } Art of Multiprocessor Programming 23

Where Things Reside void prime. Print { int i = Thread. ID. get(); //

Where Things Reside void prime. Print { int i = Thread. ID. get(); // IDs in {0. . 9} for (j = i*109+1, j<(i+1)*109; j++) { if (is. Prime(j)) print(j); } } Local variables code cache Bus 1 Bus shared memory shared counter Art of Multiprocessor Programming 24

Procedure for Thread i Counter counter = new Counter(1); void prime. Print { long

Procedure for Thread i Counter counter = new Counter(1); void prime. Print { long j = 0; while (j < 1010) { j = counter. get. And. Increment(); if (is. Prime(j)) print(j); Stop when every } value taken } Art of Multiprocessor Programming 25

Procedure for Thread i Counter counter = new Counter(1); void prime. Print { long

Procedure for Thread i Counter counter = new Counter(1); void prime. Print { long j = 0; while (j < 1010) { j = counter. get. And. Increment(); if (is. Prime(j)) print(j); } } Increment & return each new value Art of Multiprocessor Programming 26

Counter Implementation public class Counter { private long value; public long get. And. Increment()

Counter Implementation public class Counter { private long value; public long get. And. Increment() { return value++; } } Art of Multiprocessor Programming 27

Counter Implementation public class Counter { private long value; } public long get. And.

Counter Implementation public class Counter { private long value; } public long get. And. Increment() { , d a e r return value++; h t s e d l a g e r } h t r sin t o n f e r K r u O c n o c r ot fo n Art of Multiprocessor Programming 28

Why? int a = 0; void increment() { a += 1; ; } clang

Why? int a = 0; void increment() { a += 1; ; } clang -S -c increment. c movl _a(%rip), %eax addl $1, %eax movl %eax, _a(%rip) Art of Multiprocessor Programming 29

What It Means public class Counter { private long value; public long get. And.

What It Means public class Counter { private long value; public long get. And. Increment() { return value++; temp = value; } value = temp + 1; } return temp; Art of Multiprocessor Programming 30

Not so good… Value… 1 2 read 1 write 2 3 read 2 read

Not so good… Value… 1 2 read 1 write 2 3 read 2 read 1 2 write 3 write 2 time Art of Multiprocessor Programming 31

Is this problem inherent? !! !! write read If we could only glue reads

Is this problem inherent? !! !! write read If we could only glue reads and writes together… Art of Multiprocessor Programming 32

Challenge public class Counter { private long value; public long get. And. Increment() {

Challenge public class Counter { private long value; public long get. And. Increment() { temp = value; value = temp + 1; return temp; } } Art of Multiprocessor Programming 33

Challenge public class Counter { private long value; public long get. And. Increment() {

Challenge public class Counter { private long value; public long get. And. Increment() { temp = value; value = temp + 1; return temp; } Make these } steps atomic (indivisible) Art of Multiprocessor Programming 34

Hardware Solution public class Counter { private long value; public long get. And. Increment()

Hardware Solution public class Counter { private long value; public long get. And. Increment() { temp = value; value = temp + 1; return temp; } } Read. Modify. Write() instruction 35 Art of Multiprocessor Programming

An Aside: Java™ public class Counter { private long value; public long get. And.

An Aside: Java™ public class Counter { private long value; public long get. And. Increment() { synchronized { temp = value; value = temp + 1; } return temp; } } Art of Multiprocessor Programming 36

An Aside: Java™ public class Counter { private long value; } public long get.

An Aside: Java™ public class Counter { private long value; } public long get. And. Increment() { synchronized { temp = value; value = temp + 1; } return temp; } Synchronized block Art of Multiprocessor Programming 37

An Aside: Java™ public class Counter { private long value; Mutual Exclusion public long

An Aside: Java™ public class Counter { private long value; Mutual Exclusion public long get. And. Increment() { synchronized { temp = value; value = temp + 1; } return temp; } } Art of Multiprocessor Programming 38

Why do we care? • We want as much of the code as possible

Why do we care? • We want as much of the code as possible to execute concurrently (in parallel) • A larger sequential part implies reduced performance • Amdahl’s law: this relation is not linear… Art of Multiprocessor Programming 39

Amdahl’s Law Speedup= 1 -thread execution time n-thread execution time Art of Multiprocessor Programming

Amdahl’s Law Speedup= 1 -thread execution time n-thread execution time Art of Multiprocessor Programming 40

Amdahl’s Law Speedup= – Art of Multiprocessor Programming 41

Amdahl’s Law Speedup= – Art of Multiprocessor Programming 41

Amdahl’s Law Parallel fraction Speedup= – Art of Multiprocessor Programming 42

Amdahl’s Law Parallel fraction Speedup= – Art of Multiprocessor Programming 42

Amdahl’s Law Sequential fraction Speedup= Parallel fraction – Art of Multiprocessor Programming 43

Amdahl’s Law Sequential fraction Speedup= Parallel fraction – Art of Multiprocessor Programming 43

Amdahl’s Law Sequential fraction Speedup= Number of threads Parallel fraction – Art of Multiprocessor

Amdahl’s Law Sequential fraction Speedup= Number of threads Parallel fraction – Art of Multiprocessor Programming 44

Example • Ten processors • 60% concurrent, 40% sequential • How close to 10

Example • Ten processors • 60% concurrent, 40% sequential • How close to 10 -fold speedup? Art of Multiprocessor Programming 45

Example • Ten processors • 60% concurrent, 40% sequential • How close to 10

Example • Ten processors • 60% concurrent, 40% sequential • How close to 10 -fold speedup? Speedup = 2. 17= Art of Multiprocessor Programming 46

Example • Ten processors • 80% concurrent, 20% sequential • How close to 10

Example • Ten processors • 80% concurrent, 20% sequential • How close to 10 -fold speedup? Art of Multiprocessor Programming 47

Example • Ten processors • 80% concurrent, 20% sequential • How close to 10

Example • Ten processors • 80% concurrent, 20% sequential • How close to 10 -fold speedup? Speedup = 3. 57= Art of Multiprocessor Programming 48

Example • Ten processors • 90% concurrent, 10% sequential • How close to 10

Example • Ten processors • 90% concurrent, 10% sequential • How close to 10 -fold speedup? Art of Multiprocessor Programming 49

Example • Ten processors • 90% concurrent, 10% sequential • How close to 10

Example • Ten processors • 90% concurrent, 10% sequential • How close to 10 -fold speedup? Speedup = 5. 26= Art of Multiprocessor Programming 50

Example • Hundred processors • 90% concurrent, 10% sequential • How close to 100

Example • Hundred processors • 90% concurrent, 10% sequential • How close to 100 -fold speedup? Art of Multiprocessor Programming 51

Example • Hundred processors • 90% concurrent, 10% sequential • How close to 100

Example • Hundred processors • 90% concurrent, 10% sequential • How close to 100 -fold speedup? Speedup = 9. 17= Art of Multiprocessor Programming 52

Example • Ten processors • 99% concurrent, 01% sequential • How close to 10

Example • Ten processors • 99% concurrent, 01% sequential • How close to 10 -fold speedup? Art of Multiprocessor Programming 53

Example • Ten processors • 99% concurrent, 01% sequential • How close to 10

Example • Ten processors • 99% concurrent, 01% sequential • How close to 10 -fold speedup? Speedup = 9. 17= Art of Multiprocessor Programming 54

Back to Real-World Multicore Scaling Speedup 2 x 1. 8 x User code Multicore

Back to Real-World Multicore Scaling Speedup 2 x 1. 8 x User code Multicore Not reducing sequential % of code Art of Multiprocessor Programming 2. 9 x

Shared Data Structures Fine Grained Coarse Grained 25% Shared 75% Unshared

Shared Data Structures Fine Grained Coarse Grained 25% Shared 75% Unshared

Shared Data Structures Honk! Why only 2. 9 speedup Honk! Fine Grained Coarse Grained

Shared Data Structures Honk! Why only 2. 9 speedup Honk! Fine Grained Coarse Grained 25% Shared 75% Unshared

Shared Data Structures Honk! Why fine-grained parallelism maters Honk! Fine Grained Coarse Grained 25%

Shared Data Structures Honk! Why fine-grained parallelism maters Honk! Fine Grained Coarse Grained 25% Shared 75% Unshared

 This work is licensed under a Creative Commons Attribution. Share. Alike 2. 5

This work is licensed under a Creative Commons Attribution. Share. Alike 2. 5 License. • You are free: – to Share — to copy, distribute and transmit the work – to Remix — to adapt the work • Under the following conditions: – Attribution. You must attribute the work to “The Art of Multiprocessor Programming” (but not in any way that suggests that the authors endorse you or your use of the work). – Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license. • For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to – http: //creativecommons. org/licenses/by-sa/3. 0/. • Any of the above conditions can be waived if you get permission from the copyright holder. • Nothing in this license impairs or restricts the author's moral rights. Art of Multiprocessor Programming 59