ECE 1747 H Parallel Programming Lecture 1 2

  • Slides: 55
Download presentation
ECE 1747 H : Parallel Programming Lecture 1 -2: Overview

ECE 1747 H : Parallel Programming Lecture 1 -2: Overview

ECE 1747 H • Meeting time: Friday 3 -5 PM • Instructor: Cristiana Amza,

ECE 1747 H • Meeting time: Friday 3 -5 PM • Instructor: Cristiana Amza, Associate Prof http: //www. eecg. toronto. edu/~amza@eecg. toronto. edu, office BA 4142 • TA: Arnamoy Bhattacharyya arnamoyb@ece. utoronto. ca

Material • Course notes • Web material (e. g. , published papers) • No

Material • Course notes • Web material (e. g. , published papers) • No required textbook, some recommended

Prerequisites • • • Programming in C or C++ Data structures Basics of machine

Prerequisites • • • Programming in C or C++ Data structures Basics of machine architecture Basics of network programming Please send e-mail to ecehelp@ece. toronto. edu to get an eecg account !! (name, stuid, class, instructor)

Other than that • No written homeworks, no exams • 10% for each small

Other than that • No written homeworks, no exams • 10% for each small programming assignments (expect 1) • 10% class participation • Rest comes from major course project

Programming Project • Parallelizing a sequential program, or improving the performance or the functionality

Programming Project • Parallelizing a sequential program, or improving the performance or the functionality of a parallel program • Project proposal and final report • In-class project proposal and final report presentation • “Sample” project presentation can be posted

Parallelism (1 of 2) • Ability to execute different parts of a single program

Parallelism (1 of 2) • Ability to execute different parts of a single program concurrently on different machines • Goal: shorter running time • Grain of parallelism: how big are the parts? • Can be instruction, statement, procedure, … • Will mainly focus on relative coarse grain

Parallelism (2 of 2) • Coarse-grain parallelism mainly applicable to long-running, scientific programs •

Parallelism (2 of 2) • Coarse-grain parallelism mainly applicable to long-running, scientific programs • Examples: weather prediction, prime number factorization, simulations, …

Lecture material (1 of 4) • Parallelism – What is parallelism? – What can

Lecture material (1 of 4) • Parallelism – What is parallelism? – What can be parallelized? – Inhibitors of parallelism: dependences

Lecture material (2 of 4) • Standard models of parallelism – shared memory (Pthreads)

Lecture material (2 of 4) • Standard models of parallelism – shared memory (Pthreads) – message passing (MPI) – shared memory + data parallelism (Open. MP) • Classes of applications – scientific – servers

Lecture material (3 of 4) • Transaction processing – classic programming model for databases

Lecture material (3 of 4) • Transaction processing – classic programming model for databases – now being proposed for scientific programs

Lecture material (4 of 4) • Perf. of parallel & distributed programs – architecture-independent

Lecture material (4 of 4) • Perf. of parallel & distributed programs – architecture-independent optimization – architecture-dependent optimization

Course Organization • First 2 -3 weeks of semester: – lectures on parallelism, patterns,

Course Organization • First 2 -3 weeks of semester: – lectures on parallelism, patterns, models – small programming assignment done in teams of 2 or 3 • Rest of the semester – major programming project, done individually or in small group – Research paper discussions

Parallel vs. Distributed Programming Parallel programming has matured: • Few standard programming models •

Parallel vs. Distributed Programming Parallel programming has matured: • Few standard programming models • Few common machine architectures • Portability between models and architectures

Bottom Line • Programmer can now focus on program and use suitable programming model

Bottom Line • Programmer can now focus on program and use suitable programming model • Reasonable hope of portability • Problem: much performance optimization is still platform-dependent – Performance portability is a problem

ECE 1747 H: Parallel Programming Lecture 1 -2: Parallelism, Dependences

ECE 1747 H: Parallel Programming Lecture 1 -2: Parallelism, Dependences

Parallelism • Ability to execute different parts of a program concurrently on different machines

Parallelism • Ability to execute different parts of a program concurrently on different machines • Goal: shorten execution time

Measures of Performance • To computer scientists: speedup, execution time. • To applications people:

Measures of Performance • To computer scientists: speedup, execution time. • To applications people: size of problem, accuracy of solution, etc.

Speedup of Algorithm • Speedup of algorithm = sequential execution time / execution time

Speedup of Algorithm • Speedup of algorithm = sequential execution time / execution time on p processors (with the same data set). speedup p

Speedup on Problem • Speedup on problem = sequential execution time of best known

Speedup on Problem • Speedup on problem = sequential execution time of best known sequential algorithm / execution time on p processors. • A more honest measure of performance. • Avoids picking an easily parallelizable algorithm with poor sequential execution time.

What Speedups Can You Get? • Linear speedup – Confusing term: implicitly means a

What Speedups Can You Get? • Linear speedup – Confusing term: implicitly means a 1 -to-1 speedup per processor. – (almost always) as good as you can do. • Sub-linear speedup: more normal due to overhead of startup, synchronization, communication, etc.

Speedup speedup linear actual p

Speedup speedup linear actual p

Scalability • No really precise decision. • Roughly speaking, a program is said to

Scalability • No really precise decision. • Roughly speaking, a program is said to scale to a certain number of processors p, if going from p-1 to p processors results in some acceptable improvement in speedup (for instance, an increase of 0. 5).

Super-linear Speedup? • Due to cache/memory effects: – Subparts fit into cache/memory of each

Super-linear Speedup? • Due to cache/memory effects: – Subparts fit into cache/memory of each node. – Whole problem does not fit in cache/memory of a single node. • Nondeterminism in search problems. – One thread finds near-optimal solution very quickly => leads to drastic pruning of search space.

Cardinal Performance Rule • Don’t leave (too) much of your code sequential!

Cardinal Performance Rule • Don’t leave (too) much of your code sequential!

Amdahl’s Law • If 1/s of the program is sequential, then you can never

Amdahl’s Law • If 1/s of the program is sequential, then you can never get a speedup better than s. – (Normalized) sequential execution time = 1/s + (1 - 1/s) = 1 – Best parallel execution time on p processors = 1/s + (1 - 1/s) /p – When p goes to infinity, parallel execution = 1/s – Speedup = s.

Why keep something sequential? • Some parts of the program are not parallelizable (because

Why keep something sequential? • Some parts of the program are not parallelizable (because of dependences) • Some parts may be parallelizable, but the overhead dwarfs the increased speedup.

When can two statements execute in parallel? • On one processor: statement 1; statement

When can two statements execute in parallel? • On one processor: statement 1; statement 2; • On two processors: processor 1: statement 1; processor 2: statement 2;

Fundamental Assumption • Processors execute independently: no control over order of execution between processors

Fundamental Assumption • Processors execute independently: no control over order of execution between processors

When can 2 statements execute in parallel? • Possibility 1 Processor 1: Processor 2:

When can 2 statements execute in parallel? • Possibility 1 Processor 1: Processor 2: statement 1; statement 2; • Possibility 2 Processor 1: Processor 2: statement 2: statement 1;

When can 2 statements execute in parallel? • Their order of execution must not

When can 2 statements execute in parallel? • Their order of execution must not matter! • In other words, statement 1; statement 2; must be equivalent to statement 2; statement 1;

Example 1 a = 1; b = a; • Statements cannot be executed in

Example 1 a = 1; b = a; • Statements cannot be executed in parallel • Program modifications may make it possible.

Example 2 a = f(x); b = a; • May not be wise to

Example 2 a = f(x); b = a; • May not be wise to change the program (sequential execution would take longer).

Example 3 a = 1; a = 2; • Statements cannot be executed in

Example 3 a = 1; a = 2; • Statements cannot be executed in parallel.

True dependence Statements S 1, S 2 has a true dependence on S 1

True dependence Statements S 1, S 2 has a true dependence on S 1 iff S 2 reads a value written by S 1

Anti-dependence Statements S 1, S 2 has an anti-dependence on S 1 iff S

Anti-dependence Statements S 1, S 2 has an anti-dependence on S 1 iff S 2 writes a value read by S 1.

Output Dependence Statements S 1, S 2 has an output dependence on S 1

Output Dependence Statements S 1, S 2 has an output dependence on S 1 iff S 2 writes a variable written by S 1.

When can 2 statements execute in parallel? S 1 and S 2 can execute

When can 2 statements execute in parallel? S 1 and S 2 can execute in parallel iff there are no dependences between S 1 and S 2 – true dependences – anti-dependences – output dependences Some dependences can be removed.

Example 4 • Most parallelism occurs in loops. for(i=0; i<100; i++) a[i] = i;

Example 4 • Most parallelism occurs in loops. for(i=0; i<100; i++) a[i] = i; • No dependences. • Iterations can be executed in parallel.

Example 5 for(i=0; i<100; i++) { a[i] = i; b[i] = 2*i; } Iterations

Example 5 for(i=0; i<100; i++) { a[i] = i; b[i] = 2*i; } Iterations and statements can be executed in parallel.

Example 6 for(i=0; i<100; i++) a[i] = i; for(i=0; i<100; i++) b[i] = 2*i;

Example 6 for(i=0; i<100; i++) a[i] = i; for(i=0; i<100; i++) b[i] = 2*i; Iterations and loops can be executed in parallel.

Example 7 for(i=0; i<100; i++) a[i] = a[i] + 100; • There is a

Example 7 for(i=0; i<100; i++) a[i] = a[i] + 100; • There is a dependence … on itself! • Loop is still parallelizable.

Example 8 for( i=0; i<100; i++ ) a[i] = f(a[i-1]); • Dependence between a[i]

Example 8 for( i=0; i<100; i++ ) a[i] = f(a[i-1]); • Dependence between a[i] and a[i-1]. • Loop iterations are not parallelizable.

Loop-carried dependence • A loop carried dependence is a dependence that is present only

Loop-carried dependence • A loop carried dependence is a dependence that is present only if the statements are part of the execution of a loop. • Otherwise, we call it a loop-independent dependence. • Loop-carried dependences prevent loop iteration parallelization.

Example 9 for(i=0; i<100; i++ ) for(j=0; j<100; j++ ) a[i][j] = f(a[i][j-1]); •

Example 9 for(i=0; i<100; i++ ) for(j=0; j<100; j++ ) a[i][j] = f(a[i][j-1]); • Loop-independent dependence on i. • Loop-carried dependence on j. • Outer loop can be parallelized, inner loop cannot.

Example 10 for( j=0; j<100; j++ ) for( i=0; i<100; i++ ) a[i][j] =

Example 10 for( j=0; j<100; j++ ) for( i=0; i<100; i++ ) a[i][j] = f(a[i][j-1]); • Inner loop can be parallelized, outer loop cannot. • Less desirable situation. • Loop interchange is sometimes possible.

Level of loop-carried dependence • Is the nesting depth of the loop that carries

Level of loop-carried dependence • Is the nesting depth of the loop that carries the dependence. • Indicates which loops can be parallelized.

Be careful … Example 11 printf(“a”); printf(“b”); Statements have a hidden output dependence due

Be careful … Example 11 printf(“a”); printf(“b”); Statements have a hidden output dependence due to the output stream.

Be careful … Example 12 a = f(x); b = g(x); Statements could have

Be careful … Example 12 a = f(x); b = g(x); Statements could have a hidden dependence if f and g update the same variable. Also depends on what f and g can do to x.

Be careful … Example 13 for(i=0; i<100; i++) a[i+10] = f(a[i]); • • Dependence

Be careful … Example 13 for(i=0; i<100; i++) a[i+10] = f(a[i]); • • Dependence between a[10], a[20], … Dependence between a[11], a[21], … … Some parallel execution is possible.

Be careful … Example 14 for( i=1; i<100; i++ ) { a[i] = …;

Be careful … Example 14 for( i=1; i<100; i++ ) { a[i] = …; . . . = a[i-1]; } • Dependence between a[i] and a[i-1] • Complete parallel execution impossible • Pipelined parallel execution possible

Be careful … Example 15 for( i=0; i<100; i++ ) a[i] = f(a[indexa[i]]); •

Be careful … Example 15 for( i=0; i<100; i++ ) a[i] = f(a[indexa[i]]); • Cannot tell for sure. • Parallelization depends on user knowledge of values in indexa[]. • User can tell, compiler cannot.

Optimizations: Example 16 for (i = 0; i < 100000; i++) a[i + 1000]

Optimizations: Example 16 for (i = 0; i < 100000; i++) a[i + 1000] = a[i] + 1; Cannot be parallelized as is. May be parallelized by applying certain code transformations.

An aside • Parallelizing compilers analyze program dependences to decide parallelization. • In parallelization

An aside • Parallelizing compilers analyze program dependences to decide parallelization. • In parallelization by hand, user does the same analysis. • Compiler more convenient and more correct • User more powerful, can analyze more patterns.

To remember • • Statement order must not matter. Statements must not have dependences.

To remember • • Statement order must not matter. Statements must not have dependences. Some dependences can be removed. Some dependences may not be obvious.