ECE 1747 Parallel Programming Basics of Parallel Architectures

ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines

Two Parallel Architectures • Shared memory machines. • Distributed memory machines.

Shared Memory: Logical View Shared memory space proc 1 proc 2 proc 3 proc. N

Shared Memory Machines • Small number of processors: shared memory with coherent caches (SMP). • Larger number of processors: distributed shared memory with coherent caches (CCNUMA).

SMPs • • 2 - or 4 -processors PCs are now commodity. Good price/performance ratio. Memory sometimes bottleneck (see later). Typical price (8 -node): ~ $20 -40 k.

Physical Implementation Shared memory bus cache 1 cache 2 cache 3 cache. N proc 1 proc 2 proc 3 proc. N

Shared Memory Machines • Small number of processors: shared memory with coherent caches (SMP). • Larger number of processors: distributed shared memory with coherent caches (CCNUMA).

CC-NUMA: Physical Implementation mem 1 mem 2 mem 3 mem. N interconnect cache 1 cache 2 cache 3 cache. N proc 1 proc 2 proc 3 proc. N

Caches in Multiprocessors • Suffer from the coherence problem: – same line appears in two or more caches – one processor writes word in line – other processors now can read stale data • Leads to need for a coherence protocol – avoids coherence problems • Many exist, will just look at simple one.

What is coherence? • What does it mean to be shared? • Intuitively, read last value written. • Notion is not well-defined in a system without a global clock.

The Notion of “last written” in a Multi-processor System r(x) P 0 P 1 w(x) P 2 w(x) P 3 r(x)

The Notion of “last written” in a Single-machine System w(x) r(x)

Coherence: a Clean Definition • Is achieved by referring back to the single machine case. • Called sequential consistency.

Sequential Consistency (SC) • Memory is sequentially consistent if and only if it behaves “as if” the processors were executing in a time-shared fashion on a single machine.

Returning to our Example r(x) P 0 P 1 w(x) P 2 w(x) P 3 r(x)

Another Way of Defining SC • All memory references of a single process execute in program order. • All writes are globally ordered.

SC: Example 1 Initial values of x, y are 0. w(x, 1) w(y, 1) r(x) What are possible final values? r(y)

SC: Example 2 w(x, 1) w(y, 1) r(y) r(x)

SC: Example 3 w(x, 1) w(y, 1) r(y) r(x)

SC: Example 4 r(x) w(x, 1) w(x, 2) r(x)

Implementation • Many ways of implementing SC. • In fact, sometimes stronger conditions. • Will look at a simple one: MSI protocol.

Physical Implementation Shared memory bus cache 1 cache 2 cache 3 cache. N proc 1 proc 2 proc 3 proc. N

Fundamental Assumption • The bus is a reliable, ordered broadcast bus. – Every message sent by a processor is received by all other processors in the same order. • Also called a snooping bus – Processors (or caches) snoop on the bus.

States of a Cache Line • Invalid • Shared – read-only, one of many cached copies • Modified – read-write, sole valid copy

Processor Transactions • processor read(x) • processor write(x)

Bus Transactions • bus read(x) – asks for copy with no intent to modify • bus read-exclusive(x) – asks for copy with intent to modify

State Diagram: Step 0 I S M

State Diagram: Step 1 Pr. Rd/Bu. Rd I S M

State Diagram: Step 2 Pr. Rd/Bu. Rd I S M

State Diagram: Step 3 Pr. Wr/Bu. Rd. X Pr. Rd/Bu. Rd I S M

State Diagram: Step 4 Pr. Wr/Bu. Rd. X Pr. Rd/Bu. Rd Pr. Wr/Bu. Rd. X I S M

State Diagram: Step 5 Pr. Wr/Bu. Rd. X Pr. Rd/Pr. Wr/Pr. Rd/Bu. Rd Pr. Wr/Bu. Rd. X I S M

State Diagram: Step 6 Pr. Wr/Bu. Rd. X Pr. Rd/Pr. Wr/Pr. Rd/Bu. Rd Pr. Wr/Bu. Rd. X I S M Bu. Rd/Flush

State Diagram: Step 7 Pr. Wr/Bu. Rd. X Pr. Rd/Pr. Wr/Pr. Rd/Bu. Rd Pr. Wr/Bu. Rd. X I S M Bu. Rd/Flush Bu. Rd/-

State Diagram: Step 8 Pr. Wr/Bu. Rd. X Pr. Rd/Pr. Wr/Pr. Rd/Bu. Rd Pr. Wr/Bu. Rd. X I S Bu. Rd. X/- M Bu. Rd/Flush Bu. Rd/-

State Diagram: Step 9 Pr. Wr/Bu. Rd. X Pr. Rd/Pr. Wr/Pr. Rd/Bu. Rd Pr. Wr/Bu. Rd. X I S Bu. Rd. X/- M Bu. Rd/Flush Bu. Rd/Bu. Rd. X/Flush

In Reality • Most machines use a slightly more complicated protocol (4 states instead of 3). • See architecture books (MESI protocol).

Problem: False Sharing • Occurs when two or more processors access different data in same cache line, and at least one of them writes. • Leads to ping-pong effect.

False Sharing: Example (1 of 3) for( i=0; i<n; i++ ) a[i] = b[i]; • Let’s assume we parallelize code: –p=2 – element of a takes 4 words – cache line has 32 words

False Sharing: Example (2 of 3) cache line a[0] a[1] a[2] a[3] a[4] a[5] a[6] a[7] Written by processor 0 Written by processor 1

False Sharing: Example (3 of 3) a[0] inv a[1] a[2] a[4] P 0 . . . data a[3] a[5] P 1

Summary • Sequential consistency. • Bus-based coherence protocols. • False sharing.