Parallel Processing Problems Cache Coherence False Sharing Synchronization
Parallel Processing Problems • Cache Coherence • False Sharing • Synchronization
Cache Coherence Current a value in: P 1$ P 2$ DRAM * * 7 1. P 2: Rd a 2. P 2: Wr a, 5 3. P 1: Rd a 4. P 2: Wr a, 3 5. P 1: Rd a P 1, P 2 are write-back caches P 1 P 2 $$$ DRAM
Whatever are we to do? • Write-Invalidate • Write-Update
Write Invalidate P 1 Current a value in: P 1$ P 2$ DRAM 1. 2. 3. 4. 5. P 2: Rd a P 2: Wr a, 5 P 1: Rd a P 2: Wr a, 3 P 1: Rd a * * * 5 * 7 5 5 P 1, P 2 are write-back caches 7 7 $$$ P 2 2 3 $$$ DRAM 4 1
Write Update P 1 Current a value in: P 1$ P 2$ DRAM 1. 2. 3. 4. 5. P 2: Rd a P 2: Wr a, 5 P 1: Rd a P 2: Wr a, 3 P 1: Rd a * * * 5 * 7 5 5 P 1, P 2 are write-back caches 7 7 $$$ P 2 2 3, 4 $$$ DRAM 4 1
Performance Considerations Invalidate Writing makes data exclusive Receiving changed data slower Update Once shared, always shared Once shared, writes always on bus Get changed data very quickly
Cache Coherence False Sharing Current contents in: P 1$ P 2$ * * 1. P 2: Rd A[0] 2. P 1: Rd A[1] 3. P 2: Wr A[0], 5 4. P 1: Wr A[1], 3 P 1, P 2 cacheline size: 4 words P 1 P 2 $$$ DRAM
Look closely at example • P 1 and P 2 do not access the same element • A[0] and A[1] are in the same cache block, so if they are in one cache, they are in the other cache.
False Sharing • Different/same processors access different/same items in different/same cache block • Leads to ______ misses
Cache Performance // Pn = my processor number (rank) // Num. Procs = total active processors // N = total number of elements // NElem = N / Num. Procs For(i=0; i<N; i++) A[Num. Procs*i+Pn] = f(i); Vs For(i=(Pn*NElem); i<(Pn+1)*NElem; i++) A[i] = f(i);
Which is worse? • Both access the same number of elements • No processors access the same elements as each other
Synchronization • Sum += A[i]; • Two processors, i = 0, i = 50 • Before the action: – Sum = 5 – A[0] = 10 – A[50] = 33 • What is the proper result?
Synchronization • Sum = Sum + A[i]; • Assembly for this equation, assuming – A[i] is already in $t 0: – &Sum is already in $s 0
lw $t 1, 0($s 0) Synchronization add $t 1, $t 0 Ordering #1 sw $t 1, 0($s 0) P 1 inst Effect P 2 inst Effect Given $t 0 = 10 Given $t 0 = 33 Lw $t 1 = Add $t 1 = Sw Sum = add $t 1 = Sw Sum =
lw $t 1, 0($s 0) Synchronization add $t 1, $t 0 Ordering #2 sw $t 1, 0($s 0) P 1 inst Effect P 2 inst Effect Given $t 0 = 10 Given $t 0 = 33 Lw $t 1 = Add $t 1 = Sw Sum = add Sw $t 1 = Sum =
Does Cache Coherence solve it? • Did load bring in an old value? • Sum += A[i] is ______ – Atomic – operation occurs in one unit, and nothing may interrupt it.
Synchronization Problem • Reading and writing memory is a non-atomic operation – You can not read and write a memory location in a single operation • We need _________ that allow us to read and write without interruption
Solution • Software Solution – “lock” – – “unlock” – • Hardware – Provide primitives that read & write in order to implement lock and unlock
Software Using lock and unlock Sum += A[i]
Hardware Implementing lock & unlock • Swap $1, 100($2) – Swap the contents of $1 and M[$2+100]
Hardware: Implementing lock & unlock with swap • If lock has 0, it is free • If lock has 1, it is held Lock: Li Loop: $t 0, 1 swap $t 0, 0($a 0) bne $t 0, $0, loop Unlock: sw $0, 0($a 0)
Summary • Cache coherence must be implemented for shared memory to work • False sharing causes bad cache performance • Hardware primitives necessary for synchronizing shared data
- Slides: 22