Lecture 9 Outline MESI protocol Dragon updatebased protocol
Lecture 9 Outline ¡ ¡ ¡ MESI protocol Dragon update-based protocol Impact of protocol optimizations Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin 1
Lower-Level Protocol Choices ¡ Bus. Rd observed in M state: what transition to make? l Change to S: assume I’ll read again soon ¡ ¡ l Change to I: assume other will write to it (Synapse) ¡ l Lecture 9 good for mostly read data what about “migratory” data, thus: I read and write, then you read and write, then X reads and writes. . . Sequent Symmetry and MIT Alewife use adaptive protocols ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin 2
MESI (4 -state) Invalidation Protocol ¡ Problem with MSI protocol l ¡ Rd, Wr sequence incurs 2 transactions ¡ even when no one is sharing (e. g. , serial program!) ¡ Bus. Rd (I S) followed by Bus. Rd. X or Bus. Upgr (S M) ¡ In general, penalizing serial programs is unacceptable Add exclusive state: ¡ ¡ ¡ Invalid Modified (dirty) Shared (two or more caches may have copies) Exclusive: (only this cache has clean copy, same value as in memory) How to decide I E or I S? l l Need to check whether someone else has copy “Shared” signal on bus: wired-or line asserted in response to Bus. Rd Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin 3
MESI: Processor-Initiated Transactions Pr. Rd/- Pr. Rd/Pr. Wr/- M E Pr. Wr/Bus. Rd. X Pr. Rd/Bus. Rd(~S) Pr. Wr/Bus. Rd. X S Pr. Rd/Bus. Rd(S) I Pr. Rd/Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin 4
MESI: Bus-Initiated Transactions M Bus. Rd. X/Flush E Bus. Rd/Flush S Bus. Rd. X/Flush 1 Bus. Rd. X/Flush I Bus. Rd/Bus. Rd. X/- Bus. Rd/Flush 1 Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin 5
MESI State Transition Diagram l Bus. Rd(S) means shared line asserted on Bus. Rd transaction Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin 6
Flush vs. Flush 1 (Flush' in textbook) Flush: mandatory ¡ Flush' (Flush 1): happens only when ¡ l l Lecture 9 Cache-to-cache sharing is used, and, Only one cache flushes data ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin 7
MESI Visualization P 1 P 2 P 3 Snooper Cache Bus Mem Ctrl Main Memory Lecture 9 X=1 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin 8
MESI Visualization P 1 P 2 P 3 Snooper rd &X Bus. Rd Mem Ctrl X=1 Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin 9
MESI Visualization P 1 X=1 Snooper P 2 P 3 Snooper E Mem Ctrl X=1 Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin 10
MESI Visualization P 1 wr &X (X=2) X=1 2 Snooper P 2 P 3 Snooper E M Mem Ctrl X=1 Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin One less bus request due to Exclusive state, esp. for serial programs 11
MESI Visualization P 1 P 2 P 3 rd &X X=2 Snooper M Snooper Bus. Rd Mem Ctrl X=1 Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin 12
MESI Visualization P 1 X=2 Snooper P 2 M S Flush P 3 X=2 Snooper S Snooper Mem Ctrl X=1 2 Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin 13
MESI Visualization P 1 P 2 P 3 wr &X X=3 X=2 Snooper S I X=2 3 S Snooper Mem Ctrl Snooper M Bus. Upgr Note: Bus. Upgr instead of Bus. Rd. X X=2 Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin 14
MESI Visualization P 1 P 2 P 3 rd &X X=2 3 Bus. Rd Snooper I S X=3 Snooper Flush M S Snooper Mem Ctrl X=2 3 Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin 15
MESI Visualization P 1 P 2 P 3 rd &X X=3 Snooper S Snooper Mem Ctrl X=3 Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin 16
MESI Visualization P 1 P 2 P 3 rd &X X=3 Snooper S X=3 Flush 1 Snooper Mem Ctrl X=3 Lecture 9 X=3 S Bus. Rd S Snooper Referred to as Cache-to-cache transfer in Illinois MESI protocol ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin 17
MESI Example (Cache-to-Cache Transfer) Proc Action State P 1 State P 2 State P 3 Bus Action Data From R 1 E - - Bus. Rd Mem W 1 M - - - Own cache R 3 S - S Bus. Rd/Flush P 1 cache W 3 I - M Bus. Rd. X Mem R 1 S - S Bus. Rd/Flush P 3 cache R 3 S - Own cache R 2 S S S Bus. Rd/Flush 1 P 1/P 3 Cache* * Data from memory if no cache 2 cache transfer, Bus. Rd/Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin 18
MESI Example (Cache-to-Cache Transfer+Bus. Upgr) Proc Action State P 1 State P 2 State P 3 Bus Action Data From R 1 E - - Bus. Rd Mem W 1 M - - - Own cache R 3 S - S Bus. Rd/Flush P 1 cache W 3 I - M Bus. Upgr Own cache R 1 S - S Bus. Rd/Flush P 3 cache R 3 S - Own cache R 2 S S S Bus. Rd/Flush 1 P 1/P 3 Cache* * Data from memory if no cache 2 cache transfer, Bus. Rd/Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin 19
Lower-Level Protocol Choices ¡ ¡ Who supplies data on miss when not in M state: memory or cache? Original, lllinois MESI: cache l l ¡ Adds complexity l l ¡ assume cache faster than memory (Cache-to-cache transfer) Not necessarily true How does memory know it should supply data? (must wait for caches) Selection algorithm if multiple caches have valid data Valuable for distributed memory l l Lecture 9 May be cheaper to obtain from nearby cache than distant memory Especially when constructed out of SMP nodes (Stanford DASH) ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin 20
Lecture 9 Outline ¡ ¡ ¡ MESI protocol Dragon update-based protocol Impact of protocol optimizations Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin 21
Dragon Writeback Update Protocol ¡ Four states l Exclusive-clean (E): I and memory have it l Shared clean (Sc): I, others, and maybe memory, but I’m not owner l Shared modified (Sm): I and others but not memory, and I’m the owner ¡ ¡ l Modified or dirty (M): I and, no one else l On replacement: Sc can silently drop, Sm has to flush No invalid state l l ¡ If in cache, cannot be invalid If not present in cache, can view as being in not-present or invalid state New processor events: Pr. Rd. Miss, Pr. Wr. Miss l ¡ Sm and Sc can coexist in different caches, with at most one Sm Introduced to specify actions when block not present in cache New bus transaction: Bus. Upd l Lecture 9 Broadcasts single word written on bus; updates other relevant caches ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin 22
Dragon State Transition Diagram Pr. Rd/— Bus. Upd/Update Pr. Rd/— Pr. Rd. Miss/ Bus. Rd(S) E Bus. Rd/— Sc Pr. Wr/— Pr. Rd. Miss/ Bus. Rd(S) Pr. Wr/Bus. Upd(S) Pr. Wr/ Bus. Upd(S) Bus. Upd/Update Pr. Wr. Miss/ (Bus. Rd(S); Bus. Upd) Sm Pr. Rd/— Pr. Wr/Bus. Upd(S) Pr. Wr. Miss/ Bus. Rd(S) Bus. Rd/Flush M Pr. Wr/Bus. Upd(S) Pr. Rd/— Pr. Wr/— Bus. Rd/Flush Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin 23
Dragon: Processor-Initiated Transactions Pr. Rd/- Pr. Rd. Miss/Bus. Rd(~S) Pr. Rd/- E Pr. Wr/Bus. Upd(S) Sc Pr. Rd. Miss/Bus. Rd(S) Pr. Wr/Bus. Upd(~S) Pr. Wr/Pr. Wr. Miss/ (Bus. Rd(S); Bus. Upd) Sm Pr. Wr/Bus. Upd(~S) Pr. Rd/Pr. Wr/Bus. Upd(S) Lecture 9 M Pr. Rd. Miss/Bus. Rd(~S) Pr. Rd/Pr. Wr/ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin 24
Dragon: Bus-Initiated Transactions Bus. Rd/Bus. Upd/Update E Bus. Rd/- Sc Bus. Upd/Update Sm Bus. Rd/Flush M Bus. Rd/Flush Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin 25
Dragon Visualization P 1 P 2 P 3 Snooper Cache Bus Mem Ctrl Main Memory Lecture 9 X=1 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin 26
Dragon Visualization P 1 P 2 P 3 Snooper rd &X Bus. Rd Mem Ctrl X=1 Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin 27
Dragon Visualization P 1 X=1 Snooper P 2 P 3 Snooper E Mem Ctrl X=1 Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin 28
Dragon Visualization P 1 wr &X (X=2) X=1 2 Snooper P 2 P 3 Snooper E M Mem Ctrl X=1 Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin One less bus request due to Exclusive state, esp. for serial programs 29
Dragon Visualization P 1 P 2 P 3 rd &X X=2 Snooper M Snooper Bus. Rd Mem Ctrl X=1 Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin 30
Dragon Visualization P 1 X=2 Snooper P 2 M Sm P 3 X=2 Snooper Sc Snooper Mem Ctrl X=1 Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin 31
Dragon Visualization P 1 P 2 P 3 wr &X X=3 X=2 3 Sc Sm X=2 3 Sm Sc Snooper Mem Ctrl X=1 Lecture 9 Snooper Bus. Upd Note: Bus. Update instead of Bus. Upgr (no inval is performed) ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin 32
Dragon Visualization P 1 P 2 P 3 rd &X X=3 Sc Snooper X=3 Snooper Mem Ctrl X=1 Lecture 9 Sm Snooper This is a miss in the MESI and MSI protocols ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin 33
Dragon Visualization P 1 P 2 P 3 rd &X X=3 Sc Snooper X=3 Snooper Sm Snooper Mem Ctrl X=1 Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin 34
Dragon Visualization P 1 P 2 P 3 rd &X X=3 Sc Snooper X=3 Snooper Mem Ctrl X=1 Lecture 9 X=3 Sc Bus. Rd Sm Snooper Note: only one with Sm is responsible for cacheto-cache transfer ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin 35
Dragon Visualization P 1 X=3 P 2 Sc Snooper X=3 P 3 Sc Snooper X=3 Sm Snooper Mem Ctrl X=1 Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin P 1 replaces X 36
Dragon Visualization P 1 X=3 P 2 Sc Snooper X=3 Sc Snooper P 3 replaces X Owner responsible for writing back to mem Lecture 9 X=3 P 3 Mem Ctrl X=1 3 Sm Snooper vs. MSI or MESI where write-back only when the line is in M state ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin 37
Dragon Example Proc Action State P 1 State P 2 State P 3 Bus Action Data From R 1 E - - Bus. Rd Mem W 1 M - - - Own cache R 3 Sm - Sc Bus. Rd/Flush P 1 cache W 3 Sc - Sm Bus. Upd/Upd Own cache R 1 Sc - Sm - Own cache R 3 Sc - Sm - Own cache R 2 Sc Sc Sm Bus. Rd/Flush P 3 cache Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin 38
Lower-Level Protocol Choices ¡ Can shared-modified state be eliminated? l l ¡ Should replacement of an Sc block be broadcast? l l ¡ If update memory as well on Bus. Upd transactions (DEC Firefly) Dragon protocol doesn’t (assumes DRAM memory slow to update) Would allow last copy to go to Exclusive state and not generate updates Replacement bus transaction is not in critical path, later update may be Shouldn’t update local copy on write hit before controller gets bus l Can mess up serialization ¡ Coherence, consistency considerations much like write-through case ¡ In general, many subtle race conditions in protocols But first, let’s illustrate quantitative assessment at logical level ¡ Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin 39
Lecture 9 Outline ¡ ¡ ¡ MESI protocol Dragon update-based protocol Impact of protocol optimizations Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin 40
Assessing Protocol Tradeoffs ¡ Methodology: l l Use simulator; choose parameters per earlier methodology (default 1 MB, 4 -way cache, 64 -byte block, 16 processors; 64 K cache for some) Focus on frequencies, not end performance for now ¡ l Use idealized memory performance model to avoid changes of reference interleaving across processors with machine parameters ¡ Lecture 9 transcends architectural details, but not what we’re really after Cheap simulation: no need to model contention ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin 41
l Lecture 9 LU/III x Raytrace/3 St t Raytrace/3 St-Rd. Ex Ex Raytrace/III Ill Radix/3 St. Rd. Ex Radix/III l Radix/3 St t Radiosity/3 St-Rd. Ex d Radiosity/3 St Radiosity/III Ocean/3 S Ocean/3 St-Rd. Ex Ocean/III LU/3 St-Rd. Ex LU/3 St Address bus 60 Data bus 0 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin OS-Data/III 70 OS-Data/3 St-Rd. Ex. E 20 OS-Code/III OS-Code/3 St-Rd. Ex 40 Appl. Data/III Appl-Data/3 St-Rd. Ex 60 Appl-Code/3 St-Rd. Ex E 80 Appl-Code/III 120 Appl-Code/3 St 100 Traffic (MB/s) 180 Barnes/3 St. Rd. Ex x 0 Barnes/I II Barnes/3 St Traffic (MB/s) Impact of Protocol Optimizations MESI vs. MSI (w/ Bus. Upgr) vs. MSI (w/ Bus. Rd. X) 200 Address bus Data bus 160 140 80 50 40 30 20 10 l MSI = MESI Upgrades instead of read-exclusive helps l Same story when working sets don’t fit for Ocean, Radix, Raytrace 42
Impact of Cache-Block Size ¡ Multiprocessors add new kind of miss to cold, capacity, conflict l ¡ ¡ Coherence misses: Due to invalidations ¡ True sharing: Write to same word ¡ False sharing: Write to different words Reducing misses architecturally in invalidation protocol l Capacity: enlarge cache; increase block size (if spatial locality) l Conflict: increase associativity l Cold and coherence: only block size Increasing block size has advantages and disadvantages l Can reduce misses if spatial locality is good l Can hurt too ¡ ¡ Lecture 9 increase misses due to false sharing if spatial locality not good increase misses due to conflicts in fixed-size cache increase traffic due to fetching unnecessary data and due to false sharing can increase miss penalty perhaps hit. F. cost ECE/CSC 506 -and Summer 2006 - E. Gehringer, based on slides by Yan Solihin 43
Impact of Block Size on Miss Rate For default problem size: vary block/line size from 8 -256 Bytes 0. 6 12 Upgrade False sharing 0. 5 False sharing 10 True sharing Capacity Cold Miss rate (%) 8 0. 3 6 4 0. 1 2 Raytrace/128 Raytrace/256 8 Raytrace/16 Raytrace/32 Raytrace/64 6 Radix/256 4 8 Radix/128 8 2 Radix/32 Radix/64 Radix/16 Radix/8 Ocean/256 Ocean/128 Ocean/16 Radiosity/128 Radiosity/256 0 Radiosity/64 Radiosity/16 Radiosity/32 Radiosity/8 Lu/256 Lu/64 Lu/128 Lu/16 Lu/32 Lu/8 Barnes/256 Barnes/64 Barnes/128 Barnes/16 Barnes/32 0 6 8 0. 2 Barnes/8 Miss rate (%) 0. 4 Ocean/32 Ocean/64 ¡ • Decreases with larger lines: cold, capacity (due to spatial locality), true sharing (due to spatial locality) • Increases with larger lines: false sharing • Working set doesn’t fit: impact of capacity misses large: (Ocean, Radix) Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin 44
Impact of Block Size on Traffic (bytes/inst) affects performance indirectly through contention Address bus 0. 16 Data bus 0. 14 0. 12 0. 1 0. 08 0. 06 0. 04 Raytrace/128 Raytrace/256 Raytrace/64 Raytrace/16 Raytrace/32 Raytrace/8 2 Radiosity/8 Radiosity/16 Radiosity/32 Barnes/256 Barnes/64 Barnes/128 Barnes/16 Barnes/32 0 Radiosity/64 4 Radiosity/128 28 Radiosity/256 0. 02 Barnes/8 Traffic (bytes/instructions) 0. 18 ¡ Results different than for miss rate: traffic almost always increases ¡ When working sets fits, overall traffic still small, except for Radix ¡ Fixed overhead is significant component l ¡ So total traffic often minimized at 16 -32 byte block, not smaller Working set doesn’t fit: even 128 -byte good for Ocean due to capacity l Lecture 9 Address bus traffic behaves in opposite way as the data bus traffic ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin 45
- Slides: 45