Lecture on High Performance Processor Architecture CS 05162

  • Slides: 59
Download presentation
Lecture on High Performance Processor Architecture (CS 05162) Introduction to Shared Memory Model and

Lecture on High Performance Processor Architecture (CS 05162) Introduction to Shared Memory Model and Transactional Memory Guo Rui timmyguo@mail. ustc. edu. cn Fall 2007 University of Science and Technology of China Department of Computer Science and Technology CS of USTC

Outline n SMP/CMP & Shared Memory Model n Synchronization: Critical-Section Problem − Lock (Mutex)

Outline n SMP/CMP & Shared Memory Model n Synchronization: Critical-Section Problem − Lock (Mutex) n Transactional Memory 2020/9/9 CS of USTC 2

Natural Extensions of Memory System P 1 Scale Pn Switch (Interleaved) First-level $ (Interleaved)

Natural Extensions of Memory System P 1 Scale Pn Switch (Interleaved) First-level $ (Interleaved) Main memory P 1 Pn $ $ Interconnection network Mem Shared Cache Centralized Memory Dance Hall, UMA Mem Pn P 1 $ Mem $ Interconnection network Distributed Memory (NUMA) 2020/9/9 CS of USTC 3

Interconnection n Bus (Shared media) − Broadcast & snoop − Contention & arbitration −

Interconnection n Bus (Shared media) − Broadcast & snoop − Contention & arbitration − Cheap n Routing network (2 D-Mesh etc. ) − unicast communication − Multi-hop communication − Expensive 2020/9/9 CS of USTC 4

Example Cache Coherence Problem P 2 P 1 u=? $ P 3 3 u=?

Example Cache Coherence Problem P 2 P 1 u=? $ P 3 3 u=? 4 $ 5 $ u : 5 u = 7 I/O devices 1 u : 5 2 Memory n Processors see different values for u after event 3 n With write back caches, value written back to memory depends on happenstance of which cache flushes or writes back value when − Processes accessing main memory may see very stale value n Unacceptable to programs, and frequent! 2020/9/9 CS of USTC 5

Caches and Cache Coherence n private processor caches create a problem − Copies of

Caches and Cache Coherence n private processor caches create a problem − Copies of a variable can be present in multiple caches − A write by one processor may not become visible to others l They’ll keep accessing stale value in their caches => Cache coherence problem n What do we do about it? − Nothing at all − Organize the mem hierarchy to make it go away − Detect and take actions to eliminate the problem 2020/9/9 CS of USTC 6

Intuitive Memory Model P L 1 n Reading an address should return the last

Intuitive Memory Model P L 1 n Reading an address should return the last value written to that address 100: 67 n Easy in uniprocessors L 2 100: 35 Memory Disk 2020/9/9 100: 34 − except for I/O n Cache coherence problem in MPs is more pervasive and more performance critical CS of USTC 7

Example: Write-thru Invalidate Protocol P 2 P 1 u=? $ P 3 3 u=?

Example: Write-thru Invalidate Protocol P 2 P 1 u=? $ P 3 3 u=? 4 $ 5 $ u : 5 u = 7 u : 5 I/O devices 1 u : 5 : 7 2 Memory 2020/9/9 CS of USTC 8

Invalidate vs. Update n Basic question of program behavior: − Is a block written

Invalidate vs. Update n Basic question of program behavior: − Is a block written by one processor later read by others before it is overwritten? n Invalidate. − yes: readers will take a miss − no: multiple writes without addition traffic l also clears out copies that will never be used again n Update. − yes: avoids misses on later references − no: multiple useless updates l even to pack rats => Need to look at program reference patterns and hardware complexity but first - correctness 2020/9/9 CS of USTC 9

MSI Invalidate Protocol n Read obtains block in “shared” − even if only cache

MSI Invalidate Protocol n Read obtains block in “shared” − even if only cache copy n Obtain exclusive ownership before writing − Bus. Rdx causes others to invalidate (demote) − If M in another cache, will flush − Bus. Rdx even if hit in S l promote to M (upgrade) Pr. Rd/— Pr. Wr/— M Bus. Rd/Flush Pr. Wr/Bus. Rd. X S n What about replacement? Bus. Rd. X/Flush Bus. Rd. X/— − S->I, M->I as before Pr. Rd/Bus. Rd Pr. Wr/Bus. Rd. X Pr. Rd/— Bus. Rd/— I 2020/9/9 CS of USTC 10

Setup for Memory Consistency n Coherence => Writes to a location become visible to

Setup for Memory Consistency n Coherence => Writes to a location become visible to all in the same order n But when does a write become visible? n How do we establish orders between a write and a read by different processors? use event synchronization – typically use more than one location! – 2020/9/9 CS of USTC 11

Example P 1 P 2 /*Assume initial value of A and flag is 0*/

Example P 1 P 2 /*Assume initial value of A and flag is 0*/ A = 1; while (flag == 0); /*spin idly*/ flag = 1; print A; n Intuition not guaranteed by coherence n Expect memory to respect order between accesses to different locations issued by a given process n Coherence is not enough! − pertains only to single location − to preserve orders among accesses to same location by different processes 2020/9/9 CS of USTC 12

Memory Consistency Model n Specifies constraints on the order in which memory operations (from

Memory Consistency Model n Specifies constraints on the order in which memory operations (from any process) can appear to execute with respect to one another − What orders are preserved? − Given a load, constrains the possible values returned by it n Without it, can’t tell much about an SAS program’s execution n Implications for both programmer and system designer − Programmer uses to reason about correctness and possible results − System designer can use to constrain how much accesses can be reordered by compiler or hardware n Contract between programmer and system 2020/9/9 CS of USTC 13

Sequential Consistency Processors P 1 issuing memory references as per program order P 2

Sequential Consistency Processors P 1 issuing memory references as per program order P 2 Pn The “switch” is randomly set after each memory reference Memory n Total order achieved by interleaving accesses from different processes − Maintains program order, and memory operations, from all processes, appear to [issue, execute, complete] atomically w. r. t. others − as if there were no caches, and a single memory n “A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program. ” [Lamport, 1979] 2020/9/9 CS of USTC 14

SC Example P 1 P 2 /*Assume initial values of A and B are

SC Example P 1 P 2 /*Assume initial values of A and B are 0*/ (1 a) A = 1; (2 a) print B; (1 b) B = 2; (2 b) print A; A=0 B=2 n What matters is order in which operations appear to execute, not the chronological order of events n Possible outcomes for (A, B): (0, 0), (1, 2) n What about (0, 2) ? − program order => 1 a->1 b and 2 a->2 b − A = 0 implies 2 b->1 a, which implies 2 a->1 b − B = 2 implies 1 b->2 a, which leads to a contradiction 2020/9/9 CS of USTC 15

Outline n SMP/CMP & Shared Memory Model n Synchronization: Critical-Section Problem − Lock (Mutex)

Outline n SMP/CMP & Shared Memory Model n Synchronization: Critical-Section Problem − Lock (Mutex) n Transactional Memory 2020/9/9 CS of USTC 16

The problem n For (i=0; i < 100; i++, cnt++); //cnt==0 initially n A:

The problem n For (i=0; i < 100; i++, cnt++); //cnt==0 initially n A: Reg 1=cnt //Reg 1=0 n B: Reg 1=cnt //Reg 1=0 n A: Reg 1++ //Reg 1=1 n B: Reg 1++ //Reg 1=1 n A: cnt=Reg 1 //cnt=1 n B: cnt=Reg 1 //cnt=1 2020/9/9 CS of USTC 17

Critical-Section Problem n Accessing shared resources − Only one thread can access shared resources

Critical-Section Problem n Accessing shared resources − Only one thread can access shared resources protected by the critical section. n Correctness criteria − Mutual exclusion − Progress − Bounded waiting 2020/9/9 CS of USTC 18

Software mutex algorithm Bool flag[2]={0, 0} Int turn=0; Void enter. Critical. Section(int t) {

Software mutex algorithm Bool flag[2]={0, 0} Int turn=0; Void enter. Critical. Section(int t) { int other = 1 – t; flag[t] = true; turn = other; while (flag[other] == true && turn == other); //waiting } Void leaving. Critical. Section(int t) { flag[t] = false; } 2020/9/9 CS of USTC 19

Strawman Lock Busy-Wait lock: ld register, location cmp location, #0 bnz lock st location,

Strawman Lock Busy-Wait lock: ld register, location cmp location, #0 bnz lock st location, #1 /* compare with 0 */ /* if not 0, try again */ /* store 1 to mark it locked */ /* return control to caller */ ret unlock: st /* copy location to register */ location, #0 ret /* write 0 to location */ /* return control to caller */ Why doesn’t the acquire method work? Release method? 2020/9/9 CS of USTC 20

Hardware support n Atomic Read-modify-Write instruction − IBM 370, Sparc: atomic compare&swap − x

Hardware support n Atomic Read-modify-Write instruction − IBM 370, Sparc: atomic compare&swap − x 86: any instruction can be prefixed with a lock modifier − MIPS, Power. PC, Alpha: Load-link, Store conditional n Other forms of hardware support − Lock locations in memory − Lock registers (Cray Xmp) − Others… 2020/9/9 CS of USTC 21

Simple Test&Set Lock lock: t&s register, location bnz lock /* return control to caller

Simple Test&Set Lock lock: t&s register, location bnz lock /* return control to caller */ ret unlock: ret /* if not 0, try again */ st location, #0 /* write 0 to location */ /* return control to caller */ n Other read-modify-write primitives − Swap − Fetch&op − Compare&swap l. Three operands: location, register to compare with, register to swap with l. Not commonly supported by RISC instruction sets n cacheable or uncacheable 2020/9/9 CS of USTC 22

Performance Criteria for Synch. Ops n Latency (time per op) − especially when light

Performance Criteria for Synch. Ops n Latency (time per op) − especially when light contention n Bandwidth (ops per sec) − especially under high contention n Traffic − load on critical resources − especially on failures under contention n Storage n Fairness 2020/9/9 CS of USTC 23

Enhancements to Simple Lock n Reduce frequency of issuing test&sets while waiting − Test&set

Enhancements to Simple Lock n Reduce frequency of issuing test&sets while waiting − Test&set lock with backoff − Don’t back off too much or will be backed off when lock becomes free − Exponential backoff works quite well empirically: ith time = k*ci n Busy-wait with read operations rather than test&set − Test-and-test&set lock − Keep testing with ordinary load l cached lock variable will be invalidated when release occurs − When value changes (to 0), try to obtain lock with test&set l only one attemptor will succeed; others will fail and start testing again 2020/9/9 CS of USTC 24

Improved Hardware Primitives: LL-SC n Goals: − Test with reads − Failed read-modify-write attempts

Improved Hardware Primitives: LL-SC n Goals: − Test with reads − Failed read-modify-write attempts don’t generate invalidations − Nice if single primitive can implement range of r-m-w operations n Load-Locked (or -linked), Store-Conditional − LL reads variable into register − Follow with arbitrary instructions to manipulate its value − SC tries to store back to location − succeed if and only if no other write to the variable since this processor’s LL l indicated by condition codes; n If SC succeeds, all three steps happened atomically n If fails, doesn’t write or generate invalidations − must retry acquire 2020/9/9 CS of USTC 25

Simple Lock with LL-SC lock: unlock: ll reg 1, location bnz reg 1, lock

Simple Lock with LL-SC lock: unlock: ll reg 1, location bnz reg 1, lock //其他操作 sc location, reg 2 beqz lock ret st location, #0 ret /* LL location to reg 1 */ /* SC reg 2 into location*/ /* if failed, start again */ /* write 0 to location */ n Can do more fancy atomic ops by changing what’s between LL & SC − But keep it small so SC likely to succeed − Don’t include instructions that would need to be undone (e. g. stores) n SC can fail (without putting transaction on bus) if: − Detects intervening write even before trying to get bus − Tries to get bus but another processor’s SC gets bus first n LL, SC are not lock, unlock respectively − Only guarantee no conflicting write to lock variable between them − But can use directly to implement simple operations on shared variables 2020/9/9 CS of USTC 26

Trade-offs So Far n Latency? n Bandwidth? n Traffic? n Storage? n Fairness? n

Trade-offs So Far n Latency? n Bandwidth? n Traffic? n Storage? n Fairness? n What happens when several processors spinning on lock and it is released? − traffic per P lock operations? 2020/9/9 CS of USTC 27

Ticket Lock n Only one r-m-w per acquire n Two counters per lock (next_ticket,

Ticket Lock n Only one r-m-w per acquire n Two counters per lock (next_ticket, now_serving) − Acquire: fetch&inc next_ticket; wait for now_serving == next_ticket l atomic op when arrive at lock, not when it’s free (so less contention) − Release: increment now-serving n Performance − low latency for low-contention - if fetch&inc cacheable − O(p) read misses at release, since all spin on same variable − FIFO order l like simple LL-SC lock, but no inval when SC succeeds, and fair − Backoff? n Wouldn’t it be nice to poll different locations. . . 2020/9/9 CS of USTC 28

Array-based Queuing Locks n Waiting processes poll on different locations in an array of

Array-based Queuing Locks n Waiting processes poll on different locations in an array of size p − Acquire l fetch&inc to obtain address on which to spin (next array element) l ensure that these addresses are in different cache lines or memories − Release l set next location in array, thus waking up process spinning on it − O(1) traffic per acquire with coherent caches − FIFO ordering, as in ticket lock, but, O(p) space per lock − Not so great for non-cache-coherent machines with distributed memory l array location I spin on not necessarily in my local memory (solution later) 2020/9/9 CS of USTC 29

Lock Performance on SGI Challenge Array- bas ed 6 LL-SC n LL-SC, u Ticket

Lock Performance on SGI Challenge Array- bas ed 6 LL-SC n LL-SC, u Ticket s Ticket, 6 u u u u u 5 u l l u l sl l s s 6 s s s s 4 3 6 6 2 0 6 6 6 n su 6 u n l n s 6 n n sl n u 6 n n n Time ( s) 5 1 proportional 7 u 3 5 7 9 Number of processors (a) Null (c = 0, d = 0) 2020/9/9 u 3 13 15 1 6 u u u l s sl l 6 l l s 6 s s 6 6 l s s 7 u 6 9 11 Number of processors (b) Critical-section (c = 3. 64 s, d = 0) CS of USTC 15 u u 6 6 u u l 6 6 l l 6 6 6 l l u l l l 6 u l l s l su u 6 s s 6 s n s s s 6 u s sn s s s n n 6 n n n n n 3 2 13 u l 4 1 0 5 u u l n 6 6 6 n n n n n n 3 u 5 6 s l n u 6 7 u u 2 1 11 u u l l l u u sl l sl s u su s s s 4 0 1 l Loop: lock; delay(c); unlock; delay(d); ex pon ent ial Time ( s) 7 l s l n u 6 1 3 5 7 9 11 13 15 Number of processors (c) Delay (c = 3. 64 s, d = 1. 29 s) 30

The Problems of lock Thread 1: n Performance Lock B − Priority inversion. −

The Problems of lock Thread 1: n Performance Lock B − Priority inversion. − Convoying Lock A … n Productivity − Deadlock—not composable − Performance vs. ease of use l Fine-granulated lock l Coarse-granulated lock n Key: Lock is conservative Un. Lock A Un. Lock B Thread 2: Lock A Lock B … Un. Lock B Un. Lock A 2020/9/9 CS of USTC 31

Outline n SMP/CMP & Shared Memory Model n Synchronization: Critical-Section Problem − Lock (Mutex)

Outline n SMP/CMP & Shared Memory Model n Synchronization: Critical-Section Problem − Lock (Mutex) n Transactional Memory 2020/9/9 CS of USTC 32

Concept of Transactional Memory n Borrowed from Database transaction (xact) concept − ACID property

Concept of Transactional Memory n Borrowed from Database transaction (xact) concept − ACID property l Atomic Consistency Isolation Durability n A xact is an atomic group of instructions − All results appear to the system if the xact successes − No results appear to the system if the xact fails − Temporary results in different xacts are not interfered with each other n Compare: Lock n Compare: LL-SC inst. 2020/9/9 CS of USTC 33

Interaction between xacts in TCC model For (i=0; i < 100; i++) { xact_begin;

Interaction between xacts in TCC model For (i=0; i < 100; i++) { xact_begin; cnt++; xact_commit; } For (i=0; i < 100; i++) { Lock(l_counter); cnt++; un. Lock(l_counter); } 2020/9/9 CS of USTC 34

Nesting Transactions n Flattening n Closed nesting − Partial abort n Open nesting −

Nesting Transactions n Flattening n Closed nesting − Partial abort n Open nesting − Partial commit 2020/9/9 CS of USTC 35

Works on Hardware Transactional Memory n Basic Model − − Basic HTM, 1993 Herlihy

Works on Hardware Transactional Memory n Basic Model − − Basic HTM, 1993 Herlihy SLE/TLR, 2002? Rajwar & Goodman TCC, 2004 Lance Hammond Log. TM, 2006 K. E. Moore n Virtualization − − UTM, 2005 C. Scott Ananian VTM, 2005 Rajwar & Herlihy TTM, 2004 K. E. Moore Log. TM-SE, 2007 n And more… 2020/9/9 CS of USTC 36

TCC abstract n Lazy version management & conflict detection − Old value store in-place

TCC abstract n Lazy version management & conflict detection − Old value store in-place − Detect conflict when one of the transactions commits. n Bus based n Buffer write address in write buffer n Broadcast modification on commit n Fast abort by discarding buffered new value. n Ordered transaction support (phase number) n Centralized arbitration needed. 2020/9/9 CS of USTC 37

TCC Architecture 2020/9/9 CS of USTC 38

TCC Architecture 2020/9/9 CS of USTC 38

TCC Programming Model Steps: 1. Divide into Transactions 2. Specify Order 0 -unordered transaction

TCC Programming Model Steps: 1. Divide into Transactions 2. Specify Order 0 -unordered transaction model -ordered transaction model 3. Performance Tuning 0 0 0 Optimization guidelines: 0 0 (2) Large transaction are 1 advisable (3) Chose small transaction when violate frequently. 2 5 2020/9/9 CS of USTC 1 Time 0 (1) maximize parallelism & minimize data dependencies 1 3 4 6 7 39

Cache Structure extension n read bit: − set on speculative load, check against remote

Cache Structure extension n read bit: − set on speculative load, check against remote commit; − on line basis or word basis. n modified bit: − set on speculative store, discard lines with the flag on abort − on line basis. n renamed bit: − similar to modified bit. − On word basis. − Avoid false violations 2020/9/9 CS of USTC 40

What if cache is full ? n Speculative lines can not be replaced in

What if cache is full ? n Speculative lines can not be replaced in transaction. n Solution: − (1) move the cache line to victim buffer. − (2) stall and request for commit permission. 2020/9/9 CS of USTC 41

Rollback & Commit n How to rollback ? -- Checkpoints − (1) by hardware

Rollback & Commit n How to rollback ? -- Checkpoints − (1) by hardware or by software. − (2) hardware scheme can associated with register renaming. n How to commit ? − (1) write buffer: l separate buffer that buffs all store − (2) address buffer: l only keep the tags of cache line to be commit. 2020/9/9 CS of USTC 42

Log. TM abstract n Eager version management & conflict detection − New value in-place

Log. TM abstract n Eager version management & conflict detection − New value in-place − Detect conflict on memory reference n Implemented on a modified directory based MOESI protocol n Log old value in memory n Fast local commit. No arbitration needed n Slow local abort. Rollback log entries n Un-ordered transaction only n No transaction size limit 2020/9/9 CS of USTC 43

Log. TM Architecture 2020/9/9 CS of USTC 44

Log. TM Architecture 2020/9/9 CS of USTC 44

Transaction Log & Rollback n Log region in memory − Cacheable − Specified by

Transaction Log & Rollback n Log region in memory − Cacheable − Specified by two registers n Add log entries on first modification to a block − Check W bit − Minimize duplication n Trap to software handler when rollback − Rollback the logs in FILO mode. 2020/9/9 CS of USTC 45

Problems of Log. TM n Transaction conflict may stall requestor n False sharing can

Problems of Log. TM n Transaction conflict may stall requestor n False sharing can not be easily eliminated 2020/9/9 CS of USTC 46

Conflict detection 2020/9/9 CS of USTC 47

Conflict detection 2020/9/9 CS of USTC 47

Overflow handling n Speculative data can be swapped out as normal cache line. n

Overflow handling n Speculative data can be swapped out as normal cache line. n How to track the speculation in this situation? − Modified directory protocol − The overflow bit! 2020/9/9 CS of USTC 48

Overflow bit 2020/9/9 CS of USTC 49

Overflow bit 2020/9/9 CS of USTC 49

TCC v. s. Log. TM n Buffer new data locally( n Log old data

TCC v. s. Log. TM n Buffer new data locally( n Log old data in memory( old one globally available) new one visible) n Central arbitration before n No arbitration commit n Rollback on abort, local n Broadcast new data on commit n Detect conflict as n Conflict detected upon transaction running commit − Deadlock detection n Specify ordering using needed phase number n Only unordered transaction n Trans. size limited by supported cache size (stall to solve) n Solve false sharing by mask 2020/9/9 n No transaction size limit n Hard to solve false sharing CS of USTC 50

Log. TM-SE: Decoupling HTM from Caches n Key benefits − Implementation doesn’t touch the

Log. TM-SE: Decoupling HTM from Caches n Key benefits − Implementation doesn’t touch the cache structure − Transactions are more easily virtualized. l No transaction size and time restriction l Unbounded nesting transaction l Thread context switching and migration and paging 2020/9/9 CS of USTC 51

Key differences from Log. TM n Log. TM: utilizing R/W bits in cache tags

Key differences from Log. TM n Log. TM: utilizing R/W bits in cache tags to track read/write set of transactions n Log. TM-SE: utilizing standalone signature to track R/W set of transactions 2020/9/9 CS of USTC 52

Signature n Hashing a series of addresses − Example: logically OR decoded 10 least-significant

Signature n Hashing a series of addresses − Example: logically OR decoded 10 least-significant bits of addresses. n Intersection of signature detects conflict − Allow false positives − Forbid false negatives n Software visible 2020/9/9 CS of USTC 53

Example Signature Implementations 2020/9/9 CS of USTC 54

Example Signature Implementations 2020/9/9 CS of USTC 54

Log. TM-SE Architecture 2020/9/9 CS of USTC 55

Log. TM-SE Architecture 2020/9/9 CS of USTC 55

Log optimization n How to minimize duplicated log entries? − No W bits available

Log optimization n How to minimize duplicated log entries? − No W bits available now. n Log filter − Like a small cache/TLB − Log if not hit in the Log filter. vice versa − Clear out on transaction boundary 2020/9/9 CS of USTC 56

Nesting Transactions in Log. TM n Segment transaction logs into stack of frames. −

Nesting Transactions in Log. TM n Segment transaction logs into stack of frames. − Fixed header l register checkpoint l signature saving − Undo records n On new nested transaction − Saving signature into log frame header − Allocate new header, save register checkpoint − Clear Log filter 2020/9/9 CS of USTC 57

Nesting Transactions in Log. TM (cont. ) n Closed commit: − Discard inner trans’

Nesting Transactions in Log. TM (cont. ) n Closed commit: − Discard inner trans’ header, restore parent’s log frame n Opened commit: − Restore signature first from parent’s log frame n Partial abort: − Unroll child transaction’s log, restore parent signature − Repeat until conflict is solved 2020/9/9 CS of USTC 58

Summary n Low latency and high bandwidth between processor cores adds potential to CMP

Summary n Low latency and high bandwidth between processor cores adds potential to CMP n Lock is conservative − Scalability − Composability − Performance vs. productivity n TM is a optimistic alternativity to lock. n The way to go for TM − Xact size & length (time) restriction − How to deal with un-cancelable events (OS call, I/O) − Implement nesting effectively − Bind to OS thread rather than hardware processor − Performance pathology 2020/9/9 CS of USTC 59