CS 704 Advanced Computer Architecture Lecture 37 Multiprocessors
- Slides: 56
CS 704 Advanced Computer Architecture Lecture 37 Multiprocessors (Performance and Synchronization) Prof. Dr. M. Ashraf Chughtai
Today’s Topics Recap: Performance of Multiprocessors with – Symmetric Shared-Memory – Distributed Shared Memory Synchronization in Parallel Architecture Conclusion MAC/VU-Advanced Computer Architecture Lec. 37 Multiprocessor (4) 2
Recap: Cache Coherence Problem So far we have discussed the sharing of caches for multi-processing in the: § symmetric shared-memory architecture § Distributed shared memory architecture We have studied cache coherence problem in symmetric and distributed sharedmemory multiprocessors; and have noticed that this problem is indeed performancecritical MAC/VU-Advanced Computer Architecture Lec. 37 Multiprocessor (4) 3
Recap: Multiprocessor cache Coherence Last time we also studied the cache coherence protocols, which use different techniques to track the sharing status and maintain coherence without performance degrading These protocols are classified as: Snooping Protocols Directory-Based Protocols These protocols are implemented using a FSM controller MAC/VU-Advanced Computer Architecture Lec. 37 Multiprocessor (4) 4
Recap: Snooping Protocols Snooping protocols employ write invalidate and write broadcast techniques Here, the block of memory is in one of the three states, and each cached-block tracks these three states; and the controller responds to the read/write request for a block of memory or cached block, both from the processor and from the bus MAC/VU-Advanced Computer Architecture Lec. 37 Multiprocessor (4) 5
Recap: Implementation Complications of snoopy protocols The three states of the basic FSM are: Shared, Exclusive or Invalid However, the complications such as: write races, interventions and invalidation have been observed in the implementation of snoopy protocols; and to overcome these complications number of variations in the FSM controller have been suggested These variations are: MESI Protocol, Barkley Protocol and Illinois Protocol MAC/VU-Advanced Computer Architecture Lec. 37 Multiprocessor (4) 6
Recap: Variations in snoopy protocols These variations resulted in four (4) states FSM controller – The states of MESI Protocol are: Modify, Exclusive, Shared and Invalid – The sates of Barkley Protocol are: Owned. Exclusive, Owned-Sheared, Shared and Invalid; and of – Illinois Protocol are: Private Dirty, Private clean, shared and Invalid MAC/VU-Advanced Computer Architecture Lec. 37 Multiprocessor (4) 7
Recap: Directory based Protocols The larger multiprocessor systems employ distributed shared-memory , i. e. , a separate memory per processor is provided Here, the Cache Coherency is achieved using non-cached pages or directory containing information for every block in memory The directory-based protocol tracks state of every block in every cache and finds the …. . MAC/VU-Advanced Computer Architecture Lec. 37 Multiprocessor (4) 8
Recap: Directory Based Protocol …… caches having copies of block being dirty or clean The directory-based protocol tracks state of every block in every cache and finds the caches having copies of block being dirty or clean Similar to the Snoopy Protocol, the directory-based protocol are implemented by FSM having three states: Shared, Uncached and Exclusive MAC/VU-Advanced Computer Architecture Lec. 37 Multiprocessor (4) 9
Recap: Directory-based Protocol MAC/VU-Advanced Computer Architecture Lec. 37 Multiprocessor (4) 10
Recap: Directory Based Protocols These protocols involve three processors or nodes, namely: local, home and remote nodes – Local node originates the request – Home node stores the memory location of an address – Remote node holds a copy of a cache block, whether exclusive or shared MAC/VU-Advanced Computer Architecture Lec. 37 Multiprocessor (4) 11
Recap: Directory-based Protocol The transactions are caused by the messages such as: read misses, write misses, invalidates or data fetch requests These messages are sent to the directory to cause actions such as: update directory state and to satisfy requests The controller tracks all copies of memory block; and indicates an action that updates the sharing set MAC/VU-Advanced Computer Architecture Lec. 37 Multiprocessor (4) 12
Example: Working of Finite State Machine Controller Now are going to discuss the state transition and messages generated by FSM controller in each state to implement the directory-based protocols. We consider an example distributed sharedmemory multiprocessor having two processors P 1 and P 2 where each processor has its own cache, memory and directory MAC/VU-Advanced Computer Architecture Lec. 35 Multiprocessor (2) 13
Example: Working of Finite State Machine Controller Here, if the required data is not in the cache and is available in memory associated with the respective processor, then the state machine is said to be in Uncached state; and transition to other states is caused by messages such as: read miss, write miss, invalidates and data fetch request MAC/VU-Advanced Computer Architecture Lec. 35 Multiprocessor (2) 14
Example: Dealing with read/write misses Processor 1 Processor 2 Interconnect Directory Memory P 2: Write 20 to A 1 and A 2 map to the same cache block MAC/VU-Advanced Computer Architecture Lec. 37 Multiprocessor (4) 15
Example: Working of Finite State Machine Controller Let us assume that the initially the cache states are Uncached (i. e. , the block of data is in memory); and at the first step P 1 write 10 to address A 1, here the following three activities take place 1. The bus action is write miss and the processor P 1 places the address A 1 on the bus; 2. the data value reply message is sent to the controller, P 1 is inserted in the directory sharer-set {P 1}; and MAC/VU-Advanced Computer Architecture Lec. 35 Multiprocessor (2) 16
Example: Working of Finite State Machine Controller 3. the state transition from Uncached to exclusive takes place – these operations are shown here in red color Processor 1 Processor 2 Interconnect Directory Memory P 2: Write 20 to A 1 MAC/VU-Advanced Computer Architecture Lec. 35 Multiprocessor (2) 17
Example: Working of Finite State Machine Controller At Step 2 – P 1 reads A 1; CPU read HITs occurs, hence the FSM Stays in exclusive state Processor 1 Processor 2 Interconnect Directory Mem P 2: Write 20 to A 1 MAC/VU-Advanced Computer Architecture Lec. 35 Multiprocessor (2) 18
Example: Working of Finite State Machine Controller At Step 3: P 2 reads A 1 i) read miss occurs on the bus as P 2 is initially in Uncached state; the controller states of P 1 and P 2 change from Uncached to Shared ii) P 1 being in Exclusive state, remote read write- back is asserted and the state changes from exclusive to Shared; and iii) the value (10) is read 1 from the shared- memory at address A 1, into P 1 and P 2 caches at A 1; and both P 1 and P 2 controllers are inserted in sharer-set {P 1, P 2} MAC/VU-Advanced Computer Architecture Lec. 35 Multiprocessor (2) 19
Example: Working of FSM Controller Processor 1 Processor 2 Interconnect Directory Memory A 1 P 2: Write 20 to A 1 Write back A 1 and A 2 map to the same cache block MAC/VU-Advanced Computer Architecture Lec. 37 Multiprocessor (4) 20
Example: Working of Finite State Machine Controller At Step 4: P 2 write 20 to A 2 i) As A 1 and A 2 maps to the same cache block; P 1 find a remote write, so the state of the controller changes from shared to Invalid ii) P 2 find a CPU write, so places write miss on the bus and changes the state from shared to exclusive and writes value 20 to A 1 iii) The director addresses to A 1 with sharer- set containing {P 2} MAC/VU-Advanced Computer Architecture Lec. 35 Multiprocessor (2) 21
Example: working of FSM controller Step 4 Processor 1 Interconnect Processor 2 Memory Directory A 1 P 2: Write 20 to A 1 and A 2 map to the same cache block MAC/VU-Advanced Computer Architecture Lec. 37 Multiprocessor (4) 22
Example: Working of Finite State Machine Controller At Step 5: P 2 write 40 to A 2 i) P 2 being in Exclusive state, P 2 write Miss at A 2 occurs ii) Director of A 2 is in exclusive state and places P 2 in the sharer-set {P 2} iii) P 2 write-back 20 at A 1 completes; the directory at A 1 is in Uncached state; the sharerset is empty and value 20 is placed in the memory iv) P 2 remains in Exclusive state, with address A 2 and value 40 MAC/VU-Advanced Computer Architecture Lec. 35 Multiprocessor (2) 23
Example . . Cont’d Processor 1 Processor 2 Interconnect Directory Memory A 1 P 2: Write 20 to A 1 and A 2 map to the same cache block MAC/VU-Advanced Computer Architecture Lec. 37 Multiprocessor (4) 24
Performance of Multiprocessors Symmetric Shared-Memory Architecture In bus-based multiprocessor using an invalidation protocols, several phenomenon combine to determine performance: – Overall cache performance is combination of the behavior of the Uniprocessor cache miss-traffic and the traffic caused by the communication due to invalidation and subsequent cache miss – Changing processor count, cache size and block size effect these two components of miss rate MAC/VU-Advanced Computer Architecture Lec. 37 Multiprocessor (4) 25
Performance of Multiprocessors Symmetric Shared-Memory Architecture. . Cont’d – The misses arising from inter-processor communication, called coherence misses, can be from two sources: – True Sharing – False sharing – True Sharing: The so-called true sharing misses arise from communication of data through cache -coherence mechanism Explanation: …… MAC/VU-Advanced Computer Architecture Lec. 37 Multiprocessor (4) 26
Performance of Multiprocessors Symmetric Shared-Memory Architecture. . Cont’d Explanation – True Sharing: § The first write by processor to a shred cacheblock caused an invalidation to establish ownership of that block § When another processor attempts to read modified word, a miss occurs and the resultant block is transferred § Both the misses are classified as true-sharing misses, as they arise from the sharing of data MAC/VU-Advanced Computer Architecture Lec. 37 Multiprocessor (4) 27
Performance of Multiprocessors Symmetric Shared-Memory Architecture. . Cont’d – False Sharing: it arise from the use of invalidationbase coherence algorithm with a single valid bit per cache block Explanation: § False sharing occurs when a block is invalidated and a subsequent reference causes a miss. i. e. , § the word being written and the word read are different and the invalidation does not cause a new value to be communicated, but only causes an extra cache miss MAC/VU-Advanced Computer Architecture Lec. 37 Multiprocessor (4) 28
Performance of Multiprocessors Symmetric Shared-Memory Architecture. . Cont’d – Here, the block is shared but no word in the block is shared and the miss would not occur is the block size were a single word Example of True and False Sharing: § Considering the previous example, assume the words A 1 and A 2 are in the same cache block, which is in the shared state in the caches of P 1 and P 2 § Let us identify the true-sharing miss and false sharing miss for the following sequence of events MAC/VU-Advanced Computer Architecture Lec. 37 Multiprocessor (4) 29
Performance of Multiprocessors Symmetric Shared-Memory Architecture. . Cont’d Time 1 P 1 Write A 1 2 3 Read A 2 Write A 1 4 5 P 2 Write A 2 Read A 2 Explanation: ………………. MAC/VU-Advanced Computer Architecture Lec. 37 Multiprocessor (4) 30
Performance of Multiprocessors Symmetric Shared-Memory Architecture. . Cont’d Explanation: Event 1: P 1 Write A 1 – is true sharing miss, since A 1 was read by P 2 and needs to be invalidated from P 2 Event 2: P 2 Read A 2 – is false sharing miss, since A 2 was invalidated by the write of A 1 in P 1, but the value of A 1 is not used in P 2 Event 3: P 1 Write A 1 – is false sharing miss; since the block containing A 1 is marked shared due to read in P 2; but P 2 did not read A 1 MAC/VU-Advanced Computer Architecture Lec. 37 Multiprocessor (4) 31
Performance of Multiprocessors Symmetric Shared-Memory Architecture. . Cont’d Explanation Cont’d: Event 4: P 2 Write A 2 – is false sharing miss; since the block containing A 2 is marked shared due to read in P 2 (event 2); but P 2 did not Write A 2 Event 5: P 1 Read A 2 – is true sharing miss; since the value being read by P 2 was written by P 2 (in event 4) MAC/VU-Advanced Computer Architecture Lec. 37 Multiprocessor (4) 32
Performance of Multiprocessors Distributed Shared-Memory Architecture The performance of directory-based multiprocessors depends on many of the same factors (such as processor count, cache size and block size etc. ) that influence the performance of bus-based multiprocessor In addition, the location of requested data item which depends on both the initial allocation and sharing pattern also influence the performance of distributed shared-memory architecture MAC/VU-Advanced Computer Architecture Lec. 37 Multiprocessor (4) 33
Performance of Multiprocessors Distributed Shared-Memory Architecture. . Cont’d Here, the distribution of memory requests between local memory and remote memory is key to the performance, because it affects both the consumption of both global bandwidth and latency seen by the requests This can be visualized from these figures Here the cache misses are separated into the local and remote requests (Fig. 6. 31 – 6. 33 pp 585 - 587) MAC/VU-Advanced Computer Architecture Lec. 37 Multiprocessor (4) 34
Performance of Multiprocessors Distributed Shared-Memory Architecture. . Cont’d The graphs for data miss rate vs. processor count obtained using benchmarks FFT, LU, Barnes and Ocean, show that miss rate is not affected much by the changes in processor count, except for Ocean where the miss rate rises at 64 processors Note that this rise is the result of increase in the local misses which is due to mapping conflicts and increase in the remote misses resulting from coherence misses (Fig. 6. 31 pp 585) MAC/VU-Advanced Computer Architecture Lec. 37 Multiprocessor (4) 35
Performance of Multiprocessors Distributed Shared-Memory Architecture. . Cont’d The graphs for data miss rate vs. cache size, obtained using same benchmarks, show that miss rate decrease as cache size grow Note that there is a steady decrease in the local miss rate while the decline in the remote miss rate depend on coherence misses In all cases shown here, the decrease in the local miss rate is larger than the decrease in the remote miss rate (Fig. 6. 32 pp 586) MAC/VU-Advanced Computer Architecture Lec. 37 Multiprocessor (4) 36
Performance of Multiprocessors Distributed Shared-Memory Architecture. . Cont’d The graphs for data miss rate vs. block size, obtained using same benchmarks, show that miss rate decrease as block size increases (Fig. 6. 33 pp 586) MAC/VU-Advanced Computer Architecture Lec. 37 Multiprocessor (4) 37
Synchronization Why Synchronization? – While using multiprocessor architecture, we need to know when it is safe for different processes to use shared data – This is accomplished by using the synchronization mechanisms These mechanisms are built with user-level software routines that rely on the hardware supplied synchronization instructions MAC/VU-Advanced Computer Architecture Lec. 37 Multiprocessor (4) 38
Synchronization For small multiprocessors Uninterruptable instruction are used to fetch and update memory which is referred to as the atomic operation For large scale multiprocessors, synchronization can be a bottleneck Several techniques have been proposed to reduce contention and latency of synchronization Here, we will examine the hardware primitives to implement synchronization and then construct synchronization routines MAC/VU-Advanced Computer Architecture Lec. 37 Multiprocessor (4) 39
Hardware Primitives: Uninterruptable Instructions The basic requirement to implement synchronization in a multiprocessor is the set of hardware primitives with the ability to atomically read and modify a memory location, – i. e. , read and modify are performed in one step One typical operation that interchanges a value in a register for a value in memory is referred to as Atomic exchange MAC/VU-Advanced Computer Architecture Lec. 37 Multiprocessor (4) 40
Hardware Primitives: Uninterruptable Instructions There are number of other atomic primitives that can be used to implement synchronization The key property of these atomic primitives is that they read and update a memory value atomically The other such operations used in many old multiprocessors is Test-and-Set and fetch-andincrement etc. Now let us understand how the atomic operation work? MAC/VU-Advanced Computer Architecture Lec. 37 Multiprocessor (4) 41
Hardware Primitives: Uninterruptable Instructions – Atomic Exchange: To see how we can use this primitive to build synchronization, let us assume we want to build a simple lock where 0 indicates that lock is free; and 1 indicates that lock is unavailable To implement synchronization, a processor tries to set the lock by exchange of 1, which is in the register, with the memory address corresponding to the lock The value returned from the exchange instruction is 1 if some other processor had already claimed access, otherwise the value returned is 0; i. e. , MAC/VU-Advanced Computer Architecture Lec. 37 Multiprocessor (4) 42
Hardware Primitives: Uninterruptable Instructions The synchronization is locked and unavailable if some other processor had already claimed access; otherwise the value returned is 0 In the later case, where the value returned is 0, the value is changed to 1, preventing any competing exchange from also retrieving 0 Example: – Consider two processors trying to exchange simultaneously – This race is broken when one of the processor exchange first and returns 0, and the second … MAC/VU-Advanced Computer Architecture Lec. 37 Multiprocessor (4) 43
Hardware Primitives: Uninterruptable Instructions … processor will return 1 when it does the exchange Test-and-set: tests a value and sets it if the value passes the test Fetch-and-increment: it returns the value of a memory location and atomically increments it – Key to the atomic operations is that each operation is indivisible MAC/VU-Advanced Computer Architecture Lec. 37 Multiprocessor (4) 44
Uninterruptable Instructions … Cont’d Implementing a single atomic instruction in hardware is complex and is hard to have read & write in one instruction; therefore In the recent multiprocessor pair of instructions is used – the two instructions are: – Load linked (or load locked) and – store conditional Here, the second instruction returns a value from which it can be deduced as if the instruction were executed as atomic MAC/VU-Advanced Computer Architecture Lec. 37 Multiprocessor (4) 45
Uninterruptable Instructions … Cont’d Note that – Load linked (LL) returns the initial value – Store conditional (SC) returns 1 if it succeeds (no other store to same memory location since proceeding load) and 0 otherwise These instructions are used in sequence: – If the contents of memory location, specified by the LL are changed before the SC to the same address occurs, then the SC fails – The store conditional returns a value 1 or 0 indicating whether the SC was successful or not MAC/VU-Advanced Computer Architecture Lec. 37 Multiprocessor (4) 46
Uninterruptable Instructions … Cont’d Let us consider an example program segment showing implementation of atomic exchange on memory location specified by the contents of register R 1 Example doing atomic swap with LL & SC: try: MOV R 3, R 4 ; mov exchange value ll R 2, 0(R 1) ; load linked sc R 3, 0(R 1) ; store conditional beqz R 3, try ; branch store fails (R 3 = 0) mov R 4, R 2 ; put load value in R 4 At the end of this sequence, the contents of R 4 and memory location specified by R 1 have been atomically exchanged MAC/VU-Advanced Computer Architecture Lec. 37 Multiprocessor (4) 47
Uninterruptable Instructions … Cont’d The LL –SC primitive can be used to build other primitives, e. g. , the atomic fetch and increment can be constructed as: Example doing fetch & increment with LL & SC: try: ll addi sc beqz R 2, 0(R 1) R 2, #1 R 2, 0(R 1) R 2, try ; load linked ; increment (OK if reg–reg) ; store conditional ; branch store fails (R 2 = 0) As the SC instruction simply checks that its address matches that in the link register, therefore, register-register instructions can safely be place after the LL instruction; however, the number of instructions in between LL and SC must be kept small MAC/VU-Advanced Computer Architecture Lec. 37 Multiprocessor (4) 48
Summary In this series of four lectures on multiprocessors we have studied how improvement in computer performance can be accomplished using Parallel Processing Architectures Parallel Architecture is a collection of processing elements that cooperate and communicate to solve larger problems fast Then we described the four categories of Parallel Architecture as: SISD, SIMD, MISD and MIMD architecture MAC/VU-Advanced Computer Architecture Lec. 37 Multiprocessor (4) 49
Summary We noticed that based on the memory organization and interconnect strategy, the MIMD machines are classified as: – Centralized Shared Memory Architecture – Distributed Memory Architecture We also introduced the framework to describe parallel architecture as a two layer representation: Programming and Communication models MAC/VU-Advanced Computer Architecture Lec. 37 Multiprocessor (4) 50
Summary We talked about sharing of caches for multiprocessing in the symmetric shared-memory architecture in details Here, we studied the cache coherence problem and introduced two methods, write invalidation and write broadcasting schemes, to resolve the problem We also discussed the finite state machine for the implementation of snooping algorithm MAC/VU-Advanced Computer Architecture Lec. 37 Multiprocessor (4) 51
Summary Today we have discussed FSM controller to implement Directory Based Protocols which involve three processors or nodes, namely: local, home and remote nodes We discussed the state transition and messages generated by FSM controller in each state to implement the directory-based protocols We have also discussed in details the performance of distributed and centralized shared -memory architecture MAC/VU-Advanced Computer Architecture Lec. 37 Multiprocessor (4) 52
Summary Concluding our discussion on the multiprocessor, we can say that multiprocessors are highly effective for multi-programmed work loads More recently, multiprocessors have proved very effective for commercial workloads such as web searching The centralized memory architecture, also known as Symmetric Multiprocessors (SMPs) maintain a single centralized memory with uniform access time; while …… MAC/VU-Advanced Computer Architecture Lec. 37 Multiprocessor (4) 53
Conclusion In contrast, the Distributed Shared-Memory Multiprocessor (DSMs) have non uniform memory architecture and can achieve greater scalability The advantages of these two architecture, i. e. , maximizing uniform memory access while allowing greater scalability can be partially combined in the Sun Microsystems's Wildfire architecture, shown here – (Fig. 6. 48 pp 623) MAC/VU-Advanced Computer Architecture Lec. 37 Multiprocessor (4) 54
Conclusion Here, note that large SMPs (such as E 6000) are used as nodes to maximize uniform memory access and greater scalability is achieved by using Wildfire Interface (WFI) Each E 6000 can accept up to 15 processors or I/O Boards on Giga-plane bus interconnect WFI can connect 2 or 4 E 6000 multiprocessors by replacing one I/O board with WFI board You may look into further details of the Sun Microsystems's Wildfire architecture from literature and study its performance MAC/VU-Advanced Computer Architecture Lec. 37 Multiprocessor (4) 55
Thanks and Allah Hafiz MAC/VU-Advanced Computer Architecture Lec. 37 Multiprocessor (4) 56
- Computer architecture lecture notes
- Isa definition computer
- Characteristics of multiprocessors
- Characteristics of vector processing
- Uma multiprocessors using multistage switching networks
- Uma multiprocessors using crossbar switches
- Review of fundamentals of cpu
- Odbern
- Afi 36 704
- 704-631-1500
- Mc-338 cryogenic tankers
- Target allocation partnership
- Chemical grade
- Iso 704
- 704 kar 3:305
- 704 error
- 01:640:244 lecture notes - lecture 15: plat, idah, farad
- Advanced inorganic chemistry lecture notes
- Three bus architecture
- Difference between computer architecture and organization
- Basic computer design
- Computer security 161 cryptocurrency lecture
- Computer-aided drug design lecture notes
- Advanced topics in computer science
- Craig reinhart
- Advanced computer forensics
- Fastbloc se
- Software architecture definitions
- Call and return architecture
- Integral product architecture example
- Modular product architecture
- Computer organization and architecture 10th solution
- Fast desktop for architecture software
- Virtual labs iit kharagpur
- Introduction to computer organization and architecture
- Timing and control in computer architecture
- Evolution and interpretation of computer architecture
- I/o modules in computer architecture
- Fp adder
- Lmc addressing modes
- Chordal ring
- Smt in computer architecture
- Pseudo instructions
- 111011-100100
- Instruction format in computer architecture
- Nano programming
- Microprogrammed control unit in computer architecture
- Memory system
- Virtual memory tlb
- Linear pipeline
- Computer architecture definition
- Define parallelism in computer architecture
- Computer architecture number system
- Computer architecture definition
- Instruction set architecture
- What are the input devices
- Branch prediction in computer architecture