CS 152 Computer Architecture and Engineering Lecture 14
- Slides: 64
CS 152 Computer Architecture and Engineering Lecture 14 - Cache Design and Coherence 2014 -3 -6 John Lazzaro (not a prof - “John” is always OK) TA: Eric Love www-inst. eecs. berkeley. edu/~cs 152/ Play: CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Today: Shared Cache Design and Coherence CPU multi-threading. . . CPU Keeps memory system busy. Private Cache Private. . . Cache Shared Caches DRAM Shared Ports I/O Crossbars and Rings How to do on-chip sharing. Concurrent requests Interfaces that don’t stall. Coherency Protocols Building coherent caches.
Multithreading Sun Microsystems Niagara series CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB
The case for multithreading Some applications spend their lives waiting for memory. C = compute M = waiting Amdahl’s Law tells us that optimizing C is the wrong thing to do. . . Idea: Create a design that can multiplex threads onto one pipeline. Goal: Maximize throughput of large number of threads. CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Multi-threading: Assuming perfect caches 4 CPUs running @ 1/4 clock. S. Cray, 1962. Labels show this state: T 4 CS 152 L 14: Cache Design and Coherency T 3 T 2 T 1 UC Regents Spring 2014 © UCB
Bypass network is no longer needed. . . Result: Critical path shortens -- can trade for speed or pow ID (Decode) IR EX IR WB MEM IR IR WE, Mem. To. Reg Mux, Logic From WB A Y R M M B CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Multi-threading: Supporting cache misses A thread scheduler keeps track of information about all threads that share pipeline. When a thread experiences a cache miss, it is taken off the pipeline during the miss penalty period. Thread scheduler CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Sun Niagara II # threads/core? 8 threads/core: Enough to keep one core busy, given clock speed, memory system latency, and target application characteristics. CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Crossbar Networks CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Shared-memory CPU Private Cache . . . CPU Private. . . Cache Shared Caches DRAM Shared Ports I/O CPUs share lower level of memory system, and I/O. Common address space, one operating system image. Communication occurs through the memory system (100 ns latency, 20 GB/s bandwidth)
Sun’s Niagara II: Single-chip implementation. . . SPC == SPARC Core. Only DRAM is not on chip.
Crossbar: Like N ports on an N-register file clk sel(ws) 5 D E WE M. . . U X Flexible, but. . . reads slows down as sel(rs 1) O(N 2). . . R 0 - The constant 0 Q 5 32. . . M 32 Q D En R 1 rd 1 U X Q 32 D En R 2. . . sel(rs 2). . . 5 32 D En R 31 Q . . . M 32 rd 2 U X 32 Why? Number of loads on each wd Q goes as O(N), and the wire length 32 to port mux goes as O(N). CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Design challenge: High-performance crossbar Niagara II: 8 cores, 8 L 2 banks, 4 DRAM channels. Apps are locality-poor. Goal: saturate DRAM BW. Each DRAM channel: 50 GB/s Read, 25 GB/s Write BW. Crossbar BW: 270 GB/s total (Read + Write).
Sun Niagara II 8 x 9 Crossbar Tri-state distributed mux, as in microcode talk. Every cross of blue and purple is a tri-state buffer with a unique control signal. 72 control signals (if distributed unencoded). CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Sun Niagara II 8 x 9 Crossbar 8 ports on CPU side (one per core) 100 -200 wires/ port (each way). 4 cycle latency (715 ps/cycle). Cycles 1 -3 are for arbitration. Transmit data on cycle 4. Pipelined. CS 152 L 14: Cache Design and Coherency 8 ports for L 2 banks, plus one for I/0 UC Regents Spring 2014 © UCB
A complete switch transfer (4 epochs) Epoch 1: All input ports (that are ready to send data) request an output port. Epoch 2: Allocation algorithm decides which inputs get to write. Epoch 3: Allocation system informs the winning inputs and outputs. Epoch 4: Actual data transfer takes place. Allocation is pipelined: a data transfer happens on every cycle, as does the three allocation stages, for different sets of requests. CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Epoch 3: The Allocation Problem (4 x 4) Output Ports (W, X, Y, Z) W X Y Z Input Ports (A, B, C, D) A B C D 0 1 0 1 0 0 0 0 0 A 1 codes that an input has data ready to send to an output. Allocator returns a matrix with at most A one 1 in each row and column to set B switches. Algorithm should be “fair”, so C no port always loses. . . should also D “scale” to run large matrices fast. CS 152 L 21: Networks and Routers W 0 0 0 1 X 0 0 Y 1 0 0 0 Z 0 0 UC Regents Fall 2006 © UCB
Sun Niagara II Crossbar Notes Low latency: 4 cycles (less than 3 ns). Uniform latency between all port pairs. Crossbar defines floorplan: all port devices should be equidistant to the crossbar. CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB
CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Sun Niagara II Energy Facts Crossbar only 1% of total power. CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Sun Niagara II Crossbar Notes Low latency: 4 cycles (less than 3 ns). Uniform latency between all port pairs. Crossbar defines floorplan: all port devices should be equidistant to the crossbar. Did not scale up for 16 -core Rainbow Falls keeps the 8 x 9 crossbar, and shares each CPU-side port with two cores. Design alternatives to CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB
CLOS Networks: From telecom world. . . Build a high-port switch by tiling fixed-sized shuffle units. Pipeline registers naturally fit between tiles. Trades scalability for latency. CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB
CLOS Networks: An example route Numbers on left and right are port numbers. Colors show routing paths for an exchange. Arbitration still needed to prevent blocking. CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Ring Networks CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Intel Xeon Data Center server chip 20% of Intel’s revenues, 40% of profits. Why? Cloud is growing, Xeon is dominant.
Compiled Chips Xeon is a chip family, varying by # of cores, L 3 cache size. Chip family mask layouts generated automatically, by adding core/cache slices. Ring Bus
Bi-directional Ring Bus connects: Cores, cache banks, DRAM controllers, off-chip I/O. Chip compiler might size the ring bus to scale bandwidth with # of cores. Ring latency increases with # of cores. But compared to baseline, small. Ring Stop
Tiles along x-axis are 20 ways of cache 2. 5 MB L 3 cache slice from Xeon E 5 Ring stop interface lives in the Cache Control Box (CBOX)
Ring bus (perhaps 1024 wires), with address, data, and header fields (sender #, recipient #, command) 1024 Ring Stop #1 Ring Stop #2 Ring Stop #3 Empty Data Out In Control Ring Stop #2 Interface Reading: Sense Data Out to see if message is for Ring Stop #2. If so, latch data, mux Empty onto ring. Writing: Check is Data Out is Empty. If so, mux a message onto the ring via the Data In port.
In practice: “Extreme EE” to co-optimize bandwidth, reliability.
Debugging: “Network analyzer” built into chip to capture ring messages of a particular kind. Sent off chip via an aux port.
A derivative of this ring bus is also used on laptop and desktop chips.
Break Play: CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Hit-over-Miss Caches CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Recall: CPU-cache port that doesn’t stall on a miss CPU makes a request by placing the following items in Queue 1: From CPU To CPU Queue 1 Queue 2 CMD: Read, write, etc. . . MTYPE: 8 -bit, 16 -bit, 32 -bit, or 64 -bit. TAG: 9 -bit number identifying the request. MADDR: Memory address of first byte. STORE-DATA: For stores, the data to store.
This cache is used in an ASPIRE CPU (Rocket) When request is ready, cache places the following items in Queue 2: From CPU To CPU Queue 1 Queue 2 TAG: Identity of the completed command. LOAD-DATA: For loads, the requested data. CPU saves info about requests, indexed by TAG. Why use TAG approach? Multiple misses can proceed in parallel. Loads can return out of order.
Today: How a read request proceeds in L 1 D-Cache From CPU requests a read by placing MTYPE, TAG, MADDR Queue 1 in Queue 1. To CPU Queue 2 “We” == L 1 D-Cache controller We do a normal cache access. If there is a hit, we put place load result in Queue 2. . . In the case of a miss, we use the Inverted Miss Status Holding Register.
Inverted MSHR (Miss Status Holding Register) (1) Associatively look up block # of memory address in table. If there are no hits, do memory request. To look up a memory address. . . Cache Block # 42 0 Valid Bit MTYPE 1 0 1 st Byte in 4 Block 0 = Valid Qualifies CS 152 L 14: Cache Design and Coherency Hit 512 -entry table, so that every 9 -bit TAG value has an entry. [. . . ] Valid Qualifies Hit [. . . ] Hit Tag ID (ROM) 8 0 511 Assumptions: 32 -byte blocks, 48 -bit physical address space. UC Regents Spring 2014 © UCB
Inverted MSHR (Miss Status Holding Register) (2) Index into table using 9 -bit TAG, and set all fields using MADDR and MTYPE queue values. 8 0 To look up This indexing always finds V=0, because TAG (9 bits) a memory CPU promises not to reuse in-flight tags. address. . . Cache 1 st Byte Tag ID Valid MTYPE Block # (ROM) 8 0 in 4 Block Bit 42 0 1 0 0 = = Valid Qualifies CS 152 L 14: Cache Design and Coherency Hit 512 -entry table, so that every 9 -bit TAG value has an entry. [. . . ] Valid Qualifies Hit [. . . ] Hit 0 511 Assumptions: 32 -byte blocks, 48 -bit physical address space. UC Regents Spring 2014 © UCB
Inverted MSHR (Miss Status Holding Register) (3) Whenever memory system returns data, associatively look up block # to find all pending transactions. Place transaction data for all hits in Queue 2, and clear valid bits. Also update L 1 cache. To look up a memory address. . . Cache Block # 42 0 Valid Bit MTYPE 1 0 1 st Byte in 4 Block 0 = Valid Qualifies CS 152 L 14: Cache Design and Coherency Hit 512 -entry table, so that every 9 -bit TAG value has an entry. [. . . ] Valid Qualifies Hit [. . . ] Hit Tag ID (ROM) 8 0 511 Assumptions: 32 -byte blocks, 48 -bit physical address space. UC Regents Spring 2014 © UCB
Inverted MHSR notes. Structural hazards only occur when TAG space is exhausted by the CPU. High cost (# comparators + SRAM cells). See Farkas and Jouppi on class website, for low-cost designs that are often good enough. We will return to MHSRs to discuss CPI performance later in the semester. CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Coherency Hardware CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Cache Placement CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Two CPUs, two caches, shared DRAM. . . CPU 0: LW R 2, 16(R 0) CPU 1 CPU 0 CPU 1: LW R 2, 16(R 0) Cache Addr Value 16 50 Shared Main Memory Addr Value 16 Write-through CS 152 L 14: Cache Design and Coherency 5 0 CPU 1: SW R 0, 16(R 0) View of memory no longer “coherent”. Loads of location 16 from CPU 0 and CPU 1 see different values! UC Regents Spring 2014 © UCB
The simplest solution. . . one cache! CPU 1 CPU 0 Memory Switch Shared Multi-Bank Cache Shared Main Memory CS 152 L 14: Cache Design and Coherency CPUs do not have internal caches. Only one cache, so different values for a memory address cannot appear in 2 caches! Multiple caches banks support read/writes by both CPUs in a switch epoch, unless both target same bank. In that case, one UC Regents Spring 2014 © UCB
Not a complete solution. . . good for L 2. CPU 1 CPU 0 Memory Switch Shared Multi-Bank Cache Shared Main Memory Sequent Systems (1980 s) CS 152 L 14: Cache Design and Coherency For modern clock rates, access to shared cache through switch takes 10+ cycles. Using shared cache as the L 1 data cache is tantamount to slowing down clock 10 X for LWs. Not good. This approach was a complete solution in the days when DRAM row access time and CPU clock period were well matched. UC Regents Spring 2014 © UCB
Modified form: Private L 1 s, shared L 2 CPU 0 CPU 1 L 1 Caches Memory Switch or Bus Shared Multi-Bank L 2 Cache Shared Main Memory CS 152 L 14: Cache Design and Coherency Thus, we need to solve the cache coherency problem for L 1 cache. Advantages of shared L 2 over private L 2 s: Processors communicate at cache speed, not DRAM speed. Constructive interference, if both CPUs need same data/instr. Disadvantage: CPUs share BW to L 2 cache. . . UC Regents Spring 2014 © UCB
IBM Power 4 (2001) Dual core Shared, multi-bank L 2 cache. Private L 1 caches Off-chip L 3 caches
Cache Coherency CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Cache coherency goals. . . CPU 0 CPU 1 Cache Addr Value 16 50 Shared Memory Hierarchy Addr Value 16 CS 152 L 14: Cache Design and Coherency 1. Only one processor at a time has write permission for a memory location. 2. No processor can load a stale copy of a location after a write. 5 0 UC Regents Spring 2014 © UCB
Simple Implementation: Snoopy Caches CPU 1 CPU 0 Cache Snooper Memory bus Shared Main Memory Hierarchy Each cache has the ability to “snoop” on memory bus transactions of other The bus also has mechanisms to let a CPUs. CPU intervene to stop a bus transaction, and to invalidate cache lines of other CPUs. CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Writes from 10, 000 feet. . . for write-thru L 1 For write-thru caches. . . CPU 1 CPU 0 Cache Snooper Memory bus Shared Main Memory Hierarchy To a first-order, reads will “just work” if write-thru caches implement this policy. A “two-state” protocol (cache lines are “valid” or “invalid”). CS 152 L 14: Cache Design and Coherency 1. Writing CPU takes control of bus. 2. Address to be written is invalidated in all other caches. Reads will no longer hit in cache and get stale data. 3. Write is sent to main memory. Reads will cache miss, retrieve new value from main UC Regents Spring 2014 © UCB
Limitations of the write-thru approach CPU 1 CPU 0 Cache Snooper Memory bus Shared Main Memory Hierarchy Every write goes to the bus. Total bus write bandwidth does not support more than 2 CPUs, in modern practice. Write-back big trick: add extra states. To scale further, Simplest version: MSI -- Modified, we need to use Shared, Invalid. More efficient write-back caches. versions add more states (MESI adds Exclusive). State definitions are subtle. . . CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Figure 5. 5, page 358. . . the best starting point.
Read misses. . . for a MESI protocol. . . For write-back caches. . . CPU 1 CPU 0 Cache Snooper Memory bus Shared Main Memory Hierarchy 1. A cache requests a cache-line fill for a read miss. 2. Another cache with an exclusive on this line responds with fresh data. Reads miss will not hit main memory, retrieve stale 3. Thedata. responding cache changes line These sketches are just to give you a from exclusive to sense of how coherency protocols modified. work. Deep understand requires Future writes will go understanding the complete “state to bus to be CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Snoopy mechanism doesn’t scale. . . CPU 1 CPU 0 Cache Snooper Memory bus Shared Main Memory Hierarchy Single-chip implementations have moved to a centralized “directory” service that tracks the status of each line of each private cache. Multi-socket systems use distributed directories. CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Directories attached to on-chip cache network. . .
2 socket system. . . each socket a multi-core chip Each chip has its own bank of DRAM.
Distributed directories for multi-socket systems Directories for Chip 0 . . . and Chip 1 L 1 L 2 Directory for Chip 0 DRAM. Directory for Chip 1 DRAM.
Figure 5. 21, page 381. . . directory message basics Conceptually similar to snoopy caches. . . but the different mechanisms require rethinking the protocol to get correct behaviors.
Other Machine Architectures CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB
NUMA: Non-uniform Memory Access CPU 0 . . . CPU 1023 Cache DRAM Each CPU has part of main memory attached to it. To access other parts of main memory, use the interconnection network. For best results, Interconnection Network applications take the non-uniform Network uses coherent global address memory latency space. Directory protocols over fiber into account. networking. CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Clusters: Supercomputing version of WSC Connect large numbers of 1 -CPU or 2 -CPU rack mount computers together with highend network technology (not normal Ethernet). Instead of using hardware to create a shared memory abstraction, let an University of Illinois, 650 2 -CPU Apple application build its own memory Xserve cluster, connected with Myrinet (3. 5 μs ping time - low latency model. CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB
On Tuesday We return to CPU design. . . Have a good weekend !
- Computer architecture lecture notes
- Computer architecture lecture
- 01:640:244 lecture notes - lecture 15: plat, idah, farad
- Computer organization and architecture difference
- 3 bus architecture
- Computer based system engineering in software engineering
- Basic computer organization and design
- Un r 152
- Windows mru
- Przedszkole 152 łódź
- Law society of tasmania v richardson [2003] tassc 9
- Round 271 to the nearest hundred
- Mae 152
- Blending function in computer graphics
- Cs 152 stanford
- Cs 152 berkeley
- Ba 152
- Ece 152
- Ba 152
- Lorne priemaza
- Which layer of the osi model includes vlans
- Ba 152
- Macroob
- Organizational atrophy
- Hasil dari 202-152 adalah
- Econ 152
- Cs152 sjsu
- Gfi 152
- Financial engineering notes
- Requirement analysis in software engineering notes
- Foundation engineering lecture notes
- Descriptive ethics
- Computer security 161 cryptocurrency lecture
- Computer-aided drug design lecture notes
- Call and return architecture in software architecture
- Roman architecture and engineering
- Call return architecture
- Forward engineering in software engineering
- Computer organization and architecture 10th solution
- Vlab iit kharagpur
- Introduction to computer organization and architecture
- Timing and control in computer architecture
- Computer architecture: concepts and evolution
- Harris & harris digital design and computer architecture
- Spec rating formula in computer organization
- Computer organization and architecture 10th edition
- Linear pipeline
- Digital design and computer architecture
- What is decoder in computer architecture
- Digital design and computer architecture
- Digital design and computer architecture
- Assembly language and computer architecture
- Computer organization and architecture stallings
- Risc vs cisc example
- Hazard detection and resolution
- Pipelined datapath
- 1s complement
- Cs341 umb
- Bubble pushing example
- Digital logic and computer architecture
- The apollo guidance computer: architecture and operation
- Digital design and computer architecture arm edition
- Klipsch school of electrical and computer engineering
- Ucf software engineering
- Erik jonsson school of engineering and computer science