CS 152 Computer Architecture and Engineering Lecture 14

Today: Shared Cache Design and Coherence CPU multi-threading. . . CPU Keeps memory system

Multithreading Sun Microsystems Niagara series CS 152 L 14: Cache Design and Coherency UC

The case for multithreading Some applications spend their lives waiting for memory. C =

Multi-threading: Assuming perfect caches 4 CPUs running @ 1/4 clock. S. Cray, 1962. Labels

Bypass network is no longer needed. . . Result: Critical path shortens -- can

Multi-threading: Supporting cache misses A thread scheduler keeps track of information about all threads

Sun Niagara II # threads/core? 8 threads/core: Enough to keep one core busy, given

Crossbar Networks CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014

Shared-memory CPU Private Cache . . . CPU Private. . . Cache Shared Caches

Sun’s Niagara II: Single-chip implementation. . . SPC == SPARC Core. Only DRAM is

Crossbar: Like N ports on an N-register file clk sel(ws) 5 D E WE

Design challenge: High-performance crossbar Niagara II: 8 cores, 8 L 2 banks, 4 DRAM

Sun Niagara II 8 x 9 Crossbar Tri-state distributed mux, as in microcode talk.

Sun Niagara II 8 x 9 Crossbar 8 ports on CPU side (one per

A complete switch transfer (4 epochs) Epoch 1: All input ports (that are ready

Epoch 3: The Allocation Problem (4 x 4) Output Ports (W, X, Y, Z)

Sun Niagara II Crossbar Notes Low latency: 4 cycles (less than 3 ns). Uniform

CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Sun Niagara II Energy Facts Crossbar only 1% of total power. CS 152 L

CLOS Networks: From telecom world. . . Build a high-port switch by tiling fixed-sized

CLOS Networks: An example route Numbers on left and right are port numbers. Colors

Ring Networks CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014

Intel Xeon Data Center server chip 20% of Intel’s revenues, 40% of profits. Why?

Compiled Chips Xeon is a chip family, varying by # of cores, L 3

Bi-directional Ring Bus connects: Cores, cache banks, DRAM controllers, off-chip I/O. Chip compiler might

Tiles along x-axis are 20 ways of cache 2. 5 MB L 3 cache

Ring bus (perhaps 1024 wires), with address, data, and header fields (sender #, recipient

In practice: “Extreme EE” to co-optimize bandwidth, reliability.

Debugging: “Network analyzer” built into chip to capture ring messages of a particular kind.

A derivative of this ring bus is also used on laptop and desktop chips.

Break Play: CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014

Hit-over-Miss Caches CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014

Recall: CPU-cache port that doesn’t stall on a miss CPU makes a request by

This cache is used in an ASPIRE CPU (Rocket) When request is ready, cache

Today: How a read request proceeds in L 1 D-Cache From CPU requests a

Inverted MSHR (Miss Status Holding Register) (1) Associatively look up block # of memory

Inverted MSHR (Miss Status Holding Register) (2) Index into table using 9 -bit TAG,

Inverted MSHR (Miss Status Holding Register) (3) Whenever memory system returns data, associatively look

Inverted MHSR notes. Structural hazards only occur when TAG space is exhausted by the

Coherency Hardware CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014

Cache Placement CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014

Two CPUs, two caches, shared DRAM. . . CPU 0: LW R 2, 16(R

The simplest solution. . . one cache! CPU 1 CPU 0 Memory Switch Shared

Not a complete solution. . . good for L 2. CPU 1 CPU 0

Modified form: Private L 1 s, shared L 2 CPU 0 CPU 1 L

IBM Power 4 (2001) Dual core Shared, multi-bank L 2 cache. Private L 1

Cache Coherency CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014

Cache coherency goals. . . CPU 0 CPU 1 Cache Addr Value 16 50

Simple Implementation: Snoopy Caches CPU 1 CPU 0 Cache Snooper Memory bus Shared Main

Writes from 10, 000 feet. . . for write-thru L 1 For write-thru caches.

Limitations of the write-thru approach CPU 1 CPU 0 Cache Snooper Memory bus Shared

Figure 5. 5, page 358. . . the best starting point.

Read misses. . . for a MESI protocol. . . For write-back caches. .

Snoopy mechanism doesn’t scale. . . CPU 1 CPU 0 Cache Snooper Memory bus

Directories attached to on-chip cache network. . .

2 socket system. . . each socket a multi-core chip Each chip has its

Distributed directories for multi-socket systems Directories for Chip 0 . . . and Chip

Figure 5. 21, page 381. . . directory message basics Conceptually similar to snoopy

Other Machine Architectures CS 152 L 14: Cache Design and Coherency UC Regents Spring

NUMA: Non-uniform Memory Access CPU 0 . . . CPU 1023 Cache DRAM Each

Clusters: Supercomputing version of WSC Connect large numbers of 1 -CPU or 2 -CPU

On Tuesday We return to CPU design. . . Have a good weekend !

Slides: 64

Download presentation

CS 152 Computer Architecture and Engineering Lecture 14 - Cache Design and Coherence 2014 -3 -6 John Lazzaro (not a prof - “John” is always OK) TA: Eric Love www-inst. eecs. berkeley. edu/~cs 152/ Play: CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Today: Shared Cache Design and Coherence CPU multi-threading. . . CPU Keeps memory system busy. Private Cache Private. . . Cache Shared Caches DRAM Shared Ports I/O Crossbars and Rings How to do on-chip sharing. Concurrent requests Interfaces that don’t stall. Coherency Protocols Building coherent caches.

Multithreading Sun Microsystems Niagara series CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB

The case for multithreading Some applications spend their lives waiting for memory. C = compute M = waiting Amdahl’s Law tells us that optimizing C is the wrong thing to do. . . Idea: Create a design that can multiplex threads onto one pipeline. Goal: Maximize throughput of large number of threads. CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Multi-threading: Assuming perfect caches 4 CPUs running @ 1/4 clock. S. Cray, 1962. Labels show this state: T 4 CS 152 L 14: Cache Design and Coherency T 3 T 2 T 1 UC Regents Spring 2014 © UCB

Bypass network is no longer needed. . . Result: Critical path shortens -- can trade for speed or pow ID (Decode) IR EX IR WB MEM IR IR WE, Mem. To. Reg Mux, Logic From WB A Y R M M B CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Multi-threading: Supporting cache misses A thread scheduler keeps track of information about all threads that share pipeline. When a thread experiences a cache miss, it is taken off the pipeline during the miss penalty period. Thread scheduler CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Sun Niagara II # threads/core? 8 threads/core: Enough to keep one core busy, given clock speed, memory system latency, and target application characteristics. CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Crossbar Networks CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Shared-memory CPU Private Cache . . . CPU Private. . . Cache Shared Caches DRAM Shared Ports I/O CPUs share lower level of memory system, and I/O. Common address space, one operating system image. Communication occurs through the memory system (100 ns latency, 20 GB/s bandwidth)

Sun’s Niagara II: Single-chip implementation. . . SPC == SPARC Core. Only DRAM is not on chip.

Crossbar: Like N ports on an N-register file clk sel(ws) 5 D E WE M. . . U X Flexible, but. . . reads slows down as sel(rs 1) O(N 2). . . R 0 - The constant 0 Q 5 32. . . M 32 Q D En R 1 rd 1 U X Q 32 D En R 2. . . sel(rs 2). . . 5 32 D En R 31 Q . . . M 32 rd 2 U X 32 Why? Number of loads on each wd Q goes as O(N), and the wire length 32 to port mux goes as O(N). CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Design challenge: High-performance crossbar Niagara II: 8 cores, 8 L 2 banks, 4 DRAM channels. Apps are locality-poor. Goal: saturate DRAM BW. Each DRAM channel: 50 GB/s Read, 25 GB/s Write BW. Crossbar BW: 270 GB/s total (Read + Write).

Sun Niagara II 8 x 9 Crossbar Tri-state distributed mux, as in microcode talk. Every cross of blue and purple is a tri-state buffer with a unique control signal. 72 control signals (if distributed unencoded). CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Sun Niagara II 8 x 9 Crossbar 8 ports on CPU side (one per core) 100 -200 wires/ port (each way). 4 cycle latency (715 ps/cycle). Cycles 1 -3 are for arbitration. Transmit data on cycle 4. Pipelined. CS 152 L 14: Cache Design and Coherency 8 ports for L 2 banks, plus one for I/0 UC Regents Spring 2014 © UCB

A complete switch transfer (4 epochs) Epoch 1: All input ports (that are ready to send data) request an output port. Epoch 2: Allocation algorithm decides which inputs get to write. Epoch 3: Allocation system informs the winning inputs and outputs. Epoch 4: Actual data transfer takes place. Allocation is pipelined: a data transfer happens on every cycle, as does the three allocation stages, for different sets of requests. CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Epoch 3: The Allocation Problem (4 x 4) Output Ports (W, X, Y, Z) W X Y Z Input Ports (A, B, C, D) A B C D 0 1 0 1 0 0 0 0 0 A 1 codes that an input has data ready to send to an output. Allocator returns a matrix with at most A one 1 in each row and column to set B switches. Algorithm should be “fair”, so C no port always loses. . . should also D “scale” to run large matrices fast. CS 152 L 21: Networks and Routers W 0 0 0 1 X 0 0 Y 1 0 0 0 Z 0 0 UC Regents Fall 2006 © UCB

CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Sun Niagara II Energy Facts Crossbar only 1% of total power. CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Sun Niagara II Crossbar Notes Low latency: 4 cycles (less than 3 ns). Uniform latency between all port pairs. Crossbar defines floorplan: all port devices should be equidistant to the crossbar. Did not scale up for 16 -core Rainbow Falls keeps the 8 x 9 crossbar, and shares each CPU-side port with two cores. Design alternatives to CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB

CLOS Networks: From telecom world. . . Build a high-port switch by tiling fixed-sized shuffle units. Pipeline registers naturally fit between tiles. Trades scalability for latency. CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB

CLOS Networks: An example route Numbers on left and right are port numbers. Colors show routing paths for an exchange. Arbitration still needed to prevent blocking. CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Ring Networks CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Intel Xeon Data Center server chip 20% of Intel’s revenues, 40% of profits. Why? Cloud is growing, Xeon is dominant.

Compiled Chips Xeon is a chip family, varying by # of cores, L 3 cache size. Chip family mask layouts generated automatically, by adding core/cache slices. Ring Bus

Bi-directional Ring Bus connects: Cores, cache banks, DRAM controllers, off-chip I/O. Chip compiler might size the ring bus to scale bandwidth with # of cores. Ring latency increases with # of cores. But compared to baseline, small. Ring Stop

Tiles along x-axis are 20 ways of cache 2. 5 MB L 3 cache slice from Xeon E 5 Ring stop interface lives in the Cache Control Box (CBOX)

Ring bus (perhaps 1024 wires), with address, data, and header fields (sender #, recipient #, command) 1024 Ring Stop #1 Ring Stop #2 Ring Stop #3 Empty Data Out In Control Ring Stop #2 Interface Reading: Sense Data Out to see if message is for Ring Stop #2. If so, latch data, mux Empty onto ring. Writing: Check is Data Out is Empty. If so, mux a message onto the ring via the Data In port.

In practice: “Extreme EE” to co-optimize bandwidth, reliability.

Debugging: “Network analyzer” built into chip to capture ring messages of a particular kind. Sent off chip via an aux port.

A derivative of this ring bus is also used on laptop and desktop chips.

Break Play: CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Hit-over-Miss Caches CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Recall: CPU-cache port that doesn’t stall on a miss CPU makes a request by placing the following items in Queue 1: From CPU To CPU Queue 1 Queue 2 CMD: Read, write, etc. . . MTYPE: 8 -bit, 16 -bit, 32 -bit, or 64 -bit. TAG: 9 -bit number identifying the request. MADDR: Memory address of first byte. STORE-DATA: For stores, the data to store.

This cache is used in an ASPIRE CPU (Rocket) When request is ready, cache places the following items in Queue 2: From CPU To CPU Queue 1 Queue 2 TAG: Identity of the completed command. LOAD-DATA: For loads, the requested data. CPU saves info about requests, indexed by TAG. Why use TAG approach? Multiple misses can proceed in parallel. Loads can return out of order.

Today: How a read request proceeds in L 1 D-Cache From CPU requests a read by placing MTYPE, TAG, MADDR Queue 1 in Queue 1. To CPU Queue 2 “We” == L 1 D-Cache controller We do a normal cache access. If there is a hit, we put place load result in Queue 2. . . In the case of a miss, we use the Inverted Miss Status Holding Register.

Inverted MSHR (Miss Status Holding Register) (1) Associatively look up block # of memory address in table. If there are no hits, do memory request. To look up a memory address. . . Cache Block # 42 0 Valid Bit MTYPE 1 0 1 st Byte in 4 Block 0 = Valid Qualifies CS 152 L 14: Cache Design and Coherency Hit 512 -entry table, so that every 9 -bit TAG value has an entry. [. . . ] Valid Qualifies Hit [. . . ] Hit Tag ID (ROM) 8 0 511 Assumptions: 32 -byte blocks, 48 -bit physical address space. UC Regents Spring 2014 © UCB

Inverted MSHR (Miss Status Holding Register) (2) Index into table using 9 -bit TAG, and set all fields using MADDR and MTYPE queue values. 8 0 To look up This indexing always finds V=0, because TAG (9 bits) a memory CPU promises not to reuse in-flight tags. address. . . Cache 1 st Byte Tag ID Valid MTYPE Block # (ROM) 8 0 in 4 Block Bit 42 0 1 0 0 = = Valid Qualifies CS 152 L 14: Cache Design and Coherency Hit 512 -entry table, so that every 9 -bit TAG value has an entry. [. . . ] Valid Qualifies Hit [. . . ] Hit 0 511 Assumptions: 32 -byte blocks, 48 -bit physical address space. UC Regents Spring 2014 © UCB

Inverted MSHR (Miss Status Holding Register) (3) Whenever memory system returns data, associatively look up block # to find all pending transactions. Place transaction data for all hits in Queue 2, and clear valid bits. Also update L 1 cache. To look up a memory address. . . Cache Block # 42 0 Valid Bit MTYPE 1 0 1 st Byte in 4 Block 0 = Valid Qualifies CS 152 L 14: Cache Design and Coherency Hit 512 -entry table, so that every 9 -bit TAG value has an entry. [. . . ] Valid Qualifies Hit [. . . ] Hit Tag ID (ROM) 8 0 511 Assumptions: 32 -byte blocks, 48 -bit physical address space. UC Regents Spring 2014 © UCB

Inverted MHSR notes. Structural hazards only occur when TAG space is exhausted by the CPU. High cost (# comparators + SRAM cells). See Farkas and Jouppi on class website, for low-cost designs that are often good enough. We will return to MHSRs to discuss CPI performance later in the semester. CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Two CPUs, two caches, shared DRAM. . . CPU 0: LW R 2, 16(R 0) CPU 1 CPU 0 CPU 1: LW R 2, 16(R 0) Cache Addr Value 16 50 Shared Main Memory Addr Value 16 Write-through CS 152 L 14: Cache Design and Coherency 5 0 CPU 1: SW R 0, 16(R 0) View of memory no longer “coherent”. Loads of location 16 from CPU 0 and CPU 1 see different values! UC Regents Spring 2014 © UCB

The simplest solution. . . one cache! CPU 1 CPU 0 Memory Switch Shared Multi-Bank Cache Shared Main Memory CS 152 L 14: Cache Design and Coherency CPUs do not have internal caches. Only one cache, so different values for a memory address cannot appear in 2 caches! Multiple caches banks support read/writes by both CPUs in a switch epoch, unless both target same bank. In that case, one UC Regents Spring 2014 © UCB

Not a complete solution. . . good for L 2. CPU 1 CPU 0 Memory Switch Shared Multi-Bank Cache Shared Main Memory Sequent Systems (1980 s) CS 152 L 14: Cache Design and Coherency For modern clock rates, access to shared cache through switch takes 10+ cycles. Using shared cache as the L 1 data cache is tantamount to slowing down clock 10 X for LWs. Not good. This approach was a complete solution in the days when DRAM row access time and CPU clock period were well matched. UC Regents Spring 2014 © UCB

Modified form: Private L 1 s, shared L 2 CPU 0 CPU 1 L 1 Caches Memory Switch or Bus Shared Multi-Bank L 2 Cache Shared Main Memory CS 152 L 14: Cache Design and Coherency Thus, we need to solve the cache coherency problem for L 1 cache. Advantages of shared L 2 over private L 2 s: Processors communicate at cache speed, not DRAM speed. Constructive interference, if both CPUs need same data/instr. Disadvantage: CPUs share BW to L 2 cache. . . UC Regents Spring 2014 © UCB

IBM Power 4 (2001) Dual core Shared, multi-bank L 2 cache. Private L 1 caches Off-chip L 3 caches

Cache coherency goals. . . CPU 0 CPU 1 Cache Addr Value 16 50 Shared Memory Hierarchy Addr Value 16 CS 152 L 14: Cache Design and Coherency 1. Only one processor at a time has write permission for a memory location. 2. No processor can load a stale copy of a location after a write. 5 0 UC Regents Spring 2014 © UCB

Simple Implementation: Snoopy Caches CPU 1 CPU 0 Cache Snooper Memory bus Shared Main Memory Hierarchy Each cache has the ability to “snoop” on memory bus transactions of other The bus also has mechanisms to let a CPUs. CPU intervene to stop a bus transaction, and to invalidate cache lines of other CPUs. CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Writes from 10, 000 feet. . . for write-thru L 1 For write-thru caches. . . CPU 1 CPU 0 Cache Snooper Memory bus Shared Main Memory Hierarchy To a first-order, reads will “just work” if write-thru caches implement this policy. A “two-state” protocol (cache lines are “valid” or “invalid”). CS 152 L 14: Cache Design and Coherency 1. Writing CPU takes control of bus. 2. Address to be written is invalidated in all other caches. Reads will no longer hit in cache and get stale data. 3. Write is sent to main memory. Reads will cache miss, retrieve new value from main UC Regents Spring 2014 © UCB

Limitations of the write-thru approach CPU 1 CPU 0 Cache Snooper Memory bus Shared Main Memory Hierarchy Every write goes to the bus. Total bus write bandwidth does not support more than 2 CPUs, in modern practice. Write-back big trick: add extra states. To scale further, Simplest version: MSI -- Modified, we need to use Shared, Invalid. More efficient write-back caches. versions add more states (MESI adds Exclusive). State definitions are subtle. . . CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Figure 5. 5, page 358. . . the best starting point.

Read misses. . . for a MESI protocol. . . For write-back caches. . . CPU 1 CPU 0 Cache Snooper Memory bus Shared Main Memory Hierarchy 1. A cache requests a cache-line fill for a read miss. 2. Another cache with an exclusive on this line responds with fresh data. Reads miss will not hit main memory, retrieve stale 3. Thedata. responding cache changes line These sketches are just to give you a from exclusive to sense of how coherency protocols modified. work. Deep understand requires Future writes will go understanding the complete “state to bus to be CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Snoopy mechanism doesn’t scale. . . CPU 1 CPU 0 Cache Snooper Memory bus Shared Main Memory Hierarchy Single-chip implementations have moved to a centralized “directory” service that tracks the status of each line of each private cache. Multi-socket systems use distributed directories. CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Directories attached to on-chip cache network. . .

2 socket system. . . each socket a multi-core chip Each chip has its own bank of DRAM.

Distributed directories for multi-socket systems Directories for Chip 0 . . . and Chip 1 L 1 L 2 Directory for Chip 0 DRAM. Directory for Chip 1 DRAM.

Figure 5. 21, page 381. . . directory message basics Conceptually similar to snoopy caches. . . but the different mechanisms require rethinking the protocol to get correct behaviors.

NUMA: Non-uniform Memory Access CPU 0 . . . CPU 1023 Cache DRAM Each CPU has part of main memory attached to it. To access other parts of main memory, use the interconnection network. For best results, Interconnection Network applications take the non-uniform Network uses coherent global address memory latency space. Directory protocols over fiber into account. networking. CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Clusters: Supercomputing version of WSC Connect large numbers of 1 -CPU or 2 -CPU rack mount computers together with highend network technology (not normal Ethernet). Instead of using hardware to create a shared memory abstraction, let an University of Illinois, 650 2 -CPU Apple application build its own memory Xserve cluster, connected with Myrinet (3. 5 μs ping time - low latency model. CS 152 L 14: Cache Design and Coherency UC Regents Spring 2014 © UCB

On Tuesday We return to CPU design. . . Have a good weekend !