CoarseGrained Coherence Mikko H Lipasti Associate Professor Electrical

  • Slides: 91
Download presentation
Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin

Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with: Jason Cantin, IBM (Ph. D. ’ 06) Natalie Enright Jerger Prof. Jim Smith Prof. Li-Shiuan Peh (Princeton) http: //www. ece. wisc. edu/~pharm

Motivation n Multiprocessors are commonplace n n n Most common multiprocessor n n Historically,

Motivation n Multiprocessors are commonplace n n n Most common multiprocessor n n Historically, glass house servers Now laptops, soon cell phones Symmetric processors w/coherent caches Logical extension of time-shared uniprocessors Easy to program, reason about Not so easy to build Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Coherence Granularity n Track each individual word n n Too much overhead Track larger

Coherence Granularity n Track each individual word n n Too much overhead Track larger blocks n n n 32 B – 128 B common Less overhead, exploit spatial locality Large blocks cause false sharing P 0 P 1 P 2 P 3 P 4 P 5 P 6 P 7 n Solution: use multiple granularities n n Small blocks: manage local read/write permissions Large blocks: track global behavior Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Coarse-Grained Coherence n Initially n n Identify non-shared regions Decouple obtaining coherence permission from

Coarse-Grained Coherence n Initially n n Identify non-shared regions Decouple obtaining coherence permission from data transfer Filter snoops to reduce broadcast bandwidth Later n n n Aug 30, 2007 Enable aggressive prefetching Optimize DRAM accesses Customize protocol, interconnect to match Mikko Lipasti-University of Wisconsin

Coarse-Grained Coherence n Optimizations lead to n n n n Reduced memory miss latency

Coarse-Grained Coherence n Optimizations lead to n n n n Reduced memory miss latency Reduced cache-to-cache miss latency Reduced snoop bandwidth Fewer exposed cache misses Elimination of unnecessary DRAM reads Power savings on bus, interconnect, caches, and in DRAM World peace and end to global warming Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Coarse-Grained Coherence Tracking n Memory is divided into coarse-grained regions n n n Aligned,

Coarse-Grained Coherence Tracking n Memory is divided into coarse-grained regions n n n Aligned, power-of-two multiple of cache line size Can range from two lines to a physical page A cache-like structure is added to each processor for monitoring coherence at the granularity of regions n Region Coherence Array (RCA) Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Region Coherence Arrays n n n Each entry has an address tag, state, and

Region Coherence Arrays n n n Each entry has an address tag, state, and count of lines cached by the processor The region state indicates if the processor and / or other processors are sharing / modifying lines in the region Customize policy/protocol/interconnect to exploit region state Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Talk Outline þ Motivation Overview of Coarse-Grained Coherence n Techniques þ n n n

Talk Outline þ Motivation Overview of Coarse-Grained Coherence n Techniques þ n n n n Broadcast Snoop Reduction [ISCA 2005] Stealth Prefetching [ASPLOS 2006] Power-Efficient DRAM Speculation Hybrid Circuit Switching Virtual Proximity Circuit-switched snooping Research Group Overview Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Unnecessary Broadcasts Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Unnecessary Broadcasts Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Broadcast Snoop Reduction n n Identify requests that don’t need a broadcast Send data

Broadcast Snoop Reduction n n Identify requests that don’t need a broadcast Send data requests directly to memory w/o broadcasting n n n Reducing broadcast traffic Reducing memory latency Avoid sending non-data requests externally Example Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Simulator Evaluation PHARMsim: near-RTL but written in C n Execution-driven simulator built on top

Simulator Evaluation PHARMsim: near-RTL but written in C n Execution-driven simulator built on top of Sim. OS-PPC n Four 4 -way superscalar out-of-order processors n Two-level hierarchy with split L 1, unified 1 MB L 2 caches, and 64 B lines n Separate address / data networks –similar to Sun Fireplane Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Workloads n Scientific n n Multiprogrammed n n Ocean, Raytrace, Barnes SPECint 2000_rate, SPECint

Workloads n Scientific n n Multiprogrammed n n Ocean, Raytrace, Barnes SPECint 2000_rate, SPECint 95_rate Commercial (database, web) n n TPC-W, TPC-B, TPC-H SPECweb 99, SPECjbb 2000 Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Broadcasts Avoided Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Broadcasts Avoided Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Execution Time Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Execution Time Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Summary n n Eliminates nearly all unnecessary broadcasts Reduces snoop activity by 65% n

Summary n n Eliminates nearly all unnecessary broadcasts Reduces snoop activity by 65% n n n Fewer broadcasts Fewer lookups Provides modest speedup Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Talk Outline þ Motivation Overview of Coarse-grained Coherence n Techniques þ þ n n

Talk Outline þ Motivation Overview of Coarse-grained Coherence n Techniques þ þ n n n Broadcast Snoop Reduction [ISCA-2005] Stealth Prefetching [ASPLOS 2006] Power-Efficient DRAM Speculation Hybrid Circuit Switching Virtual Proximity Circuit-switched snooping Research Group Overview Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Prefetching in Multiprocessors n Prefetching n n Anticipate future reference, fetch into cache Many

Prefetching in Multiprocessors n Prefetching n n Anticipate future reference, fetch into cache Many prefetching heuristics possible n n Current systems: next-block, stride Proposed: skip pointer, content-based Some/many prefetched blocks are not used Multiprocessors complications n Premature or unnecessary prefetches n Permission thrashing if blocks are shared n Separate study [ISPASS 2006] Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Stealth Prefetching Lines from non-shared regions can be prefetched stealthily and efficiently n Without

Stealth Prefetching Lines from non-shared regions can be prefetched stealthily and efficiently n Without disturbing other processors n n n Without downgrades, invalidations Without preventing them from obtaining exclusive copies Without broadcasting prefetch requests Fetched from DRAM with low overhead Example n Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Stealth Prefetching n n n After a threshold number of L 2 misses (2),

Stealth Prefetching n n n After a threshold number of L 2 misses (2), the rest of the lines from a region are prefetched These lines are buffered close to the processor for later use (Stealth Data Prefetch Buffer) After accessing the RCA, requests may obtain data from the buffer as they would from memory n To access data, region must be in valid state and a broadcast unnecessary for coherent access Aug 30, 2007 Mikko Lipasti-University of Wisconsin

L 2 Misses Prefetched Aug 30, 2007 Mikko Lipasti-University of Wisconsin

L 2 Misses Prefetched Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Speedup Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Speedup Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Summary Stealth Prefetching can prefetch data: n Stealthily: n n n Aggressively: n n

Summary Stealth Prefetching can prefetch data: n Stealthily: n n n Aggressively: n n Only non-shared data prefetched Prefetch requests not broadcast Large regions prefetched at once, 80 -90% timely Efficiently: n n Piggybacked onto a demand request Fetched from DRAM in open-page mode Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Talk Outline þ Motivation Overview of Coarse-grained Coherence n Techniques þ þ þ n

Talk Outline þ Motivation Overview of Coarse-grained Coherence n Techniques þ þ þ n n n Broadcast Snoop Reduction [ISCA-2005] Stealth Prefetching [ASPLOS 2006] Power-Efficient DRAM Speculation Hybrid Circuit Switching Virtual Proximity Circuit-switched snooping Research Group Overview Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Power-Efficient DRAM Speculation Broadcast Req Snoop Tags Send Resp DRAM Read n Modern systems

Power-Efficient DRAM Speculation Broadcast Req Snoop Tags Send Resp DRAM Read n Modern systems overlap the DRAM access with the snoop, speculatively accessing DRAM before snoop response n n n Xmit Block Trading DRAM bandwidth for latency Wasting power Approximately 25% of DRAM requests are reads that speculatively access DRAM unnecessarily Aug 30, 2007 Mikko Lipasti-University of Wisconsin

DRAM Operations Aug 30, 2007 Mikko Lipasti-University of Wisconsin

DRAM Operations Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Power-Efficient DRAM Speculation n n Direct memory requests are non-speculative Lines from externally-dirty regions

Power-Efficient DRAM Speculation n n Direct memory requests are non-speculative Lines from externally-dirty regions likely to be sourced from another processor’s cache n n n Region state can serve as a prediction Need not access DRAM speculatively Initial requests to a region (state unknown) have a lower but significant probability of obtaining data from other processors’ caches Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Useless DRAM Reads Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Useless DRAM Reads Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Useful DRAM Reads Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Useful DRAM Reads Aug 30, 2007 Mikko Lipasti-University of Wisconsin

DRAM Reads Performed/Delayed Aug 30, 2007 Mikko Lipasti-University of Wisconsin

DRAM Reads Performed/Delayed Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Summary Power-Efficient DRAM Speculation: n Can reduce DRAM reads 20%, with less than 1%

Summary Power-Efficient DRAM Speculation: n Can reduce DRAM reads 20%, with less than 1% degradation in performance n n 7% slowdown with nonspeculative DRAM Nearly doubles interval between DRAM requests, allowing modules to stay in low -power modes longer Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Talk Outline þ Motivation Overview of Coarse-grained Coherence n Techniques þ þ n n

Talk Outline þ Motivation Overview of Coarse-grained Coherence n Techniques þ þ n n Broadcast Snoop Reduction [ISCA-2005] Stealth Prefetching [ASPLOS 2006] Power-Efficient DRAM Speculation Hybrid Circuit Switching Virtual Proximity Circuit-switched snooping Research Group Overview Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Chip Multiprocessor Interconnect n Options n n n Buses: don’t scale Crossbars: too expensive

Chip Multiprocessor Interconnect n Options n n n Buses: don’t scale Crossbars: too expensive Rings: too slow Packet-switched mesh Attractive for all the same 1990’s DSM reasons n n n Scalable Low latency High link utilization Aug 30, 2007 Mikko Lipasti-University of Wisconsin

CMP Interconnection Networks n But… n Cables/traces are now onchip wires n n n

CMP Interconnection Networks n But… n Cables/traces are now onchip wires n n n Router latency adds up n n 3 -4 cycles per hop Store-and-forward n n Fast, cheap, plentiful Short: 1 cycle per hop Lots of activity/power Is this the right answer? Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Circuit-Switched Interconnects n Communication patterns n n n Circuit-switched links n n Spatial locality

Circuit-Switched Interconnects n Communication patterns n n n Circuit-switched links n n Spatial locality to memory Pairwise communication Avoid switching/routing Reduce latency Save power? Poor utilization! Maybe OK Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Router Design n Switches consist of n n n Configurable crossbar Configuration memory 4

Router Design n Switches consist of n n n Configurable crossbar Configuration memory 4 -stage router pipeline exposes only 1 cycle if CS Can also act as packet-switched network Design details in [CA Letters ‘ 07] Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Protocol Optimization n n Initial 3 -hop miss establishes CS path Subsequent miss requests

Protocol Optimization n n Initial 3 -hop miss establishes CS path Subsequent miss requests n n n Sent directly on CS path to predicted owner Also in parallel to home node Predicted owner sources data early Directory acks update to sharing list Benefits n n Aug 30, 2007 Reduced 3 -hop latency Less activity, less power Mikko Lipasti-University of Wisconsin

Hybrid Circuit Switching (1) HCS, 2 Circuits HCS, 4 Circuits 1 0. 98 0.

Hybrid Circuit Switching (1) HCS, 2 Circuits HCS, 4 Circuits 1 0. 98 0. 96 0. 94 0. 92 TPC-W TPC-H SPECweb SPECjbb Raytrace Radiosity Ocean 0. 9 Barnes Normalized Execution Time 1. 02 • Hybrid Circuit Switching improves performance by up to 7% Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Hybrid Circuit Switching (2) PS + Protocol Opt HCS, 4 Circuits + Protocol Opt

Hybrid Circuit Switching (2) PS + Protocol Opt HCS, 4 Circuits + Protocol Opt 1 0. 95 0. 9 0. 85 TPC-W TPC-H SPECweb SPECjbb Raytrace Radiosity Ocean 0. 8 Barnes Normalized Cycle Counts 1. 05 • Positive interaction in co-designed interconnect & protocol • More circuit reuse => greater latency benefit Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Summary Hybrid Circuit Switching: n Routing overhead eliminated n n Co-designed protocol n n

Summary Hybrid Circuit Switching: n Routing overhead eliminated n n Co-designed protocol n n n Still enable high bandwidth when needed Optimize cache-to-cache transfers Substantial performance benefits To do: power analysis Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Talk Outline þ Motivation Overview of Coarse-grained Coherence n Techniques þ þ þ n

Talk Outline þ Motivation Overview of Coarse-grained Coherence n Techniques þ þ þ n n n Broadcast Snoop Reduction [ISCA-2005] Stealth Prefetching [ASPLOS 2006] Power-Efficient DRAM Speculation Hybrid Circuit Switching Virtual Proximity Circuit-switched snooping Research Group Overview Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Server Consolidation on CMPs n n CMP as consolidation platform Simplify system administration n

Server Consolidation on CMPs n n CMP as consolidation platform Simplify system administration n n Study combinations of individual workloads in full system environment n n Save power, cost and physical infrastructure Micro-coded hypervisor schedules VMs See An Evaluation of Server Consolidation Workloads for Multi-Core Designs in IISWC 2007 for additional details n Aug 30, 2007 Nugget: shared LLC a big win Mikko Lipasti-University of Wisconsin

Virtual Proximity n Interactions between VM scheduling, placement, and interconnect n n n Evaluate

Virtual Proximity n Interactions between VM scheduling, placement, and interconnect n n n Evaluate 3 scheduling policies n n Goal: placement agnostic scheduling Best workload balance Gang, Affinity and Load Balanced HCS provides virtual proximity Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Scheduling Algorithms n Gang Scheduling n n n Affinity Scheduling n n n Co-schedules

Scheduling Algorithms n Gang Scheduling n n n Affinity Scheduling n n n Co-schedules all threads of a VM No idle-cycle stealing VMs assigned to neighboring cores Can steal idle cycles across VMs sharing core Load Balanced Scheduling n n n Aug 30, 2007 Ready threads assigned to any core Any/all VMs can steal idle cycles Over time, VM fragments across chip Mikko Lipasti-University of Wisconsin

Normalized Execution Time 1. 2 Performance with Increasing Interconnect Latency 1. 1 Affinity Load

Normalized Execution Time 1. 2 Performance with Increasing Interconnect Latency 1. 1 Affinity Load 1 0. 9 1 2 4 Per Hop Delay 5 • Load balancing wins with fast interconnect • Affinity scheduling wins with slow interconnect • HCS creates virtual proximity Aug 30, 2007 Mikko Lipasti-University of Wisconsin 10

Virtual Proximity Performance 1. 3 Gang Affinity Load Normalized Cycle Counts 1. 2 1.

Virtual Proximity Performance 1. 3 Gang Affinity Load Normalized Cycle Counts 1. 2 1. 1 1 0. 9 0. 8 0. 7 PS HCS PS HCS TPC-W TPC-H Mix 1 TPC-W JBB Mix 2 • HCS able to provide virtual proximity Aug 30, 2007 Mikko Lipasti-University of Wisconsin TPC-H Mix 3

20 Hop Count vs. Interconnect Latency 18 Average Latency 16 14 12 10 PS

20 Hop Count vs. Interconnect Latency 18 Average Latency 16 14 12 10 PS 8 HCS 6 4 2 0 0 1 2 3 4 Average Hop Count 5 • As physical distance (hop count) increases, HCS provides significantly lower latency Aug 30, 2007 Mikko Lipasti-University of Wisconsin 6

Summary Virtual Proximity [in submission] n Enables placement agnostic hypervisor scheduler n Results: n

Summary Virtual Proximity [in submission] n Enables placement agnostic hypervisor scheduler n Results: n n n Low-latency interconnect mitigates increase in L 2 cache conflicts from load balancing n n Up to 17% better than affinity scheduling Idle cycle reduction : 84% over gang and 41% over affinity L 2 misses up by 10% but execution time reduced by 11% A flexible, distributed address mapping combined with HCS out-performs a localized affinity-based memory mapping by an average of 7% Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Talk Outline þ Motivation Overview of Coarse-grained Coherence n Techniques þ þ þ n

Talk Outline þ Motivation Overview of Coarse-grained Coherence n Techniques þ þ þ n n Broadcast Snoop Reduction [ISCA-2005] Stealth Prefetching [ASPLOS 2006] Power-Efficient DRAM Speculation Hybrid Circuit Switching Virtual Proximity Circuit-switched snooping Research Group Overview Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Circuit Switched Snooping (1) n Scalable, efficient broadcasting on unordered network n n Extend

Circuit Switched Snooping (1) n Scalable, efficient broadcasting on unordered network n n Extend point-to-point circuit-switched links to trees n n Remove latency overhead of directory indirection Low latency multicast via circuit-switched tree Help provide performance isolation as requests do not share same communication medium Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Circuit-Switched Snooping (2) n Extend Coarse Grain Coherence Tracking (CGCT) n n n Remove

Circuit-Switched Snooping (2) n Extend Coarse Grain Coherence Tracking (CGCT) n n n Remove unnecessary broadcasts Convert broadcasts to multicasts Effective in Server Consolidation Workloads n Aug 30, 2007 Very few coherence requests to globally shared data Mikko Lipasti-University of Wisconsin

Snooping Interconnect n Switches consist of n n Circuits span two or more nodes,

Snooping Interconnect n Switches consist of n n Circuits span two or more nodes, based on RCA Snooping occurs across circuits n n Configurable crossbar Configuration memory All sharers in region join circuit Each link can physically accommodate multiple circuits Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Circuit-Switched Snooping n n n Use RCA to identify subsets of nodes that share

Circuit-Switched Snooping n n n Use RCA to identify subsets of nodes that share data Create shared circuits among these nodes Design challenges n n n Multi-drop, bidirectional circuits Memory ordering Results: very much in progress Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Talk Outline þ Motivation Overview of Coarse-grained Coherence n Techniques þ þ þ þ

Talk Outline þ Motivation Overview of Coarse-grained Coherence n Techniques þ þ þ þ n Broadcast Snoop Reduction [ISCA-2005] Stealth Prefetching [ASPLOS 2006] Power-Efficient DRAM Speculation Hybrid Circuit Switching Virtual Proximity Circuit-switched snooping Research Group Overview Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Research Group Overview n n Faculty: Mikko Lipasti, since 1999 Current MS/Ph. D students

Research Group Overview n n Faculty: Mikko Lipasti, since 1999 Current MS/Ph. D students n n Gordie Bell (also IBM), Natalie Enright Jerger, Erika Gunadi, Atif Hashmi, Eric Hill, Lixin Su, Dana Vantrease Graduates, current employment: n n Intel: Ilhyun Kim, Morris Marden, Craig Saldanha, Madhu Seshadri IBM: Trey Cain, Jason Cantin, Brian Mestan AMD: Kevin Lepak Sun Microsystems: Matt Ramsay, Razvan Cheveresan, Pranay Koka Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Current Focus Areas n Multiprocessors n n Microprocessor design n n n Coherence protocol

Current Focus Areas n Multiprocessors n n Microprocessor design n n n Coherence protocol optimization Interconnection network design Fairness issues in hierarchical systems Complexity-effective microarchitecture Scalable dynamic scheduling hardware Speculation reduction for power savings Transparent clock gating Domain-specific ISA extensions Software n n Java Virtual Machine run-time optimization Workload development and characterization Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Funding n n n National Science Foundation Intel Research Council IBM Faculty Partnership Awards

Funding n n n National Science Foundation Intel Research Council IBM Faculty Partnership Awards IBM Shared University Research equipment Schneider ECE Faculty Fellowship UW Graduate School Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Questions? http: //www. ece. wisc. edu/~pharm Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Questions? http: //www. ece. wisc. edu/~pharm Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Backup Slides Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Backup Slides Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Region Coherence Arrays The regions are kept coherent with a protocol, which summarizes the

Region Coherence Arrays The regions are kept coherent with a protocol, which summarizes the local and global state of lines in the region Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Region Coherence Arrays n n On cache misses, the region state is read to

Region Coherence Arrays n n On cache misses, the region state is read to determine if a broadcast is necessary On external snoops, the region state is read to provide a region snoop response n n n Piggybacked onto the conventional response Used to update other processors’ region state The regions are kept coherent with a protocol, which summarizes the local and global state of lines in the region Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Coarse-Grain Coherence Tracking Region Coherence Array added; two lines per region • P 1

Coarse-Grain Coherence Tracking Region Coherence Array added; two lines per region • P 1 stores 100002 § MISS Network Region not exclusive Owned, Region Owned RFO: P 1, 100002 anymore RFO: P 1, 100002 0010 Exclusive Pending Invalid 001 0000 Modified 0010 Pending Invalid 000 Pending 001 Invalid DD 0 0011 Exclusive 000 Invalid 0000 Invalid $ • Response sent • Data transfer Aug 30, 2007 RCA $Invalid 1 Data • Snoop performed § Hits in P 0 cache DD DI Data P 0 P 1 M 0 M 1 Mikko Lipasti-University of Wisconsin RCA Store: 100002

Overhead n n Storage for RCA Two bits in snoop response for region snoop

Overhead n n Storage for RCA Two bits in snoop response for region snoop response n Region Externally Clean/Dirty Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Overhead n RCA maintains inclusion over caches n n n RCA must respond correctly

Overhead n RCA maintains inclusion over caches n n n RCA must respond correctly to external requests if lines cached When regions evicted from RCA, their lines are evicted from the cache Replacement algorithm uses line count to favor regions with no lines cached Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Snoop Traffic – Peak Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Snoop Traffic – Peak Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Snoop Traffic – Average Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Snoop Traffic – Average Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Snoop Traffic n n n Peak snoop traffic is halved Average snoop traffic reduced

Snoop Traffic n n n Peak snoop traffic is halved Average snoop traffic reduced by nearly two thirds The system is more scalable, and may effectively support more processors Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Tag Lookups Filtered n Coarse-Grain Coherence Tracking can be used to filter external snoops

Tag Lookups Filtered n Coarse-Grain Coherence Tracking can be used to filter external snoops n n Send external requests to RCA first If region valid and line-count nonzero, send external request to cache Reduces power consumption in the cache tag arrays Increases broadcast snoop latency Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Tag Lookups Filtered Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Tag Lookups Filtered Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Line Evictions for Inclusion Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Line Evictions for Inclusion Aug 30, 2007 Mikko Lipasti-University of Wisconsin

L 2 Miss Ratio Increase Aug 30, 2007 Mikko Lipasti-University of Wisconsin

L 2 Miss Ratio Increase Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Stealth Prefetching n n Lines from a region may be prefetched again after a

Stealth Prefetching n n Lines from a region may be prefetched again after a threshold number of L 2 misses (currently 2). A bit mask of the lines cached since the last prefetch is used to avoid prefetching useless data Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Stealth Prefetching Prefetched lines are managed by a simple protocol Aug 30, 2007 Mikko

Stealth Prefetching Prefetched lines are managed by a simple protocol Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Prefetch Timeliness Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Prefetch Timeliness Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Data Traffic Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Data Traffic Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Period Between DRAM Requests Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Period Between DRAM Requests Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Switch design Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Switch design Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Value-Aware Techniques n Coherence misses in multiprocessors n n Ensuring consistency n n Value-based

Value-Aware Techniques n Coherence misses in multiprocessors n n Ensuring consistency n n Value-based checks [Cain ‘ 04] Reducing speculation n Store Value Locality [Lepak ‘ 03] Operand significance Create (nearly) nonspeculative execution schedule Java Virtual Machine runtime optimization [Su] n Speculative optimizations [VEE ’ 07] Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Complexity-Effective Techniques n Scalable dynamic scheduling hardware n n Half-price architecture [Kim ’ 03]

Complexity-Effective Techniques n Scalable dynamic scheduling hardware n n Half-price architecture [Kim ’ 03] Macro-op scheduling [Kim ’ 03] Operand significance [Gunadi] Scalable snoop-based coherence n n Coarse-grained coherence [Cantin ’ 06] Circuit-switched coherence [Enright] Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Power-Efficient Techniques n Power-efficient techniques n n Reduced speculation [Gunadi] Clock gating [E. Hill]

Power-Efficient Techniques n Power-efficient techniques n n Reduced speculation [Gunadi] Clock gating [E. Hill] n n n Transparent pipelines need fine-grained stalls Redistribute coarse-grained stall cycles Circuit-switched coherence [Enright] n n Aug 30, 2007 Reduce overhead of CMP cache coherence Improve latency, power Mikko Lipasti-University of Wisconsin

Cache Coherence Problem Load A Store A<= 1 P 0 A P 1 01

Cache Coherence Problem Load A Store A<= 1 P 0 A P 1 01 A 0 Memory Aug 30, 2007 Mikko Lipasti-University of Wisconsin Load A

Cache Coherence Problem Load A Store A<= 1 P 0 A P 1 10

Cache Coherence Problem Load A Store A<= 1 P 0 A P 1 10 A 10 Memory Aug 30, 2007 Mikko Lipasti-University of Wisconsin Load A

Snoopy Cache Coherence n All cache misses broadcast on shared bus n n Processors

Snoopy Cache Coherence n All cache misses broadcast on shared bus n n Processors and memory snoop and respond Cache block permissions enforced n n Multiple readers allowed (shared state) Only a single writer (exclusive state) n n n Must upgrade block before writing to it Other copies invalidated Read/write-shared blocks bounce from cache to cache n Migratory sharing Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Example: Conventional Snooping Network Read: P 0, 100002 Tag • P 0 loads 100002

Example: Conventional Snooping Network Read: P 0, 100002 Tag • P 0 loads 100002 § MISS State Invalid 0000 Exclusive 0010 Pending Invalid 0000 $Invalid 0 Invalid $Invalid 1 Data Load: 100002 • Snoop performed Data • Response sent • Data transfer Aug 30, 2007 P 0 P 1 M 0 M 1 Mikko Lipasti-University of Wisconsin

Coarse-Grain Coherence Tracking Region Coherence Array added; two lines per region § MISS •

Coarse-Grain Coherence Tracking Region Coherence Array added; two lines per region § MISS • Snoop performed • Response sent Aug 30, 2007 Read: P 0, 100002 Tag • P 0 loads 100002 • Data transfer Network P 0 has exclusive Read: , Region 100002 Not Shared access to. P 0 region Invalid, Region Not Shared State 0000 Exclusive 0010 Pending Invalid 000 Pending 001 Invalid DI 0000 000 Invalid 0000 $Invalid 0 Data RCA Invalid $Invalid 1 Load: 100002 P 0 P 1 M 0 M 1 Data Mikko Lipasti-University of Wisconsin 000 Invalid RCA 000 Invalid

Coarse-Grain Coherence Tracking Region Coherence Array added; two lines per region • P 0

Coarse-Grain Coherence Tracking Region Coherence Array added; two lines per region • P 0 loads 110002 § MISS, Region Hit • Direct request sent • Data transfer Network Exclusive region state, broadcast unnecessary Tag State 0010 Exclusive 001 0 0000 Exclusive 0011 Pending Invalid 000 Invalid $ Data DI RCA 0000 Invalid $Invalid 1 Load: 110002 P 0 P 1 M 0 M 1 Read: P 0, 110002 Data Aug 30, 2007 Mikko Lipasti-University of Wisconsin 000 Invalid RCA 000 Invalid

Impact on Execution Time Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Impact on Execution Time Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Stealth Prefetching Assume 8 -byte lines, 32 -byte regions, 2 line threshold • P

Stealth Prefetching Assume 8 -byte lines, 32 -byte regions, 2 line threshold • P 0 loads 0 x 28 § MISS, RCA Hit Network Tag State 0000 Invalid $0 0100 Exclusive • Direct request sent Data 0000 0101 Exclusive Pending Invalid Data • Data transfer P 0 001 DI RCA 000 Invalid 0000 $Invalid 1 000 Invalid 0000 Load: 0 x 28 Prefetch: 11002 Invalid 0110 Pending 0000 Invalid Valid SDPB 0111 Pending 0000 Invalid Valid P 1 • Prefetch. Read: data. P 0, 0 x 28 Prefetch: 11002 M 0 Aug 30, 2007 Mikko Lipasti-University of Wisconsin RCA M 1 SDPB

Stealth Prefetching Assume 8 -byte lines, 32 -byte regions, 2 line threshold • P

Stealth Prefetching Assume 8 -byte lines, 32 -byte regions, 2 line threshold • P 0 loads 0 x 30 § MISS, SDPB Hit Network Tag State 0000 Exclusive 0110 Pending Invalid $0 0100 Exclusive • Data Transfer Return 001 DI RCA 000 Invalid 0101 Exclusive Data P 0 Invalid 0000 $Invalid 1 000 Invalid 0000 RCA Invalid Load: 0 x 30 Data 0000 Invalid 0110 Valid SDPB Valid 0000 0111 M 0 Aug 30, 2007 0000 Mikko Lipasti-University of Wisconsin P 1 M 1 SDPB

Communication Latencies CC-NUMA CMP Local Cache Access 12 12 Remote Cache-to-Cache Transfer 12 +

Communication Latencies CC-NUMA CMP Local Cache Access 12 12 Remote Cache-to-Cache Transfer 12 + 21 * H * 3 (H = hop count) 12 + 4 * H * 3 Local Memory Access 150 Remote Memory Access 150 + 21 * H * 2 150 + 4 * H *2 • Remote cache access is 2 -5 x faster in CMPs than NUMA machines • Lower communication latencies allow for more flexible thread placement Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Configuration Simulation Parameters Cores 16 single-threaded light-weight, inorder Interconnect 2 -D Packet-Switched Mesh 3

Configuration Simulation Parameters Cores 16 single-threaded light-weight, inorder Interconnect 2 -D Packet-Switched Mesh 3 -cycle router pipeline (baseline) Hybrid Circuit-Switched Mesh 4 Circuits L 1 Cache Split I/D, 16 KB each (2 cycles) L 2 Cache Private, 128 KB (6 cycles) L 3 Cache Shared, 16 MB (16 1 MB banks) 12 cycles Memory Latency Aug 30, 2007 150 cycles Workload Mixes Mix 1 TPC-W (4) + TPC-H (4) Mix 2 TPC-W (4) + SPECjbb (4) Mix 3 TPC-H (4) + SPECjbb(4) Mikko Lipasti-University of Wisconsin

Normalized Cycle Counts Effect of Memory Placement 1. 4 1. 3 1. 2 1.

Normalized Cycle Counts Effect of Memory Placement 1. 4 1. 3 1. 2 1. 1 1 0. 9 0. 8 0. 7 0. 6 TPC-W TPC-H TPC-W JBB TPC-H Mix 1 Mix 2 Mix 3 Gang Distributed Affinity Local Gang Local Load Distributed Affinity Distributed • Load Balancing with HCS outperforms local placement • Virtual proximity to memory home node Aug 30, 2007 Mikko Lipasti-University of Wisconsin