Power Efficient Cache Coherence Craig Saldanha Mikko Lipasti

  • Slides: 43
Download presentation
Power Efficient Cache Coherence Craig Saldanha Mikko Lipasti

Power Efficient Cache Coherence Craig Saldanha Mikko Lipasti

Motivation n n Power consumption becoming a serious design constraint. Market demand for faster

Motivation n n Power consumption becoming a serious design constraint. Market demand for faster and more complex servers. Complex coherence protocol and interconnect. fclock + Complexity = Pinterconnect

Motivation n n Traditional power saving methodologies ineffective. Minimize number of transaction packets. At

Motivation n n Traditional power saving methodologies ineffective. Minimize number of transaction packets. At the end-points: Jetty. At the source: Serial Snoop.

OUTLINE n n n Overview of Snoop based protocols and opportunity for power savings

OUTLINE n n n Overview of Snoop based protocols and opportunity for power savings Latency and Power consumption of parallel snooping techniques Serial Snooping Results Conclusions Future Work

Snoop Based Coherence n P 1 P 2 P 3 P 4 à Tag

Snoop Based Coherence n P 1 P 2 P 3 P 4 à Tag Lookup Bus Arbitration Snoop Transmission n Remote Node à à Memory Local Node Tag Array à Data Array à à Tag Lookup Snoop Response Combination of Responses Data Fetch Data Transmit

Degrees of Speculation 3 Degrees of Freedom to Speculate Snooping Data Fetch Data Transmit

Degrees of Speculation 3 Degrees of Freedom to Speculate Snooping Data Fetch Data Transmit n Parallel Snoop, Spec Dfetch Spec DXmit D-Xmit DFetch D-Xmit Snoop Parallel Snoop, Spec Dfetch Non-Spec DXmit X Parallel Snoop, Non-Spec Dfetch Non-Spec DXmit Serial Snoop, Spec Dfetch Spec DXmit D-Xmit DFetch Serial Snoop, Spec Dfetch Non-Spec DXmit D-Xmit X Serial Snoop, Non-Spec Dfetch Non-Spec DXmit

Latency and Power assumptions n n n Consider only load misses Tree of point-point

Latency and Power assumptions n n n Consider only load misses Tree of point-point connections. Latency to traverse a link: 1 Bus cycle (7 ns) Tag Look up : 1 bus cycle (7 ns) D-Fetch: 2 bus cycles (14 ns) DRAM access: 10 bus cycles (70 ns) Backplane MEMORY Root Node Switch 2 Switch 1 P 1 Board 1 P 2 P 3 P 4

Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST) Snoop Broadcast MEMORY Latency 0 7 Power

Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST) Snoop Broadcast MEMORY Latency 0 7 Power Plink

Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST) Snoop Broadcast MEMORY Latency 0 7 14

Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST) Snoop Broadcast MEMORY Latency 0 7 14 Power Plink+Pswitch

Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST) Snoop Broadcast MEMORY Latency 0 7 14

Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST) Snoop Broadcast MEMORY Latency 0 7 14 21 Power 2 Plink+Pswitch

Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST) Snoop Broadcast MEMORY Latency 0 7 14

Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST) Snoop Broadcast MEMORY Latency 0 7 14 21 28 Power 2 Plink+2 Pswitch

Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST) Snoop Broadcast MEMORY Latency 0 7 14

Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST) Snoop Broadcast MEMORY Latency 0 7 14 21 28 35 Power 5 Plink+2 Pswitch

Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST) Snoop Broadcast MEMORY Latency 0 7 14

Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST) Snoop Broadcast MEMORY Latency 0 7 14 21 28 35 Power 5 Plink+4 Pswitch 42

Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST) Snoop Broadcast MEMORY Latency 0 7 14

Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST) Snoop Broadcast MEMORY Latency 0 7 14 21 28 35 42 49 Power 8 Plink+4 Pswitch

Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST) Memory Access : Data Fetch Latency 35

Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST) Memory Access : Data Fetch Latency 35 Power Pmem 91 105 MEMORY

Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST) Memory Access : Data Transmit Latency 35

Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST) Memory Access : Data Transmit Latency 35 91 105 140 Power Pmem+3 Plink+2 Pswitch MEMORY

Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST) Remote Node: Tag Lookup Latency 49 56

Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST) Remote Node: Tag Lookup Latency 49 56 Power 3 Ptag MEMORY

Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST) Remote Node : Snoop Response MEMORY Latency

Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST) Remote Node : Snoop Response MEMORY Latency 49 56 105 Power 3 Ptag+3*(Pswitch+2 Plink)+Plink+2 Pswitch+2 Plink

Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST) Remote Node: Data Fetch Latency 49 63

Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST) Remote Node: Data Fetch Latency 49 63 Power 3 Pcache MEMORY

Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST) Remote Node : Data Transmit Latency 49

Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST) Remote Node : Data Transmit Latency 49 63 112 Power 3 Ptag+3*(4 Plink+3 Pswitch) MEMORY

Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST) Latency TL 0 Snoop BRDCST 49 Local

Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST) Latency TL 0 Snoop BRDCST 49 Local Node 49 49 35 TL DF 56 RSP + CMB 63 Memory Access 105 Data Xmit 91 105 112 Remote Node Data Xmit 140 Memory Power Remote Node supplies the data 29 Plink+18 Pswitch+3 Ptag+3 Pcache+Pmem Memory supplies the data 20 Plink+11 Pswitch+3 Ptag+3 Pcache+Pmem

Parallel Snoop Speculative Fetch Non. Speculative Transmit (PS/SF/NT) Latency TL 0 49 Snoop BRDCST

Parallel Snoop Speculative Fetch Non. Speculative Transmit (PS/SF/NT) Latency TL 0 49 Snoop BRDCST 49 49 35 Local Node TL DF 56 RSP + CMB 63 Memory Access 105 91 105 Remote Node 154 Data Xmit 140 Memory Power Remote Node supplies the data 21 Plink+12 Pswitch+3 Ptag+Pcache+Pmem Memory supplies the data 20 Plink+11 Pswitch+3 Ptag+3 Pcache+Pmem

Parallel Snoop Non-Speculative Fetch Non -Speculative Transmit (PS/NF/NT) Latency TL 0 49 Snoop BRDCST

Parallel Snoop Non-Speculative Fetch Non -Speculative Transmit (PS/NF/NT) Latency TL 0 49 Snoop BRDCST 49 TL Local Node RSP + 105 CMB Remote Node 105 119 168 DF Data Xmit Remote Node 91 Memory Access 161 196 Data Xmit Memory Power Remote Node supplies the data 21 Plink+12 Pswitch+3 Ptag+Pcache Memory supplies the data 20 Plink+11 Pswitch+3 Ptag+Pmem

Serial Snooping n Avoids Speculative transmission of Snoop packets. MEMORY

Serial Snooping n Avoids Speculative transmission of Snoop packets. MEMORY

Serial Snooping n n n Avoids Speculative transmission of Snoop packets. Check the nearest

Serial Snooping n n n Avoids Speculative transmission of Snoop packets. Check the nearest neighbor Data supplied with minimum latency and power MEMORY

Serial Snooping n Forward snoop to next level MEMORY

Serial Snooping n Forward snoop to next level MEMORY

Serial Snooping n Forward snoop to next level MEMORY

Serial Snooping n Forward snoop to next level MEMORY

Serial Snooping n Search other half of tree MEMORY

Serial Snooping n Search other half of tree MEMORY

Serial Snooping n n Search other half of tree Search leaf nodes serially MEMORY

Serial Snooping n n Search other half of tree Search leaf nodes serially MEMORY

Serial Snooping n n Search other half of tree Search leaf nodes serially MEMORY

Serial Snooping n n Search other half of tree Search leaf nodes serially MEMORY

Serial Snooping : Features n n n Latency to satisfy a request dependent on

Serial Snooping : Features n n n Latency to satisfy a request dependent on distance from requestor. Data resident at the nearest neighbor supplied with the lowest latency and power. Requests visible to memory controller only at root node. Latency is adversely affected when requested data present at the farthest node Worst case power consumption is still less than the parallel snooping.

Serial Snooping : Request satisfied by Nearest Node Latency 0 TL MEMORY SNP 21

Serial Snooping : Request satisfied by Nearest Node Latency 0 TL MEMORY SNP 21 P 1 21 21 TL 28 DF 49 RSP 35 Data Xmit P 2 56 P 2 Power Xmit Snoop: 2 Plink + Pswitch P 2 Tag access and snoop response: Ptag + 2 Plink + Pswitch P 2 Data Fetch and Xmit: Pcache +2 Plink + Pswitch Ptotal= 6 Plink+3 Pswitch+Ptag+Pcache

Serial Snooping : Request satisfied by Next-Nearest Neighbor Latency 0 TL SNP MEMORY 21

Serial Snooping : Request satisfied by Next-Nearest Neighbor Latency 0 TL SNP MEMORY 21 21 P 1 28 TL 49 63 77 P 2 RSP 77 84 133 TL Xmit 77 63 91 DF P 3 140 Xmit Memory Access 133 P 3 168 Xmit Memory Power 16 Plink+10 Pswitch+2 Ptag+Pcache

Serial Snooping : Request satisfied by farthest node Latency MEMORY 0 21 P 1

Serial Snooping : Request satisfied by farthest node Latency MEMORY 0 21 P 1 TL Xmit 21 28 49 TL RSP 63 77 77 P 2 84 TL 105 112 DF 119 147 Memory Access 133 147 P 4 168 Xmit 168 133 147 Memory Access Xmit P 3 P 4 161 TL RSP 105 10 Xm Xmit Spec-Memory 182 Memory Access 217 Xmit 252 Non-Spec-Memory Power Remote Node supplies the data 18 Plink+11 Pswitch+3 Ptag+Pcache If Memory supplies the data 17 Plink+10 Pswitch+3 Ptag+Pmem

RESULTS : Load Miss Distributions

RESULTS : Load Miss Distributions

RESULTS: Average Latencies to satisfy load misses

RESULTS: Average Latencies to satisfy load misses

RESULTS: Relative Power Savings

RESULTS: Relative Power Savings

CONCLUSIONS n n n Reducing degree of speculation has potential for significant power savings

CONCLUSIONS n n n Reducing degree of speculation has potential for significant power savings Performance degradation is minimal for the set of benchmarks studied. Serial Snooping with speculative memory fetch provides optimal latency and power consumption.

Future Work n n n Develop detailed execution-driven Power Model Explore different interconnect topologies.

Future Work n n n Develop detailed execution-driven Power Model Explore different interconnect topologies. Examine the viability of adaptive mechanisms for protocol policy.

Serial Snooping Nearest Neighbor Latency 0 7 14 21 Power 2 Plink+Pswitch MEMORY

Serial Snooping Nearest Neighbor Latency 0 7 14 21 Power 2 Plink+Pswitch MEMORY

Questions

Questions

Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST) TL 0 Snoop BRDCST 35 49 Local

Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST) TL 0 Snoop BRDCST 35 49 Local Node Memory Access 91 MEMORY