CircuitSwitched Coherence Natalie Enright Jerger LiShiuan Peh Mikko

  • Slides: 31
Download presentation
Circuit-Switched Coherence Natalie Enright Jerger*, Li-Shiuan Peh+, Mikko Lipasti* *University of Wisconsin - Madison

Circuit-Switched Coherence Natalie Enright Jerger*, Li-Shiuan Peh+, Mikko Lipasti* *University of Wisconsin - Madison +Princeton University 2 nd IEEE International Symposium on Networks-on-Chip

Motivation n Network on Chip for general purpose multi-core n n n Router latency

Motivation n Network on Chip for general purpose multi-core n n n Router latency overhead can be significant n n 6/11/2021 Replacing dedicated global wires Efficient/scalable communication on-chip Exploit application characteristics to lower latency Co-design coherence protocol to match network functionality Natalie Enright Jerger - University of Wisconsin 2

Executive Summary n Hybrid Network n n n Co-design cache coherence protocol n 6/11/2021

Executive Summary n Hybrid Network n n n Co-design cache coherence protocol n 6/11/2021 Interleaves circuit-switched and packetswitched flits Optimize setup latency Improve throughput over traditional circuitswitching Reduce interconnect delay by up to 22% Improves performance by up to 17% Natalie Enright Jerger - University of Wisconsin 3

Switching Techniques n Packet Switching n n n Efficient bandwidth utilization Router latency overhead

Switching Techniques n Packet Switching n n n Efficient bandwidth utilization Router latency overhead Circuit Switching Best of both worlds? n Poor bandwidth utilization n Low latency Efficient bandwidth utilization + low latency n Stalled requests due to unavailable resources n 6/11/2021 Avoids router overhead after circuit is established Natalie Enright Jerger - University of Wisconsin 4

Circuit-Switched Coherence Two key observations n n Commercial 1. 25 Normalized Runtime n Scientific

Circuit-Switched Coherence Two key observations n n Commercial 1. 25 Normalized Runtime n Scientific 1. 2 Commercial workloads 1. 15 are very sensitive to 1. 1 Construct fast pair-wise circuits? communication latency 1. 05 Significant pair-wise sharing 1 0. 95 0. 9 1 3 5 7 Per Hop Delay 11 Commercial Workloads: Spec. JBB, Spec. Web, TPC-H, TPC-W Scientific Workloads: Barnes-Hut, Ocean, Radiosity, Raytrace 6/11/2021 Natalie Enright Jerger - University of Wisconsin 5

1. 07 1. 06 1. 05 1. 04 1. 03 1. 02 1. 01

1. 07 1. 06 1. 05 1. 04 1. 03 1. 02 1. 01 1 0. 99 n Raytrace Radiosity Ocean Barnes TPC-W TPC-H SPECweb 0. 98 SPECjbb Normalized Cycle Counts Traditional Circuit Switching Traditional circuit-switching hurts performance by up to ~7% *Data collected for 16 in-order core chip multiprocessor 6/11/2021 Natalie Enright Jerger - University of Wisconsin 6

Circuit Switching Redesigned n n Latency is critical Utilize Circuit Switching for lower latency

Circuit Switching Redesigned n n Latency is critical Utilize Circuit Switching for lower latency n n n Traditional circuit-switching performs poorly My contributions n n 6/11/2021 A circuit connects resources across multiple hops to avoid router overhead Novel setup mechanism Bandwidth stealing Natalie Enright Jerger - University of Wisconsin 7

Outline n n Motivation Router Design n Coherence Protocol Co-design n n 6/11/2021 Setup

Outline n n Motivation Router Design n Coherence Protocol Co-design n n 6/11/2021 Setup Mechanism Bandwidth Stealing Pair-wise sharing 3 -hop optimization Region prediction Results Conclusions Natalie Enright Jerger - University of Wisconsin 8

Traditional Circuit Switching Path Setup (with Acknowledgement) 0 Configuration Probe 5 Data Circuit Acknowledgement

Traditional Circuit Switching Path Setup (with Acknowledgement) 0 Configuration Probe 5 Data Circuit Acknowledgement n n Significant latency overhead prior to data transfer Other requests forced to wait for resources 6/11/2021 Natalie Enright Jerger - University of Wisconsin 9

Novel Circuit Setup Policy 0 Configuration Packet A 5 Data Circuit n n Overlap

Novel Circuit Setup Policy 0 Configuration Packet A 5 Data Circuit n n Overlap circuit setup with 1 st data transfer Reconfigure existing circuits if no unused links available n Allows piggy-backed request to always achieve low n latency Multiple circuit planes prevent frequent reconfiguration 6/11/2021 Natalie Enright Jerger - University of Wisconsin 10

Setup Network n Light-weight setup network n Narrow n n n Low Load n

Setup Network n Light-weight setup network n Narrow n n n Low Load n n Multiple narrow circuit planes prevent frequent reconfiguration Reconfiguration n 6/11/2021 No virtual channels small area footprint Stores circuit configuration information n n Circuit plane identifier (2 bits) + Destination (4 bits) Buffered, traverses packet-switched pipeline Natalie Enright Jerger - University of Wisconsin 11

Packet-Switched Bandwidth Stealing n Remember: problem with traditional Circuit-Switching is poor bandwidth n n

Packet-Switched Bandwidth Stealing n Remember: problem with traditional Circuit-Switching is poor bandwidth n n Need to overcome this limitation Hybrid Circuit-Switched Solution: Packetswitched messages snoop incoming links n When there are no circuit-switched messages on the link n 6/11/2021 A waiting packet-switched message can steal idle bandwidth Natalie Enright Jerger - University of Wisconsin 12

Hybrid Circuit-Switched Router Design Allocators Inj T Ej N S E W 6/11/2021 T

Hybrid Circuit-Switched Router Design Allocators Inj T Ej N S E W 6/11/2021 T N S T E T T W Crossbar Natalie Enright Jerger - University of Wisconsin 13

HCS Pipeline n Circuit-switched messages: 1 stage Switch Traversal Router n Link Traversal Link

HCS Pipeline n Circuit-switched messages: 1 stage Switch Traversal Router n Link Traversal Link Packet-switched messages: 3 stages n Aggressive Speculation reduces stages Buffer Write Virtual Channel/ Switch Allocation Switch Traversal Router 6/11/2021 Natalie Enright Jerger - University of Wisconsin Link Traversal Link 14

Outline n n Motivation Router Design n Coherence Protocol Co-design n n 6/11/2021 Setup

Outline n n Motivation Router Design n Coherence Protocol Co-design n n 6/11/2021 Setup Mechanism Bandwidth Stealing Pair-wise sharing 3 -hop optimization Region prediction Results Conclusions Natalie Enright Jerger - University of Wisconsin 15

Sharing Characterization n Temporal sharing relationship: 67 -76% of misses are serviced by 2

Sharing Characterization n Temporal sharing relationship: 67 -76% of misses are serviced by 2 most recently shared with cores Commercial Workloads: Spec. JBB, Spec. Web, TPC-H, TPC-W Scientific Workloads: Barnes-Hut, Ocean, Radiosity, Raytrace 6/11/2021 Natalie Enright Jerger - University of Wisconsin 16

Directory Coherence 3 1 1 Data Response A 2 Read A Directory 6/11/2021 Address

Directory Coherence 3 1 1 Data Response A 2 Read A Directory 6/11/2021 Address State Sharers A Exclusive Shared 1, 2 2 B Shared 1, 2 Natalie Enright Jerger - University of Wisconsin 2 Forward Read A 17

Coherence Protocol Co-Design n n Goal: Better exploit circuits through coherence protocol Modifications: n

Coherence Protocol Co-Design n n Goal: Better exploit circuits through coherence protocol Modifications: n n Allow a cache to send a request directly to another cache Notify the directory in parallel Prediction mechanism for pair-wise sharers Directory is sole ordering point 6/11/2021 Natalie Enright Jerger - University of Wisconsin 18

Circuit-Switched Coherence Optimization 2 Data Response A 1 1 1 Update A 3 2

Circuit-Switched Coherence Optimization 2 Data Response A 1 1 1 Update A 3 2 Read A Ack A Directory 6/11/2021 Address State Sharers A Shared Exclusive 1, 2 2 B Shared 1, 2 Natalie Enright Jerger - University of Wisconsin 19

Region Prediction Region Table A -2 B 3 1 4 Region A Update 3

Region Prediction Region Table A -2 B 3 1 4 Region A Update 3 1 Data Response A[0] 2 Miss A[0] 5 Read A[1] Directory n State Sharers A[0] Shared 1, 2 2 A[1] Shared 2 2 Forward Read A[0] Each memory region spans 1 KB n 6/11/2021 Address Takes advantage of spatial and temporal sharing Natalie Enright Jerger - University of Wisconsin 20

Simulation Methodology n PHARMSim n n Full-system multi-core simulator Detailed network level model n

Simulation Methodology n PHARMSim n n Full-system multi-core simulator Detailed network level model n n n 6/11/2021 Cycle accurate router model Flit-level contention modeled More results in paper Natalie Enright Jerger - University of Wisconsin 21

Simulation Workloads Commercial SPECjbb Java server workload 24 warehouse, 200 requests SPECweb Web server,

Simulation Workloads Commercial SPECjbb Java server workload 24 warehouse, 200 requests SPECweb Web server, 300 requests TPC-W Web e-commerce, 40 transactions TPC-H Decision support system Scientific Barnes-Hut 8 k particles, full run Ocean 514 x 514, parallel phase Radiosity Parallel phase Raytrace Car input, parallel phase Synthetic Uniform Random Destination select with uniform random distribution Permutation Traffic Each node communicates with one other node (pair-wise) 6/11/2021 Natalie Enright Jerger - University of Wisconsin 22

Simulation Configuration Processors Cores 16 in-order general purpose Memory System L 1 I/D Caches

Simulation Configuration Processors Cores 16 in-order general purpose Memory System L 1 I/D Caches 32 KB 2 -way set associative 1 cycle Private L 2 caches 512 KB 4 -way set associative 6 cycles 64 Byte lines Shared L 3 Cache 16 MB (1 MB bank/tile) 4 -way set associative 12 cycles Main Memory Latency 100 cycles Interconnect: 4 x 4 2 -D Mesh Packet-switched baseline n Table with config parameters Hybrid Circuit Switching 6/11/2021 Optimized 1 -3 router stages 4 Virtual channels with 4 Buffers each 1 router stage or 4 Circuit planes Natalie Enright Jerger - 2 University of Wisconsin 23

Network Results HCS, 2 Circuits HCS, 4 Circuits 1 Normalized Delay 0. 95 0.

Network Results HCS, 2 Circuits HCS, 4 Circuits 1 Normalized Delay 0. 95 0. 9 0. 85 0. 8 0. 75 n 6/11/2021 TPC-W TPC-H SPECweb SPECjbb Raytrace Radiosity Ocean Barnes 0. 7 Communication latency is key: shave off precious cycles in network latency Natalie Enright Jerger - University of Wisconsin 24

Barnes n 6/11/2021 Ocean Radiosity Raytrace SPECjbb SPECweb TPC-H 65. 6% 34. 4% 71.

Barnes n 6/11/2021 Ocean Radiosity Raytrace SPECjbb SPECweb TPC-H 65. 6% 34. 4% 71. 3% Partial 28. 7% 28. 1% 63. 8% 36. 2% 56. 0% 44. 0% 66. 8% 33. 2% 82. 3% 17. 7% 71. 2% 9 8 7 6 5 4 3 2 1 0 71. 9% CS 28. 8% Cycles Flit breakdown TPC-W Reduce interconnect latency for a significant fraction of messages Natalie Enright Jerger - University of Wisconsin 25

HCS + Protocol Optimization 1. 2 Performance Improvement 1. 15 1. 1 Protocol Optimization

HCS + Protocol Optimization 1. 2 Performance Improvement 1. 15 1. 1 Protocol Optimization 1. 05 Interconnect 1 0. 95 0. 9 PS HCS PS HCS Barnes n Ocean Radiosity Raytrace SPECjbb SPECweb TPC-H TPC-W Improvement of HCS + Protocol optimization is greater than the sum of HCS or Protocol Optimization alone. n Protocol Optimization drives up circuit reuse, better utilizing HCS 6/11/2021 Natalie Enright Jerger - University of Wisconsin 26

Uniform Random Traffic Interconnect Latency 12 11 10 9 HCS 7 6 5 n

Uniform Random Traffic Interconnect Latency 12 11 10 9 HCS 7 6 5 n PS 8 0% 10% 20% 30% Load (% Link Capacity) 40% 50% HCS successfully overcomes bandwidth limitations associated with Circuit Switching 6/11/2021 Natalie Enright Jerger - University of Wisconsin 27

Related Work n Router optimizations n n Hybrid Circuit-Switching n n n Wave-switching [Duato,

Related Work n Router optimizations n n Hybrid Circuit-Switching n n n Wave-switching [Duato, ICPP 1996] So. CBus [Wiklund, IPDPS 2003] Coherence Protocols n 6/11/2021 Express Virtual Channels [Kumar, ISCA 2007] Single-cycle router [Mullins, ISCA 2004] Many more… Significant research in removing overhead of indirection Natalie Enright Jerger - University of Wisconsin 28

Circuit-Switched Coherence Summary n Replace packet-switched mesh with hybrid circuit-switched mesh n n Reconfigurable

Circuit-Switched Coherence Summary n Replace packet-switched mesh with hybrid circuit-switched mesh n n Reconfigurable circuits Dedicated bandwidth for frequent pair-wise sharers Low Latency and low power n n Interleave circuit and packet switched flits Avoid switching/routing Devise novel coherence mechanisms to take advantage of benefits of circuit switching 6/11/2021 Natalie Enright Jerger - University of Wisconsin 29

Thank you www. ece. wisc. edu/~pharm enrightn@cae. wisc. edu 6/11/2021 Natalie Enright Jerger -

Thank you www. ece. wisc. edu/~pharm enrightn@cae. wisc. edu 6/11/2021 Natalie Enright Jerger - University of Wisconsin 30

Circuit Setup n Novel Setup Policy n Overlap circuit setup with first data transfer

Circuit Setup n Novel Setup Policy n Overlap circuit setup with first data transfer n n Reconfigure existing circuits if no unused links available n n n Allows piggy-backed request to always achieve low latency Multiple narrow circuit planes prevent frequent reconfiguration Reconfiguration n 6/11/2021 Store circuit information at each router Buffered, traverses packet-switched pipeline Natalie Enright Jerger - University of Wisconsin 31