The No X Router Mitchell Hayenga Mikko Lipasti

  • Slides: 21
Download presentation
The No. X Router Mitchell Hayenga Mikko Lipasti Predictive High-Performance Architecture Research Mavens (PHARM),

The No. X Router Mitchell Hayenga Mikko Lipasti Predictive High-Performance Architecture Research Mavens (PHARM), Department of ECE

Overview • New low-latency router technique – Don’t arbitrate or speculate! Encode. Control •

Overview • New low-latency router technique – Don’t arbitrate or speculate! Encode. Control • XOR Property (A^B) ^ B = A – Hides arbitration latency Input Channel – Eliminates dead cycles • The No. X Router – Single-cycle/wormhole/mesh implementation Switch Fabric – Frequency competitive with pure speculative – 2. 7%-34. 4% better ED 2 on application traces – Up to 9. 9% better throughput on synthetic traffic The No. X Router, Micro’ 11 2/19

Motivation • Modern On-Chip Networks Virtual Channel Router Pipeline Intel Teraflops Router Evolution –

Motivation • Modern On-Chip Networks Virtual Channel Router Pipeline Intel Teraflops Router Evolution – Bandwidth Plentiful, Latency Critical BW RC VA SA ST – Control BW NRC VA SA ST LT VA NRC SA ST LT • Complex, Speculative, Critical Path – Datapath • Fast, Simple, Wire-Dominated LT • No. X Tradeoff – Marginal increase in datapath complexity – Hide control latency The No. X Router, Micro’ 11 3/19

Switch Arbitration Techniques • Non-Speculative – Arbitration occurs before switch traversal • Speculative Switch

Switch Arbitration Techniques • Non-Speculative – Arbitration occurs before switch traversal • Speculative Switch Traversal [Mullins ISCA 2004] Control B Wins A A ? Switch Fabric B – Assume contention doesn’t happen – Wasted cycle in the event of contention • Arbiter decides what gets sent on the next cycle 0 clk port 0 port 1 grant valid out data out 1 2 3 4 A A p 0 B p 1 B p 0 A ? ? ? B No Contention The No. X Router, Micro’ 11 A Contention 4/19

Switch Arbitration Techniques • Non-Speculative – Arbitration occurs before switch traversal • Speculative Switch

Switch Arbitration Techniques • Non-Speculative – Arbitration occurs before switch traversal • Speculative Switch Traversal [Mullins ISCA 2004] B Wins Control A A^B Switch Fabric B – Assume contention doesn’t happen – Wasted cycle in the event of contention • Arbiter decides what gets sent on the next cycle • Encoding – Blindly transmit, XOR within switch fabric – No contention - data sent unmodified – Contention - data sent XOR’d • Arbiter decides what was sent cycle 0 clk port 0 port 1 grant valid out data out 1 2 3 A A A p 0 B p 1 A B^A 4 A No Contention The No. X Router, Micro’ 11 5/19

Receive Logic • Works upon simple XOR property. – (A^B^C) ^ (B^C) = A

Receive Logic • Works upon simple XOR property. – (A^B^C) ^ (B^C) = A • Simple Decode – Always able to decode by XORing two sequential values – Maintains previous router’s arbitration order/fairness The No. X Router, Micro’ 11 10 0 A B B^C C A^B^C A C Flit Buffer B^C A^B^C Coded 6/19

Tradeoffs and Scaling • Arbitration – O(log n) delay for most arbiters Control Input

Tradeoffs and Scaling • Arbitration – O(log n) delay for most arbiters Control Input Channel • Decode logic – Constant with respect to # of ports Switch Fabric • Switch Fabric – XOR delay scales slightly worse than a mux/tristate-based solution – Maybe not an issue (control latency) The No. X Router, Micro’ 11 7/19

The No. X Router • Network of XORs • Implementation Details – 8 x

The No. X Router • Network of XORs • Implementation Details – 8 x 8 Mesh, 2 mm long 64 -bit links – Single Cycle (Router+Link) – Wormhole – Dimension ordered routing – Minimally buffered The No. X Router, Micro’ 11 8/19

Baseline Designs • Non-Speculative – Serial arbitration & switch logic – Long cycle time

Baseline Designs • Non-Speculative – Serial arbitration & switch logic – Long cycle time – Efficient link utilization • Speculative Techniques [Mullins ISCA 2004] – Hides arbitration latency – Potential for wasted link bandwidth – Spec-Fast & Spec-Accurate [Mullins ASP-DAC 2006] The No. X Router, Micro’ 11 9/19

Frequency Analysis • Overheads present in all designs – 248 ps SRAM delay –

Frequency Analysis • Overheads present in all designs – 248 ps SRAM delay – 98 ps link latency Architecture Clock Period % Non-Speculative 0. 92 ns - Spec-Fast 0. 69 ns 33. 3% Spec-Accurate 0. 72 ns 27. 7% No. X 0. 76 ns 21. 1% The No. X Router, Micro’ 11 10/19

Synthetic Traffic - Latency bandwidth (MB/s/node) The No. X Router, Micro’ 11 bandwidth (MB/s/node)

Synthetic Traffic - Latency bandwidth (MB/s/node) The No. X Router, Micro’ 11 bandwidth (MB/s/node) 11/19

Synthetic Traffic – ED 2 bandwidth (MB/s/node) The No. X Router, Micro’ 11 bandwidth

Synthetic Traffic – ED 2 bandwidth (MB/s/node) The No. X Router, Micro’ 11 bandwidth (MB/s/node) 12/19

Application Traffic - Latency The No. X Router, Micro’ 11 13/19

Application Traffic - Latency The No. X Router, Micro’ 11 13/19

Application Traffic – ED 2 The No. X Router, Micro’ 11 14/19

Application Traffic – ED 2 The No. X Router, Micro’ 11 14/19

Power @ Fixed Bandwidth • Traffic Pattern – Uniform Random Decode negligible – 2

Power @ Fixed Bandwidth • Traffic Pattern – Uniform Random Decode negligible – 2 GB/s/node injection rate • Spec-Fast saturated • Switch/Link glitching in speculative • Marginal additional decode power The No. X Router, Micro’ 11 15/19

Port 0 – 64 x 4 SRAM Port 1 – 64 x 4 SRAM

Port 0 – 64 x 4 SRAM Port 1 – 64 x 4 SRAM Port 2 – 64 x 4 SRAM Port 3 – 64 x 4 SRAM Port 4 – 64 x 4 SRAM 140 µm 70 µm The No. X Router, Micro’ 11 Crossbar 101. 0 µm 70 µm 28 µm XOR Switch 161. 2 µm Standard Router ~17% More Area Decoding and Masking Port 0 – 64 x 4 SRAM Port 1 – 64 x 4 SRAM Port 2 – 64 x 4 SRAM Port 3 – 64 x 4 SRAM Port 4 – 64 x 4 SRAM 140 µm 161. 2 µm Area Floorplanning No. X Router 102. 2 µm 16/19

Going Further • Input Speedup – What if we could drive two values from

Going Further • Input Speedup – What if we could drive two values from an input buffer in a single cycle Switch Fabric – Final decode step has 2 values available • Last packet sees no additional delay from contention at the previous router – Requires additional sideband info The No. X Router, Micro’ 11 A B – Allow new collisions with the “head” flit Flit Buffer A^B – Don’t decode @ every hop, decode when packets diverge B • Multi-hop encoded forwarding 17/19

Conclusion • New encoding-based low-latency router technique – Hides arbitration latency – Comparable frequency

Conclusion • New encoding-based low-latency router technique – Hides arbitration latency – Comparable frequency to speculative switch traversal techniques – Eliminates wasted interconnect bandwidth – Promising application to multiple router architectures The No. X Router, Micro’ 11 18/19

Thanks – Questions? The No. X Router, Micro’ 11 19/19

Thanks – Questions? The No. X Router, Micro’ 11 19/19

Virtual Channels • Future Work • Physical Channels vs. Virtual Channels – VC Router

Virtual Channels • Future Work • Physical Channels vs. Virtual Channels – VC Router Benefits ü Dynamic bandwidth sharing (performance) – VC Router Negatives Increased arbitration delay (performance) Increased buffer energy (power) Large unified crossbar (area, power) • Possible but tradeoffs need to be re-evaluated – Structuring of input buffers/decode logic – VC credit accounting The No. X Router, Micro’ 11 20/19

Multi-Flit Support • Current support is conservative – Performs similarly to speculative routers if

Multi-Flit Support • Current support is conservative – Performs similarly to speculative routers if multi-flit packets collide – Not all bad though • ~70% of packets are single-flit coherence packets • Only head-flit collisions matter • Requests all single-flit • Alternatives – Fragment multi-flit packets – Provide sufficient buffering space The No. X Router, Micro’ 11 21/19