Physical constraints 12 One of the physical constraints

Physical constraints (2/2) Network Bisection width Node size k-ary n-cude 2 Wkn-1 2 Wn

Hardware cost and speed model • Chien proposed cost and speed model for wormhole

Canonical Router model LC Input channels Injection channel LC Output channels switch LC LC

Buffer partitioning Input channels LC LC V C LC Output channels switch LC Injection

Alternative partitioning LC m ux Input channels LC Injection channel LC m ux s

Circular buffer d c b a tail e head 分散処理論２ (No 9) h g

A parametrization Module parameter Gate count Delay Crossbar P(#ports) O(P 2) c 0+c 1×log

Pipelined routing Cycle 1 2 3 4 5 6 7 RC VA SA ST

Pipeline stalls • Pipeline stalls occur if a given pipeline stage cannot be completed

Virtual-channel allocation stall Cycle Head flit 1 tail flit 2 1 RC 2 3

Physical channel (1/2) • Half-duplex channels have the disadvantage that both sides of the

Physical channel (2/2) • Lower dimensionality, and communication traffic below network saturation would tend

A bidirectional half-duplex channel data R 1 control R 1 Fair arbitration requires status

A unidirectional channel data R 1 control R 1 Arbitration overhead is avoided, and

A bidirectional full-duplex channel data R 1 control R 1 data control When the

A demand-driven mutual exclusion ring ARB VC ARB ack Link control VC Link control

Available buffer space • Assume that propagation delay is specified as P ns per

Simultaneous bidirectional signaling • It allows simultaneous signaling between two routers across a single

Separate crossbar X+ AD FC From node AD FC X- AD FC Y+ AD

Planar-adaptive router L 1 L 2 vc vc crossbar m ux L 3 vc

Cray T 3 D router NI X+ Y+ Z+ xbar X- Y- Z- 分散処理論２

Message processing datapath on T 3 D processor Local memory Network Data translation buffer

Intel Cavallino router X+ VC 0 VC 1 VC 2 VC 3 X- Y+

SGI SPIDER chip I/F I/F crossbar I/F I/F Link I/F VCs Msg cntl It

Routing control for PCS Input VC Channel mappings Routing header decode History store Decision

Pipelined circuit switching (PCS) • The implementation of routing protocols based on PCS is

8 FIFO routing 64 Central queue 64 serializer FC deserializer Buffered wormhole (IBM SP

Slides: 29

Download presentation

Physical constraints (1/2) • One of the physical constraints facing the implementation of interconnection network is the available wiring area. • The minimum number of wires that must be cut when the network is divided into two equal sets of nodes, is referred as bisection width. • Even the failure of a single link can destroy the deadlock freedom properties. • A second constraint is the number of I/Os available per router, that is referred as node size. 分散処理論２ (No 9) 1

Physical constraints (2/2) Network Bisection width Node size k-ary n-cude 2 Wkn-1 2 Wn Binary n-cube n. W/2 n. W n-D mesh Wkn-1 2 Wn NW 2 t. W Omega net. channel of width W bits, t×t switches in Omega with N nodes 分散処理論２ (No 9) 2

Bisection width 分散処理論２ (No 9) 3

Hardware cost and speed model • Chien proposed cost and speed model for wormhole routers to compare their complexity and performance[1993]. • He showed gate counts of each router component and its delay based on a canonical router model. • Intrarouter delay can be parametrized by the number of ports on a crossbar switch and routing freedom with technology dependent constants. 分散処理論２ (No 9) 4

Canonical Router model LC Input channels Injection channel LC Output channels switch LC LC Routing and arbitration Ejection channel LC: Link Controller 分散処理論２ (No 9) 5

Buffer partitioning Input channels LC LC V C LC Output channels switch LC Injection channel V C LC Routing and arbitration LC Ejection channel LC: Link Controller, VC: virtual channel controller 分散処理論２ (No 9) 6

Alternative partitioning LC m ux Input channels LC Injection channel LC m ux s w i t c h m ux Routing and arbitration V C LC LC Output channels Ejection channel LC: Link Controller, VC: virtual channel controller 分散処理論２ (No 9) 7

Circular buffer d c b a tail e head 分散処理論２ (No 9) h g f tail 8

A parametrization Module parameter Gate count Delay Crossbar P(#ports) O(P 2) c 0+c 1×log P Flow control - O(1) c 2 Address decoder - O(1) c 3 Routing decision F(freedom) O(F 2) c 4+c 5 ×log F Header selection F(freedom) O(log F) c 6+c 7 ×log F VC V(#VCs) O(V) c 8+c 9 ×log V 分散処理論２ (No 9) 9

Pipelined routing Cycle 1 2 3 4 5 6 7 RC VA SA ST Body flit 1 SA ST Body flit 2 SA ST Head flit Tail flit RC: Routing computation, VA: Virtual channel allocation, SA: Switch allocation, ST: Switch traversal 分散処理論２ (No 9) 10

Pipeline stalls • Pipeline stalls occur if a given pipeline stage cannot be completed in the current cycles. • Stalls may occur at any stage. – No available output ports (virtual channels) – Input buffer is empty, etc. • Latency of an interconnection network is directly related to pipeline depth. 分散処理論２ (No 9) 11

Virtual-channel allocation stall Cycle Head flit 1 tail flit 2 1 RC 2 3 4 5 6 7 8 9 VA SA ST Body flit 1 Tail flit 1 Header flit 1 is not able to allocate virtual channel until cycle 5, and reallocation is taken. 分散処理論２ (No 9) 12

Physical channel (1/2) • Half-duplex channels have the disadvantage that both sides of the link must arbitrate for the use of the link. • Unidirectional channels doubles the average distance traveled by a message in tori. • In unidirectional full-duplex channels, channel widths are reduced, i. e. approximately halves as bandwidth is statically allocated in each direction. 分散処理論２ (No 9) 13

Physical channel (2/2) • Lower dimensionality, and communication traffic below network saturation would tend to favor the use of halfduplex channels. • Cost considerations would encourage the use of low cost packaging which would also favor half-duplex channels and the use of command/data encodings to reduce the overall pin count. • For large systems with a higher number of dimensions (wire delays would dominate), the use of higher clock speeds, communication intensive applications and the use of pipelined links would favor the use of full-duplex channels. 分散処理論２ (No 9) 14

A bidirectional half-duplex channel data R 1 control R 1 Fair arbitration requires status information (ownership) to be transmitted across the channel to indicate the availability of data to be transmitted. 分散処理論２ (No 9) 15

A unidirectional channel data R 1 control R 1 Arbitration overhead is avoided, and therefore links can generally be run faster. 分散処理論２ (No 9) 16

A bidirectional full-duplex channel data R 1 control R 1 data control When the data are being transmitted in only one direction across the channel, 50% of the pin bandwidth is unused. 分散処理論２ (No 9) 17

A demand-driven mutual exclusion ring ARB VC ARB ack Link control VC Link control ready ARB VC A signal representing the privilege to drive the physical channel circulates around the mutual ring. 分散処理論２ (No 9) 18

Available buffer space • Assume that propagation delay is specified as P ns per unit length and the length of the wire in the channel is L units. • When the receiver requests the sender to stop transmitting, there may be LPB phits in transit on a channel running at B Gphits/s. • An additional LPB phits may be placed in the channel to propagate the flow control signal to the sender. • If the flow control operation latency is F ns, the sender place another FB phits on the channel. • Thus, available buffer space = 2 LPB + FB (phits) 分散処理論２ (No 9) 19

Simultaneous bidirectional signaling • It allows simultaneous signaling between two routers across a single signal line (full-duplex bidirectional communication). • It transmits a logic 1(0) as a positive (negative) current. • The received signal is the superposition of the two signals transmitted from both sides of the channel. • Each transmitter generates a reference signal which is subtracted from the superimposed signal to generate the received signal. 分散処理論２ (No 9) 20

Separate crossbar X+ AD FC From node AD FC X- AD FC Y+ AD FC Yin AD FC Y- AD FC crossbar X+ To Yin X- crossbar Y+ To node Y- 分散処理論２ (No 9) 21

Planar-adaptive router L 1 L 2 vc vc crossbar m ux L 3 vc L 4 L 1 vc L 2 L 3 crossbar vc 分散処理論２ (No 9) m ux L 1 L 2 L 4 22

Cray T 3 D router NI X+ Y+ Z+ xbar X- Y- Z- 分散処理論２ (No 9) NI 23

Message processing datapath on T 3 D processor Local memory Network Data translation buffer addressing Messaging support Msg queue control Input buf. Output buf. interface 分散処理論２ (No 9) Addressing and routing tag lookup 24

Intel Cavallino router X+ VC 0 VC 1 VC 2 VC 3 X- Y+ VC 0 VC 1 VC 2 VC 3 Y- Z+ VC 0 VC 1 VC 2 VC 3 Z- crossbar 分散処理論２ (No 9) 25

SGI SPIDER chip I/F I/F crossbar I/F I/F Link I/F VCs Msg cntl It is the first router to compute the route one step ahead. 分散処理論２ (No 9) 26

Routing control for PCS Input VC Channel mappings Routing header decode History store Decision unit Output VC 分散処理論２ (No 9) Inc/Dec banks Modified header 27

Pipelined circuit switching (PCS) • The implementation of routing protocols based on PCS is more complex than wormhole-switched routers to support backtracking. • Backtracking must use history information. • When a routing header is backtracking, its history mask must be retrieved from the history store. 分散処理論２ (No 9) 28

8 FIFO routing 64 Central queue 64 serializer FC deserializer Buffered wormhole (IBM SP family) FC FIFO 8 Bypass Crossbar 8 Packets are buffered in the central queue when they fail wining access to the bypass分散処理論２ crossbar(No 9) (output port). 8 29