Physical constraints 12 One of the physical constraints

  • Slides: 29
Download presentation
Physical constraints (1/2) • One of the physical constraints facing the implementation of interconnection

Physical constraints (1/2) • One of the physical constraints facing the implementation of interconnection network is the available wiring area. • The minimum number of wires that must be cut when the network is divided into two equal sets of nodes, is referred as bisection width. • Even the failure of a single link can destroy the deadlock freedom properties. • A second constraint is the number of I/Os available per router, that is referred as node size. 分散処理論2 (No 9) 1

Physical constraints (2/2) Network Bisection width Node size k-ary n-cude 2 Wkn-1 2 Wn

Physical constraints (2/2) Network Bisection width Node size k-ary n-cude 2 Wkn-1 2 Wn Binary n-cube n. W/2 n. W n-D mesh Wkn-1 2 Wn NW 2 t. W Omega net. channel of width W bits, t×t switches in Omega with N nodes 分散処理論2 (No 9) 2

Bisection width 分散処理論2 (No 9) 3

Bisection width 分散処理論2 (No 9) 3

Hardware cost and speed model • Chien proposed cost and speed model for wormhole

Hardware cost and speed model • Chien proposed cost and speed model for wormhole routers to compare their complexity and performance[1993]. • He showed gate counts of each router component and its delay based on a canonical router model. • Intrarouter delay can be parametrized by the number of ports on a crossbar switch and routing freedom with technology dependent constants. 分散処理論2 (No 9) 4

Canonical Router model LC Input channels Injection channel LC Output channels switch LC LC

Canonical Router model LC Input channels Injection channel LC Output channels switch LC LC Routing and arbitration Ejection channel LC: Link Controller 分散処理論2 (No 9) 5

Buffer partitioning Input channels LC LC V C LC Output channels switch LC Injection

Buffer partitioning Input channels LC LC V C LC Output channels switch LC Injection channel V C LC Routing and arbitration LC Ejection channel LC: Link Controller, VC: virtual channel controller 分散処理論2 (No 9) 6

Alternative partitioning LC m ux Input channels LC Injection channel LC m ux s

Alternative partitioning LC m ux Input channels LC Injection channel LC m ux s w i t c h m ux Routing and arbitration V C LC LC Output channels Ejection channel LC: Link Controller, VC: virtual channel controller 分散処理論2 (No 9) 7

Circular buffer d c b a tail e head 分散処理論2 (No 9) h g

Circular buffer d c b a tail e head 分散処理論2 (No 9) h g f tail 8

A parametrization Module parameter Gate count Delay Crossbar P(#ports) O(P 2) c 0+c 1×log

A parametrization Module parameter Gate count Delay Crossbar P(#ports) O(P 2) c 0+c 1×log P Flow control - O(1) c 2 Address decoder - O(1) c 3 Routing decision F(freedom) O(F 2) c 4+c 5 ×log F Header selection F(freedom) O(log F) c 6+c 7 ×log F VC V(#VCs) O(V) c 8+c 9 ×log V 分散処理論2 (No 9) 9

Pipelined routing Cycle 1 2 3 4 5 6 7 RC VA SA ST

Pipelined routing Cycle 1 2 3 4 5 6 7 RC VA SA ST Body flit 1 SA ST Body flit 2 SA ST Head flit Tail flit RC: Routing computation, VA: Virtual channel allocation, SA: Switch allocation, ST: Switch traversal 分散処理論2 (No 9) 10

Pipeline stalls • Pipeline stalls occur if a given pipeline stage cannot be completed

Pipeline stalls • Pipeline stalls occur if a given pipeline stage cannot be completed in the current cycles. • Stalls may occur at any stage. – No available output ports (virtual channels) – Input buffer is empty, etc. • Latency of an interconnection network is directly related to pipeline depth. 分散処理論2 (No 9) 11

Virtual-channel allocation stall Cycle Head flit 1 tail flit 2 1 RC 2 3

Virtual-channel allocation stall Cycle Head flit 1 tail flit 2 1 RC 2 3 4 5 6 7 8 9 VA SA ST Body flit 1 Tail flit 1 Header flit 1 is not able to allocate virtual channel until cycle 5, and reallocation is taken. 分散処理論2 (No 9) 12

Physical channel (1/2) • Half-duplex channels have the disadvantage that both sides of the

Physical channel (1/2) • Half-duplex channels have the disadvantage that both sides of the link must arbitrate for the use of the link. • Unidirectional channels doubles the average distance traveled by a message in tori. • In unidirectional full-duplex channels, channel widths are reduced, i. e. approximately halves as bandwidth is statically allocated in each direction. 分散処理論2 (No 9) 13

Physical channel (2/2) • Lower dimensionality, and communication traffic below network saturation would tend

Physical channel (2/2) • Lower dimensionality, and communication traffic below network saturation would tend to favor the use of halfduplex channels. • Cost considerations would encourage the use of low cost packaging which would also favor half-duplex channels and the use of command/data encodings to reduce the overall pin count. • For large systems with a higher number of dimensions (wire delays would dominate), the use of higher clock speeds, communication intensive applications and the use of pipelined links would favor the use of full-duplex channels. 分散処理論2 (No 9) 14

A bidirectional half-duplex channel data R 1 control R 1 Fair arbitration requires status

A bidirectional half-duplex channel data R 1 control R 1 Fair arbitration requires status information (ownership) to be transmitted across the channel to indicate the availability of data to be transmitted. 分散処理論2 (No 9) 15

A unidirectional channel data R 1 control R 1 Arbitration overhead is avoided, and

A unidirectional channel data R 1 control R 1 Arbitration overhead is avoided, and therefore links can generally be run faster. 分散処理論2 (No 9) 16

A bidirectional full-duplex channel data R 1 control R 1 data control When the

A bidirectional full-duplex channel data R 1 control R 1 data control When the data are being transmitted in only one direction across the channel, 50% of the pin bandwidth is unused. 分散処理論2 (No 9) 17

A demand-driven mutual exclusion ring ARB VC ARB ack Link control VC Link control

A demand-driven mutual exclusion ring ARB VC ARB ack Link control VC Link control ready ARB VC A signal representing the privilege to drive the physical channel circulates around the mutual ring. 分散処理論2 (No 9) 18

Available buffer space • Assume that propagation delay is specified as P ns per

Available buffer space • Assume that propagation delay is specified as P ns per unit length and the length of the wire in the channel is L units. • When the receiver requests the sender to stop transmitting, there may be LPB phits in transit on a channel running at B Gphits/s. • An additional LPB phits may be placed in the channel to propagate the flow control signal to the sender. • If the flow control operation latency is F ns, the sender place another FB phits on the channel. • Thus, available buffer space = 2 LPB + FB (phits) 分散処理論2 (No 9) 19

Simultaneous bidirectional signaling • It allows simultaneous signaling between two routers across a single

Simultaneous bidirectional signaling • It allows simultaneous signaling between two routers across a single signal line (full-duplex bidirectional communication). • It transmits a logic 1(0) as a positive (negative) current. • The received signal is the superposition of the two signals transmitted from both sides of the channel. • Each transmitter generates a reference signal which is subtracted from the superimposed signal to generate the received signal. 分散処理論2 (No 9) 20

Separate crossbar X+ AD FC From node AD FC X- AD FC Y+ AD

Separate crossbar X+ AD FC From node AD FC X- AD FC Y+ AD FC Yin AD FC Y- AD FC crossbar X+ To Yin X- crossbar Y+ To node Y- 分散処理論2 (No 9) 21

Planar-adaptive router L 1 L 2 vc vc crossbar m ux L 3 vc

Planar-adaptive router L 1 L 2 vc vc crossbar m ux L 3 vc L 4 L 1 vc L 2 L 3 crossbar vc 分散処理論2 (No 9) m ux L 1 L 2 L 4 22

Cray T 3 D router NI X+ Y+ Z+ xbar X- Y- Z- 分散処理論2

Cray T 3 D router NI X+ Y+ Z+ xbar X- Y- Z- 分散処理論2 (No 9) NI 23

Message processing datapath on T 3 D processor Local memory Network Data translation buffer

Message processing datapath on T 3 D processor Local memory Network Data translation buffer addressing Messaging support Msg queue control Input buf. Output buf. interface 分散処理論2 (No 9) Addressing and routing tag lookup 24

Intel Cavallino router X+ VC 0 VC 1 VC 2 VC 3 X- Y+

Intel Cavallino router X+ VC 0 VC 1 VC 2 VC 3 X- Y+ VC 0 VC 1 VC 2 VC 3 Y- Z+ VC 0 VC 1 VC 2 VC 3 Z- crossbar 分散処理論2 (No 9) 25

SGI SPIDER chip I/F I/F crossbar I/F I/F Link I/F VCs Msg cntl It

SGI SPIDER chip I/F I/F crossbar I/F I/F Link I/F VCs Msg cntl It is the first router to compute the route one step ahead. 分散処理論2 (No 9) 26

Routing control for PCS Input VC Channel mappings Routing header decode History store Decision

Routing control for PCS Input VC Channel mappings Routing header decode History store Decision unit Output VC 分散処理論2 (No 9) Inc/Dec banks Modified header 27

Pipelined circuit switching (PCS) • The implementation of routing protocols based on PCS is

Pipelined circuit switching (PCS) • The implementation of routing protocols based on PCS is more complex than wormhole-switched routers to support backtracking. • Backtracking must use history information. • When a routing header is backtracking, its history mask must be retrieved from the history store. 分散処理論2 (No 9) 28

8 FIFO routing 64 Central queue 64 serializer FC deserializer Buffered wormhole (IBM SP

8 FIFO routing 64 Central queue 64 serializer FC deserializer Buffered wormhole (IBM SP family) FC FIFO 8 Bypass Crossbar 8 Packets are buffered in the central queue when they fail wining access to the bypass分散処理論2 crossbar(No 9) (output port). 8 29