CS 258 Parallel Computer Architecture Lecture 3 Introduction

  • Slides: 32
Download presentation
CS 258 Parallel Computer Architecture Lecture 3 Introduction to Scalable Interconnection Network Design January

CS 258 Parallel Computer Architecture Lecture 3 Introduction to Scalable Interconnection Network Design January 30, 2008 Prof John D. Kubiatowicz http: //www. cs. berkeley. edu/~kubitron/cs 258

Scalable, High Perf. Network • At Core of Parallel Computer Arch. • Requirements and

Scalable, High Perf. Network • At Core of Parallel Computer Arch. • Requirements and trade-offs at many levels – – Elegant mathematical structure Deep relationships to algorithm structure Managing many traffic flows Electrical / Optical link properties – – interactions across levels Performance metrics? Cost metrics? Workload? • Little consensus => need holistic understanding 1/30/08 Kubiatowicz CS 258 ©UCB Spring 2008 Lec 3. 2

Requirements from Above • Communication-to-computation ratio bandwidth that must be sustained for given computational

Requirements from Above • Communication-to-computation ratio bandwidth that must be sustained for given computational rate – traffic localized or dispersed? – bursty or uniform? • Programming Model – protocol – granularity of transfer – degree of overlap (slackness) job of a parallel machine network is to transfer information from source node to dest. node in support of network transactions that realize the programming model 1/30/08 Kubiatowicz CS 258 ©UCB Spring 2008 Lec 3. 3

Goals • latency as small as possible • as many concurrent transfers as possible

Goals • latency as small as possible • as many concurrent transfers as possible – operation bandwidth – data bandwidth • cost as low as possible 1/30/08 Kubiatowicz CS 258 ©UCB Spring 2008 Lec 3. 4

Outline • Introduction • Basic concepts, definitions, performance perspective • Organizational structure • Topologies

Outline • Introduction • Basic concepts, definitions, performance perspective • Organizational structure • Topologies 1/30/08 Kubiatowicz CS 258 ©UCB Spring 2008 Lec 3. 5

Basic Definitions • Network interface – Processor (or programmer’s) interface to the network –

Basic Definitions • Network interface – Processor (or programmer’s) interface to the network – Mechanism for injecting packets/removing packets • Links – Bundle of wires or fibers that carries a signal – May have separate wires for clocking • Switches – connects fixed number of input channels to fixed number of output channels – Can have a serious impact on latency, saturation, deadlock 1/30/08 Kubiatowicz CS 258 ©UCB Spring 2008 Lec 3. 6

Links and Channels . . . ABC 123 => . . . QR 67

Links and Channels . . . ABC 123 => . . . QR 67 => Transmitter Receiver • transmitter converts stream of digital symbols into signal that is driven down the link • receiver converts it back – tran/rcv share physical protocol • trans + link + rcv form Channel for digital info flow between switches • link-level protocol segments stream of symbols into larger units: packets or messages (framing) • node-level protocol embeds commands for dest communication assist within packet 1/30/08 Kubiatowicz CS 258 ©UCB Spring 2008 Lec 3. 7

Clock Synchronization? • Receiver must be synchronized to transmitter – To know when to

Clock Synchronization? • Receiver must be synchronized to transmitter – To know when to latch data • Fully Synchronous – Same clock and phase: Isochronous – Same clock, different phase: Mesochronous » High-speed serial links work this way » Use of encoding (8 B/10 B) to ensure sufficient highfrequency component for clock recovery • Fully Asynchronous – No clock: Request/Ack signals – Different clock: Need some sort of clock recovery? Data Transmitter Asserts Data Req 1/30/08 Ack t 0 t 1 t 2 t 3 t 4 t 5 Kubiatowicz CS 258 ©UCB Spring 2008 Lec 3. 8

Formalism • network is a graph V = {switches and nodes} connected by communication

Formalism • network is a graph V = {switches and nodes} connected by communication channels C Í V ´ V • Channel has width w and signaling rate f = 1/ – channel bandwidth b = wf – phit (physical unit) data transferred per cycle – flit - basic unit of flow-control • Number of input (output) channels is switch degree • Sequence of switches and links followed by a message is a route • Think streets and intersections 1/30/08 Kubiatowicz CS 258 ©UCB Spring 2008 Lec 3. 9

What characterizes a network? • Topology (what) • Routing Algorithm (which) • Switching Strategy

What characterizes a network? • Topology (what) • Routing Algorithm (which) • Switching Strategy (how) • Flow Control Mechanism (when) – physical interconnection structure of the network graph – direct: node connected to every switch – indirect: nodes connected to specific subset of switches – restricts the set of paths that msgs may follow – many algorithms with different properties » gridlock avoidance? – how data in a msg traverses a route – circuit switching vs. packet switching – when a msg or portions of it traverse a route – what happens when traffic is encountered? 1/30/08 Kubiatowicz CS 258 ©UCB Spring 2008 Lec 3. 10

Topological Properties • • 1/30/08 Routing Distance - number of links on route Diameter

Topological Properties • • 1/30/08 Routing Distance - number of links on route Diameter - maximum routing distance Average Distance A network is partitioned by a set of links if their removal disconnects the graph Kubiatowicz CS 258 ©UCB Spring 2008 Lec 3. 11

Interconnection Topologies • Class of networks scaling with N • Logical Properties: – distance,

Interconnection Topologies • Class of networks scaling with N • Logical Properties: – distance, degree • Physical properties – length, width • Fully connected network – diameter = 1 – degree = N – cost? » bus => O(N), but BW is O(1) - actually worse » crossbar => O(N 2) for BW O(N) • VLSI technology determines switch degree 1/30/08 Kubiatowicz CS 258 ©UCB Spring 2008 Lec 3. 12

Example: Linear Arrays and Rings • Linear Array – – Diameter? Average Distance? Bisection

Example: Linear Arrays and Rings • Linear Array – – Diameter? Average Distance? Bisection bandwidth? Route A -> B given by relative address R = B-A • Torus? • Examples: FDDI, SCI, Fiber. Channel Arbitrated Loop, KSR 1 1/30/08 Kubiatowicz CS 258 ©UCB Spring 2008 Lec 3. 13

Example: Multidimensional Meshes and Tori 3 D Cube 2 D Grid • d-dimensional array

Example: Multidimensional Meshes and Tori 3 D Cube 2 D Grid • d-dimensional array – n = kd-1 X. . . X k. O nodes – described by d-vector of coordinates (id-1, . . . , i. O) • d-dimensional k-ary mesh: N = kd – k = dÖN – described by d-vector of radix k coordinate • d-dimensional k-ary torus (or k-ary d-cube)? 1/30/08 Kubiatowicz CS 258 ©UCB Spring 2008 Lec 3. 14

Properties of • Routing – relative distance: R = (b d-1 - a d-1,

Properties of • Routing – relative distance: R = (b d-1 - a d-1, . . . , b 0 - a 0 ) – traverse ri = b i - a i hops in each dimension – dimension-order routing? Adaptive routing? • Average Distance Wire Length? • Degree? • Bisection bandwidth? Partitioning? – d x 2 k/3 for mesh – dk/2 for cube – k d-1 bidirectional links • Physical layout? – 2 D in O(N) space – higher dimension? 1/30/08 Short wires Kubiatowicz CS 258 ©UCB Spring 2008 Lec 3. 15

On Chip: Embeddings in two dimensions 6 x 3 x 2 • Embed multiple

On Chip: Embeddings in two dimensions 6 x 3 x 2 • Embed multiple logical dimension in one physical dimension using long wires • When embedding higher-dimension in lower one, either some wires longer than others, or all wires long 1/30/08 Kubiatowicz CS 258 ©UCB Spring 2008 Lec 3. 16

Real Machines • Wide links, smaller routing delay • Tremendous variation 1/30/08 Kubiatowicz CS

Real Machines • Wide links, smaller routing delay • Tremendous variation 1/30/08 Kubiatowicz CS 258 ©UCB Spring 2008 Lec 3. 17

Administrivia • First set of readings posted for Monday – – 1/30/08 “The Future

Administrivia • First set of readings posted for Monday – – 1/30/08 “The Future of Wires, ” Ron Ho, Kenneth W. Mai, and Mark A. Horowitz. “An Adaptive and Fault Tolerant Wormhole Routing Strategy for k-ary n-cubes, ” Daniel H. Linder and Jim C. Harden, Kubiatowicz CS 258 ©UCB Spring 2008 Lec 3. 18

Typical Packet Format • Two basic mechanisms for abstraction – encapsulation – fragmentation 1/30/08

Typical Packet Format • Two basic mechanisms for abstraction – encapsulation – fragmentation 1/30/08 Kubiatowicz CS 258 ©UCB Spring 2008 Lec 3. 19

Communication Perf: Latency per hop • Time(n)s-d = overhead + routing delay + channel

Communication Perf: Latency per hop • Time(n)s-d = overhead + routing delay + channel occupancy + contention delay • Channel occupancy = (n + ne) / b • Routing delay? • Contention? 1/30/08 Kubiatowicz CS 258 ©UCB Spring 2008 Lec 3. 20

Store&Forward vs Cut-Through Routing Time: h(n/b + D/ ) vs OR(cycles): h(n/w + D)

Store&Forward vs Cut-Through Routing Time: h(n/b + D/ ) vs OR(cycles): h(n/w + D) n/b + h D/ vs n/w + h D • what if message is fragmented? • wormhole vs virtual cut-through 1/30/08 Kubiatowicz CS 258 ©UCB Spring 2008 Lec 3. 21

Contention • Two packets trying to use the same link at same time –

Contention • Two packets trying to use the same link at same time – limited buffering – drop? • Most parallel mach. networks block in place – link-level flow control – tree saturation • Closed system - offered load depends on delivered – Source Squelching 1/30/08 Kubiatowicz CS 258 ©UCB Spring 2008 Lec 3. 22

Bandwidth • What affects local bandwidth? – packet density – routing delay – contention

Bandwidth • What affects local bandwidth? – packet density – routing delay – contention » endpoints » within the network b x n/(n + ne) b x n /(n + ne + w. D) • Aggregate bandwidth – bisection bandwidth » sum of bandwidth of smallest set of links that partition the network – total bandwidth of all the channels: Cb – suppose N hosts issue packet every M cycles with ave dist » each msg occupies h channels for l = n/w cycles each » C/N channels available per node » link utilization for store-and-forward: r = (hl/M channel cycles/node)/(C/N) = Nhl/MC < 1! » link utilization for wormhole routing? 1/30/08 Kubiatowicz CS 258 ©UCB Spring 2008 Lec 3. 23

Saturation 1/30/08 Kubiatowicz CS 258 ©UCB Spring 2008 Lec 3. 24

Saturation 1/30/08 Kubiatowicz CS 258 ©UCB Spring 2008 Lec 3. 24

Organizational Structure • Processors – datapath + control logic – control logic determined by

Organizational Structure • Processors – datapath + control logic – control logic determined by examining register transfers in the datapath • Networks – links – switches – network interfaces 1/30/08 Kubiatowicz CS 258 ©UCB Spring 2008 Lec 3. 25

Link Design/Engineering Space • Cable of one or more wires/fibers with connectors at the

Link Design/Engineering Space • Cable of one or more wires/fibers with connectors at the ends attached to switches or interfaces Narrow: - control, data and timing multiplexed on wire Synchronous: - source & dest on same clock Short: - single logical value at a time Asynchronous: - source encodes clock in signal 1/30/08 Long: - stream of logical values at a time Wide: - control, data and timing on separate wires Kubiatowicz CS 258 ©UCB Spring 2008 Lec 3. 26

Example: Cray MPPs • T 3 D: Short, Wide, Synchronous (300 MB/s) • T

Example: Cray MPPs • T 3 D: Short, Wide, Synchronous (300 MB/s) • T 3 E: long, wide, asynchronous (500 MB/s) – 24 bits » 16 data, 4 control, 4 reverse direction flow control – single 150 MHz clock (including processor) – flit = phit = 16 bits – two control bits identify flit type (idle and framing) » no-info, routing tag, packet, end-of-packet – 14 bits, 375 MHz, LVDS – flit = 5 phits = 70 bits » 64 bits data + 6 control – switches operate at 75 MHz – framed into 1 -word and 8 -word read/write request packets • Cray Black. Widow High-Radix Clos Network: long, narrow, mesochronous (800 MHz) – 3 bits each direction/port, 6. 25 Gbps per link, clock recovery – 64 ports – High-radix butterfly/fat tree network • Cost = f(length, width) ? 1/30/08 Kubiatowicz CS 258 ©UCB Spring 2008 Lec 3. 27

Switches 1/30/08 Kubiatowicz CS 258 ©UCB Spring 2008 Lec 3. 28

Switches 1/30/08 Kubiatowicz CS 258 ©UCB Spring 2008 Lec 3. 28

Switch Components • Output ports – transmitter (typically drives clock and data) • Input

Switch Components • Output ports – transmitter (typically drives clock and data) • Input ports – synchronizer aligns data signal with local clock domain – essentially FIFO buffer • Crossbar – connects each input to any output – degree limited by area or pinout • Buffering • Control logic – complexity depends on routing logic and scheduling algorithm – determine output port for each incoming packet – arbitrate among inputs directed at same output 1/30/08 Kubiatowicz CS 258 ©UCB Spring 2008 Lec 3. 29

Properties of Routing Algorithms • Deterministic – route determined by (source, dest), not intermediate

Properties of Routing Algorithms • Deterministic – route determined by (source, dest), not intermediate state (i. e. traffic) • Adaptive – route influenced by traffic along the way • Minimal – only selects shortest paths • Deadlock free – no traffic pattern can lead to a situation where no packets move forward 1/30/08 Kubiatowicz CS 258 ©UCB Spring 2008 Lec 3. 30

Deadlock Freedom • How can it arise? – necessary conditions: » shared resource »

Deadlock Freedom • How can it arise? – necessary conditions: » shared resource » incrementally allocated » non-preemptible – think of a channel as a shared resource that is acquired incrementally » source buffer then dest. buffer » channels along a route • How do you avoid it? – constrain how channel resources are allocated – ex: dimension order • How do you prove that a routing algorithm is deadlock free? – More next time: See Linder and Harden reading 1/30/08 Kubiatowicz CS 258 ©UCB Spring 2008 Lec 3. 31

Summary Topology Degree Diameter Ave Dist Bisection D (D ave) @ P=1024 1 D

Summary Topology Degree Diameter Ave Dist Bisection D (D ave) @ P=1024 1 D Array 2 N-1 N/3 1 huge 1 D Ring 2 N/4 2 2 D Mesh 4 2 (N 1/2 - 1) 2/3 N 1/2 63 (21) 2 D Torus 4 N 1/2 2 N 1/2 32 (16) nk/2 nk/4 15 (7. 5) @n=3 n n/2 N/2 k-ary n-cube 2 n Hypercube 1/30/08 n =log N Kubiatowicz CS 258 ©UCB Spring 2008 10 (5) Lec 3. 32