Lecture 25 Interconnection Networks Disks Topics flow control

Virtual Channel Flow Control • Each switch has multiple virtual channels per phys. channel

Example • Wormhole: A is going from Node-1 to Node-4; B is going from

Buffer Management • Credit-based: keep track of the number of free buffers in the

Deadlock Avoidance with VCs • VCs provide another way to number the links such

Router Functions • Crossbar, buffer, arbiter, VC state and allocation, buffer management, ALUs, control

Virtual Channel Router • Buffers and channels are allocated per flit • Each physical

Router Pipeline • Four typical stages: § RC routing computation: the head flit indicates

Router Pipeline • Four typical stages: § RC routing computation: compute the output channel

Speculative Pipelines • Perform VA and SA in parallel • Note that SA only

Recent Intel Router • Used for a 6 x 6 mesh • 16 B,

Recent Intel Router Source: Partha Kundu, “On-Die Interconnects for Next-Generation CMPs”, talk at On-Chip

Magnetic Disks • A magnetic disk consists of 1 -12 platters (metal or glass

Disk Latency • To read/write data, the arm has to be placed on the

RAID • Reliability and availability are important metrics for disks • RAID: redundant array

RAID 0 and RAID 1 • RAID 0 has no additional redundancy (misnomer) –

RAID 3 • Data is bit-interleaved across several disks and a separate disk maintains

RAID 4 and RAID 5 • Data is block interleaved – this allows us

RAID 5 • If we have a single disk for parity, multiple writes can

RAID Summary • RAID 1 -5 can tolerate a single fault – mirroring (RAID

Slides: 22

Download presentation

Lecture 25: Interconnection Networks, Disks • Topics: flow control, router microarchitecture, RAID 1

Virtual Channel Flow Control • Each switch has multiple virtual channels per phys. channel • Each virtual channel keeps track of the output channel assigned to the head, and pointers to buffered packets • A head flit must allocate the same three resources in the next switch before being forwarded • By having multiple virtual channels per physical channel, two different packets are allowed to utilize the channel and not waste the resource when one packet is idle 2

Example • Wormhole: A is going from Node-1 to Node-4; B is going from Node-0 to Node-5 Node-0 B Node-1 idle A B Node-2 idle Node-3 Traffic Analogy: B is trying to make a left turn; A is trying to go straight; there is no left-only lane with wormhole, but there is one with VC Node-4 Node-5 (blocked, no free VCs/buffers) • Virtual channel: Node-1 Node-0 B A A B A Node-2 Node-3 Node-4 Node-5 (blocked, no free VCs/buffers) 3

Buffer Management • Credit-based: keep track of the number of free buffers in the downstream node; the downstream node sends back signals to increment the count when a buffer is freed; need enough buffers to hide the round-trip latency • On/Off: the upstream node sends back a signal when its buffers are close to being full – reduces upstream signaling and counters, but can waste buffer space 4

Deadlock Avoidance with VCs • VCs provide another way to number the links such that a route always uses ascending link numbers 2 17 18 1 2 18 17 1 2 19 16 1 2 1 0 2 1 3 0 102 101 117 118 100 118 117 101 102 103 119 116 • Alternatively, use West-first routing on the 1 st plane and cross over to the 2 nd plane in case you need to go West again (the 2 nd plane uses North-last, for example) 202 201 217 218 200 218 217 201 202 203 219 216 5

Router Functions • Crossbar, buffer, arbiter, VC state and allocation, buffer management, ALUs, control logic • Typical on-chip network power breakdown: § 30% link § 30% buffers § 30% crossbar 6

Virtual Channel Router • Buffers and channels are allocated per flit • Each physical channel is associated with multiple virtual channels – the virtual channels are allocated per packet and the flits of various VCs can be interweaved on the physical channel • For a head flit to proceed, the router has to first allocate a virtual channel on the next router • For any flit to proceed (including the head), the router has to allocate the following resources: buffer space in the next router (credits indicate the available space), access 7 to the physical channel

Router Pipeline • Four typical stages: § RC routing computation: the head flit indicates the VC that it belongs to, the VC state is updated, the headers are examined and the next output channel is computed (note: this is done for all the head flits arriving on various input channels) § VA virtual-channel allocation: the head flits compete for the available virtual channels on their computed output channels § SA switch allocation: a flit competes for access to its output physical channel § ST switch traversal: the flit is transmitted on the output channel A head flit goes through all four stages, the other flits do nothing in the first two stages (this is an in-order pipeline and flits can not jump ahead), a tail flit also de-allocates the VC 8

Router Pipeline • Four typical stages: § RC routing computation: compute the output channel § VA virtual-channel allocation: allocate VC for the head flit § SA switch allocation: compete for output physical channel § ST switch traversal: transfer data on output physical channel Cycle Head flit Body flit 1 Body flit 2 Tail flit 1 2 3 4 5 6 RC VA SA ST -- --- RC VA SA SA ST --- STALL 7 -- SA ST -- -- -- SA ST 9

Speculative Pipelines • Perform VA and SA in parallel • Note that SA only requires knowledge of the output physical channel, not the VC • If VA fails, the successfully allocated channel goes un-utilized Cycle 1 2 Head flit RC VA ST SA Body flit 1 Body flit 2 Tail flit -- 3 4 5 6 7 RC SA ST -- • Perform VA, SA, and ST in parallel (can cause collisions and re-tries) • Typically, VA is the critical path – can possibly perform SA and ST sequentially SA ST -- VA SA ST • Router pipeline latency is a greater bottleneck when there is little contention • When there is little contention, speculation will likely work well! • Single stage pipeline? 10

Recent Intel Router • Used for a 6 x 6 mesh • 16 B, > 3 GHz • Wormhole with VC flow control Source: Partha Kundu, “On-Die Interconnects for Next-Generation CMPs”, talk at On-Chip Interconnection Networks Workshop, Dec 2006 11

Recent Intel Router Source: Partha Kundu, “On-Die Interconnects for Next-Generation CMPs”, talk at On-Chip Interconnection Networks Workshop, Dec 2006 12

Recent Intel Router Source: Partha Kundu, “On-Die Interconnects for Next-Generation CMPs”, talk at On-Chip Interconnection Networks Workshop, Dec 2006 13

Magnetic Disks • A magnetic disk consists of 1 -12 platters (metal or glass disk covered with magnetic recording material on both sides), with diameters between 1 -3. 5 inches • Each platter is comprised of concentric tracks (5 -30 K) and each track is divided into sectors (100 – 500 per track, each about 512 bytes) • A movable arm holds the read/write heads for each disk surface and moves them all in tandem – a cylinder of data is accessible at a time 14

Disk Latency • To read/write data, the arm has to be placed on the correct track – this seek time usually takes 5 to 12 ms on average – can take less if there is spatial locality • Rotational latency is the time taken to rotate the correct sector under the head – average is typically more than 2 ms (15, 000 RPM) • Transfer time is the time taken to transfer a block of bits out of the disk and is typically 3 – 65 MB/second • A disk controller maintains a disk cache (spatial locality can be exploited) and sets up the transfer on the bus (controller overhead) 15

RAID • Reliability and availability are important metrics for disks • RAID: redundant array of inexpensive (independent) disks • Redundancy can deal with one or more failures • Each sector of a disk records check information that allows it to determine if the disk has an error or not (in other words, redundancy already exists within a disk) • When the disk read flags an error, we turn elsewhere for correct data 16

RAID 0 and RAID 1 • RAID 0 has no additional redundancy (misnomer) – it uses an array of disks and stripes (interleaves) data across the arrays to improve parallelism and throughput • RAID 1 mirrors or shadows every disk – every write happens to two disks • Reads to the mirror may happen only when the primary disk fails – or, you may try to read both together and the quicker response is accepted • Expensive solution: high reliability at twice the cost 17

RAID 3 • Data is bit-interleaved across several disks and a separate disk maintains parity information for a set of bits • For example: with 8 disks, bit 0 is in disk-0, bit 1 is in disk-1, …, bit 7 is in disk-7; disk-8 maintains parity for all 8 bits • For any read, 8 disks must be accessed (as we usually read more than a byte at a time) and for any write, 9 disks must be accessed as parity has to be re-calculated • High throughput for a single request, low cost for redundancy (overhead: 12. 5%), low task-level parallelism 18

RAID 4 and RAID 5 • Data is block interleaved – this allows us to get all our data from a single disk on a read – in case of a disk error, read all 9 disks • Block interleaving reduces thruput for a single request (as only a single disk drive is servicing the request), but improves task-level parallelism as other disk drives are free to service other requests • On a write, we access the disk that stores the data and the parity disk – parity information can be updated simply by checking if the new data differs from the old data 19

RAID 5 • If we have a single disk for parity, multiple writes can not happen in parallel (as all writes must update parity info) • RAID 5 distributes the parity block to allow simultaneous writes 20

RAID Summary • RAID 1 -5 can tolerate a single fault – mirroring (RAID 1) has a 100% overhead, while parity (RAID 3, 4, 5) has modest overhead • Can tolerate multiple faults by having multiple check functions – each additional check can cost an additional disk (RAID 6) • RAID 6 and RAID 2 (memory-style ECC) are not commercially employed 21

Title • Bullet 22