Introduction to MultiProcessor Architectures Dr Konstantinos Tatas Outline








































- Slides: 40
Introduction to Multi-Processor Architectures Dr. Konstantinos Tatas
Outline • • Why multiprocessor architectures? Challenges The communication problem The cache coherence problem
Technology Process Evolution Node years: 2007/65 nm, 2010/45 nm, 2013/33 nm, 2016/23 nm
Why that didn’t happen?
THE MANY CORES ERA Source: International Roadmap for Semiconductors 2007 edition (http: //www. itrs. net/)
CHALLENGES • Communication • Data coherence • Programming
SHARED ADDRESS SPACE COMMUNICATIONS
SYSTEM BUS
CROSS-BAR
MULTI-STAGES NETWORK ON CHIP
No. C-based MPSo. C • nodes – Processing Elements (PEs), such as CPUs, custom IPs, DSPs, etc. – storage elements (embedded memory blocks), • • Routers Links Network Interfaces (NIs) Often a switch together with its host node memory is referred to as a tile.
NOC VS. “OFF-CHIP” NETWORKS What is Different? Ø Routers on Planar Grid Topology Ø Short Point-To-Point Links between routers Ø Unique VLSI Cost Sensitivity: ü Area-Routers and Links ü Power
NOC VS. “OFF-CHIP” NETWORKS Ø No legacy protocols to be compliant with … Ø No software simple and hardware efficient protocols Ø Different operating env. (no dynamic changes and failures)
AN NOC EXAMPLE • Source: ossum, Intel @ MPSo. C’ 07
NOC TOPOLOGIES Regular topologies: general-purposed on -chip multiprocessors Custom topologies:
NOC VS. “OFF-CHIP” NETWORKS What is Different? Ø Routers on Planar Grid Topology Ø Short Point-To-Point Links between routers Ø Unique VLSI Cost Sensitivity: ü Area-Routers and Links ü Power
NOC VS. “OFF-CHIP” NETWORKS Ø No legacy protocols to be compliant with … Ø No software simple and hardware efficient protocols Ø Different operating env. (no dynamic changes and failures)
NOC VS. “OFF-CHIP” NETWORKS Ø No legacy protocols to be compliant with … Ø No software simple and hardware efficient protocols Custom Network Designenv. – You(no design what you need! Ø Different operating dynamic changes and failures)
NOC VS. “OFF-CHIP” NETWORKS Ø No legacy protocols to be compliant with … Ø No software simple and hardware efficient protocols Custom Network Designenv. – You(no design what you need! Ø Different operating dynamic changes and failures) Example 1: Replace modules Replace
NOC VS. “OFF-CHIP” NETWORKS Ø No legacy protocols to be compliant with … Ø No software simple and hardware efficient protocols Custom Network Designenv. – You(no design what you need! Ø Different operating dynamic changes and failures) Example 2: Adapt Links
NOC COST SCALABILITY VS. ALTERNATIVES • Compare the cost of: ØNo. C ØNon-Segmented Bus (NS-Bus) ØSegmented Bus (S-Bus) ØPoint-To-Point (PTP)
NOC ROUTER
No. C Topologies • Regular/irregular • Direct/indirect – each node has a direct point-to-point link to a subset of other nodes in the system, called neighboring nodes
2 D Mesh • simplest and most popular topology for No. Cs. • Every switch, except those at the edges, is connected to four neighboring switches and one node.
2 D Torus • layout of a regular mesh except that nodes at the edges are connected to switches at the opposite edge via wrap-around routing channels. • Every switch has five ports • The limitation of this topology affects the long end-around connections
Octagon • well-established direct topology found in No. Cs. • ring of 8 nodes connected by 12 bi-directional links. • links provide two-hop communication between any pair of nodes in the ring • simple algorithms for fast yet efficient shortest-path routing. • In case a platform consists of more than eight nodes, the octagon is extended to multidimensional space
Fat-tree and butterfly fat-tree • • • nodes are connected to an architecture's external switches have point-to-point links to other switches. processing units and memory modules are assigned to the leafs of the trees, switches are placed at the vertices, communication involves climbing up and down some part of the tree. A pair of coordinates is used to label each node, ($l$, $p$), where $l$ denotes a node's level and $p$ gives its position within this level.
Polygon • widely accepted topology • packets travel in a loop from one router to the next. • We can add chords to the circle • if chords are inserted only between opposite routers, the topology is called a spidergon.
Star • central router in the middle of the star, • computational resources, or subnetworks, in the spikes of the star. • The capacity requirements of the central router are quite large, • significant possibility of congestion in the middle of the star
Flow Control • intra-switch • switch-to-switch – Buffered – Bufferless • end-to-end
ACK/NACK • • handshaking protocol When a sender puts data on the link, it activates a VALID signal. When the receiver is ready to consume the valid data, it activates the corresponding ACK signal. If the data is corrupt or there is no buffer space to store them, a NACK signal is activated instead. Upon receipt of a NACK, the sender starts resending flits starting from the not acknowledged one inherently supports fault tolerance, additional buffer space required to keep sent flits in case retransmission is required.
Stall/go • requires just two control wires • one going forward, signifying data availability, • one going backward and signaling either a condition of buffers filled ("STALL") or of buffers free ("GO")
Credit-based • • transmitter has a "credit" counter initialized to the value of empty buffer slots of the receiver decrements it every time a flit is sent. The credit counter must be updated in case the receiver consumes or forwards a flit and therefore increases its buffer space. a credit value that is sent back to the transmitter to be added to the current value of the credit counter. transmitter stalls when the credit value is zero and resumes when its value increases again.
NI Design • logic required to connect the nodes to the No. C. • NIs can differ significantly depending on the nature of the node • Using a NI allows IPs and communication infrastructure to be designed independently • One end of a NI is connected to a router using the selected flow control protocol • the other to the node IP • Since most IPs are designed to communicate through a bus, the NI uses a bus interface • NI is not simply a protocol adapter from a processor bus to a router port. • Ideally, the NI must offer the processing cores the view of a shared memory system, and the network itself should be transparent.
NI services • adaptation services – packetization/depacketization – protocol conversion and clock domain crossing. – absolute minimum services required of the NI so that data can be sent and received on the No. C • transaction reordering services, • error and flow control services – error detection and/or correction – request retransmission when required • route computation services – Source routing • upper layer services – Cache coherence
Typical No. C Packet Format • Header – routing and network control information. – In the case of distributed routing the information required is the destination and source addresses – in the case of source routing the complete routing information is written – In the case of variable packet size a length field is required • • Payload Tail – sequence number – error control fields such as hamming code or CRC fields
Source vs Distributed Routing • In source routing the entire routing path is computed at the source and appended to the packet. – The routers do not make any routing decisions, • in distributed routing, the routing path is decided in a hop -by-hop basis at each router even for deterministic routing algorithms. – The only information required to be found in the packet is the destination address. • The advantage of source routing is that it requires simple routers and can easily support irregular architectures. Its disadvantage is that it does not provide adaptiveness and requires more complex NIs and packets.
Source vs Distributed Routing
Cache Coherence • Each processor has its own L 1 cache • Main memory is shared • What happens if a processor modifies data (store) and another processor has an old (invalid) copy in his L 1 cache?
MESI cache coherence protocol • • Modified: The cache has the only valid copy that is in the whole system. The data which are in the main memory are invalid (out-of-date). A write-back operation will change this state to Exclusive: The cache has the only valid copy of the block, but it has not been modified. The data in the main memory are valid. A read operation from another processor will change the state to Shared: Another processor can have the data into its cache memory and both copies are updated Invalid: The data in the cache is not valid. Either the data the processor requests are not in the cache (miss), or the local copy of these data is not valid because another processor has updated the corresponding memory position.