Introduction to MultiProcessor Architectures Dr Konstantinos Tatas Outline

  • Slides: 40
Download presentation
Introduction to Multi-Processor Architectures Dr. Konstantinos Tatas

Introduction to Multi-Processor Architectures Dr. Konstantinos Tatas

Outline • • Why multiprocessor architectures? Challenges The communication problem The cache coherence problem

Outline • • Why multiprocessor architectures? Challenges The communication problem The cache coherence problem

Technology Process Evolution Node years: 2007/65 nm, 2010/45 nm, 2013/33 nm, 2016/23 nm

Technology Process Evolution Node years: 2007/65 nm, 2010/45 nm, 2013/33 nm, 2016/23 nm

Why that didn’t happen?

Why that didn’t happen?

THE MANY CORES ERA Source: International Roadmap for Semiconductors 2007 edition (http: //www. itrs.

THE MANY CORES ERA Source: International Roadmap for Semiconductors 2007 edition (http: //www. itrs. net/)

CHALLENGES • Communication • Data coherence • Programming

CHALLENGES • Communication • Data coherence • Programming

SHARED ADDRESS SPACE COMMUNICATIONS

SHARED ADDRESS SPACE COMMUNICATIONS

SYSTEM BUS

SYSTEM BUS

CROSS-BAR

CROSS-BAR

MULTI-STAGES NETWORK ON CHIP

MULTI-STAGES NETWORK ON CHIP

No. C-based MPSo. C • nodes – Processing Elements (PEs), such as CPUs, custom

No. C-based MPSo. C • nodes – Processing Elements (PEs), such as CPUs, custom IPs, DSPs, etc. – storage elements (embedded memory blocks), • • Routers Links Network Interfaces (NIs) Often a switch together with its host node memory is referred to as a tile.

NOC VS. “OFF-CHIP” NETWORKS What is Different? Ø Routers on Planar Grid Topology Ø

NOC VS. “OFF-CHIP” NETWORKS What is Different? Ø Routers on Planar Grid Topology Ø Short Point-To-Point Links between routers Ø Unique VLSI Cost Sensitivity: ü Area-Routers and Links ü Power

NOC VS. “OFF-CHIP” NETWORKS Ø No legacy protocols to be compliant with … Ø

NOC VS. “OFF-CHIP” NETWORKS Ø No legacy protocols to be compliant with … Ø No software simple and hardware efficient protocols Ø Different operating env. (no dynamic changes and failures)

AN NOC EXAMPLE • Source: ossum, Intel @ MPSo. C’ 07

AN NOC EXAMPLE • Source: ossum, Intel @ MPSo. C’ 07

NOC TOPOLOGIES Regular topologies: general-purposed on -chip multiprocessors Custom topologies:

NOC TOPOLOGIES Regular topologies: general-purposed on -chip multiprocessors Custom topologies:

NOC VS. “OFF-CHIP” NETWORKS What is Different? Ø Routers on Planar Grid Topology Ø

NOC VS. “OFF-CHIP” NETWORKS What is Different? Ø Routers on Planar Grid Topology Ø Short Point-To-Point Links between routers Ø Unique VLSI Cost Sensitivity: ü Area-Routers and Links ü Power

NOC VS. “OFF-CHIP” NETWORKS Ø No legacy protocols to be compliant with … Ø

NOC VS. “OFF-CHIP” NETWORKS Ø No legacy protocols to be compliant with … Ø No software simple and hardware efficient protocols Ø Different operating env. (no dynamic changes and failures)

NOC VS. “OFF-CHIP” NETWORKS Ø No legacy protocols to be compliant with … Ø

NOC VS. “OFF-CHIP” NETWORKS Ø No legacy protocols to be compliant with … Ø No software simple and hardware efficient protocols Custom Network Designenv. – You(no design what you need! Ø Different operating dynamic changes and failures)

NOC VS. “OFF-CHIP” NETWORKS Ø No legacy protocols to be compliant with … Ø

NOC VS. “OFF-CHIP” NETWORKS Ø No legacy protocols to be compliant with … Ø No software simple and hardware efficient protocols Custom Network Designenv. – You(no design what you need! Ø Different operating dynamic changes and failures) Example 1: Replace modules Replace

NOC VS. “OFF-CHIP” NETWORKS Ø No legacy protocols to be compliant with … Ø

NOC VS. “OFF-CHIP” NETWORKS Ø No legacy protocols to be compliant with … Ø No software simple and hardware efficient protocols Custom Network Designenv. – You(no design what you need! Ø Different operating dynamic changes and failures) Example 2: Adapt Links

NOC COST SCALABILITY VS. ALTERNATIVES • Compare the cost of: ØNo. C ØNon-Segmented Bus

NOC COST SCALABILITY VS. ALTERNATIVES • Compare the cost of: ØNo. C ØNon-Segmented Bus (NS-Bus) ØSegmented Bus (S-Bus) ØPoint-To-Point (PTP)

NOC ROUTER

NOC ROUTER

No. C Topologies • Regular/irregular • Direct/indirect – each node has a direct point-to-point

No. C Topologies • Regular/irregular • Direct/indirect – each node has a direct point-to-point link to a subset of other nodes in the system, called neighboring nodes

2 D Mesh • simplest and most popular topology for No. Cs. • Every

2 D Mesh • simplest and most popular topology for No. Cs. • Every switch, except those at the edges, is connected to four neighboring switches and one node.

2 D Torus • layout of a regular mesh except that nodes at the

2 D Torus • layout of a regular mesh except that nodes at the edges are connected to switches at the opposite edge via wrap-around routing channels. • Every switch has five ports • The limitation of this topology affects the long end-around connections

Octagon • well-established direct topology found in No. Cs. • ring of 8 nodes

Octagon • well-established direct topology found in No. Cs. • ring of 8 nodes connected by 12 bi-directional links. • links provide two-hop communication between any pair of nodes in the ring • simple algorithms for fast yet efficient shortest-path routing. • In case a platform consists of more than eight nodes, the octagon is extended to multidimensional space

Fat-tree and butterfly fat-tree • • • nodes are connected to an architecture's external

Fat-tree and butterfly fat-tree • • • nodes are connected to an architecture's external switches have point-to-point links to other switches. processing units and memory modules are assigned to the leafs of the trees, switches are placed at the vertices, communication involves climbing up and down some part of the tree. A pair of coordinates is used to label each node, ($l$, $p$), where $l$ denotes a node's level and $p$ gives its position within this level.

Polygon • widely accepted topology • packets travel in a loop from one router

Polygon • widely accepted topology • packets travel in a loop from one router to the next. • We can add chords to the circle • if chords are inserted only between opposite routers, the topology is called a spidergon.

Star • central router in the middle of the star, • computational resources, or

Star • central router in the middle of the star, • computational resources, or subnetworks, in the spikes of the star. • The capacity requirements of the central router are quite large, • significant possibility of congestion in the middle of the star

Flow Control • intra-switch • switch-to-switch – Buffered – Bufferless • end-to-end

Flow Control • intra-switch • switch-to-switch – Buffered – Bufferless • end-to-end

ACK/NACK • • handshaking protocol When a sender puts data on the link, it

ACK/NACK • • handshaking protocol When a sender puts data on the link, it activates a VALID signal. When the receiver is ready to consume the valid data, it activates the corresponding ACK signal. If the data is corrupt or there is no buffer space to store them, a NACK signal is activated instead. Upon receipt of a NACK, the sender starts resending flits starting from the not acknowledged one inherently supports fault tolerance, additional buffer space required to keep sent flits in case retransmission is required.

Stall/go • requires just two control wires • one going forward, signifying data availability,

Stall/go • requires just two control wires • one going forward, signifying data availability, • one going backward and signaling either a condition of buffers filled ("STALL") or of buffers free ("GO")

Credit-based • • transmitter has a "credit" counter initialized to the value of empty

Credit-based • • transmitter has a "credit" counter initialized to the value of empty buffer slots of the receiver decrements it every time a flit is sent. The credit counter must be updated in case the receiver consumes or forwards a flit and therefore increases its buffer space. a credit value that is sent back to the transmitter to be added to the current value of the credit counter. transmitter stalls when the credit value is zero and resumes when its value increases again.

NI Design • logic required to connect the nodes to the No. C. •

NI Design • logic required to connect the nodes to the No. C. • NIs can differ significantly depending on the nature of the node • Using a NI allows IPs and communication infrastructure to be designed independently • One end of a NI is connected to a router using the selected flow control protocol • the other to the node IP • Since most IPs are designed to communicate through a bus, the NI uses a bus interface • NI is not simply a protocol adapter from a processor bus to a router port. • Ideally, the NI must offer the processing cores the view of a shared memory system, and the network itself should be transparent.

NI services • adaptation services – packetization/depacketization – protocol conversion and clock domain crossing.

NI services • adaptation services – packetization/depacketization – protocol conversion and clock domain crossing. – absolute minimum services required of the NI so that data can be sent and received on the No. C • transaction reordering services, • error and flow control services – error detection and/or correction – request retransmission when required • route computation services – Source routing • upper layer services – Cache coherence

Typical No. C Packet Format • Header – routing and network control information. –

Typical No. C Packet Format • Header – routing and network control information. – In the case of distributed routing the information required is the destination and source addresses – in the case of source routing the complete routing information is written – In the case of variable packet size a length field is required • • Payload Tail – sequence number – error control fields such as hamming code or CRC fields

Source vs Distributed Routing • In source routing the entire routing path is computed

Source vs Distributed Routing • In source routing the entire routing path is computed at the source and appended to the packet. – The routers do not make any routing decisions, • in distributed routing, the routing path is decided in a hop -by-hop basis at each router even for deterministic routing algorithms. – The only information required to be found in the packet is the destination address. • The advantage of source routing is that it requires simple routers and can easily support irregular architectures. Its disadvantage is that it does not provide adaptiveness and requires more complex NIs and packets.

Source vs Distributed Routing

Source vs Distributed Routing

Cache Coherence • Each processor has its own L 1 cache • Main memory

Cache Coherence • Each processor has its own L 1 cache • Main memory is shared • What happens if a processor modifies data (store) and another processor has an old (invalid) copy in his L 1 cache?

MESI cache coherence protocol • • Modified: The cache has the only valid copy

MESI cache coherence protocol • • Modified: The cache has the only valid copy that is in the whole system. The data which are in the main memory are invalid (out-of-date). A write-back operation will change this state to Exclusive: The cache has the only valid copy of the block, but it has not been modified. The data in the main memory are valid. A read operation from another processor will change the state to Shared: Another processor can have the data into its cache memory and both copies are updated Invalid: The data in the cache is not valid. Either the data the processor requests are not in the cache (miss), or the local copy of these data is not valid because another processor has updated the corresponding memory position.