NetworksonChip Ben Abdallah Abderazek The University of Aizu
Networks-on-Chip Ben Abdallah Abderazek The University of Aizu, Graduate School of Computer Science and Eng. Adaptive Systems Laboratory, E-mail: benab@u-aizu. ac. jp 03/01/2010 Hong Kong University of Science and Technology, March 2010 1
Part I Application Requirements Network on Chip: A paradigm Shift in VLSI Critical problems addressed by No. C Traffic abstractions Data Abstraction Network delay modeling Hong Kong University of Science and Technology, March 2010 2
Application Requirements Signal processing o Hard real time o Very regular load o High quality Media processing o Hard real time Typically on DSPs So. C/media processors o Irregular load o High quality Multimedia o Soft real time o Irregular load o Limited quality PC/desktop Very challenging! Hong Kong University of Science and Technology, March 2010 3
What the Internet Needs? Increasing Huge Amount of Packets & Routing, Packet Classification, Encryption, Qo. S, New Applications and Protocols, etc…. . ASIC (large, expensive to develop, not flexible) General Purpose RISC (not capable enough) So. C, MCSo. C? • High processing power • Support wire speed • Programmable • Scalable • Specially for network applications Hong Kong University of Science and Technology, March 2010 4
Example - Network Processor (NP) n n 16 pico-procesors and 1 power. PC Each pico-processor n n n Dyadic Processing Unit n n n Two pico-processors 2 KB Shared memory Tree search engine Focus is layers 2 -4 Power. PC 405 for control plane operations n n Support 2 hardware threads 3 stage pipeline : fetch/decode/execute 16 K I and D caches Target is OC-48 IBM Power. NP Adaptive Systems Laboratory, Univ. of Aizu 5
Example - Network Processor (NP) n NP can be applied in various network layers and applications n n n Traditional apps – forwarding, classification Advanced apps – transcoding, URL-based switching, security etc. New apps Adaptive Systems Laboratory, Univ. of Aizu 6
Telecommunication Systems and No. C Paradigm q. The trend nowadays is to integrate telecommunication system on complex multicore So. C (MCSo. C): § Network processors, § Multimedia hubs , and § base-band telecom circuits q. These applications have tight time-tomarket and performance constraints Adaptive Systems Laboratory, Univ. of Aizu 7
Telecommunication Systems and No. C Paradigm q. Telecommunication multicore So. C is composed of 4 kinds of components: 1. 2. 3. 4. Software tasks, Processors executing software, Specific hardware cores , and Global on-chip communication network Adaptive Systems Laboratory, Univ. of Aizu 8
Telecommunication Systems and No. C Paradigm q. Telecommunication multicore So. C is composed of 4 kinds of components: 1. 2. 3. 4. Software tasks, Processors executing software, Specific hardware cores , and Global on-chip communication network This is the most challenging part. Adaptive Systems Laboratory, Univ. of Aizu 9
Technology & Architecture Trends q. Technology trends: § Vast transistor budgets § Relatively poor interconnect scaling § Need to manage complexity and power § Build flexible designs (multi-/generalpurpose) q. Architectural trends: § Go parallel ! § Keep core complexity constant or simplify v. Result is lots of modules (cores, memories, offchip interfaces, specialized IP cores, etc. ) Hong Kong University of Science and Technology, March 2010 10
Wire Delay vs. Logic Delay Operation Delay (. 13 mico) Delay (. 05 micro ) 32 -bit ALU Operation 650 ps 250 ps 32 -bit Register read 325 ps 125 ps Read 32 -bit from 8 KB RAM 780 ps 300 ps Transfer 32 -bit across chip (10 mm) 1400 ps 2300 ps Transfer 32 -bit across chip (200 mm) 2800 ps 4600 ps 2: 1 global on-chip communication to operation delay 9: 1 in 2010 Ref: W. J. Dally HPCA Panel presentation 2002 Hong Kong University of Science and Technology, March 2010 11
Communication Reliability q. Information transfer is inherently unreliable at the electrical level, due to: § Timing errors § Cross-talk § Electro-magnetic interference (EMI) § Soft errors q. The problem will get increasingly worse as technology scales down Adaptive Systems Laboratory, Uo. A 12
Evolution of on-chip communication Hong Kong University of Science and Technology, March 2010 13
Traditional So. C nightmare q. Variety of dedicated interfaces q Design and verification complexity q Unpredictable performance q Many underutilized wires DMA CPU DSP Control signals CPU Bus A Bridge B C Peripheral Bus IO IO IO Hong Kong University of Science and Technology, March 2010 14
Network on Chip: A paradigm Shift in VLSI From: Dedicated signal wires To: Shared network s s s Module s s Modul e s Point. To-point Link s s Computing Module s Network switch Adaptive Systems Laboratory, Uo. A 15
No. C essential s s s Module s s Modul e s s q. Communication by packets of bits q Routing of packets through several hops, via switches q. Efficient sharing of wires q Parallelism Hong Kong University of Science and Technology, March 2010 16
Characteristics of a paradigm shift q. Solves a critical problem q Step-up in abstraction q Design is affected: § Design becomes more restricted § New tools § The changes enable higher complexity and capacity § Jump in design productivity Hong Kong University of Science and Technology, March 2010 17
Characteristics of a paradigm shift q. Solves a critical problem q Step-up in abstraction q Design is affected: We will look at the problem addressed by No. C. § Design becomes more restricted § New tools § The changes enable higher complexity and capacity § Jump in design productivity Hong Kong University of Science and Technology, March 2010 18
Origins of the No. C concept q The idea was talked about in the 90’s, but actual research came in the new illenium. Some well-known early publications: q q q q Guerrier and Greiner (2000) “A generic architecture for on-chip packet-switched interconnections” Hemani et al. (2000) “Network on chip: An architecture for billion transistor era” Dally and Towles (2001) “Route packets, not wires: on-chip interconnection networks” Wingard (2001) “Micro. Network-based integration of So. Cs” Rijpkema, Goossens and Wielage (2001) “A router architecture for networks on silicon” Kumar et al. (2002) “A Network on chip architecture and design methodology” De Micheli and Benini (2002) “Networks on chip: A new paradigm for systems on chip design” Hong Kong University of Science and Technology, March 2010 19
Don't we already know how to design interconnection networks? q. Many existing network topologies, router designs and theory has already been developed for high end supercomputers and telecom switches q. Yes, and we'll cover some of this material, but the trade-offs on-chip lead to very different designs!! Hong Kong University of Science and Technology, March 2010 20
Critical problems addressed by No. C 1) Global interconnect design problem: delay, power, noise, scalability, reliability 2) System integration productivity problem 3) Chip Multi Processors (key to power-efficient computing Hong Kong University of Science and Technology, March 2010 21
1(a): No. C and Global wire delay Long wire delay is dominated by Resistance Add repeaters Repeaters become latches (with clock frequency scaling) Latches evolve to No. C routers No. C Router Hong Kong University of Science and Technology, March 2010 22
1(b): Wire design for No. C q No. C links: § Regular § Point-to-point (no fanout tree) § Can use transmission-line layout § Well-defined current return path q Can be optimized for noise / speed / power § Low swing, current mode, …. Hong Kong University of Science and Technology, March 2010 23
1(c): No. C scalability q For Same Performance, compare the wire area and power No. C: O(n) Simple Bus O(n^3 √n) O(n√n) Point –to-Point Segmented Bus: O(n^2 √n) O(n√n) O(n √n) Hong Kong University of Science and Technology, March 2010 24
1(d): No. C and communication reliability q Fault tolerance & error correction Router n … Input buffer UMODEM U M O D E M Router U M O D E M Error correction Synchronizatio n UMODEM ISI reduction m Parallel to Serial Convertor UMODEM U M O D E M Router U M O D E M Modulation Link Interface UMODEM Interconnect A. Morgenshtein, E. Bolotin, I. Cidon, A. Kolodny, R. Ginosar, “Micro-modem – reliability solution for NOC communications”, ICECS 2004 Hong Kong University of Science and Technology, March 2010 25
1(e): No. C and GALS q. Modules in No. C System use different clocks § May use different voltages No. C can take care of synchronization q No. C design may be asynchronous q § No waste of power when the links and routers are idle Hong Kong University of Science and Technology, March 2010 26
2: No. C and engineering productivity q. No. C eliminates ad-hoc global wire engineering q No. C separates computation from communication § No. C supports modularity and reuse of cores No. C is a platform for system integration, debugging and testing q Hong Kong University of Science and Technology, March 2010 27
3: No. C and CMP cannot provide Powerefficient performance growth Interconnect q. Uniprocessors § Interconnect dominates dynamic power Gate § Global wire delay doesn’t scale § Instruction-level parallelism is limited Diff. q Power-efficiency requires many parallel local. Uniprocessor dynamic power computations (Magen et al. , SLIP 200 Uniprocessir § Chip Multi Processors (CMP) Performance § Thread-Level Parallelism (TLP) Die Area (or Power) Hong Kong University of Science and Technology, March 2010 28
3: No. C and CMP q Uniprocessors cannot provide Power-efficient performance growth § Interconnect dominates dynamic power § Global wire delay doesn’t scale § Instruction-level parallelism is limited q Power-efficiency requires many parallel local computations § Chip Multi Processors (CMP) § Thread-Level Parallelism (TLP) q Network is a natural choice for CMP! Hong Kong University of Science and Technology, March 2010 29
3: No. C and CMP Network is a natural choice for CMP q Uniprocessors cannot provide Power-efficient performance growth § Interconnect dominates dynamic power § Global wire delay doesn’t scale § Instruction-level parallelism is limited q Power-efficiency requires many parallel local computations § Chip Multi Processors (CMP) § Thread-Level Parallelism (TLP) q Network is a natural choice for CMP! Hong Kong University of Science and Technology, March 2010 30
Why Now is the time for No. C? Difficulty of DSM wire design Productivity pressure CMPs Hong Kong University of Science and Technology, March 2010 31
Traffic abstractions q Traffic model are generally captured from actual traces of functional simulation q A statically distribution is often assumed for message PE 1 PE 2 PE 3 PE 4 PE 12 PE 10 PE 11 PE 5 PE 9 PE 7 PE 8 PE 6 Hong Kong University of Science and Technology, March 2010 32
Data abstractions Hong Kong University of Science and Technology, March 2010 33
Layers of abstraction in network modeling q Software layers § Application, OS q Network & transport layers § Network topology e. g. crossbar, ring, mesh, torus, fat tree, … § Switching Circuit / packet switching(SAF, VCT), wormhole § Addressing Logical/physical, source/destination, flow, transaction § Routing Static/dynamic, distributed/source, deadlock avoidance § Quality of Service e. g. guaranteed-throughput, best-effort § Congestion control, end-to-end flow control q Data link layer § Flow control (handshake) § Handling of contention § Correction of transmission errors q Physical layer § Wires, drivers, receivers, repeaters, signaling, circuits, . . Hong Kong University of Science and Technology, March 2010 34
How to select architecture ? q. Architecture choices depends on system needs. Reconfiguration Rate During run time CMP/ Multicore ASSP At boot time FPGA At design time ASIC Flexibility Single application General purpose or Embedded systems Hong Kong University of Science and Technology, March 2010 35
How to select architecture ? q. Architecture choices depends on system needs. Reconfiguration Rate A large range of solutions! During run time CMP/ Multicore ASSP At boot time FPGA At design time ASIC Flexibility Single application General purpose or Embedded systems Hong Kong University of Science and Technology, March 2010 36
Example: OASIS q ASIC assumed § Traffic requirement are known a-priori q Features § Packet switching – wormhole § Quality of service e § Mesh topology K. Mori, A. Ben Abdallah, and K. Kuruda, “Design and Evaluation of a Complexity Effective Network-on-Chip Architecture on FPGA", The 19 th Intelligent System Symposium (FAN 2009), pp. 318321, Sep. 2009. S. Miura, A. Ben Abdallah, and K. Kuroda, "PNo. C - Design and Preliminary Evaluation of a Parameterizable No. C for MCSo. CGeneration and Design Space Exploration", The 19 th Intelligent System Symposium (FAN 2009), pp. 314 -317, Sep. 2009. Hong Kong University of Science and Technology, March 2010 37
Perspective 1: No. C vs. Bus No. C Aggregate bandwidth grows q Link speed unaffected by N q Concurrent spatial reuse q Pipelining is built-in q Distributed arbitration q Separate abstraction layers However: q No performance guarantee q Extra delay in routers q Area and power overhead? q Modules need NI q Unfamiliar methodology q Bus q Bandwidth is limited, shared q Speed goes down as N grows q No concurrency q Pipelining is tough q Central arbitration q No layers of abstraction (communication and computation are coupled) However: q Fairly simple and familiar Hong Kong University of Science and Technology, March 2010 38
Perspective 2: No. C vs. Off-chip Networks No. C q Sensitive to cost: § area § power q Wires are relatively cheap q Latency is critical q Traffic may be known a-priori Off-Chip Networks Cost is in the links q Latency is tolerable q Traffic/applications unknown q Changes at runtime q Adherence to networking q standards q q Design time specialization q Custom No. Cs are possible Hong Kong University of Science and Technology, March 2010 39
VLSI CAD problems q. Application mapping q Floorplanning / placement q Routing q Buffer sizing q Timing closure q Simulation q Testing Hong Kong University of Science and Technology, March 2010 40
VLSI CAD problems in No. C q Application q q q mapping (map tasks to cores) Floorplanning / placement (within the network) Routing (of messages) Buffer sizing (size of FIFO queues in the routers) Timing closure (Link bandwidth capacity allocation) Simulation (Network simulation, traffic/delay/power modeling) q Other No. C design problems (topology synthesis, switching, virtual channels, arbitration, flow control, ……) Hong Kong University of Science and Technology, March 2010 41
Typical No. C design flow Place Modules Determine routing and adjust link capacities Hong Kong University of Science and Technology, March 2010 42
Timing closure in No. C Define intermodule traffic Place modules Increase link capacities No Qo. S satisfied ? Yes Finish § Too long capacity results in poor Qo. S § Too high capacity wastes area § Uniform link capacities are a waste in ASIP system Hong Kong University of Science and Technology, March 2010 43
Network delay modeling q Analysis of mean packet delay us wormhole network § Multiple Virtual-Channels § Different link capacities § Different communication demands Hong Kong University of Science and Technology, March 2010 44
No. C design requirements q. High-performance interconnect § High-throughput, latency, power, area q. Complex functionality (performance again) § Support for virtual-channels § Qo. S q. Synchronization § Reliability, high-throughput, low-laten 45
ISO/OSI network protocol stack model Hong Kong University of Science and Technology, March 2010 46
Part II No. C topologies Switching strategies Routing algorithms Flow control schemes Clocking schemes Qo. S Basic Building Blocks Status and Open Problems Hong Kong University of Science and Technology, March 2010 47
No. C Topology The connection map between PEs q. Adopted from large-scale networks and parallel computing q Topology classifications: § Direct topologies § Indirect topologies Adaptive Systems Laboratory, Univ. of Aizu 48
Direct topologies q. Each switch (SW) connected to a single PE q. As the # of nodes in the system increases, the total bandwidth also PE PE increases 1 PE is connected to only a single SW PE SW SW PE Hong Kong University of Science and Technology, March 2010 49
Direct topologies Mesh q 2 D mesh is most popular § All links have the same length v Eases physical design § Area grows linearly with the # of nodes 4 x 4 Mesh Hong Kong University of Science and Technology, March 2010 50
Direct topologies Torus and Folded Torus q Similar to a regular Mesh q Excessive delay problem due to long-end-around connection q Overcomes the long link limitation of a 2 -D torus q Links have the same size Hong Kong University of Science and Technology, March 2010 51
Direct topologies Octagon topology q. Messages being sent between any 2 nodes require at most two hops q. More octagons can be tiled together to accommodate larger designs PE PE PE SW PE PE PE Hong Kong University of Science and Technology, March 2010 52
Indirect topologies A set of PEs are connected to a switch (router). q. Fat tree topology § Nodes are connected only to the leaves of the tree § More links near root, where bandwidth requirements are higher SW SW PE PE SW SW PE PE PE Hong Kong University of Science and Technology, March 2010 53
Indirect topologies k-ary n-fly butterfly network q. Blocking multi-stage network – packets may be temporarily blocked or dropped in the network if contention occurs Example: 2 -ary 3 -fly butterfly network Hong Kong University of Science and Technology, March 2010 54
Indirect topologies (m, n, r) symmetric Clos network q 3 -stage network in which each stage is made up of a number of crossbar switches qm : number of middle-stage switches qn : number of input/output nodes on each input/output switch qr : number of I and O switches Example: (3, 3, 4) Clos network q. Non-blocking network q. Expensive (several full crossbars) Hong Kong University of Science and Technology, March 2010 55
Indirect topologies Benes network q Rearrangeable network in which paths may have to be rearranged to provide a connection, requiring an appropriate controller q Clos topology composed of 2 x 2 switches Example: (2, 2, 4) re-arrangeable Clos network constructed using two (2, 2, 2) Clos networks with 4 x 4 middle switches. Hong Kong University of Science and Technology, March 2010 56
Irregular Topologies Customized q Customized for an application q Usually a mix of shared bus, direct, and indirect network topologies sw sw PE PE sw sw sw PE PE PE sw sw sw PE Example 1: Reduced mesh sw PE sw sw sw PE PE sw PE PE sw sw sw PE Example 2: Cluster-based hybrid topology Hong Kong University of Science and Technology, March 2010 57
Example 1: Partially irregular 2 DMesh topology q Contains PEs. oversized rectangularly shaped Adaptive Systems Laboratory, Univ. of Aizu 58
Example 2: Irregular Mesh n This kind of chip does not limit the shape of the PEs or the placement of the routers. It may be considered a "custom" No. C Adaptive Systems Laboratory, Univ. of Aizu 59
How to Select a Topology ? q. Application decides the topology type q. If PEs = few tens Star, Mesh topologies are recommended q. If PEs = 100 or more Hierarchical Star, Mesh are recommended q. Some topologies are better for certain designs than others q. Most of the times, when one topology is better in performance, it is worse in Adaptive Systems Laboratory, Univ. of Aizu 60
Part II No. C topologies No. C Switching strategies Routing algorithms Flow control schemes Clocking schemes Qo. S Basic Building Blocks Status and Open Problems Hong Kong University of Science and Technology, March 2010 61
No. C Switching Strategies Switching determines how flits and packets flows through routers in the network q. There are two basic modes: § Circuit switching § Packet switching Adaptive Systems Laboratory, Univ. of Aizu 62
Circuit Switching q. Network resources (channels) are reserved before a packet is sent q. Entire path must be reserved first q. The packets do not contain routing information, but rather data and information about the data. q. Circuit-switched networks require no overhead for packetisation, packet header processing or packet buffering Hong Kong University of Science and Technology, March 2010 63
Circuit Switching Header ACK Data R 1 R 2 R 3 Routing + switching delay Router Delay Setup time Transfer time Adaptive Systems Laboratory, Univ. of Aizu 64
Circuit Switching q. Once circuit is setup, router latency and control overheads are very low q Very poor use of channel bandwidth if lots of short packets must be sent to many different destinations § More commonly seen in embedded So. C applications where traffic patterns may be static and involve streaming large amounts of data between different IP blocks Hong Kong University of Science and Technology, March 2010 65
Packet Switching q We can aim to make better use of channel resources by buffering packets. We then arbitrate for access to network resources dynamically. q We distinguish between different approaches by the granularity at which we reserve resources (e. g. channels and buffers) and conditions that must be met for a packet to advance to the next node Hong Kong University of Science and Technology, March 2010 66
Packet Switching Advance when entire packet is buffered + L free flit buffers at next node Store-and-forward (Sa. F) Advance when L free flit buffers at the next node Packet-Buffer Flow Control Cut-through Can advance when at least one flit buffer is available Flit-Buffer Flow Control Wormhole L : Packet Length Hong Kong University of Science and Technology, March 2010 67
Packet Switching Store and Forward (SAF) q Packet is sent from one router to the next only if the receiving router has buffer space for entire packet q Buffer size in the router is at least equal to the size of a packet Forward packet by packet Buffer packet Switch Buffer Switch Store and Forward switching data flit header flit Hong Kong University of Science and Technology, March 2010 68
Packet switching Wormhole (WH) q Flit is forwarded to a router if space exists for that flit q Parts of the packet can be distributed among two or more routers q Buffer requirements are reduced to one flit, instead of an entire packet Forward flit by flit Buffer packet Switch Buffer Switch WH switching technique data flit header flit Hong Kong University of Science and Technology, March 2010 69
Packet switching Virtual Channel (VC) q. Improve performance of WH routing, prevent a single packet blocking a free channel § e. g. if the green packet is blocked, the red packet may still make progress through the network § We can interleave flits from different packets over the same channel Hong Kong University of Science and Technology, March 2010 70
Part II No. C topologies No. C Switching strategies Routing algorithms Flow control schemes Clocking schemes Qo. S Basic Building Blocks Status and Open Problems Hong Kong University of Science and Technology, March 2010 71
- Slides: 71