18 742 Fall 2012 Parallel Computer Architecture Lecture

  • Slides: 102
Download presentation
18 -742 Fall 2012 Parallel Computer Architecture Lecture 21: Interconnects IV Prof. Onur Mutlu

18 -742 Fall 2012 Parallel Computer Architecture Lecture 21: Interconnects IV Prof. Onur Mutlu Carnegie Mellon University 10/29/2012

New Review Assignments Were Due: Sunday, October 28, 11: 59 pm. q q Due:

New Review Assignments Were Due: Sunday, October 28, 11: 59 pm. q q Due: Tuesday, October 30, 11: 59 pm. q Das et al. , “Aergia: Exploiting Packet Latency Slack in On-Chip Networks, ” ISCA 2010. Dennis and Misunas, “A Preliminary Architecture for a Basic Data Flow Processor, ” ISCA 1974. Arvind and Nikhil, “Executing a Program on the MIT Tagged. Token Dataflow Architecture, ” IEEE TC 1990. Due: Thursday, November 1, 11: 59 pm. q q Patt et al. , “HPS, a new microarchitecture: rationale and introduction, ” MICRO 1985. Patt et al. , “Critical issues regarding HPS, a high performance microarchitecture, ” MICRO 1985. 2

Other Readings Dataflow Gurd et al. , “The Manchester prototype dataflow computer, ” CACM

Other Readings Dataflow Gurd et al. , “The Manchester prototype dataflow computer, ” CACM 1985. Lee and Hurson, “Dataflow Architectures and Multithreading, ” IEEE Computer 1994. Restricted Dataflow q q Patt et al. , “HPS, a new microarchitecture: rationale and introduction, ” MICRO 1985. Patt et al. , “Critical issues regarding HPS, a high performance microarchitecture, ” MICRO 1985. Sankaralingam et al. , “Exploiting ILP, TLP and DLP with the Polymorphous TRIPS Architecture, ” ISCA 2003. Burger et al. , “Scaling to the End of Silicon with EDGE Architectures, ” IEEE Computer 2004. 3

Project Milestone I Meetings Please come to office hours for feedback on q q

Project Milestone I Meetings Please come to office hours for feedback on q q Your progress Your presentation 4

Last Lectures Transactional Memory (brief) Interconnect wrap-up Project Milestone I presentations 5

Last Lectures Transactional Memory (brief) Interconnect wrap-up Project Milestone I presentations 5

Today More on Interconnects Research Start Dataflow 6

Today More on Interconnects Research Start Dataflow 6

Research in Interconnects

Research in Interconnects

Research Topics in Interconnects Plenty of topics in interconnection networks. Examples: Energy/power efficient and

Research Topics in Interconnects Plenty of topics in interconnection networks. Examples: Energy/power efficient and proportional design Reducing Complexity: Simplified router and protocol designs Adaptivity: Ability to adapt to different access patterns Qo. S and performance isolation q Co-design of No. Cs with other shared resources q Reducing and controlling interference, admission control End-to-end performance, Qo. S, power/energy optimization Scalable topologies to many cores, heterogeneous systems Fault tolerance Request prioritization, priority inversion, coherence, … New technologies (optical, 3 D) 8

Packet Scheduling Which packet to choose for a given output port? q q Common

Packet Scheduling Which packet to choose for a given output port? q q Common strategies q q q Router needs to prioritize between competing flits Which input port? Which virtual channel? Which application’s packet? Round robin across virtual channels Oldest packet first (or an approximation) Prioritize some virtual channels over others Better policies in a multi-core environment q Use application characteristics 9

Application-Aware Packet Scheduling Das et al. , “Application-Aware Prioritization Mechanisms for On-Chip Networks, ”

Application-Aware Packet Scheduling Das et al. , “Application-Aware Prioritization Mechanisms for On-Chip Networks, ” MICRO 2009.

The Problem: Packet Scheduling App 1 App 2 P P App N-1 P P

The Problem: Packet Scheduling App 1 App 2 P P App N-1 P P P Network-on-Chip L 2$ L 2$ Bank mem Memory cont Controller Accelerator Network-on-Chip is a critical resource shared by multiple applications

The Problem: Packet Scheduling P E P E R R R P E P

The Problem: Packet Scheduling P E P E R R R P E P E R R R R P E P E P E R Input Port with Buffers VC Identifier From East R From West R R VC 0 VC 1 VC 2 Control Logic Routing Unit (RC) VC Allocator (VA) Switch Allocator (S A) To East From North To West To North To South To PE From South R P E Routers Processing Element (Cores, L 2 Banks, Memory Controllers etc) Crossbar (5 x 5) From PE Crossba r

The Problem: Packet Scheduling From East From West From North From South From PE

The Problem: Packet Scheduling From East From West From North From South From PE VC 0 VC 1 VC 2 Routing Unit (RC) VC (VA) Allocator Switch Allocator(S A)

The Problem: Packet Scheduling VC 0 From East From West From North From South

The Problem: Packet Scheduling VC 0 From East From West From North From South VC 0 VC 1 VC 2 From East Routing Unit (RC) VC 2 VC (VA) Allocator Switch Allocator(SA ) VC 1 From West Conceptual View From North From South From PE App 1 App 2 App 5 App 6 App 3 App 7 App 4 App 8

The Problem: Packet Scheduling VC 0 From West From North From South Routing Unit

The Problem: Packet Scheduling VC 0 From West From North From South Routing Unit (RC) From East VC 2 VC (VA) Allocator Switch Allocator(SA ) VC 1 From West Conceptual View From North Scheduler From East VC 0 VC 1 VC 2 From South Which packet to choose? From PE App 1 App 2 App 5 App 6 App 3 App 7 App 4 App 8

The Problem: Packet Scheduling Existing scheduling policies Round Robin Age Problem 1: Local to

The Problem: Packet Scheduling Existing scheduling policies Round Robin Age Problem 1: Local to a router Lead to contradictory decision making between routers: packets from one application may be prioritized at one router, to be delayed at next. Problem 2: Application oblivious Treat all applications’ packets equally But applications are heterogeneous Solution : Application-aware global scheduling policies.

Motivation: Stall Time Criticality Applications are not homogenous Applications have different criticality with respect

Motivation: Stall Time Criticality Applications are not homogenous Applications have different criticality with respect to the network Some applications are network latency sensitive Some applications are network latency tolerant Application’s Stall Time Criticality (STC) can be measured by its average network stall time per packet (i. e. NST/packet) Network Stall Time (NST) is number of cycles the processor stalls waiting for network transactions to complete

Motivation: Stall Time Criticality Why do applications have different network stall time criticality (STC)?

Motivation: Stall Time Criticality Why do applications have different network stall time criticality (STC)? Memory Level Parallelism (MLP) Lower MLP leads to higher STC Shortest Job First Principle (SJF) Lower network load leads to higher STC Average Memory Access Time Higher memory access time leads to higher STC

STC Principle 1 {MLP} Compute STALL of Red Packet = 0 STALL Application with

STC Principle 1 {MLP} Compute STALL of Red Packet = 0 STALL Application with high MLP LATENCY Observation 1: Packet Latency != Network Stall Time

STC Principle 1 {MLP} STALL of Red Packet = 0 STALL Application with high

STC Principle 1 {MLP} STALL of Red Packet = 0 STALL Application with high MLP LATENCY Application with low MLP STALL LATENCY Observation 1: Packet Latency != Network Stall Time Observation 2: A low MLP application’s packets have higher criticality than a high MLP application’s

STC Principle 2 {Shortest-Job. First} Heavy Application Light Application Running ALONE Compute Baseline (RR)

STC Principle 2 {Shortest-Job. First} Heavy Application Light Application Running ALONE Compute Baseline (RR) Scheduling 4 X network slow down 1. 3 X network slow down SJF Scheduling 1. 2 X network slow down 1. 6 X network slow down Overall system throughput{weighted speedup} increases by 34%

Solution: Application-Aware Policies Idea Identify stall time critical applications (i. e. network sensitive applications)

Solution: Application-Aware Policies Idea Identify stall time critical applications (i. e. network sensitive applications) and prioritize their packets in each router. Key components of scheduling policy: Application Ranking Packet Batching Propose low-hardware complexity solution

Component 1 : Ranking distinguishes applications based on Stall Time Criticality (STC) Periodically rank

Component 1 : Ranking distinguishes applications based on Stall Time Criticality (STC) Periodically rank applications based on Stall Time Criticality (STC). Explored many heuristics for quantifying STC (Details & analysis in paper) Heuristic based on outermost private cache Misses Per Instruction (L 1 -MPI) is the most effective Low L 1 -MPI => high STC => higher rank Why Misses Per Instruction (L 1 -MPI)? Easy to Compute (low complexity) Stable Metric (unaffected by interference in network)

Component 1 : How to Rank? Execution time is divided into fixed “ranking intervals”

Component 1 : How to Rank? Execution time is divided into fixed “ranking intervals” Ranking interval is 350, 000 cycles At the end of an interval, each core calculates their L 1 -MPI and sends it to the Central Decision Logic (CDL) CDL is located in the central node of mesh CDL forms a ranking order and sends back its rank to each core Two control packets per core every ranking interval Ranking order is a “partial order” Rank formation is not on the critical path Ranking interval is significantly longer than rank computation time Cores use older rank values until new ranking is available

Component 2: Batching Problem: Starvation Prioritizing a higher ranked application can lead to starvation

Component 2: Batching Problem: Starvation Prioritizing a higher ranked application can lead to starvation of lower ranked application Solution: Packet Batching Network packets are grouped into finite sized batches Packets of older batches are prioritized over younger batches Alternative batching policies explored in paper Time-Based Batching New batches are formed in a periodic, synchronous manner across all nodes in the network, every T cycles

Putting it all together Before injecting a packet into the network, it is tagged

Putting it all together Before injecting a packet into the network, it is tagged by Batch ID (3 bits) Rank ID (3 bits) Three tier priority structure at routers Oldest batch first (prevent starvation) Highest rank first (maximize performance) Local Round-Robin (final tie breaker) Simple hardware support: priority arbiters Global coordinated scheduling Ranking order and batching order are the same across all routers

STC Scheduling Example Injection Cycles 8 7 Batch 2 6 5 Batching interval length

STC Scheduling Example Injection Cycles 8 7 Batch 2 6 5 Batching interval length = 3 cycles 4 Batch 1 Ranking order = 3 3 2 2 1 2 Batch 0 Core 1 Core 2 Core 3 Packet Injection Order at Processor

STC Scheduling Example Router Batch 2 7 5 8 6 2 5 4 7

STC Scheduling Example Router Batch 2 7 5 8 6 2 5 4 7 1 6 2 Batch 1 3 3 2 2 1 4 1 2 Batch 0 3 1 Applications Scheduler Injection Cycles 8

STC Scheduling Example Router Round Robin 3 5 2 8 7 6 4 3

STC Scheduling Example Router Round Robin 3 5 2 8 7 6 4 3 7 1 6 2 Scheduler 8 Time STALL CYCLES 2 3 2 RR Age STC 8 6 Avg 11 8. 3

STC Scheduling Example Router Round Robin 5 5 3 1 3 7 1 6

STC Scheduling Example Router Round Robin 5 5 3 1 3 7 1 6 2 3 2 8 7 6 3 Time 5 4 STALL CYCLES 2 3 2 2 Age 4 Scheduler 8 4 Time 6 7 8 Avg RR 8 6 11 8. 3 Age 4 6 11 7. 0 STC

Ranking order STC Scheduling Example Router Round Robin 5 5 3 7 6 3

Ranking order STC Scheduling Example Router Round Robin 5 5 3 7 6 3 1 1 2 3 2 8 7 6 2 2 2 3 3 Time 5 4 6 7 STC 3 8 Time 5 4 STALL CYCLES 2 3 2 2 Age 4 Scheduler 8 4 Time 6 7 8 Avg RR 8 6 11 8. 3 Age 4 6 11 7. 0 STC 1 3 11 5. 0

Qualitative Comparison Round Robin & Age Local and application oblivious Age is biased towards

Qualitative Comparison Round Robin & Age Local and application oblivious Age is biased towards heavy applications flood the network higher likelihood of an older packet being from heavy application Globally Synchronized Frames (GSF) [Lee et al. , ISCA 2008] Provides bandwidth fairness at the expense of system performance Penalizes heavy and bursty applications Each application gets equal and fixed quota of flits (credits) in each batch. Heavy application quickly run out of credits after injecting into all active batches & stall till oldest batch completes and frees up fresh credits.

System Performance STC provides 9. 1% improvement in weighted speedup over the best existing

System Performance STC provides 9. 1% improvement in weighted speedup over the best existing policy{averaged across 96 workloads} Detailed case studies in the paper Local. RR Local. Age 1, 0 0, 8 0, 6 0, 4 Local. RR GSF Local. Age STC 10 Network Unfairness Normalized System Speedup 1, 2 8 6 4 0, 2 2 0, 0 0 GSF STC

Slack-Driven Packet Scheduling Das et al. , “Aergia: Exploiting Packet Latency Slack in On-Chip

Slack-Driven Packet Scheduling Das et al. , “Aergia: Exploiting Packet Latency Slack in On-Chip Networks, ” ISCA 2010.

Packet Scheduling in No. C Existing scheduling policies Round robin Age Problem Treat all

Packet Scheduling in No. C Existing scheduling policies Round robin Age Problem Treat all packets equally All packets are not the same…!!! Application-oblivious Packets have different criticality Packet is critical if latency of a packet affects application’s performance Different criticality due to memory level parallelism (MLP)

MLP Principle Stall ( Comput e ) =0 Sta ll Latency ( ) Packet

MLP Principle Stall ( Comput e ) =0 Sta ll Latency ( ) Packet Latency != Network Stall Time Different Packets have different criticality due to MLP Criticality( ) > Criticality( )

Outline Introduction Packet Scheduling Memory Level Parallelism Ae rgia Concept of Slack Estimating Slack

Outline Introduction Packet Scheduling Memory Level Parallelism Ae rgia Concept of Slack Estimating Slack Evaluation Conclusion

What is Ae rgia? Ae rgia is the spirit of laziness in Greek mythology

What is Ae rgia? Ae rgia is the spirit of laziness in Greek mythology Some packets can afford to slack!

Outline Introduction Packet Scheduling Memory Level Parallelism Ae rgia Concept of Slack Estimating Slack

Outline Introduction Packet Scheduling Memory Level Parallelism Ae rgia Concept of Slack Estimating Slack Evaluation Conclusion

Slack of Packets What is slack of a packet? Slack of a packet is

Slack of Packets What is slack of a packet? Slack of a packet is number of cycles it can be delayed in a router without (significantly) reducing application’s performance Local network slack Source of slack: Memory-Level Parallelism (MLP) Latency of an application’s packet hidden from application due to overlap with latency of pending cache miss requests Prioritize packets with lower slack

Concept of Slack Instruction Window Execution Time Network-on-Chip Latency ( ) Load Miss Causes

Concept of Slack Instruction Window Execution Time Network-on-Chip Latency ( ) Load Miss Causes Stall Slack Comput e returns earlier than necessary Slack ( ) = Latency ( ) – Latency ( ) = 26 – 6 = 20 hops Packet( ) can be delayed for available slack cycles without reducing performance!

Prioritizing using Slack Core A Load Miss Causes Pack et Latenc y 13 hops

Prioritizing using Slack Core A Load Miss Causes Pack et Latenc y 13 hops 0 hops 3 hops Core B Load Miss Causes Slack 10 hops Interference at 3 hops 4 hops 6 hops Slack( ) > Slack ( ) Prioritize

Slack in Applications 100 Noncritical Percentage of all Packets (%) 90 80 50% of

Slack in Applications 100 Noncritical Percentage of all Packets (%) 90 80 50% of packets have 350+ slack cycles 70 60 50 Gems 40 30 critical 20 10% of packets have <50 slack cycles 10 0 0 50 100 150 200 250 300 Slack in cycles 350 400 450 500

Slack in Applications 100 Percentage of all Packets (%) 90 68% of packets have

Slack in Applications 100 Percentage of all Packets (%) 90 68% of packets have zero slack cycles 80 Gems 70 60 50 40 30 20 art 10 0 0 50 100 150 200 250 300 Slack in cycles 350 400 450 500

Percentage of all Packets (%) Diversity in Slack 100 Gems 90 omnet tpcw 80

Percentage of all Packets (%) Diversity in Slack 100 Gems 90 omnet tpcw 80 mcf 70 bzip 2 60 sjbb sap 50 sphinx deal 40 barnes 30 astar 20 calculix 10 art libquantum 0 0 50 100 150 200 250 300 Slack in cycles 350 400 450 500 sjeng h 264 ref

Diversity in Slack 100 Gems 90 omnet Percentage of all Packets (%) tpcw Slack

Diversity in Slack 100 Gems 90 omnet Percentage of all Packets (%) tpcw Slack 80 varies between packets of different mcf applications 70 bzip 2 sjbb 60 sap 50 sphinx Slack 40 varies between packets of a single deal barnes application 30 astar 20 calculix 10 art libquantum 0 0 50 100 150 200 250 300 Slack in cycles 350 400 450 500 sjeng h 264 ref

Outline Introduction Packet Scheduling Memory Level Parallelism Ae rgia Concept of Slack Estimating Slack

Outline Introduction Packet Scheduling Memory Level Parallelism Ae rgia Concept of Slack Estimating Slack Evaluation Conclusion

Estimating Slack Priority Slack (P) = Max (Latencies of P’s Predecessors) – Latency of

Estimating Slack Priority Slack (P) = Max (Latencies of P’s Predecessors) – Latency of P Predecessors(P) are the packets of outstanding cache miss requests when P is issued Packet latencies not known when issued Predicting latency of any packet Q Higher latency if Q corresponds to an L 2 miss Higher latency if Q has to travel farther number of hops

Estimating Slack Priority Slack of P = Maximum Predecessor Latency – Latency of P

Estimating Slack Priority Slack of P = Maximum Predecessor Latency – Latency of P Slack(P) = Pred. L 2 (2 bits) My. L 2 (1 bit) Hop. Estimate (2 bits) Pred. L 2: Set if any predecessor packet is servicing L 2 miss My. L 2: Set if P is NOT servicing an L 2 miss Hop. Estimate: Max (# of hops of Predecessors) – hops of P

Estimating Slack Priority How to predict L 2 hit or miss at core? Global

Estimating Slack Priority How to predict L 2 hit or miss at core? Global Branch Predictor based L 2 Miss Predictor Use Pattern History Table and 2 -bit saturating counters Threshold based L 2 Miss Predictor If #L 2 misses in “M” misses >= “T” threshold then next load is a L 2 miss. Number of miss predecessors? List of outstanding L 2 Misses Hops estimate? Hops => ∆X + ∆ Y distance Use predecessor list to calculate slack hop estimate

Starvation Avoidance Problem: Starvation Prioritizing packets can lead to starvation of lower priority packets

Starvation Avoidance Problem: Starvation Prioritizing packets can lead to starvation of lower priority packets Solution: Time-Based Packet Batching New batches are formed at every T cycles Packets of older batches are prioritized over younger batches

Putting it all together Tag header of the packet with priority bits before Pred.

Putting it all together Tag header of the packet with priority bits before Pred. L injection(P) = Priority Batch (3 bits) 2 (2 bits) My. L 2 (1 bit) Hop. Estim ate (2 bits) Priority(P)? P’s batch priority) P’s Slack Local Round-Robin breaker) (highest (final tie

Outline Introduction Packet Scheduling Memory Level Parallelism Ae rgia Concept of Slack Estimating Slack

Outline Introduction Packet Scheduling Memory Level Parallelism Ae rgia Concept of Slack Estimating Slack Evaluation Conclusion

Evaluation Methodology 64 -core system x 86 processor model based on Intel Pentium M

Evaluation Methodology 64 -core system x 86 processor model based on Intel Pentium M 2 GHz processor, 128 -entry instruction window 32 KB private L 1 and 1 MB per core shared L 2 caches, 32 miss buffers 4 GB DRAM, 320 cycle access latency, 4 on-chip DRAM controllers Detailed Network-on-Chip model 2 -stage routers (with speculation and look ahead routing) Wormhole switching (8 flit data packets) Virtual channel flow control (6 VCs, 5 flit buffer depth) 8 x 8 Mesh (128 bit bi-directional channels) Benchmarks Multiprogrammed scientific, server, desktop workloads (35 applications) 96 workload combinations

Qualitative Comparison Round Robin & Age Local and application oblivious Age is biased towards

Qualitative Comparison Round Robin & Age Local and application oblivious Age is biased towards heavy applications Globally Synchronized Frames (GSF) [Lee et al. , ISCA 2008] Provides bandwidth fairness at the expense of system performance Penalizes heavy and bursty applications Application-Aware Prioritization Policies (SJF) [Das et al. , MICRO 2009] Shortest-Job-First Principle Packet scheduling policies which prioritize network sensitive applications which inject lower load

System Performance in weighted speedup Ae rgia improves system throughput by 10. 3% Ae

System Performance in weighted speedup Ae rgia improves system throughput by 10. 3% Ae rgia+SJF improves system throughput by 16. 1% Normalized System Speedup SJF provides 8. 9% improvement Age GSF Aergia 1, 2 1, 1 1, 0 0, 9 0, 8 0, 7 0, 6 0, 5 0, 4 0, 3 0, 2 0, 1 0, 0 RR SJF+Aergia

Age GSF Aergia Network Unfairness 12, 0 SJF does not imbalance 9, 0 Network

Age GSF Aergia Network Unfairness 12, 0 SJF does not imbalance 9, 0 Network Unfairness network fairness Aergia improves network unfairness by 1. 5 X SJF+Aergia improves network unfairness by 1. 3 X 6, 0 3, 0 0, 0 RR SJF+Aergia

Conclusions & Future Directions Packets have different criticality, yet existing packet scheduling policies treat

Conclusions & Future Directions Packets have different criticality, yet existing packet scheduling policies treat all packets equally We propose a new approach to packet scheduling in No. Cs We define Slack as a key measure that characterizes the relative importance of a packet. We propose Ae rgia a novel architecture to accelerate low slack critical packets Result Improves system performance: 16. 1% Improves network fairness: 30. 8%

Express-Cube Topologies Grot et al. , “Express Cube Topologies for On-Chip Interconnects” ” HPCA

Express-Cube Topologies Grot et al. , “Express Cube Topologies for On-Chip Interconnects” ” HPCA 2009.

2 -D Mesh UTCS HPCA '09 60

2 -D Mesh UTCS HPCA '09 60

2 -D Mesh Pros Cons UTCS Low design & layout complexity Simple, fast routers

2 -D Mesh Pros Cons UTCS Low design & layout complexity Simple, fast routers Large diameter Energy & latency impact HPCA '09 61

Concentration (Balfour & Dally, ICS ‘ 06) Pros Cons UTCS Multiple terminals attached to

Concentration (Balfour & Dally, ICS ‘ 06) Pros Cons UTCS Multiple terminals attached to a router node Fast nearest-neighbor communication via the crossbar Hop count reduction proportional to concentration degree Benefits limited by crossbar complexity HPCA '09 62

Concentration Side-effects UTCS Fewer channels Greater channel width HPCA '09 63

Concentration Side-effects UTCS Fewer channels Greater channel width HPCA '09 63

Replication Benefits Restores bisection channel count Restores channel width Reduced crossbar complexity CMesh-X 2

Replication Benefits Restores bisection channel count Restores channel width Reduced crossbar complexity CMesh-X 2 UTCS HPCA ‘ 09 64

Flattened Butterfly (Kim et al. , Micro ‘ 07) Objectives: UTCS Improve connectivity Exploit

Flattened Butterfly (Kim et al. , Micro ‘ 07) Objectives: UTCS Improve connectivity Exploit the wire budget HPCA '09 65

Flattened Butterfly (Kim et al. , Micro ‘ 07) UTCS HPCA '09 66

Flattened Butterfly (Kim et al. , Micro ‘ 07) UTCS HPCA '09 66

Flattened Butterfly (Kim et al. , Micro ‘ 07) UTCS HPCA '09 67

Flattened Butterfly (Kim et al. , Micro ‘ 07) UTCS HPCA '09 67

Flattened Butterfly (Kim et al. , Micro ‘ 07) UTCS HPCA '09 68

Flattened Butterfly (Kim et al. , Micro ‘ 07) UTCS HPCA '09 68

Flattened Butterfly (Kim et al. , Micro ‘ 07) UTCS HPCA '09 69

Flattened Butterfly (Kim et al. , Micro ‘ 07) UTCS HPCA '09 69

Flattened Butterfly (Kim et al. , Micro ‘ 07) Pros Cons UTCS Excellent connectivity

Flattened Butterfly (Kim et al. , Micro ‘ 07) Pros Cons UTCS Excellent connectivity Low diameter: 2 hops High channel count: k 2/2 per row/column Low channel utilization Increased control (arbitration) complexity HPCA '09 70

Multidrop Express Channels (MECS) Objectives: UTCS Connectivity More scalable channel count Better channel utilization

Multidrop Express Channels (MECS) Objectives: UTCS Connectivity More scalable channel count Better channel utilization HPCA '09 71

Multidrop Express Channels (MECS) UTCS HPCA '09 72

Multidrop Express Channels (MECS) UTCS HPCA '09 72

Multidrop Express Channels (MECS) UTCS HPCA '09 73

Multidrop Express Channels (MECS) UTCS HPCA '09 73

Multidrop Express Channels (MECS) UTCS HPCA '09 74

Multidrop Express Channels (MECS) UTCS HPCA '09 74

Multidrop Express Channels (MECS) UTCS HPCA '09 75

Multidrop Express Channels (MECS) UTCS HPCA '09 75

Multidrop Express Channels (MECS) UTCS HPCA ‘ 09 76

Multidrop Express Channels (MECS) UTCS HPCA ‘ 09 76

Multidrop Express Channels (MECS) Pros Cons UTCS One-to-many topology Low diameter: 2 hops k

Multidrop Express Channels (MECS) Pros Cons UTCS One-to-many topology Low diameter: 2 hops k channels row/column Asymmetric Increased control (arbitration) complexity HPCA ‘ 09 77

Partitioning: a GEC Example MECS-X 2 Partitioned MECS Flattened Butterfly UTCS HPCA '09 78

Partitioning: a GEC Example MECS-X 2 Partitioned MECS Flattened Butterfly UTCS HPCA '09 78

Analytical Comparison Network Size Radix (conctr’d) Diameter Channel count Channel width Router inputs Router

Analytical Comparison Network Size Radix (conctr’d) Diameter Channel count Channel width Router inputs Router outputs UTCS CMesh 64 256 4 8 6 14 2 2 576 1152 4 4 HPCA '09 FBfly 64 256 4 8 2 2 8 32 144 72 6 14 MECS 64 256 4 8 2 2 4 8 288 6 14 4 4 79

Experimental Methodology Topologies Network sizes Mesh, CMesh-X 2, FBFly, MECS-X 2 64 & 256

Experimental Methodology Topologies Network sizes Mesh, CMesh-X 2, FBFly, MECS-X 2 64 & 256 terminals Routing DOR, adaptive Messages Synthetic traffic PARSEC benchmarks 64 & 576 bits Uniform random, bit complement, transpose, self-similar Blackscholes, Bodytrack, Canneal, Ferret, Fluidanimate, Freqmine, Vip, x 264 Full-system config M 5 simulator, Alpha ISA, 64 OOO cores Energy evaluation Orion + CACTI 6 UTCS HPCA '09 80

64 nodes: Uniform Random UTCS HPCA '09 81

64 nodes: Uniform Random UTCS HPCA '09 81

256 nodes: Uniform Random UTCS HPCA '09 82

256 nodes: Uniform Random UTCS HPCA '09 82

Energy (100 K pkts, Uniform Random) UTCS HPCA '09 83

Energy (100 K pkts, Uniform Random) UTCS HPCA '09 83

64 Nodes: PARSEC UTCS HPCA '09 84

64 Nodes: PARSEC UTCS HPCA '09 84

Summary MECS Generalized Express Cubes UTCS A new one-to-many topology Good fit for planar

Summary MECS Generalized Express Cubes UTCS A new one-to-many topology Good fit for planar substrates Excellent connectivity Effective wire utilization Framework & taxonomy for NOC topologies Extension of the k-ary n-cube model Useful for understanding and exploring on-chip interconnect options Future: expand & formalize HPCA '09 85

Kilo-No. C: Topology-Aware Qo. S Grot et al. , “Kilo-NOC: A Heterogeneous Network-on-Chip Architecture

Kilo-No. C: Topology-Aware Qo. S Grot et al. , “Kilo-NOC: A Heterogeneous Network-on-Chip Architecture for Scalability and Service Guarantees, ” ISCA 2011.

Motivation Extreme-scale chip-level integration q q q Cores Cache banks Accelerators I/O logic Network-on-chip

Motivation Extreme-scale chip-level integration q q q Cores Cache banks Accelerators I/O logic Network-on-chip (NOC) 10 -100 cores today 1000+ assets in the near future 87

Kilo-NOC requirements High efficiency q q Area Energy Good performance Strong service guarantees (Qo.

Kilo-NOC requirements High efficiency q q Area Energy Good performance Strong service guarantees (Qo. S) 88

Topology-Aware Qo. S Problem: Qo. S support in each router is expensive (in terms

Topology-Aware Qo. S Problem: Qo. S support in each router is expensive (in terms of buffering, arbitration, bookkeeping) q E. g. , Grot et al. , “Preemptive Virtual Clock: A Flexible, Efficient, and Cost-effective QOS Scheme for Networks-on-Chip, ” MICRO 2009. Goal: Provide Qo. S guarantees at low area and power cost Idea: q q Isolate shared resources in a region of the network, support Qo. S within that area Design the topology so that applications can access the region without interference 89

Baseline QOS-enabled CMP Multiple VMs sharing a die Shared resources (e. g. , memory

Baseline QOS-enabled CMP Multiple VMs sharing a die Shared resources (e. g. , memory controllers) VM-private resources (cores, caches) Q QOS-enabled router 90

Conventional NOC QOS Contention scenarios: Shared resources q Intra-VM traffic q memory access shared

Conventional NOC QOS Contention scenarios: Shared resources q Intra-VM traffic q memory access shared cache access Inter-VM traffic q VM page sharing 91

Conventional NOC QOS Contention scenarios: Shared resources q Intra-VM traffic q memory access shared

Conventional NOC QOS Contention scenarios: Shared resources q Intra-VM traffic q memory access shared cache access Inter-VM traffic q VM page sharing Network-wide guarantees without network-wide QOS support 92

Kilo-NOC QOS Insight: leverage rich network connectivity q Ø Naturally reduce interference among flows

Kilo-NOC QOS Insight: leverage rich network connectivity q Ø Naturally reduce interference among flows Limit the extent of hardware QOS support Requires a low-diameter topology q This work: Multidrop Express Channels (MECS) Grot et al. , HPCA 2009 93

Topology-Aware QOS Dedicated, QOS-enabled regions q e e r f S- QO Richly-connected topology

Topology-Aware QOS Dedicated, QOS-enabled regions q e e r f S- QO Richly-connected topology q Rest of die: QOS-free Traffic isolation Special routing rules q Manage interference 94

Topology-Aware QOS Dedicated, QOS-enabled regions q Richly-connected topology q Rest of die: QOS-free Traffic

Topology-Aware QOS Dedicated, QOS-enabled regions q Richly-connected topology q Rest of die: QOS-free Traffic isolation Special routing rules q Manage interference 95

Topology-Aware QOS Dedicated, QOS-enabled regions q Richly-connected topology q Rest of die: QOS-free Traffic

Topology-Aware QOS Dedicated, QOS-enabled regions q Richly-connected topology q Rest of die: QOS-free Traffic isolation Special routing rules q Manage interference 96

Topology-Aware QOS Dedicated, QOS-enabled regions q Richly-connected topology q Rest of die: QOS-free Traffic

Topology-Aware QOS Dedicated, QOS-enabled regions q Richly-connected topology q Rest of die: QOS-free Traffic isolation Special routing rules q Manage interference 97

Kilo-NOC view Topology-aware QOS support q QO e e r f S- Limit QOS

Kilo-NOC view Topology-aware QOS support q QO e e r f S- Limit QOS complexity to a fraction of the die Optimized flow control q Reduce buffer requirements in QOSfree regions 98

Evaluation Methodology Parameter Value Technology 15 nm Vdd 0. 7 V System 1024 tiles:

Evaluation Methodology Parameter Value Technology 15 nm Vdd 0. 7 V System 1024 tiles: 256 concentrated nodes (64 shared resources) Networks: MECS+PVC VC flow control, QOS support (PVC) at each node MECS+TAQ VC flow control, QOS support only in shared regions MECS+TAQ+EB EB flow control outside of SRs, Separate Request and Reply networks K-MECS Proposed organization: TAQ + hybrid flow control 99

Area comparison 100

Area comparison 100

Energy comparison 101

Energy comparison 101

Summary Kilo-NOC: a heterogeneous NOC architecture for kilo-node substrates Topology-aware QOS Limits QOS support

Summary Kilo-NOC: a heterogeneous NOC architecture for kilo-node substrates Topology-aware QOS Limits QOS support to a fraction of the die Leverages low-diameter topologies Improves NOC area- and energy-efficiency Provides strong guarantees 102