ThroughputEffective OnChip Networks for Manycore Accelerators Ali Bakhoda

Throughput-Effective On-Chip Networks for Manycore Accelerators Ali Bakhoda, John Kim¹ and Tor M. Aamodt ¹KAIST, Korea

Manycore Accelerators and No. C § Manycore accelerators § § Prevalent example: high-end GPUs 10 s of thousands of threads running at the same time Bulk Synchronous Parallel programming style 3 / 5 top supercomputers § Based on the Nov. 2010 Top 500 list § Primary goal: Higher application level throughput § No. C in accelerators § Needs a different perspective from CPUs § Not very well studied in this context 2

The Need for Throughput-Effective No. Cs (Chip Area)-1 [1/mm 2] Throughput-Effective design: Improves application level performance per unit chip area 0. 0020 Ideal No. C 0. 0018 0. 5 5 I LESS AREA 0. 0016 0. 3 5 IP 0. 3 0 IP C/m m 2 0. 0012 190 210 230 m 2 0. 5 0 IP HIGHER THROUGHPUT 0. 0014 PC /m 0. 4 5 0. 4 C/m m 2 250 270 0 IP Average Throughput [IPC] C/m m 2 IPC /mm 2 C/m m 2 290 310 3

Contributions § Study impact of No. C on application level performance § Traditional improvements (router latency reduction): minimal impact on application level performance § Increasing channel width: High performance gain + high area cost Ø Consider application level throughput per unit area of No. C § Throughput correlated with injection rate of few nodes § Many-to-few-to-many traffic pattern § Propose Throughput-Effective No. C design § Checkerboard network § Multi-port router structure 4

Outline § § § Introduction Baseline architecture No. C properties in accelerators Throughput-Effective No. C design Experimental results Conclusion 5

Accelerator Overview 6

Baseline Network § Mesh with MCs at periphery of the chip § Similar to Tilera’s TILE 64 or Intel’s 80 -core Teraflops chip § Simple and Scalable § Dimension Order Routing § Virtual Channel Flow Control § 4 -cycle routers 7

Finding a Balanced Design Application Level Throughput/Area 1. 00 0. 75 Bisection bandwidth of baseline mesh 0. 50 0. 2 0. 4 0. 6 0. 8 1. 0 1. 2 Bandwidth Limit of Ideal Interconnect [fraction of off-chip DRAM bandwidth] 1. 4 1. 6 8

(Chip Area)-1 [1/mm 2] Gap between Balanced Mesh and Ideal No. C 0. 0020 Ideal No. C LESS AREA 0. 5 0. 0018 5 I Balanced Mesh 0. 0016 0. 0014 HIGHER THROUGHPUT 0. 3 0 IP C/m 0. 0012 190 210 m 2 230 PC 0. 5 0. 4 C/m m 2 250 270 m 2 0 IP C/m m 2 0. 4 5 5 IP /m 0 IP Average Throughput [IPC] IPC /mm 2 C/m 290 m 2 310 9

Outline § § § Introduction Baseline architecture No. C properties in accelerators Throughput-Effective No. C design Experimental results Conclusion 10

No. C properties in Many. Core Accelerators § Router latency has minimal impact on application level throughput § Aggressive 1 -cycle routers instead of 4 -cycle router § Only 2. 3% application level speedup HM Speedup 20% 10% W 2 x B . . . 0% le § 27% speedup by doubling BW § But quadratic area increase 30% 1 C yc § Channel Bandwidth is very important 11

(Chip Area)-1 [1/mm 2] 2 x Channel Bandwidth 0. 0020 Ideal No. C LESS AREA 0. 5 0. 0018 5 I Balanced Mesh 0. 0016 0. 0014 HIGHER THROUGHPUT 0. 3 0 IP C/m 0. 0012 190 210 m 2 230 PC 0. 5 C/m 0. 4 m 2 2 x BW 250 270 m 2 0 IP C/m m 2 0. 4 5 5 IP /m 0 IP Average Throughput [IPC] IPC /mm 2 C/m 290 m 2 310 12

Many-to-Few-to-Many Traffic Pattern MC Injection bandwidth C 0 C 2 Cn MC 0 MC 1 MCm reply network C 1 request network C 0 C 1 C 2 Cn 13

Outline § Introduction § Baseline architecture § No. C properties in accelerators § Throughput-Effective No. C design § Experimental results § Conclusion 14

Throughput-Effective Network design Checkerboard Routing Reduce Area Channel Slicing Throughput. Effective Increase Performance Checkerboard Placement Multi-Port routers at MCs 15

Checkerboard Routing: Half-Routers § No turns allowed at half-routers § Limited connectivity § Saves ~50% of router crossbar area § Full-Routers: § Normal routers w/ complete connectivity § Use Half-Routers every other node Half-Router Connectivity 16

Solution: Routing Restriction (1) • Routing from a full-router to a half-router that is: – An odd number of columns away – Not in the same row • Solution: Use YX routing instead of XY routing in this case 17

Solution: Routing Restriction (2) § Routing from a half-router to a half-router that is: § An even number of columns away § Not in the same row § Solution: needs two turns (1) To intermediate full-router using YX (2) To the destination using XY § Requires an extra VC to avoid deadlock 18

Routing Restriction (3) § Full-routers that are odd number of columns away § We avoid this case by using a different MC placement (next 2 slides) 19

Throughput-Effective Network design Checkerboard Routing Reduce Area Channel Slicing Throughput. Effective Increase Performance Checkerboard Placement Multi-Port routers at MCs 20

Placement of MCs § Exploit Many-to-Few § Place the MCs at Half-Router nodes § Half-Routers can communicate will all nodes with no penalty § Common case for BSP: compute cores communicate with MCs not each other [CMP-MSI’ 08] “Extending the Scalability of Single Chip Stream Processors with Onchip Caches”, Bakhoda et al. [ISCA’ 09] “Achieving Predictable Performance Through Better Memory Controller Placement in Many-Core CMPs" Abts et al. 21

Throughput-Effective Network design Checkerboard Routing Reduce Area Channel Slicing Throughput. Effective Increase Performance Checkerboard Placement Multi-Port routers at MCs 22

Multi-port routers at MCs • Reduce the bottleneck at the few nodes • Increase terminal BW of the few nodes – Increase the injection ports of MC routers – Minimal area overhead (~1% in total No. C area) – Speedups of up to 25% 23

Throughput-Effective Network design Checkerboard Routing Reduce Area Channel Slicing Throughput. Effective Increase Performance Checkerboard Placement Multi-Port routers at MCs 24

Outline § § Introduction Baseline architecture No. C properties in accelerators Throughput-Effective No. C design § Experimental results § Conclusion 25

Methodology § Compute simulation: GPGPU-Sim (2. 2. 1 b) § No. C simulation: Booksim-2 § Integrated into GPGPU-Sim as network simulator § Area estimations: Orion 2. 0 § Benchmarks: 24 CUDA applications including the Rodinia benchmarks 26

Results § Combination of § Checkerboard routing and placement § Channel Slicing § Multi-port routers at MCs § Overall HM speedup 17% across 24 benchmarks over balanced baseline § Total No. C area reduction of 43% Speedup Low Traffic 80% 60% 40% 20% 0% -20% Low Speedup High Traffic High Speedup High Traffic AES BIN HSP NE NDL HW LE HIS LU SLA BP CONNNC BLK MM LPS RAY DG SS TRA SR WP MUM LIB FWT SCP STC KM CFD BFS RD HM 27

(Chip Area)-1 [1/mm 2] Throughput-Effective No. C 0. 0020 Ideal No. C LESS AREA 0. 5 Thr. Eff. 0. 0018 Balanced Mesh 0. 0016 0. 0014 HIGHER THROUGHPUT 0. 3 0 IP C/m 0. 0012 190 210 5 I m 2 230 PC 0. 5 0. 4 C/m m 2 250 0 IP 2 x BW 270 m 2 0 IP C/m m 2 0. 4 5 5 IP /m Average Throughput [IPC] IPC /mm 2 C/m 290 m 2 310 28

Summary § Throughput-Effective design: Consider system level performance impact + area impact of No. C § Observations § No. C BW is more important than latency in accelerators § Many-to-Few-to-Many traffic pattern § Throughput-Effective No. C for accelerators § Checkerboard § Multi-port MC routers § Channel-slicing 29

Thank you

§ Backups… 31

Channel Slicing – Double networks § Divide the single network into two physical networks § Each new network: half the bisection BW of the original network § Overall bisection BW: constant § Saves area § Quadratic dependency of crossbar area on channel BW § Increases serialization latency § But compute accelerators are not sensitive to latency 32

Results § Memory Controller placement § HM of speedup 13% over balanced baseline design 33

Results • Checkerboard routing – Less than 1% performance loss compared to DOR with same resources – Reduces total router area by 14. 2% 34

Results § Channel slicing § Average change in performance < 1% § No. C area reduction of 37% 35

Top 5 systems § TOP 5 Systems - 11/2010 § 1 Tianhe-1 A - NUDT TH MPP, X 5670 2. 93 Ghz 6 C, Nvidia GPU, FT 1000 8 C § 2 Jaguar - Cray XT 5 -HE Opteron 6 -core 2. 6 GHz § 3 Nebulae - Dawning TC 3600 Blade, Intel X 5650, Nvidia Tesla C 2050 GPU § 4 TSUBAME 2. 0 - HP Pro. Liant SL 390 s G 7 Xeon 6 C X 5670, Nvidia GPU, Linux/Windows § 5 Hopper - Cray XE 6 12 -core 2. 1 GHz 36

Alternative MC placement example 37

Many-to-Few-to-Many Traffic Pattern MC input bandwidth Core output bandwidth MC output bandwidth Core input bandwidth C 0 C 2 Cn MC 0 MC 1 MCm reply network C 1 request network C 0 C 1 C 2 Cn 38