Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog

  • Slides: 46
Download presentation
Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir,

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das

Parallelize your code! Launch more threads! Multithreading Improve Replacement Policies Caching Is the Warp

Parallelize your code! Launch more threads! Multithreading Improve Replacement Policies Caching Is the Warp Scheduler aware of these techniques? Main Memory Improve Memory Scheduling Policies Prefetching Improve Prefetcher (look deep in the future, if you can!)

Cache-Conscious Scheduling, MICRO’ 12 Two-level Scheduling MICRO’ 11 Multithreading Caching Aware Warp Scheduler Main

Cache-Conscious Scheduling, MICRO’ 12 Two-level Scheduling MICRO’ 11 Multithreading Caching Aware Warp Scheduler Main Memory Thread-Block-Aware Scheduling (OWL) ASPLOS’ 13 Prefetching ?

Our Proposal n Prefetch Aware Warp Scheduler n Goals: q Make a Simple prefetcher

Our Proposal n Prefetch Aware Warp Scheduler n Goals: q Make a Simple prefetcher more Capable q Improve system performance by orchestrating scheduling and prefetching mechanisms n 25% average IPC improvement over q Prefetching + Conventional Warp Scheduling Policy n 7% average IPC improvement over q Prefetching + Best Previous Warp Scheduling Policy 4

Outline n Proposal n Background and Motivation n Prefetch-aware Scheduling n Evaluation n Conclusions

Outline n Proposal n Background and Motivation n Prefetch-aware Scheduling n Evaluation n Conclusions 5

High-Level View of a GPU Threads CTA W Warps Scheduler L 1 Caches Prefetcher

High-Level View of a GPU Threads CTA W Warps Scheduler L 1 Caches Prefetcher ALUs Interconnect … … Streaming Multiprocessors (SMs) Cooperative Thread Arrays (CTAs) Or Thread Blocks L 2 cache DRAM 6

Warp Scheduling Policy n Equal scheduling priority q Round-Robin (RR) execution n Problem: Warps

Warp Scheduling Policy n Equal scheduling priority q Round-Robin (RR) execution n Problem: Warps stall roughly at the same time W 1 W 2 W 3 W 4 W 5 W 6 W 7 W 8 SIMT Core Stalls Compute Phase (1) DRAM Requests D 1 D 2 D 3 D 4 D 5 D 6 D 7 D 8 W 1 W 2 W 3 W 4 W 5 W 6 W 7 W 8 Compute Phase (2) Time 7

W 1 W 2 W 3 W 4 W 5 W 6 W 7

W 1 W 2 W 3 W 4 W 5 W 6 W 7 Compute Phase (1) DRAM Requests W 8 SIMT Core Stalls W 1 W 2 W 3 W 4 W 6 W 7 W 8 Compute Phase (2) D 1 D 2 D 3 D 4 D 5 D 6 D 7 D 8 W W W W 1 2 3 4 5 6 7 8 W 5 Time W W 1 2 3 4 W W 5 6 7 8 Group 2 TWO LEVEL (TL) SCHEDULING Group 1 Group 2 Compute Phase (1) (2) Group 1 DRAM Requests D 1 D 2 D 3 D 4 D 5 D 6 D 7 D 8 Saved Cycles

W 2 W 4 W 5 Y+2 Y+3 W 3 Y Y+1 X+3 W

W 2 W 4 W 5 Y+2 Y+3 W 3 Y Y+1 X+3 W 1 X+2 High Bank. Level Parallelism X X+1 Accessing DRAM … W 6 W 7 W 8 Memory Addresses Legend Group 1 High Row Buffer Locality Bank 1 W 2 W 3 Group 2 Bank 2 W 4 W 5 W 6 W 7 Idle for a period Bank 1 Bank 2 W 8 Low Bank-Level Parallelism High Row Buffer Locality

Warp Scheduler Perspective (Summary) Warp Scheduler Forms Multiple Warp Groups? DRAM Bandwidth Utilization Bank

Warp Scheduler Perspective (Summary) Warp Scheduler Forms Multiple Warp Groups? DRAM Bandwidth Utilization Bank Row Level Buffer Parallelism Locality Round. Robin (RR) ✖ ✔ ✔ Two-Level (TL) ✔ ✖ ✔ 10

Evaluating RR and TL schedulers Round-robin (RR) IPC Improvement factor with Perfect L 1

Evaluating RR and TL schedulers Round-robin (RR) IPC Improvement factor with Perfect L 1 Cache Can we further reduce this gap? 2. 20 X 1. 88 X GMEAN JPEG FWT BLK SCP FFT BFSR SPMV KMN PVC Via Prefetching ? SSC 7 6 5 4 3 2 1 0 Two-level (TL) 11

(1) Prefetching: Saves more cycles Compute Phase (1) DRAM Requests D 1 D 2

(1) Prefetching: Saves more cycles Compute Phase (1) DRAM Requests D 1 D 2 D 3 D 4 (B) Comp. Phase (2) TL RR Saved Compute Cycles Phase-2 (Group-2) Time Can Start D 5 D 6 D 7 D 8 Compute Phase (1) DRAM Requests (A) Comp. Phase (2) D 1 D 2 D 3 D 4 P 5 P 6 P 7 P 8 Comp. Phase (2) Saved Cycles Prefetch Requests

W 4 Y+2 Y+3 W 3 Y Y+1 W 2 X+3 W 1 X+2

W 4 Y+2 Y+3 W 3 Y Y+1 W 2 X+3 W 1 X+2 X X+1 (2) Prefetching: Improve DRAM Bandwidth Utilization W 5 W 7 W 6 W 8 Idle for a period High Bank. Level Parallelism High Row Buffer Locality Bank 1 W 2 W 3 No Idle period! Bank 2 W 4 W 5 W 6 Memory Addresses W 7 W 8 Prefetch Requests Bank 1 Bank 2

W 2 W 4 W 5 W 6 Y+2 Y+3 W 3 Y Y

W 2 W 4 W 5 W 6 Y+2 Y+3 W 3 Y Y Y+1 X+3 W 1 X+2 X X X+1 Challenge: Designing a Prefetcher W 7 Memory Addresses W 8 Prefetch Requests Bank 1 X Bank 2 Sophisticated Prefetcher Y

Our Goal n Keep the prefetcher simple, yet get the performance benefits of a

Our Goal n Keep the prefetcher simple, yet get the performance benefits of a sophisticated prefetcher. To this end, we will design a prefetch-aware warp scheduling policy Why? A simple prefetching does not improve performance with existing scheduling policies. 15

Simple Prefetching + RR scheduling Compute Phase (1) DRAM Requests RR Compute Phase (2)

Simple Prefetching + RR scheduling Compute Phase (1) DRAM Requests RR Compute Phase (2) D 1 D 2 D 3 D 4 D 5 D 6 D 7 D 8 Overlap with D 2 (Late Prefetch) D 1 P 2 D 3 P 4 D 5 P 6 D 7 P 8 No Saved Cycles. Time Overlap with D 4 (Late Prefetch) Compute Phase (2) Time

Simple Prefetching + TL scheduling Group 1 Group 2 Compute Phase (1) DRAM Requests

Simple Prefetching + TL scheduling Group 1 Group 2 Compute Phase (1) DRAM Requests Group 1 D 2 D 3 D 4 D 5 D 6 D 7 D 8 Group 2 Compute Phase (1) D 1 Overlap P 2 with D 2 D 3 P 4 (Late Prefetch) Overlap with D 4 (Late Prefetch) D 5 P 6 D 7 P 8 TL Group 1 Group 2 Comp. Phase (2) No Saved Cycles (over Group 2 Group 1 TL) Comp. Phase (2) RR Saved Cycles Time

Let’s Try… X Simple Prefetcher X+4 18

Let’s Try… X Simple Prefetcher X+4 18

W 2 X+4 W 4 May not be equal to Bank 1 W 2

W 2 X+4 W 4 May not be equal to Bank 1 W 2 W 3 W 5 Y+2 Y+3 W 3 Y Y+1 X+3 W 1 X+2 X X+1 Simple Prefetching with TL scheduling W 6 W 7 W 8 Useless Prefetch (X + 4) Idle for a Y period Bank 2 W 4 W 5 W 6 W 7 W 8 UP 1 UP 2 UP 3 Bank 1 Memory Addresses Bank 2 Useless UP 4 Prefetches

Simple Prefetching with TL scheduling Compute Phase (1) DRAM Requests D 1 D 2

Simple Prefetching with TL scheduling Compute Phase (1) DRAM Requests D 1 D 2 D 3 D 4 DRAM Requests Comp. Phase (2) No Saved Comp. Phase Cycles (2) (over TL) Useless Prefetches D 5 D 6 D 7 D 8 RR Saved Cycles D 5 D 6 D 7 D 8 Compute Phase (1) D 1 D 2 D 3 D 4 U 5 U 6 U 7 U 8 Comp. Phase (2) TL Time

Warp Scheduler Perspective (Summary) Warp Scheduler Forms Multiple Warp Groups? Simple Prefetcher Friendly? DRAM

Warp Scheduler Perspective (Summary) Warp Scheduler Forms Multiple Warp Groups? Simple Prefetcher Friendly? DRAM Bandwidth Utilization Bank Row Level Buffer Parallelism Locality Round. Robin (RR) ✖ ✖ ✔ ✔ Two-Level (TL) ✔ ✖ ✖ ✔ 21

Our Goal n Keep the prefetcher simple, yet get the performance benefits of a

Our Goal n Keep the prefetcher simple, yet get the performance benefits of a sophisticated prefetcher. To this end, we will design a prefetch-aware warp scheduling policy A simple prefetching does not improve performance with existing scheduling policies. 22

Sophisticated Prefetcher Prefetch Aware (PA) Warp Scheduler Simple Prefetcher 23

Sophisticated Prefetcher Prefetch Aware (PA) Warp Scheduler Simple Prefetcher 23

Y Y+1 Y+2 Y+3 W 7 Y+2 Y+3 W 8 Y Y+1 W 6

Y Y+1 Y+2 Y+3 W 7 Y+2 Y+3 W 8 Y Y+1 W 6 X+3 W 5 W 1 W 3 W 4 W 5 W 7 X+3 W 4 X+2 Group 2 W 3 X+2 Group 1 W 2 X X+1 W 1 X+2 X X+1 Prefetch-aware (PA) warp scheduling W 3 W 4 W 2 W 6 W 8 Round Robin Scheduling Two-level Scheduling W 2 Y+3 W 1 Y Y+1 X X+1 See paper for generalized algorithm of PA scheduler Prefetch-aware W 5 W 7 W 6 W 8 Scheduling Non-consecutive warps are associated with one group

X X+1 X+2 X+3 Y Y+1 Y+2 Y+3 Simple Prefetching with PA scheduling W

X X+1 X+2 X+3 Y Y+1 Y+2 Y+3 Simple Prefetching with PA scheduling W 1 W 3 W 4 W 5 W 7 W 2 Bank 1 W 6 W 8 Bank 2 Reasoning of non-consecutive warp grouping is that groups can (simple) prefetch for each other (green warps can prefetch for red warps using simple prefetcher) X Simple Prefetcher X+1

W 3 W 4 W 5 W 6 W 7 Y+3 X+2 W 2

W 3 W 4 W 5 W 6 W 7 Y+3 X+2 W 2 Y+1 Y+2 X+1 W 1 X+3 Y X Simple Prefetching with PA scheduling W 8 Cache Hits! Bank 1 X Bank 2 Simple Prefetcher X+1

Simple Prefetching with PA scheduling (A) (B) TL Comp. Phase (2) Compute Phase (1)

Simple Prefetching with PA scheduling (A) (B) TL Comp. Phase (2) Compute Phase (1) DRAM Requests D 1 D 3 D 5 D 7 Saved Compute Cycles Phase-2 (Group-2) Time Can Start D 2 D 4 D 6 D 8 Compute Phase (1) DRAM Requests Comp. Phase (2) RR Comp. Phase (2) D 1 D 3 D 5 D 7 P 2 P 4 P 6 P 8 Prefetch Requests Comp. Phase (2) Saved Cycles!!! (over. Cycles TL)

W 3 Y+3 X+2 W 2 Y+1 Y+2 X+1 W 1 X+3 Y X

W 3 Y+3 X+2 W 2 Y+1 Y+2 X+1 W 1 X+3 Y X DRAM Bandwidth Utilization 18% increase in bank-level parallelism W 4 W 5 W 6 W 7 W 8 24% decrease in row buffer locality Bank 1 Bank 2 High Bank-Level Parallelism High Row Buffer Locality X Simple Prefetcher X+1

Warp Scheduler Perspective (Summary) Warp Scheduler Forms Multiple Warp Groups? Simple Prefetcher Friendly? DRAM

Warp Scheduler Perspective (Summary) Warp Scheduler Forms Multiple Warp Groups? Simple Prefetcher Friendly? DRAM Bandwidth Utilization Bank Level Parallelism Row Buffer Locality Round. Robin (RR) ✖ ✖ ✔ ✔ Two-Level (TL) ✔ ✖ ✖ ✔ Prefetch. Aware (PA) ✔ ✔ (with prefetching) 29

Outline n Proposal n Background and Motivation n Prefetch-aware Scheduling n Evaluation n Conclusions

Outline n Proposal n Background and Motivation n Prefetch-aware Scheduling n Evaluation n Conclusions 30

Evaluation Methodology n Evaluated on GPGPU-Sim, a cycle accurate GPU simulator n Baseline Architecture

Evaluation Methodology n Evaluated on GPGPU-Sim, a cycle accurate GPU simulator n Baseline Architecture q q n 30 SMs, 8 memory controllers, crossbar connected 1300 MHz, SIMT Width = 8, Max. 1024 threads/core 32 KB L 1 data cache, 8 KB Texture and Constant Caches L 1 Data Cache Prefetcher, GDDR 3@1100 MHz Applications Chosen from: q q Mapreduce Applications Rodinia – Heterogeneous Applications Parboil – Throughput Computing Focused Applications NVIDIA CUDA SDK – GPGPU Applications 31

Spatial Locality Detector based Prefetching MACRO BLOCK X D Prefetch: - Not accessed (demanded)

Spatial Locality Detector based Prefetching MACRO BLOCK X D Prefetch: - Not accessed (demanded) Cache Lines X+1 P Prefetch-aware Scheduler Improves effectiveness of X+2 D this simple prefetcher See paper for more details X+3 P D = Demand, P = Prefetch 32

Improving Prefetching Effectiveness TL+Prefetching RR+Prefetching Fraction of Late Prefetches 100% 89% 86% PA+Prefetching Prefetch

Improving Prefetching Effectiveness TL+Prefetching RR+Prefetching Fraction of Late Prefetches 100% 89% 86% PA+Prefetching Prefetch Accuracy 69% 85% 89% 90% 100% 80% 60% 40% 20% 0% 0% 20% 15% Reduction in L 1 D Miss Rates 16% 10% 5% 2% 4% 0% 33

Performance Evaluation Results are Normalized to RR scheduling RR+Prefetching TL TL+Prefetching Prefetch-aware (PA) 1.

Performance Evaluation Results are Normalized to RR scheduling RR+Prefetching TL TL+Prefetching Prefetch-aware (PA) 1. 01 3 PA+Prefetching 1. 16 1. 19 1. 20 1. 26 2. 5 2 1. 5 See paper for Additional Results GMEAN JPEG FWT BLK SCP FFT BFSR SPMV KMN PVC 0. 5 SSC 1 25% IPC improvement over Prefetching + RR Warp Scheduling Policy (Commonly Used) 7% IPC improvement over Prefetching + TL Warp Scheduling Policy (Best Previous) 34

Conclusions n n n Existing warp schedulers in GPGPUs cannot take advantage of simple

Conclusions n n n Existing warp schedulers in GPGPUs cannot take advantage of simple prefetchers q Consecutive warps have good spatial locality, and can prefetch well for each other q But, existing schedulers schedule consecutive warps closeby in time prefetches are too late We proposed prefetch-aware (PA) warp scheduling q Key idea: group consecutive warps into different groups q Enables a simple prefetcher to be timely since warps in different groups are scheduled at separate times Evaluations show that PA warp scheduling improves performance over combinations of conventional (RR) and the best previous (TL) warp scheduling and prefetching policies q Better orchestrates warp scheduling and prefetching decisions 35

THANKS! QUESTIONS? 36

THANKS! QUESTIONS? 36

BACKUP 37

BACKUP 37

Effect of Prefetch-aware Scheduling Percentage of DRAM requests (averaged over group) with: 1 miss

Effect of Prefetch-aware Scheduling Percentage of DRAM requests (averaged over group) with: 1 miss 2 misses 3 -4 misses to a macro-block 60% High Spatial Locality Requests Recovered by Prefetching 40% High Spatial Locality Requests 20% 0% Two-level Prefetch-aware 38

Working (With Two-Level Scheduling) MACRO BLOCK X MACRO BLOCK Y D X+1 D X+2

Working (With Two-Level Scheduling) MACRO BLOCK X MACRO BLOCK Y D X+1 D X+2 D X+3 D High Spatial Locality Requests D Y+1 D Y+2 D Y+3 D 39

Working (With Prefetch-Aware Scheduling) MACRO BLOCK X MACRO BLOCK Y D X+1 P X+2

Working (With Prefetch-Aware Scheduling) MACRO BLOCK X MACRO BLOCK Y D X+1 P X+2 D X+3 P High Spatial Locality Requests D Y+1 P Y+2 D Y+3 P

Working (With Prefetch-Aware Scheduling) MACRO BLOCK X X+1 D Y Cache Hits Y+1 D

Working (With Prefetch-Aware Scheduling) MACRO BLOCK X X+1 D Y Cache Hits Y+1 D X+2 Y+2 X+3 D Y+3 D

Effect on Row Buffer locality TL+Prefetching PA PA+Prefetching 12 10 8 6 4 AVG

Effect on Row Buffer locality TL+Prefetching PA PA+Prefetching 12 10 8 6 4 AVG JPEG FWT BLK SCP FFT BFSR SPMV KMN 0 PVC 2 SSC Row Buffer Locality TL 24% decrease in row buffer locality over TL 42

RR TL PA 25 20 15 10 AVG JPEG FWT BLK SCP FFT BFSR

RR TL PA 25 20 15 10 AVG JPEG FWT BLK SCP FFT BFSR SPMV KMN 0 PVC 5 SSC Bank Level Parallelism Effect on Bank-Level Parallelism 18% increase in bank-level parallelism over TL 43

W 2 W 4 Bank 1 W 2 W 3 Bank 1 W 5

W 2 W 4 Bank 1 W 2 W 3 Bank 1 W 5 Y+2 Y+3 W 3 Y Y+1 X+3 W 1 X+2 X X+1 Simple Prefetching + RR scheduling W 6 W 7 W 8 Bank 2 W 4 W 5 W 6 Bank 2 Memory Addresses

W 2 W 4 W 5 W 6 Y+2 Y+3 W 3 Y Y+1

W 2 W 4 W 5 W 6 Y+2 Y+3 W 3 Y Y+1 X+3 W 1 X+2 X X+1 Simple Prefetching with TL scheduling W 7 Memory Addresses W 8 Idle for a period Bank 1 Legend Bank 2 Group 1 W 2 W 3 W 4 W 5 W 6 Idle for a period Bank 1 Bank 2 W 7 W 8 Group 2

CTA-Assignment Policy (Example) Multi-threaded CUDA Kernel CTA-1 CTA-2 SIMT Core-1 CTA-2 Warp Scheduler L

CTA-Assignment Policy (Example) Multi-threaded CUDA Kernel CTA-1 CTA-2 SIMT Core-1 CTA-2 Warp Scheduler L 1 Caches ALUs CTA-3 CTA-4 SIMT Core-2 CTA-3 CTA-4 Warp Scheduler L 1 Caches ALUs 46