Continuous Runahead Transparent Hardware Acceleration for Memory Intensive

  • Slides: 39
Download presentation
Continuous Runahead Transparent Hardware Acceleration for Memory Intensive Workloads Authors: M. Hashemi, O. Mutlu,

Continuous Runahead Transparent Hardware Acceleration for Memory Intensive Workloads Authors: M. Hashemi, O. Mutlu, Y. N. Patt Presented at MICRO 2016 ETH Zürich –Computer Architecture Seminar HS 2020 Leandra Maisch

Problem Statement For various applications we would like to process large amounts of data

Problem Statement For various applications we would like to process large amounts of data p Frequent memory accesses lead to a lot of wait time p Runahead techniques want to reduce this wait time by prefetching and executing memory requests during wait time p 2

Quick Summary Continuous Runahead explores a method to prefetch and execute instructions while a

Quick Summary Continuous Runahead explores a method to prefetch and execute instructions while a program is running to generate cache misses and subsequent memory loads. This leads to fewer cache misses while a program is executed and therefore to lower wait times on memory. 3

Overview Runahead Execution p Continuous Runahead p n n Choosing and Storing Dependence Chains

Overview Runahead Execution p Continuous Runahead p n n Choosing and Storing Dependence Chains CRE Performance evaluations p Critic p Discussion p 4

RUNAHEAD EXECUTION 5

RUNAHEAD EXECUTION 5

Runahead Execution What is Runahead Execution? p Prefetching methods p n n p Stream

Runahead Execution What is Runahead Execution? p Prefetching methods p n n p Stream prefetcher Global History buffer Current Limitations of Runahead Execution 6

Runahead Execution Memory accesses can cause full pipeline stalls p Stalls use around 50%

Runahead Execution Memory accesses can cause full pipeline stalls p Stalls use around 50% of execution time of a program p Runahead uses instruction window to fetch and execute upcoming instructions p Fewer cache misses 7

Stream Prefetcher Defines stream of cache misses by looking at addresses close in memory

Stream Prefetcher Defines stream of cache misses by looking at addresses close in memory p Looks only in a defined direction p Prefetches blocks of memory in said direction p More in “Memory Prefetching using Adaptive Stream Detection” by I. Hur and C. Lin https: //www. cs. utexas. edu/~lin/papers/micro 06. pdf 8

Global History Buffer Holds most recent miss addresses in FIFO order p Ordered table

Global History Buffer Holds most recent miss addresses in FIFO order p Ordered table allows to discard unused data p Complete picture of cache miss history p Small sized table p More in “Data Cache Prefetching Using a Global History Buffer” by K. J. Nesbit and J. E. Smith https: //www. eecg. utoronto. ca/~steffan/carg/readings/ghb. pdf 9

Limitations of Prefetching Short duration of full-window stall p Prioritisation of memory accesses p

Limitations of Prefetching Short duration of full-window stall p Prioritisation of memory accesses p 10

CONTINUOUS RUNAHEAD 11

CONTINUOUS RUNAHEAD 11

Key Ideas p Dynamically filter incoming dependence chains n Filter dependence chains generating memory

Key Ideas p Dynamically filter incoming dependence chains n Filter dependence chains generating memory accesses Execute dependence chains in a loop p Loop executed on the Continuous Runahead Engine (CRE) p 12

DEFINITIONS 13

DEFINITIONS 13

Dependence Chain Set of dependent instructions leading up to a key instruction p Generated

Dependence Chain Set of dependent instructions leading up to a key instruction p Generated by backtracking the data flow p Example of a dependence chain: Computing the address for a memory access 14

Full-Window Stall Instructions are retired in program order p Long-latency instructions can block pipeline

Full-Window Stall Instructions are retired in program order p Long-latency instructions can block pipeline p Instruction window is filled with incoming instructions p Both instruction window is blocked and pipeline stalled is called full-window stall p 15

IMPLEMENTATION 16

IMPLEMENTATION 16

Dependence Chain Selection p Base Policy n p PC based Policy n n p

Dependence Chain Selection p Base Policy n p PC based Policy n n p Lists all PCs that caused LLC misses Dependent on operation which is blocking retirement Maximum-Misses Policy n p Select next memory access in buffer Finds and selects PC causing most cache misses Stall Policy n n Tracks PCs causing full-window stalls Selects chain causing most full-window stalls 17

Evaluation of the Policies p p Evaluation of the policies on a single core

Evaluation of the Policies p p Evaluation of the policies on a single core system using Runahead Using policies tracking most misses gives improved performance on most workloads Comparisions of the policies 18

Selecting Instructions p Small amount of instructions cause over 90% of full window stalls

Selecting Instructions p Small amount of instructions cause over 90% of full window stalls Instructions causing full window stalls Only a handful instructions need to be looped to be effective 19

Continuous Runahead Engine p Strongly based on an enhanced memory controller See paper “Accelerating

Continuous Runahead Engine p Strongly based on an enhanced memory controller See paper “Accelerating Dependent Cache Misses with an Enhanced Memory Controller” by M. Hashemi et al. http: //eimanebrahimi. com/pub/hashemi_isca 16. pdf p Sits on the memory controller to reduce latency on memory loads 20

Architecture of the CRE 32 -uop buffer to hold full dependence chains p 32

Architecture of the CRE 32 -uop buffer to hold full dependence chains p 32 -entry physical register p 4 k. B cache with 32 entry TLB p Data path of the CRE 21

Handling Dependence Chains Upon generation TLB sends required load to the CRE p TLB

Handling Dependence Chains Upon generation TLB sends required load to the CRE p TLB misses are sent to core of the CPU to resolve p Dependence chains are continuously executed p The running dependence chain is relaced every full-window stall p 22

PERFORMANCE EVALUATION 23

PERFORMANCE EVALUATION 23

Simulation Environment Execution-driven, cycle-level x 86 simulator p Single core system with p n

Simulation Environment Execution-driven, cycle-level x 86 simulator p Single core system with p n n n p 256 -entry reorder buffer 32 KB of instruction/data cache 1 MB LLC Combined with n n GBH prefetcher Stream prefetcher 24

RESULTS 25

RESULTS 25

CRE alone p p 34. 4% performance gain over the no-prefetching baseline 11. 9%

CRE alone p p 34. 4% performance gain over the no-prefetching baseline 11. 9% performance gain over GHB prefetcher Performance comparisons 26

CRE + GHB Prefetching 36. 4% performance gain over the no-prefetching baseline p 11.

CRE + GHB Prefetching 36. 4% performance gain over the no-prefetching baseline p 11. 9% performance gain over GHB prefetcher p Performance comparisons 27

Memory Bandwidth Consumption Increased memory bandwidth consumption for stream prefetching and GHB on some

Memory Bandwidth Consumption Increased memory bandwidth consumption for stream prefetching and GHB on some applications p Overhead drastically reduced with CRE p Comparisions on memory bandwidth consumption 28

CONCLUSION 29

CONCLUSION 29

Points to take Home p Solves limit on runahead distance by n n Dynamically

Points to take Home p Solves limit on runahead distance by n n Dynamically identifying critical dependence chains Executing these in a loop Cheap and low-complexity hardware solution p Significant performance gain on a variety of workloads p 30

CRITIQUE 31

CRITIQUE 31

Formal Critique p Positives n n p Written in an understandable way Well structured

Formal Critique p Positives n n p Written in an understandable way Well structured Negatives n Relying heavily on the readers understanding of specific previous work 32

Positives regarding Content New idea on handling the specified problem p Efficient solution using

Positives regarding Content New idea on handling the specified problem p Efficient solution using few additional resources p Exploring variety of ways to combine previous solutions with described solution p 33

Negatives regarding Content Potentially few workloads profiting from this p Potential negative side effects

Negatives regarding Content Potentially few workloads profiting from this p Potential negative side effects caused by placing a CRE on the memory controller not explored p Solution only for independent cache misses p 34

QUESTIONS 35

QUESTIONS 35

DISCUSSION 36

DISCUSSION 36

Topics Alternatives for Implementation p Could/Should we implement this in general purpose computers p

Topics Alternatives for Implementation p Could/Should we implement this in general purpose computers p Performance on Multicore Systems p Energy consumption p 37

Alternatives for Implementation What do we need to be able to p Is the

Alternatives for Implementation What do we need to be able to p Is the CRE the only way to implement Continous Runahead? p n n Simulations multi threading Idle cores 38

Performance on Multicore Systems 39

Performance on Multicore Systems 39