Continuous Runahead Transparent Hardware Acceleration for Memory Intensive
- Slides: 39
Continuous Runahead Transparent Hardware Acceleration for Memory Intensive Workloads Authors: M. Hashemi, O. Mutlu, Y. N. Patt Presented at MICRO 2016 ETH Zürich –Computer Architecture Seminar HS 2020 Leandra Maisch
Problem Statement For various applications we would like to process large amounts of data p Frequent memory accesses lead to a lot of wait time p Runahead techniques want to reduce this wait time by prefetching and executing memory requests during wait time p 2
Quick Summary Continuous Runahead explores a method to prefetch and execute instructions while a program is running to generate cache misses and subsequent memory loads. This leads to fewer cache misses while a program is executed and therefore to lower wait times on memory. 3
Overview Runahead Execution p Continuous Runahead p n n Choosing and Storing Dependence Chains CRE Performance evaluations p Critic p Discussion p 4
RUNAHEAD EXECUTION 5
Runahead Execution What is Runahead Execution? p Prefetching methods p n n p Stream prefetcher Global History buffer Current Limitations of Runahead Execution 6
Runahead Execution Memory accesses can cause full pipeline stalls p Stalls use around 50% of execution time of a program p Runahead uses instruction window to fetch and execute upcoming instructions p Fewer cache misses 7
Stream Prefetcher Defines stream of cache misses by looking at addresses close in memory p Looks only in a defined direction p Prefetches blocks of memory in said direction p More in “Memory Prefetching using Adaptive Stream Detection” by I. Hur and C. Lin https: //www. cs. utexas. edu/~lin/papers/micro 06. pdf 8
Global History Buffer Holds most recent miss addresses in FIFO order p Ordered table allows to discard unused data p Complete picture of cache miss history p Small sized table p More in “Data Cache Prefetching Using a Global History Buffer” by K. J. Nesbit and J. E. Smith https: //www. eecg. utoronto. ca/~steffan/carg/readings/ghb. pdf 9
Limitations of Prefetching Short duration of full-window stall p Prioritisation of memory accesses p 10
CONTINUOUS RUNAHEAD 11
Key Ideas p Dynamically filter incoming dependence chains n Filter dependence chains generating memory accesses Execute dependence chains in a loop p Loop executed on the Continuous Runahead Engine (CRE) p 12
DEFINITIONS 13
Dependence Chain Set of dependent instructions leading up to a key instruction p Generated by backtracking the data flow p Example of a dependence chain: Computing the address for a memory access 14
Full-Window Stall Instructions are retired in program order p Long-latency instructions can block pipeline p Instruction window is filled with incoming instructions p Both instruction window is blocked and pipeline stalled is called full-window stall p 15
IMPLEMENTATION 16
Dependence Chain Selection p Base Policy n p PC based Policy n n p Lists all PCs that caused LLC misses Dependent on operation which is blocking retirement Maximum-Misses Policy n p Select next memory access in buffer Finds and selects PC causing most cache misses Stall Policy n n Tracks PCs causing full-window stalls Selects chain causing most full-window stalls 17
Evaluation of the Policies p p Evaluation of the policies on a single core system using Runahead Using policies tracking most misses gives improved performance on most workloads Comparisions of the policies 18
Selecting Instructions p Small amount of instructions cause over 90% of full window stalls Instructions causing full window stalls Only a handful instructions need to be looped to be effective 19
Continuous Runahead Engine p Strongly based on an enhanced memory controller See paper “Accelerating Dependent Cache Misses with an Enhanced Memory Controller” by M. Hashemi et al. http: //eimanebrahimi. com/pub/hashemi_isca 16. pdf p Sits on the memory controller to reduce latency on memory loads 20
Architecture of the CRE 32 -uop buffer to hold full dependence chains p 32 -entry physical register p 4 k. B cache with 32 entry TLB p Data path of the CRE 21
Handling Dependence Chains Upon generation TLB sends required load to the CRE p TLB misses are sent to core of the CPU to resolve p Dependence chains are continuously executed p The running dependence chain is relaced every full-window stall p 22
PERFORMANCE EVALUATION 23
Simulation Environment Execution-driven, cycle-level x 86 simulator p Single core system with p n n n p 256 -entry reorder buffer 32 KB of instruction/data cache 1 MB LLC Combined with n n GBH prefetcher Stream prefetcher 24
RESULTS 25
CRE alone p p 34. 4% performance gain over the no-prefetching baseline 11. 9% performance gain over GHB prefetcher Performance comparisons 26
CRE + GHB Prefetching 36. 4% performance gain over the no-prefetching baseline p 11. 9% performance gain over GHB prefetcher p Performance comparisons 27
Memory Bandwidth Consumption Increased memory bandwidth consumption for stream prefetching and GHB on some applications p Overhead drastically reduced with CRE p Comparisions on memory bandwidth consumption 28
CONCLUSION 29
Points to take Home p Solves limit on runahead distance by n n Dynamically identifying critical dependence chains Executing these in a loop Cheap and low-complexity hardware solution p Significant performance gain on a variety of workloads p 30
CRITIQUE 31
Formal Critique p Positives n n p Written in an understandable way Well structured Negatives n Relying heavily on the readers understanding of specific previous work 32
Positives regarding Content New idea on handling the specified problem p Efficient solution using few additional resources p Exploring variety of ways to combine previous solutions with described solution p 33
Negatives regarding Content Potentially few workloads profiting from this p Potential negative side effects caused by placing a CRE on the memory controller not explored p Solution only for independent cache misses p 34
QUESTIONS 35
DISCUSSION 36
Topics Alternatives for Implementation p Could/Should we implement this in general purpose computers p Performance on Multicore Systems p Energy consumption p 37
Alternatives for Implementation What do we need to be able to p Is the CRE the only way to implement Continous Runahead? p n n Simulations multi threading Idle cores 38
Performance on Multicore Systems 39
- Runahead execution
- Runahead execution
- Runahead execution
- Tmo: transparent memory offloading in datacenters
- Centripital acceleration
- Angular and linear quantities
- Radial acceleration definition
- Linear acceleration vs tangential acceleration
- Centripetal acceleration tangential acceleration
- External components of computer
- Past and future continuous tense
- Present simple, past simple, future simple
- Hardware transactional memory
- Hardware transactional memory
- Hardware transactional memory
- Prototype in semantics
- Excplicit memory
- Long term memory vs short term memory
- Internal memory and external memory
- Primary memory and secondary memory
- Logical address
- Which memory is the actual working memory?
- Virtual memory and cache memory
- Virtual memory in memory hierarchy consists of
- Eidetic memory vs iconic memory
- Shared memory vs distributed memory
- Fspos vägledning för kontinuitetshantering
- Typiska novell drag
- Nationell inriktning för artificiell intelligens
- Ekologiskt fotavtryck
- Varför kallas perioden 1918-1939 för mellankrigstiden?
- En lathund för arbete med kontinuitetshantering
- Kassaregister ideell förening
- Personlig tidbok fylla i
- A gastrica
- Förklara densitet för barn
- Datorkunskap för nybörjare
- Stig kerman
- Mall för debattartikel
- Delegerande ledarskap