Improving Memory BankLevel Parallelism in the Presence of

Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and Computer Engineering The University of Texas at Austin * Electrical and Computer Engineering Carnegie Mellon University 12/14/2021 1

Main Memory System • Crucial to high performance computing • Made of DRAM chips • Multiple banks → Each bank can be accessed independently 12/14/2021 2

Memory Bank-Level Parallelism (BLP) DRAM bank 0 Overlapped time DRAM bank 1 Bank 0 DRAM system Bank 1 Req B 0 Req B 1 DRAM controller Data bus Req B 0 Req B 1 Older DRAM request buffer 12/14/2021 Data for Req B 0 Data for Req B 1 Time DRAM throughput increased 3

Memory Latency-Tolerance Mechanisms • Out-of-order execution, prefetching, runahead etc. • Increase outstanding memory requests on the chip – Memory-Level Parallelism (MLP) [Glew’ 98] • Hope many requests will be serviced in parallel in the memory system • Higher performance can be achieved when BLP is exposed to the DRAM controller 12/14/2021 4

Problems • On-chip buffers e. g. , Miss Status Holding Registers (MSHRs) are limited in size – Limit the BLP exposed to the DRAM controller – E. g. , requests to the same bank fill up MSHRs • In CMPs, memory requests from different cores are mixed together in DRAM request buffers – Destroy the BLP of each application running on CMPs Request Issue policies are critical to BLP exploited by DRAM controller 12/14/2021 5

Goals and Proposal Goals 1. Maximize the BLP exposed from each core to the DRAM controller → Increase DRAM throughput for useful requests BLP-Aware Prefetch Issue (BAPI): Decides the order in which prefetches are sent from prefetcher to MSHRs 2. Preserve the BLP of each application in CMPs → Increase system performance BLP-Preserving Multi-core Request Issue (BPMRI): Decides the order in which memory requests are sent from each core to DRAM request buffers 12/14/2021 6

DRAM BLP-Aware Request Issue Policies • BLP-Aware Prefetch Issue (BAPI) • BLP-Preserving Multi-core Request Issue (BPMRI) 12/14/2021 7

What Can Limit DRAM BLP? • Miss Status Holding Registers (MSHRs) are NOT large enough to handle many memory requests [Tuck, MICRO’ 06] – MSHRs keep track of all outstanding misses for a core → Total number of demand/prefetch requests ≤ total number of MSHR entries – Complex, latency-critical, and power-hungry → Not scalable Request issue policy to MSHRs affects the level of BLP exploited by DRAM controller 12/14/2021 8

What Can Limit DRAM BLP? Ø To DRAM α β Bank 0 FIFO (Intel Core) Bank 0 β Dem B 0 Bank 1 Overlapped time Pref B 0 Pref B 1 Bank 1 DRAM request buffers 2 requests 1 0 request 1 Prefetch request buffer Ø DRAM service time BLP-aware Overlapped time MSHRs Full α: Dem B 0 Pref B 1 β: Pref B 1 Pref B 0 Core Older Increasing the Dem B 0 Bank 1 number of requests Pref B 0 ≠ high DRAM BLP Saved time Pref B 1 DRAM service time Simple issue policy improves DRAM BLP 12/14/2021 9

BLP-Aware Prefetch Issue (BAPI) • Sends prefetches to MSHRs based on current BLP exposed in the memory system – Sends a prefetch mapped to the least busy DRAM bank • Adaptively limits the issue of prefetches based on prefetch accuracy estimation – Low prefetch accuracy → Fewer prefetches issued to MSHRs – High prefetch accuracy → Maximize BLP 12/14/2021 10

Implementation of BAPI • FIFO prefetch request buffer per DRAM bank – Stores prefetches mapped to the corresponding DRAM bank • MSHR occupancy counter per DRAM bank – Keeps track of the number of outstanding requests to the corresponding DRAM bank • Prefetch accuracy register – Stores the estimated prefetch accuracy periodically 12/14/2021 11

BAPI Policy Every prefetch issue cycle 1. Make the oldest prefetch to each bank valid only if the bank’s MSHR occupancy counter ≤ prefetch send threshold 2. Among valid prefetches, select the request to the bank with minimum MSHR occupancy counter value 12/14/2021 12

Adaptivity of BAPI • Prefetch Send Threshold – Reserves MSHR entries for prefetches to different banks – Adjusted based on prefetch accuracy • Low prefetch accuracy → low prefetch send threshold • High prefetch accuracy → high prefetch send threshold 12/14/2021 13

DRAM BLP-Aware Request Issue Policies • BLP-Aware Prefetch Issue (BAPI) • BLP-Preserving Multi-core Request Issue (BPMRI) 12/14/2021 14

BLP Destruction in CMP Systems • DRAM request buffers are shared by multiple cores – To exploit the BLP of a core, the BLP should be exposed to DRAM request buffers – BLP potential of a core can be destroyed by the interference from other cores’ requests Request issue policy from each core to DRAM request buffers affects BLP of each application 12/14/2021 15

Why is DRAM BLP Destroyed? Ø To DRAM Round-robin Bank 0 Bank 1 Older Request issuer Req A 0 Ø Req A 1 Req B 0 Core A Core B Req B 1 Req A 1 BLP-Preserving Bank 0 B 0 Req A 0 Serializes requests. Req from Older Time Stall Core A Core B Bank 0 Bank 1 DRAM request controller buffers Req B 0 Req A 0 Bank 1 Req A 1 Core A Core B Stall each core Req B 1 Stall Time Increased cycles for Core B Saved cycles for Core A 12/14/2021 Issue policy should preserve DRAM BLP 16

BLP-Preserving Multi-Core Request Issue (BPMRI) • Consecutively sends requests from one core to DRAM request buffers • Limits the maximum number of consecutive requests sent from one core – Prevent starvation of memory non-intensive applications • Prioritizes memory non-intensive applications – Impact of delaying requests from memory non-intensive application > Impact of delaying requests from memory intensive application 12/14/2021 17

Implementation of BPMRI • Last-level (L 2) cache miss counter per core – Stores the number of L 2 cache misses from the core • Rank register per core – Fewer L 2 cache misses → higher rank – More L 2 cache misses → lower rank 12/14/2021 18

BPMRI Policy Every request issue cycle If consecutive requests from selected core ≥ request send threshold then selected core ← highest ranked core issue oldest request from selected core 12/14/2021 19

Simulation Methodology • x 86 cycle accurate simulator • Baseline processor configuration – Per core • • 4 -wide issue, out-of-order, 128 -entry ROB Stream prefetcher (prefetch degree: 4, prefetch distance: 64) 32 -entry MSHRs 512 KB 8 -way L 2 cache – Shared • • 12/14/2021 On-chip, demand-first FR-FCFS memory controller(s) 1, 2, 4 DRAM channels for 1, 4, 8 -core systems 64, 128, 512 -entry DRAM request buffers for 1, 4 and 8 -core systems DDR 3 1600 DRAM, 15 -15 -15 ns, 8 KB row buffer 20

Simulation Methodology • Workloads – 14 most memory-intensive SPEC CPU 2000/2006 benchmarks for single-core system – 30 and 15 SPEC 2000/2006 workloads for 4 and 8 -core CMPs • Pseudo-randomly chosen multiprogrammed • BAPI’s prefetch send threshold: Prefetch accuracy (%) 0~40 40~85 85~100 Threshold 1 7 27 • BPMRI’s request send threshold: 10 • Prefetch accuracy estimation and rank decision are made every 100 K cycles 12/14/2021 21

Performance of BLP-Aware Issue Policies 8. 5% 1 -core 12/14/2021 13. 8% 4 -core 13. 6% 8 -core 22

Hardware Storage Cost for 4 -core CMP Cost (bits) BAPI BPMRI Total 94, 368 72 94, 440 • Total storage: 94, 440 bits (11. 5 KB) – 0. 6% of L 2 cache data storage • Logic is not on the critical path – Issue decision can be made slower than processor cycle 12/14/2021 23

Conclusion • Uncontrolled memory request issue policies limit the level of BLP exploited by DRAM controller • BLP-Aware Prefetch Issue – Increases the BLP of useful requests from each core exposed to DRAM controller • BLP-Preserving Multi-core Request Issue – Ensures requests from the same core can be serviced in parallel by DRAM controller • Simple, low-storage cost • Significantly improve DRAM throughput and performance for both single and multi-core systems • Applicable to other memory technologies 12/14/2021 24

Questions? 12/14/2021 25