Coordinated Control of Multiple Prefetchers in MultiCore Systems

  • Slides: 24
Download presentation
Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi* Onur Mutlu‡ Chang Joo

Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi* Onur Mutlu‡ Chang Joo Lee* Yale N. Patt* * HPS Research Group The University of Texas at Austin ‡ Computer Architecture Laboratory Carnegie Mellon University 1

Motivation Aggressive prefetching improves memory latency tolerance of many applications when they run alone

Motivation Aggressive prefetching improves memory latency tolerance of many applications when they run alone Prefetching for concurrently-executing applications on a CMP can lead to Significant system performance degradation and bandwidth waste Problem: Prefetcher-caused inter-core interference Prefetches of one application contend with prefetches and demands of other applications 2

Potential Performance System performance improvement of ideally removing all prefetcher-caused inter-core interference in shared

Potential Performance System performance improvement of ideally removing all prefetcher-caused inter-core interference in shared resources 56% Exact workload combinations can be found in paper 3

Outline Background Shortcoming of Prior Approaches to Prefetcher Control Hierarchical Prefetcher Aggressiveness Control Evaluation

Outline Background Shortcoming of Prior Approaches to Prefetcher Control Hierarchical Prefetcher Aggressiveness Control Evaluation Conclusion 4

Increasing Prefetcher Accuracy Increasing prefetcher accuracy can reduce prefetcher-caused inter-core interference Single-core prefetcher aggressiveness

Increasing Prefetcher Accuracy Increasing prefetcher accuracy can reduce prefetcher-caused inter-core interference Single-core prefetcher aggressiveness throttling (e. g. , Srinath et al. , HPCA ’ 07) Filtering inaccurate prefetches (e. g. , Zhuang and Lee, ICPP ’ 03) Dropping inaccurate prefetches at memory controller (Lee et al. , MICRO ’ 08) All such techniques operate independently on the prefetches of each application 5

Feedback-Directed Prefetching (FDP) (Srinath et al. , HPCA ’ 07) Uses prefetcher feedback information

Feedback-Directed Prefetching (FDP) (Srinath et al. , HPCA ’ 07) Uses prefetcher feedback information local to the prefetcher’s core Prefetch accuracy Prefetch timeliness Prefetch cache pollution Prefetch Distance Prefetch Degree Dynamically adapts the prefetcher’s aggressiveness Stream Prefetcher Aggressiveness Prefetch Degree A+1 Access Stream P P+1 P+2 P+3 P+4 Prefetch Distance A Shown to perform better than and consume less bandwidth than static aggressiveness configurations 6

Outline Background Shortcoming of Prior Approaches to Prefetcher Control Hierarchical Prefetcher Aggressiveness Control Evaluation

Outline Background Shortcoming of Prior Approaches to Prefetcher Control Hierarchical Prefetcher Aggressiveness Control Evaluation Conclusion 7

High Interference caused by Accurate Prefetchers Core 0 Shared Cache Dem 2 Addr: A

High Interference caused by Accurate Prefetchers Core 0 Shared Cache Dem 2 Addr: A Miss Dem 2 Core 1 Pref 1 Dem 1 Addr: A Dem X Addr: Y Dem 2 Demand Request Pref 0 Addr: B Dem 0 … From Core X For Addr Y Core 3 Pref 3 Memory Controller In a. DRAM CMP Row Buffer Hit Requests Being Serviced Dem 2 Pref 1 Pref 3 Bank 0 Bank 1 system, accurate prefetchers can cause significant interference with Row Pref 1 Pref 3 Addr. concurrently-executing Buffers Row Addr. Rowapplications 8

Shortcoming of Per-Core (Local-Only) Prefetcher Aggressiveness Control Core 0 Core 1 Prefetcher 4 Degree:

Shortcoming of Per-Core (Local-Only) Prefetcher Aggressiveness Control Core 0 Core 1 Prefetcher 4 Degree: 2 Prefetcher Degree: 4 2 Core 3 FDP Throttle Up Shared Cache Used_P Pref 02 Dem Pref 13 Dem 2 Dem 3 Set 0 Dem Used_P Dem 03 Used_P Pref Dem 03 Pref Dem 12 Dem Pref 12 Dem 3 Set 1 Used_P Set 2 Pref Dem 02 Pref Dem 13 Dem 02 Pref … … Local-only prefetcher control techniques have no mechanism to detect inter-core interference 9

Shortcoming of Local-Only Prefetcher Control 4 -core workload example: lbm_06 + swim_00 + crafty_00

Shortcoming of Local-Only Prefetcher Control 4 -core workload example: lbm_06 + swim_00 + crafty_00 + bzip 2_00 Our Approach: Use both global and per-core feedback to determine each prefetcher’s aggressiveness 10

Outline Background Shortcoming of Prior Approaches to Prefetcher Control Hierarchical Prefetcher Aggressiveness Control Evaluation

Outline Background Shortcoming of Prior Approaches to Prefetcher Control Hierarchical Prefetcher Aggressiveness Control Evaluation Conclusion 11

Hierarchical Prefetcher Aggressiveness Control (HPAC) Global Control: goal: accepts or Global control’s goal: Keep

Hierarchical Prefetcher Aggressiveness Control (HPAC) Global Control: goal: accepts or Global control’s goal: Keep Local control’s of and control Maximize decisions the overrides made bytrack Memory Controller prefetcher-caused prefetching local controlperformance to improve of core system i independently overall performance inter-core interference in shared memory system Final Throttling Decision Pref. i Throttling Decision Local Control Accuracy Local Core i Throttling Decision Bandwidth Feedback Global Control Cache Pollution Feedback Shared Cache 12

Terminology Global feedback metrics used in our mechanism: For each core i: Core i’s

Terminology Global feedback metrics used in our mechanism: For each core i: Core i’s prefetcher accuracy – Acc (i) Core i’s prefetcher caused inter-core cache pollution Pol (i) Demand cache lines of other cores evicted by this core’s prefetches that are requested subsequent to eviction Bandwidth consumed by core i - BW (i) Accounts for how long requests from this core tie up DRAM banks Bandwidth needed by other cores j != i - BWNO (i) Accounts for how long requests from other cores have to wait for DRAM banks because of requests from this core 13

Calculating Inter-Core Cache Pollution Prefetch from core i, aevicts a core j’s miss demand

Calculating Inter-Core Cache Pollution Prefetch from core i, aevicts a core j’s miss demand from shared cache Core j experiences demand cache Pollution Filter of core i Core id Pollution bit Missing Evicted line’s Address From core j 0 0 0 2 0 1 0 j Increment Pol (i) . . . Hash Function 0 0 0 2 14

Hierarchical Prefetcher Aggressiveness Control (HPAC) - High accuracy - High pollution - High bandwidth

Hierarchical Prefetcher Aggressiveness Control (HPAC) - High accuracy - High pollution - High bandwidth consumed while other cores need bandwidth Pref. i Local Control Final Enforce Throttling. Down Decision Throttle Memory Controller High BW (i) BWNO (i) High BWNO (i) Global Control High Acc (i) Local Throttling. Up Decision Core i Throttle Pol (i) High Pol. Filter i Shared Cache 15

Heuristics for Global Control Classification of global control heuristics based on interference severity Severe

Heuristics for Global Control Classification of global control heuristics based on interference severity Severe interference Action: Reduce the aggressiveness of interfering prefetcher Borderline interference Action: Prevent prefetcher from transitioning into severe interference: q Allow local-control to only throttle-down No interference or moderate interference from an accurate prefetcher Action: Allow local control to maximize local benefits from prefetching 16

HPAC Control Policies Pol (i) Acc (i) Inaccurate BW (i) Low BW Consumption High

HPAC Control Policies Pol (i) Acc (i) Inaccurate BW (i) Low BW Consumption High BW Consumption Causing Low Pollution BWNO (i) Interference Class Action Others’ low BW need Others’ high BW need Severe interference throttle down Others’ low BW need Highly Accurate Inaccurate Causing High Pollution Low BW Consumption Highly Accurate High BW Consumption Others’ low BW need Others’ high BW need 17

Hardware Cost (4 -Core System) Total hardware cost local-control & global control Additional cost

Hardware Cost (4 -Core System) Total hardware cost local-control & global control Additional cost on top of FDP 15. 14 KB 1. 55 KB Additional cost on top of FDP only 1. 55 KB HPAC does not require any structures or logic that are on the processor’s critical path 18

Outline Background Shortcoming of Prior Approaches to Prefetcher Control Hierarchical Prefetcher Aggressiveness Control Evaluation

Outline Background Shortcoming of Prior Approaches to Prefetcher Control Hierarchical Prefetcher Aggressiveness Control Evaluation Conclusion 19

Evaluation Methodology x 86 cycle accurate simulator Baseline processor configuration Per core Shared 4

Evaluation Methodology x 86 cycle accurate simulator Baseline processor configuration Per core Shared 4 -wide issue, out-of-order, 256 -entry ROB Stream prefetcher with 32 streams, prefetch degree: 4, prefetch distance: 64 2 MB, 16 -way L 2 cache (4 MB, 32 -way for 8 -core) DDR 3 1333 Mhz 8 B wide core to memory bus 128, 256 L 2 MSHRs for 4 -, 8 -core Latency of 15 ns per command (t. RP, t. RCD, CL) HPAC thresholds used Acc BW Pol BWNO 0. 6 50 k 90 75 k 20

Performance Results Class 1 Class 2 Class 3 Class 4 15% Exact workload combinations

Performance Results Class 1 Class 2 Class 3 Class 4 15% Exact workload combinations can be found in paper 21

Summary of Other Results Further results and analysis are presented in the paper Results

Summary of Other Results Further results and analysis are presented in the paper Results with different types of memory controllers Prefetch-Aware DRAM Controllers (PADC) First-Ready First-Come-First-Served (FR-FCFS) Effect of HPAC on system fairness HPAC performance on 8 -core systems Multiple types of prefetchers per core and different local-control policies Sensitivity to system parameters 22

Conclusion Prefetcher-caused inter-core interference can destroy potential performance of prefetching When prefetching for concurrently

Conclusion Prefetcher-caused inter-core interference can destroy potential performance of prefetching When prefetching for concurrently executing applications in CMPs Did not exist in single-application environments Develop one low-cost hierarchical solution which throttles different cores’ prefetchers in a coordinated manner The key is to take global feedback into account to determine aggressiveness of each core’s prefetcher Improves system performance by 15% compared to no throttling on a 4 -core system Enables performance improvement from prefetching that is not possible without it on many workloads 23

Thank you! Questions? 24

Thank you! Questions? 24