MultiCore Systems unfairness CORE 0 CORE 1 CORE

  • Slides: 30
Download presentation
Multi-Core Systems unfairness CORE 0 CORE 1 CORE 2 CORE 3 L 2 CACHE

Multi-Core Systems unfairness CORE 0 CORE 1 CORE 2 CORE 3 L 2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 Bank 1 Bank 2 Multi-Core Chip Shared DRAM Memory System . . . DRAM Bank 7 1

DRAM Bank Operation Access Address (Row 1, 0, Column 0) 1) 9) Row decoder

DRAM Bank Operation Access Address (Row 1, 0, Column 0) 1) 9) Row decoder Rows Row address 0 1 Columns Row 01 Row Empty 1 9 Column address 0 Row Buffer CONFLICT HIT ! Column decoder Data 2

DRAM Controllers n A row-conflict memory access takes much longer than a row-hit access

DRAM Controllers n A row-conflict memory access takes much longer than a row-hit access (50 ns vs. 100 ns) n Current controllers take advantage of the row buffer n Commonly used scheduling policy (FR-FCFS): q q n Row-hit first: Service row-hit memory accesses first Oldest-first: Then service older accesses first This scheduling policy aims to q q Maximize DRAM throughput But, yields itself to a new form of denial of service with multiple threads 3

Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems Thomas Moscibroda and Onur

Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems Thomas Moscibroda and Onur Mutlu Computer Architecture Group Microsoft Research Efficient Runahead Execution

Outline n The Problem q n n Preventing Memory Performance Hogs Our Solution q

Outline n The Problem q n n Preventing Memory Performance Hogs Our Solution q n n Memory Performance Hogs Fair Memory Scheduling Experimental Evaluation Conclusions 5

The Problem n Multiple threads share the DRAM controllers are designed to maximize DRAM

The Problem n Multiple threads share the DRAM controllers are designed to maximize DRAM throughput n DRAM scheduling policies are thread-unaware and unfair n q Row-hit first: unfairly prioritizes threads with high row buffer locality n n q n Streaming threads Threads that keep on accessing the same row Oldest-first: unfairly prioritizes memory-intensive threads Vulnerable to denial of service attacks 6

Memory Performance Hog (MPH) n A thread that exploits unfairness in the DRAM controller

Memory Performance Hog (MPH) n A thread that exploits unfairness in the DRAM controller q to deny memory service (for long time periods) to threads coscheduled on the same chip n Can severely degrade other threads’ performance While maintaining most of its own performance! n Easy to write an MPH n q q q An MPH can be designed intentionally (malicious) Or, unintentionally! Implications in commodity multi-core systems: n n n widespread performance degradation and unpredictability user discomfort and loss of productivity vulnerable hardware could lose market share 7

An Example Memory Performance Hog STREAM - Sequential memory access - Very high row

An Example Memory Performance Hog STREAM - Sequential memory access - Very high row buffer locality (96% hit rate) - Memory intensive 8

A Co-Scheduled Application RDARRAY - Random memory access - Very low row buffer locality

A Co-Scheduled Application RDARRAY - Random memory access - Very low row buffer locality (3% hit rate) - Similarly memory intensive 9

T 0: Row 0 T 0: T 1: Row 05 T 1: T 0:

T 0: Row 0 T 0: T 1: Row 05 T 1: T 0: Row 111 0 Row decoder What Does the MPH Do? Row size: 8 KB, data size: 64 B 128 requests of stream serviced before rdarray T 1: T 0: Row 16 0 Request Buffer Row 00 Row Buffer Column decoder Data 10

Already a Problem in Real Dual-Core Systems 2. 82 X slowdown 1. 18 X

Already a Problem in Real Dual-Core Systems 2. 82 X slowdown 1. 18 X slowdown Results on Intel Pentium D running Windows XP (Similar results for Intel Core Duo and AMD Turion, and on Fedora) 11

More Severe Problem with 4 and 8 Cores Simulated config: - 512 KB private

More Severe Problem with 4 and 8 Cores Simulated config: - 512 KB private L 2 s - 3 -wide processors -128 -entry instruction window 12

Outline n The Problem q n n Preventing Memory Performance Hogs Our Solution q

Outline n The Problem q n n Preventing Memory Performance Hogs Our Solution q n n Memory Performance Hogs Fair Memory Scheduling Experimental Evaluation Conclusions 13

Preventing Memory Performance Hogs n MPHs cannot be prevented in software q n Software

Preventing Memory Performance Hogs n MPHs cannot be prevented in software q n Software has no direct control over DRAM controller’s scheduling q q n Fundamentally hard to distinguish between malicious and unintentional MPHs OS/VMM can adjust applications’ virtual physical page mappings No control over DRAM controller’s hardwired page bank mappings MPHs should be prevented in hardware q q Hardware also cannot distinguish malicious vs unintentional Goal: Contain and limit MPHs by providing fair memory scheduling 14

Fairness in Shared DRAM Systems n A thread’s DRAM performance dependent on its inherent

Fairness in Shared DRAM Systems n A thread’s DRAM performance dependent on its inherent q Row-buffer locality Bandwidth, Bank parallelism. Latency n Performance or Request Fairness Interference between threads can destroy either(slowdown) or both Fairness Well-studied notions of fairness are inadequate n Fair scheduling cannot simply q n q Equalize memory bandwidth among threads n n Unfairly penalizes threads with good bandwidth utilization DRAM scheduling is not a pure bandwidth allocation problem q q Unlike a network wire, DRAM is a stateful system Equalize the latency or number of requests among threads n Unfairly penalizes threads with good row-buffer locality q Row-hit requests are less costly than row-conflict requests 15

Our Definition of DRAM Fairness n A DRAM system is fair if it slows

Our Definition of DRAM Fairness n A DRAM system is fair if it slows down each thread equally q n Compared to when the thread is run alone Experienced Latency = EL(i) = Cumulative DRAM latency of thread i running with other threads n Ideal Latency= IL(i) = Cumulative DRAM latency of thread i if it were run alone on the same hardware resources n n DRAM-Slowdown(i) = EL(i)/IL(i) The goal of a fair scheduling policy is to equalize DRAMSlowdown(i) for all Threads i q Considers inherent DRAM performance of each thread 16

Outline n The Problem q n n Preventing Memory Performance Hogs Our Solution q

Outline n The Problem q n n Preventing Memory Performance Hogs Our Solution q n n Memory Performance Hogs Fair Memory Scheduling Experimental Evaluation Conclusions 17

Fair Memory Scheduling Algorithm (1) n During each time interval of cycles, for each

Fair Memory Scheduling Algorithm (1) n During each time interval of cycles, for each thread, DRAM controller q q n At the beginning of a scheduling cycle, DRAM controller q q n Tracks EL (Experienced Latency) Estimates IL (Ideal Latency) Computes Slowdown = EL/IL for each thread Computes unfairness index = MAX Slowdown / MIN Slowdown If unfairness < q Use baseline scheduling policy to maximize DRAM throughput n n (1) row-hit first (2) oldest-first 18

Fair Memory Scheduling Algorithm (2) n If unfairness ≥ q Use fairness-oriented scheduling policy

Fair Memory Scheduling Algorithm (2) n If unfairness ≥ q Use fairness-oriented scheduling policy n n n q n n n (1) requests from thread with MAX Slowdown first (if any) (2) row-hit first (3) oldest-first Maximizes DRAM throughput if cannot improve fairness : Maximum tolerable unfairness : Lever between short-term versus long-term fairness System software determines and q Conveyed to the hardware via privileged instructions 19

How Does Fair. Mem Prevent an MPH? T 0: Row 0 T 1: Row

How Does Fair. Mem Prevent an MPH? T 0: Row 0 T 1: Row 5 T 0: Row 0 T 1: Row 111 T 0: Row 0 T 0: T 1: Row 0 16 T 0 Slowdown 1. 10 1. 04 1. 07 1. 03 Row 16 Row 00 Row 111 Row Buffer T 1 Slowdown 1. 14 1. 03 1. 06 1. 08 1. 11 1. 00 Unfairness 1. 06 1. 04 1. 03 1. 00 1. 05 Data 20

Hardware Implementation n Tracking Experienced Latency q q n Relatively easy Update a per-thread

Hardware Implementation n Tracking Experienced Latency q q n Relatively easy Update a per-thread latency counter as long as there is an outstanding memory request for a thread Estimating Ideal Latency q q More involved because thread is not running alone Need to estimate inherent row buffer locality n n n Keep track of the row that would have been in the row buffer if thread were running alone For each memory request, update a per-thread latency counter based on “would-have-been” row buffer state Efficient implementation possible – details in paper 21

Outline n The Problem q n n Preventing Memory Performance Hogs Our Solution q

Outline n The Problem q n n Preventing Memory Performance Hogs Our Solution q n n Memory Performance Hogs Fair Memory Scheduling Experimental Evaluation Conclusions 22

Evaluation Methodology n n n Instruction-level performance simulation Detailed x 86 processor model based

Evaluation Methodology n n n Instruction-level performance simulation Detailed x 86 processor model based on Intel Pentium M 4 GHz processor, 128 -entry instruction window 512 Kbyte per core private L 2 caches Detailed DRAM model q q q n 8 banks, 2 Kbyte row buffer Row-hit latency: 50 ns (200 cycles) Row-conflict latency: 100 ns (400 cycles) Benchmarks q q Stream (MPH) and rdarray SPEC CPU 2000 and Olden benchmark suites 23

Dual-core Results MPH with non-intensive VPR Fairness improvement 7. 9 X to 1. 11

Dual-core Results MPH with non-intensive VPR Fairness improvement 7. 9 X to 1. 11 System throughput improvement 4. 5 X MPH with intensive MCF Fairness improvement 4. 8 X to 1. 08 System throughput improvement 1. 9 X 24

4 -Core Results • Non-memory intensive applications suffer the most (VPR) • Low row-buffer

4 -Core Results • Non-memory intensive applications suffer the most (VPR) • Low row-buffer locality hurts (MCF) • Fairness improvement: 6. 6 X • System throughput improvement: 2. 2 X 25

8 -Core Results • Fairness improvement: 10. 1 X • System throughput improvement: 1.

8 -Core Results • Fairness improvement: 10. 1 X • System throughput improvement: 1. 97 X 26

Effect of Row Buffer Size • Vulnerability increases with row buffer size • Also

Effect of Row Buffer Size • Vulnerability increases with row buffer size • Also with memory latency (in paper) 27

Outline n The Problem q n n Preventing Memory Performance Hogs Our Solution q

Outline n The Problem q n n Preventing Memory Performance Hogs Our Solution q n n Memory Performance Hogs Fair Memory Scheduling Experimental Evaluation Conclusions 28

Conclusions n Introduced the notion of memory performance attacks in shared DRAM memory systems

Conclusions n Introduced the notion of memory performance attacks in shared DRAM memory systems Unfair DRAM scheduling is the cause of the vulnerability More severe problem in future many-core systems n We provide a novel definition of DRAM fairness n n q n New DRAM scheduling algorithm enforces this definition q n Threads should experience equal slowdowns Effectively prevents memory performance attacks Preventing attacks also improves system throughput! 29

Questions? Efficient Runahead Execution

Questions? Efficient Runahead Execution