Seminar in Computer Architecture Meeting 2 Logistics and

Recap: Key Goal (Learn how to) rigorously analyze, present, discuss papers and ideas in

Recap: Some Goals of This Course n Teach/enable/empower you to: q Think critically q

Recap: Steps to Achieve the Key Goal n Steps for the Presenter q q

Course Logistics n Requirements: q q q 1: Submit HW 0 2: Explicitly provide

Course Requirements and n Attendance required for all meetings Expectations Sign in sheet q

Grading Rubric n Quality of your presentation (60%) q q q n Quality of

Algorithm for Presentation Preparation n Study Lecture 1 again for presentation guidelines n Read

Learning by Example n A great way of learning n We already did one

Structure of the Presentation n n Background, Problem & Goal Novelty Key Approach and

Key Results: Methodology and Evaluation 16

Let’s Review This Paper n Vivek Seshadri, Yoongu Kim, Chris Fallin, Donghyuk Lee, Rachata

Row. Clone Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization Vivek Seshadri Y.

Memory Channel – Bottleneck Core MC Channel High Energy Memory Core Cache Limited Bandwidth

Core MC Channel Memory Core Cache Goal: Reduce Memory Bandwidth Demand Reduce unnecessary data

Bulk Data Copy and Initialization Bulk Data Copy src dst Bulk Data Initialization val

Bulk Data Copy and Initialization memmove & memcpy: 5% cycles in Google’s datacenter [Kan

Shortcomings of Today’s Systems 1) High latency 3) Cache pollution CPU L 1 Memory

Row. Clone: In-Memory Copy 3) No cache pollution 1) Low latency Memory CPU L

Row. Clone: In-DRAM Row Copy 4 Kbytes Idea: Two consecutive ACTivates Negligible HW cost

Subarray Chip I/O Memory Channel DRAM Chip Organization Bank I/O Row of DRAM Cells

Row. Clone Types n Intra-subarray Row. Clone (row granularity) q n Inter-bank Row. Clone

Row. Clone: Fast Parallel Mode (FPM) s r c r o w ds sr

Row. Clone: Intra-Subarray (I) VDD/2 V +DD δ src 0 dst 0 V VDD

Row. Clone: Intra-Subarray (II) s r c r o w ds sr ct r

Fast Parallel Mode: Benefits Bulk Data Copy Latency 11 x 1046 ns to 90

Fast Parallel Mode: Constraints n Location of source/destination q n Both should be in

Bank Chip I/O Memory Channel Row. Clone: Inter-Bank Shared internal bus Overlap the latency

Generalized Row. Clone Chip I/O Memory Channel Inter Subarray Copy (Use Inter-Bank Copy Twice)

Row. Clone: Fast Row Initialization 0 0 0 Fix a row at Zero (0.

Row. Clone: Bulk Initialization n Initialization with arbitrary data q q n Initialize one

Row. Clone: Latency & Energy Benefits Energy Reduction 74. 4 x Latency Reduction 40

End-to-End System Design Application Operating System ISA How to communicate occurrences of bulk copy/initialization

1. Hardware/Software Interface Two new instructions • memcopy and meminit • Similar instructions present

2. Managing Cache Coherence Row. Clone modifies data in memory • Need to maintain

3. Maximizing Use of the Fast Parallel Mode Make operating system subarray-aware Primitives amenable

4. Handling Data Reuse After Zeroing Data reuse after zero initialization • Phase 1:

Key Results: Methodology and Evaluation 55

Methodology n Out-of-order multi-core simulator 1 MB/core last-level cache n Cycle-accurate DDR 3 DRAM

Copy/Initialization Intensive Applications n System bootup (Booting the Debian OS) n Compile (GNU C

Copy and Initialization in Workloads Fraction of Memory Traffic Zero Copy Write Read 1

Single-Core – Performance and Energy Compared to Baseline IPC Improvement Memory Energy Reduction 70%

Multi-Core Systems Reduced bandwidth consumption benefits all applications. Run copy/initialization intensive applications with memory

Multi-Core Results: Summary Improvement over Baseline System Performance Memory Energy Efficiency 30% 25% 20%

Executive Summary Bulk data copy and initialization • Unnecessarily move data on the memory

Strengths of the Paper n n n Simple, novel mechanism to solve an important

Weaknesses n Requires data to be mapped in the same subarray to deliver the

Recall: Avoid Rat Holes Source: https: //www. cse. wustl. edu/~jain/iucee/ftp/k_10 adp. pdf 68

Raj Jain, “The Art of Computer Systems Performance Analysis, ” Wiley, 1991. 69

Raj Jain, “The Art of Computer Systems Performance Analysis, ” Wiley, 1991. 70

Aside: A Recommended Book Raj Jain, “The Art of Computer Systems Performance Analysis, ”

Extensions n Can this be improved to do faster inter-subarray copy? q Yes, see

LISA: Fast Inter-Subarray Data Movement Kevin K. Chang, Prashant J. Nair, Saugata Ghose, Donghyuk

In-DRAM Bulk AND/OR n Vivek Seshadri, Kevin Hsieh, Amirali Boroumand, Donghyuk Lee, Michael A.

Ambit: Bulk-Bitwise in-DRAM Computation n Vivek Seshadri et al. , “Ambit: In-Memory Accelerator for

Efficient Data Coherence Support n Amirali Boroumand, Saugata Ghose, Minesh Patel, Hasan Hassan, Brandon

Key Takeaways n A novel method to accelerate data copy and initialization n Simple

Discussion Starters n Thoughts on the previous ideas? n How practical is this? n

More on Row. Clone n Vivek Seshadri, Yoongu Kim, Chris Fallin, Donghyuk Lee, Rachata

We Did Not Cover the Following Slides. They Are For Your Benefit.

PAR-BS n Onur Mutlu and Thomas Moscibroda, "Parallelism-Aware Batch Scheduling: Enhancing both Performance and

We Will Do This Differently n I will give a “conference talk” n You

Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu

Outline n n Background and Goal Motivation q n Parallelism-Aware Batch Scheduling q q

The DRAM System Columns Rows BANK 0 BANK 1 BANK 2 BANK 3 Row

Multi-Core Systems threads’ requests interfere CORE 0 CORE 1 CORE 2 CORE 3 L

Inter-thread Interference in the DRAM System n Threads delay each other by causing resource

Normalized Memory Stall-Time Consequences of Inter-Thread Interference in DRAM is the only shared resource

Our Goal n Control inter-thread interference in DRAM n Design a shared DRAM scheduler

The Problem n Processors try to tolerate the latency of DRAM requests by generating

Bank Parallelism of a Thread Bank 0 2 DRAM Requests Bank 1 Single Thread:

Bank Parallelism Interference in Baseline Scheduler: DRAM Bank 0 Bank 1 2 DRAM Requests

Parallelism-Aware Scheduler Baseline Scheduler: Bank 0 2 DRAM Requests A : Compute Bank 0

Outline n n Background and Goal Motivation q n Parallelism-Aware Batch Scheduling (PAR-BS) q

Parallelism-Aware Batch Scheduling (PAR-BS) n Principle 1: Parallelism-awareness q Schedule requests from a thread

PAR-BS Components n Request batching n Within-batch scheduling q Parallelism aware 103

Request Batching n Each memory request has a bit (marked) associated with it n

Within-Batch Scheduling n Can use any existing DRAM scheduling policy q n FR-FCFS (row-hit

How to Rank Threads within a Batch n Ranking scheme affects system throughput and

Shortest Stall-Time First Ranking n Maximum number of marked requests to any bank (max-bank-load)

Putting It Together: PAR-BS Scheduling Policy n PAR-BS Scheduling Policy Batching (1) Marked requests

Hardware Cost n <1. 5 KB storage cost for q 8 -core system with

System Software Support n OS conveys each thread’s priority level to the controller q

Evaluation Methodology n 4 -, 8 -, 16 -core systems q q q n

Comparison with Other DRAM Controllers n Baseline FR-FCFS [Zuravleff and Robinson, US Patent 1997;

Unfairness on 4 -, 8 -, 16 -core Systems Unfairness = MAX Memory Slowdown

System Performance (Hmean-speedup) 8. 3% 6. 1% 5. 1% 117

Summary n Inter-thread interference can destroy each thread’s DRAM bank parallelism q q q

Multiple Memory Controllers (I) n Local ranking: Each controller uses PAR-BS independently q n

1. 18 X 1. 33 X Nomalized Hmean-Speedup Unfairness Multiple Memory Controllers (II) 7.

Example with Row Hits Stall time Thread 1 4 Thread 1 5. 5 Thread

Now Your Turn to Analyze… n n n Background, Problem & Goal Novelty Key

PAR-BS Pros and Cons n Upsides: q q n Downsides: q q q n

More on PAR-BS n Onur Mutlu and Thomas Moscibroda, "Parallelism-Aware Batch Scheduling: Enhancing both

Slides: 129

Download presentation

Seminar in Computer Architecture Meeting 2: Logistics and Examples Prof. Onur Mutlu ETH Zürich Fall 2019 28 February 2019

Recap: Key Goal (Learn how to) rigorously analyze, present, discuss papers and ideas in computer architecture 2

Recap: Some Goals of This Course n Teach/enable/empower you to: q Think critically q Think broadly q Learn how to understand, analyze and present papers and ideas q Get familiar with key first steps in research q Get familiar with key research directions 3

Recap: Steps to Achieve the Key Goal n Steps for the Presenter q q q q n Read Absorb, read more (other related works) Critically analyze; think; synthesize Prepare a clear and rigorous talk Present Answer questions Analyze and synthesize (in meeting, after, and at course end) Steps for the Participants q q q Discuss Ask questions Analyze and synthesize (in meeting, after, and at course end) 4

Course Logistics n Requirements: q q q 1: Submit HW 0 2: Explicitly provide paper preferences https: //safari. ethz. ch/architecture_seminar/spring 2019 n 2 presentations each week we meet n Each presentation q q n One student presents one paper and leads discussion Max 30 -minute summary+analysis Max 20 -minute discussion+brainstorming+feedback Should follow the suggested guidelines No class next week 5

Course Requirements and n Attendance required for all meetings Expectations Sign in sheet q n Each student presents one paper q q n Non-presenters participate during the meeting q q n Ask questions, contribute thoughts/ideas Better if you read/skim the paper beforehand Everyone comments on papers in the online review system q n Prepare for presentation with engagement from the mentor(s) Full presentation + questions + discussion After presentation Write synthesis report at the end of semester q Sample synthesis report online 6

Grading Rubric n Quality of your presentation (60%) q q q n Quality of the final synthesis paper (30%) q n n How well did you understand the material? How well did you present it? How well did you answer the questions? Be prepared to explain technical terms We will take into account the difficulty of the paper and the time you had to prepare. How well did you understand some of the papers presented during the seminar? Attendance (10%) Participation (during class and online) (BONUS 10%) q q Did you ask good questions? Did you attend all sessions? 7

Algorithm for Presentation Preparation n Study Lecture 1 again for presentation guidelines n Read analyze your paper thoroughly q Discuss with anyone you wish + use any resources n Prepare a draft presentation based on guidelines n Meet mentor(s) and get feedback q n Meet mentor(s) again and get further feedback q n n Revise the presentation and delivery Meetings are mandatory – you have to schedule them with your assigned mentor(s). We may suggest meeting times. Practice, practice 8

Example Paper Presentations 9

Learning by Example n A great way of learning n We already did one example last time q n Memory Channel Partitioning We will do at least one more today 10

Structure of the Presentation n n Background, Problem & Goal Novelty Key Approach and Ideas Mechanisms (in some detail) Key Results: Methodology and Evaluation Summary Strengths Weaknesses Thoughts and Ideas Takeaways Open Discussion 11

Background, Problem & Goal 12

Novelty 13

Key Approach and Ideas 14

Mechanisms (in some detail) 15

Key Results: Methodology and Evaluation 16

Summary 17

Strengths 18

Weaknesses 19

Thoughts and Ideas 20

Takeaways 21

Open Discussion 22

Example Paper Presentation 23

Let’s Review This Paper n Vivek Seshadri, Yoongu Kim, Chris Fallin, Donghyuk Lee, Rachata Ausavarungnirun, Gennady Pekhimenko, Yixin Luo, Onur Mutlu, Michael A. Kozuch, Phillip B. Gibbons, and Todd C. Mowry, "Row. Clone: Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization" Proceedings of the 46 th International Symposium on Microarchitecture (MICRO), Davis, CA, December 2013. [Slides (pptx) (pdf)] [Lightning Session Slides (pptx) (pdf)] [Poster (pptx) (pdf)] 24

Row. Clone Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization Vivek Seshadri Y. Kim, C. Fallin, D. Lee, R. Ausavarungnirun, G. Pekhimenko, Y. Luo, O. Mutlu, P. B. Gibbons, M. A. Kozuch, T. C. Mowry

Background, Problem & Goal 26

Memory Channel – Bottleneck Core MC Channel High Energy Memory Core Cache Limited Bandwidth

Core MC Channel Memory Core Cache Goal: Reduce Memory Bandwidth Demand Reduce unnecessary data movement

Bulk Data Copy and Initialization Bulk Data Copy src dst Bulk Data Initialization val dst

Bulk Data Copy and Initialization memmove & memcpy: 5% cycles in Google’s datacenter [Kan ISCA’ 15] 00000 Zero initialization Forking (e. g. , security) Checkpointing Many more VM Cloning Deduplication Page Migration 31

Shortcomings of Today’s Systems 1) High latency 3) Cache pollution CPU L 1 Memory L 2 L 3 MC 2) High bandwidth utilization 4) Unwanted data movement 1046 ns, 3. 6 u. J (for 4 KB page copy via DMA) 32

Novelty, Key Approach, and Ideas 33

Row. Clone: In-Memory Copy 3) No cache pollution 1) Low latency Memory CPU L 1 L 2 L 3 MC 2) Low bandwidth utilization 4) No unwanted data movement 1046 ns, 3. 6 u. J 90 ns, 0. 04 u. J 34

Row. Clone: In-DRAM Row Copy 4 Kbytes Idea: Two consecutive ACTivates Negligible HW cost Step 1: Activate row A Step 2: Activate row B Transfer row DRAM subarray Transfer row Row Buffer (4 Kbytes) 8 bits Data Bus

Mechanisms (in some detail) 36

Subarray Chip I/O Memory Channel DRAM Chip Organization Bank I/O Row of DRAM Cells Row Buffer 37

Row. Clone Types n Intra-subarray Row. Clone (row granularity) q n Inter-bank Row. Clone (byte granularity) q n Fast Parallel Mode (FPM) Pipelined Serial Mode (PSM) Inter-subarray Row. Clone 38

Row. Clone: Fast Parallel Mode (FPM) s r c r o w ds sr ct r o w s r c r o w Row Buffer � 1. Source row to row buffer ? 2. Row buffer to destination row

Row. Clone: Intra-Subarray (I) VDD/2 V +DD δ src 0 dst 0 V VDD DD/2 + δ Amplify the difference Data gets copied Sense Amplifier (Row Buffer) VDD/2 0

Row. Clone: Intra-Subarray (II) s r c r o w ds sr ct r o w s r c r o w Row Buffer 1. Activate src row (copy data from src to row buffer) 2. Activate dst row (disconnect src from row buffer, connect dst – copy data from row buffer to dst)

Fast Parallel Mode: Benefits Bulk Data Copy Latency 11 x 1046 ns to 90 ns Energy 74 x 3600 n. J to 40 n. J No bandwidth consumption Very little changes to the DRAM chip 42

Fast Parallel Mode: Constraints n Location of source/destination q n Both should be in the same subarray Size of the copy q Copies all the data from source row to destination

Bank Chip I/O Memory Channel Row. Clone: Inter-Bank Shared internal bus Overlap the latency of the read and the write 1. 9 X latency reduction, 3. 2 X energy reduction

Generalized Row. Clone Chip I/O Memory Channel Inter Subarray Copy (Use Inter-Bank Copy Twice) Inter Bank Copy (Pipelined Internal RD/WR) Bank Intra Subarray Copy (2 ACTs) 0. 01% area cost Subarray Bank I/O

Row. Clone: Fast Row Initialization 0 0 0 Fix a row at Zero (0. 5% loss in capacity) 46

Row. Clone: Bulk Initialization n Initialization with arbitrary data q q n Initialize one row Copy the data to other rows Zero initialization (most common) q q Reserve a row in each subarray (always zero) Copy data from reserved row (FPM mode) 6. 0 X lower latency, 41. 5 X lower DRAM energy 0. 2% loss in capacity 47

Row. Clone: Latency & Energy Benefits Energy Reduction 74. 4 x Latency Reduction 40 20 3. 2 x 1. 5 x y ar ra ub an -B In te r ub tra -S ar ra y y ar ra an ar ra y k 0 y 1. 9 x 1. 0 x 41. 5 x 60 tra -S 6. 0 x Copy In In te r tra -S In ub ub -S te r In In te r -B Very low cost: 0. 01% increase in die area ub tra -S In 80 ub ar ra 11. 6 x -S 14 12 10 8 6 4 2 0 Zero 48

System Design to Enable Row. Clone 49

End-to-End System Design Application Operating System ISA How to communicate occurrences of bulk copy/initialization across layers? How to ensure cache coherence? Microarchitecture How to maximize latency and energy savings? DRAM (Row. Clone) How to handle data reuse? 50

1. Hardware/Software Interface Two new instructions • memcopy and meminit • Similar instructions present in existing ISAs Microarchitecture Implementation • Checks if instructions can be sped up by Row. Clone • Export instructions to the memory controller 51

2. Managing Cache Coherence Row. Clone modifies data in memory • Need to maintain coherence of cached data Similar to DMA • Source and destination in memory • Can leverage hardware support for DMA Additional optimizations 52

3. Maximizing Use of the Fast Parallel Mode Make operating system subarray-aware Primitives amenable to use of FPM • Copy-on-Write Allocate destination in same subarray as source Use FPM to copy • Bulk Zeroing Use FPM to copy data from reserved zero row 53

4. Handling Data Reuse After Zeroing Data reuse after zero initialization • Phase 1: OS zeroes out the page • Phase 2: Application uses cachelines of the page Row. Clone • Avoids misses in phase 1 • But incurs misses in phase 2 Row. Clone-Zero-Insert (Row. Clone-ZI) • Insert clean zero cachelines 54

Key Results: Methodology and Evaluation 55

Methodology n Out-of-order multi-core simulator 1 MB/core last-level cache n Cycle-accurate DDR 3 DRAM simulator n 6 Copy/Initialization intensive applications n +SPEC CPU 2006 for multi-core n Performance q q Instruction throughput for single-core Weighted Speedup for multi-core

Copy/Initialization Intensive Applications n System bootup (Booting the Debian OS) n Compile (GNU C compiler – executing cc 1) n Forkbench (A fork microbenchmark) n Memcached (Inserting a large number of objects) n My. Sql (Loading a database) n Shell script (find with ls on each subdirectory)

Copy and Initialization in Workloads Fraction of Memory Traffic Zero Copy Write Read 1 0. 8 0. 6 0. 4 0. 2 0 bootup compile forkbench mcached mysql shell 58

Single-Core – Performance and Energy Compared to Baseline IPC Improvement Memory Energy Reduction 70% 60% 50% 40% 30% Improvements correlate with fraction of 20% memory traffic due to copy/initialization 10% 0% bootup compile forkbench mcached mysql shell 59

Multi-Core Systems Reduced bandwidth consumption benefits all applications. Run copy/initialization intensive applications with memory intensive SPEC applications. Half the cores run copy/initialization intensive applications. Remaining half run SPEC applications. 60

Multi-Core Results: Summary Improvement over Baseline System Performance Memory Energy Efficiency 30% 25% 20% 15% 10% 5% Performance Consistent improvement increases in 0% with energy/instruction increasing 4 -Core count 8 -Core 2 -Core 61

Summary 62

Executive Summary Bulk data copy and initialization • Unnecessarily move data on the memory channel • Degrade system performance and energy efficiency Row. Clone – perform copy in DRAM with low cost • Uses row buffer to copy large quantity of data • Source row → row buffer → destination row • 11 X lower latency and 74 X lower energy for a bulk copy Accelerate Copy-on-Write and Bulk Zeroing • Forking, checkpointing, zeroing (security), VM cloning Improves performance and energy efficiency at low cost • 27% and 17% for 8 -core systems (0. 01% DRAM chip area) 63

Strengths 64

Strengths of the Paper n n n Simple, novel mechanism to solve an important problem Effective and low hardware overhead Intuitive idea! Greatly improves performance and efficiency (assuming data is mapped nicely) Seems like a clear win for data initialization (without mapping requirements) Makes software designer’s life easier q n n If copies are 10 x-100 x cheaper, how to design software? Paper tackles many low-level and system-level issues Well-written, insightful paper 65

Weaknesses 66

Weaknesses n Requires data to be mapped in the same subarray to deliver the largest benefits q q n n Inter-subarray copy is very inefficient Causes many changes in the system stack q q n n Helps less if data movement is not within a subarray Does not help if data movement is across DRAM channels End-to-end design spans applications to circuits Software-hardware cooperative solution might not always be easy to adopt Cache coherence and data reuse cause real overheads Evaluation is done solely in simulation Evaluation does not consider multi-chip systems Are these the best workloads to evaluate? 67

Recall: Avoid Rat Holes Source: https: //www. cse. wustl. edu/~jain/iucee/ftp/k_10 adp. pdf 68

Raj Jain, “The Art of Computer Systems Performance Analysis, ” Wiley, 1991. 69

Raj Jain, “The Art of Computer Systems Performance Analysis, ” Wiley, 1991. 70

Aside: A Recommended Book Raj Jain, “The Art of Computer Systems Performance Analysis, ” Wiley, 1991. 71

Thoughts and Ideas 72

Extensions n Can this be improved to do faster inter-subarray copy? q Yes, see the LISA paper [Chang et al. , HPCA 2016] n Can this be extended to move data at smaller granularities? n Can we have more efficient solutions to q q n n Cache coherence (minimize overhead) Data reuse after copy and initialization Can this idea be evaluated on a real system? How? Can similar ideas and DRAM properties be used to perform computation on data? q Yes, see the Ambit paper [Seshadri et al. , MICRO 2017] 73

LISA: Fast Inter-Subarray Data Movement Kevin K. Chang, Prashant J. Nair, Saugata Ghose, Donghyuk Lee, n Moinuddin K. Qureshi, and Onur Mutlu, "Low-Cost Inter-Linked Subarrays (LISA): Enabling Fast Inter-Subarray Data Movement in DRAM" Proceedings of the 22 nd International Symposium on High. Performance Computer Architecture (HPCA), Barcelona, Spain, March 2016. [Slides (pptx) (pdf)] [Source Code] 74

In-DRAM Bulk AND/OR n Vivek Seshadri, Kevin Hsieh, Amirali Boroumand, Donghyuk Lee, Michael A. Kozuch, Onur Mutlu, Phillip B. Gibbons, and Todd C. Mowry, "Fast Bulk Bitwise AND and OR in DRAM" IEEE Computer Architecture Letters (CAL), April 2015. 75

Ambit: Bulk-Bitwise in-DRAM Computation n Vivek Seshadri et al. , “Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology, ” MICRO 2017. 76

Efficient Data Coherence Support n Amirali Boroumand, Saugata Ghose, Minesh Patel, Hasan Hassan, Brandon Lucia, Kevin Hsieh, Krishna T. Malladi, Hongzhong Zheng, and Onur Mutlu, "Lazy. PIM: An Efficient Cache Coherence Mechanism for Processing-in-Memory" IEEE Computer Architecture Letters (CAL), June 2016. 77

Takeaways 78

Key Takeaways n A novel method to accelerate data copy and initialization n Simple and effective n Hardware/software cooperative n Good potential for work building on it to extend it q q q n To different granularities To make things more efficient and effective Multiple works have already built on the paper (see LISA, Ambit, and many other works in Google Scholar) Easy to read and understand paper 79

Open Discussion 80

Discussion Starters n Thoughts on the previous ideas? n How practical is this? n n Will the problem become bigger and more important over time? Will the solution become more important over time? Are other solutions better? Is this solution clearly advantageous in some cases? 81

More on Row. Clone n Vivek Seshadri, Yoongu Kim, Chris Fallin, Donghyuk Lee, Rachata Ausavarungnirun, Gennady Pekhimenko, Yixin Luo, Onur Mutlu, Michael A. Kozuch, Phillip B. Gibbons, and Todd C. Mowry, "Row. Clone: Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization" Proceedings of the 46 th International Symposium on Microarchitecture (MICRO), Davis, CA, December 2013. [Slides (pptx) (pdf)] [Lightning Session Slides (pptx) (pdf)] [Poster (pptx) (pdf)] 82

Seminar in Computer Architecture Meeting 2: Logistics and Examples Prof. Onur Mutlu ETH Zürich Fall 2019 28 February 2019

We Did Not Cover the Following Slides. They Are For Your Benefit.

Example Paper Presentation II 86

PAR-BS n Onur Mutlu and Thomas Moscibroda, "Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems" Proceedings of the 35 th International Symposium on Computer Architecture (ISCA), pages 63 -74, Beijing, China, June 2008. [Summary] [Slides (ppt)] 87

We Will Do This Differently n I will give a “conference talk” n You can ask questions and analyze what I described 88

Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research

Outline n n Background and Goal Motivation q n Parallelism-Aware Batch Scheduling q q n n n Destruction of Intra-thread DRAM Bank Parallelism Batching Within-batch Scheduling System Software Support Evaluation Summary 90

The DRAM System Columns Rows BANK 0 BANK 1 BANK 2 BANK 3 Row Buffer DRAM Bus FR-FCFS policy 1) Row-hit first 2) Oldest first DRAM CONTROLLER 91

Multi-Core Systems threads’ requests interfere CORE 0 CORE 1 CORE 2 CORE 3 L 2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 Bank 1 Bank 2 Multi-Core Chip Shared DRAM Memory System . . . DRAM Bank 7 92

Inter-thread Interference in the DRAM System n Threads delay each other by causing resource contention: q n Threads can also destroy each other’s DRAM bank parallelism q n n Bank, bus, row-buffer conflicts [MICRO 2007] Otherwise parallel requests can become serialized Existing DRAM schedulers are unaware of this interference They simply aim to maximize DRAM throughput q q q Thread-unaware and thread-unfair No intent to service each thread’s requests in parallel FR-FCFS policy: 1) row-hit first, 2) oldest first n Unfairly prioritizes threads with high row-buffer locality 93

Normalized Memory Stall-Time Consequences of Inter-Thread Interference in DRAM is the only shared resource n n High priority Memory Low performance priority hog Cores make very slow progress Unfair slowdown of different threads [MICRO 2007] System performance loss [MICRO 2007] Vulnerability to denial of service [USENIX Security 2007] Inability to enforce system-level thread priorities [MICRO 2007] 94

Our Goal n Control inter-thread interference in DRAM n Design a shared DRAM scheduler that q provides high system performance n q provides fairness to threads sharing the DRAM system n q preserves each thread’s DRAM bank parallelism equalizes memory-slowdowns of equal-priority threads is controllable and configurable n enables different service levels for threads with different priorities 95

The Problem n Processors try to tolerate the latency of DRAM requests by generating multiple outstanding requests q q n n n Memory-Level Parallelism (MLP) Out-of-order execution, non-blocking caches, runahead execution Effective only if the DRAM controller actually services the multiple requests in parallel in DRAM banks Multiple threads share the DRAM controllers are not aware of a thread’s MLP q Can service each thread’s outstanding requests serially, not in parallel 97

Bank Parallelism of a Thread Bank 0 2 DRAM Requests Bank 1 Single Thread: Thread A : Compute Stall Compute Bank 0 Bank 1 Thread A: Bank 0, Row 1 Thread A: Bank 1, Row 1 Bank access latencies of the two requests overlapped Thread stalls for ~ONE bank access latency 98

Bank Parallelism Interference in Baseline Scheduler: DRAM Bank 0 Bank 1 2 DRAM Requests A : Compute Stall Compute Bank 0 Bank 1 2 DRAM Requests B: Compute Stall Bank 1 Bank 0 Thread A: Bank 0, Row 1 Stall Compute Thread B: Bank 1, Row 99 Thread B: Bank 0, Row 99 Thread A: Bank 1, Row 1 Bank access latencies of each thread serialized Each thread stalls for ~TWO bank access latencies 99

Parallelism-Aware Scheduler Baseline Scheduler: Bank 0 2 DRAM Requests A : Compute Bank 0 Compute Bank 1 2 DRAM Requests B: Compute Stall Bank 1 Thread A: Bank 0, Row 1 Stall Compute Bank 1 Thread B: Bank 1, Row 99 Thread B: Bank 0, Row 99 Bank 0 Thread A: Bank 1, Row 1 Parallelism-aware Scheduler: 2 DRAM Requests A : Compute Stall Compute Bank 0 Bank 1 Saved Cycles 2 DRAM Requests B: Compute Stall Compute Average stall-time: ~1. 5 bank access latencies Bank 0 Bank 1 100

Outline n n Background and Goal Motivation q n Parallelism-Aware Batch Scheduling (PAR-BS) q q n n n Destruction of Intra-thread DRAM Bank Parallelism Request Batching Within-batch Scheduling System Software Support Evaluation Summary 101

Parallelism-Aware Batch Scheduling (PAR-BS) n Principle 1: Parallelism-awareness q Schedule requests from a thread (to different banks) back to back q q n Preserves each thread’s bank parallelism But, this can cause starvation… Principle 2: Request Batching q Group a fixed number of oldest requests from each thread into a “batch” q Service the batch before all other requests q q q Form a new batch when the current one is done Eliminates starvation, provides fairness Allows parallelism-awareness within a batch T 1 T 2 T 0 T 2 T 3 T 2 T 0 T 3 T 2 T 1 T 0 Bank 1 Batch 102

PAR-BS Components n Request batching n Within-batch scheduling q Parallelism aware 103

Request Batching n Each memory request has a bit (marked) associated with it n Batch formation: q q q n Marked requests are prioritized over unmarked ones q n Mark up to Marking-Cap oldest requests per bank for each thread Marked requests constitute the batch Form a new batch when no marked requests are left No reordering of requests across batches: no starvation, high fairness How to prioritize requests within a batch? 104

Within-Batch Scheduling n Can use any existing DRAM scheduling policy q n FR-FCFS (row-hit first, then oldest-first) exploits row-buffer locality But, we also want to preserve intra-thread bank parallelism q Service each thread’s requests back to back HOW? n Scheduler computes a ranking of threads when the batch is formed q q Higher-ranked threads are prioritized over lower-ranked ones Improves the likelihood that requests from a thread are serviced in parallel by different banks n Different threads prioritized in the same order across ALL banks 105

How to Rank Threads within a Batch n Ranking scheme affects system throughput and fairness n Maximize system throughput q n Minimize unfairness (Equalize the slowdown of threads) q q n Minimize average stall-time of threads within the batch Service threads with inherently low stall-time early in the batch Insight: delaying memory non-intensive threads results in high slowdown Shortest stall-time first (shortest job first) ranking q q q Provides optimal system throughput [Smith, 1956]* Controller estimates each thread’s stall-time within the batch Ranks threads with shorter stall-time higher * W. E. Smith, “Various optimizers for single stage production, ” Naval Research Logistics Quarterly, 1956. 106

Shortest Stall-Time First Ranking n Maximum number of marked requests to any bank (max-bank-load) q n Rank thread with lower max-bank-load higher (~ low stall-time) Total number of marked requests (total-load) q Breaks ties: rank thread with lower total-load higher T 3 max-bank-load total-load T 3 T 2 T 3 T 0 1 3 T 1 T 0 T 2 T 0 T 1 2 4 T 2 T 1 T 2 2 6 T 3 T 1 T 0 T 3 T 1 T 3 T 2 T 3 5 9 Bank 0 Bank 1 Bank 2 Bank 3 Ranking: T 0 > T 1 > T 2 > T 3 107

T 3 6 7 6 T 3 T 2 T 3 T 1 T 0 T 2 T 2 T 1 T 2 T 3 T 1 T 0 T 3 5 4 3 2 T 1 T 3 T 2 T 3 1 Bank 0 Bank 1 Bank 2 Bank 3 Time T 3 T 3 T 2 T 2 T 2 T 3 T 1 T 1 T 2 5 4 3 2 T 1 T 0 T 0 1 Bank 0 Bank 1 Bank 2 Bank 3 Ranking: T 0 > T 1 > T 2 > T 3 Stall times T 0 T 1 T 2 T 3 4 4 5 7 AVG: 5 bank access latencies Stall times T 0 T 1 T 2 T 3 1 2 4 7 AVG: 3. 5 bank access latencies 108 Time Example Within-Batch Scheduling Baseline Scheduling PAR-BS Scheduling 7 Order (Arrival order) Order

Putting It Together: PAR-BS Scheduling Policy n PAR-BS Scheduling Policy Batching (1) Marked requests first (2) Row-hit requests first Parallelism-aware (3) Higher-rank thread first (shortest stall-time first) within-batch scheduling (4) Oldest first n Three properties: q q Exploits row-buffer locality and intra-thread bank parallelism Work-conserving n q Marking-Cap is important n n n Services unmarked requests to banks without marked requests Too small cap: destroys row-buffer locality Too large cap: penalizes memory non-intensive threads Many more trade-offs analyzed in the paper 109

Hardware Cost n <1. 5 KB storage cost for q 8 -core system with 128 -entry memory request buffer n No complex operations (e. g. , divisions) n Not on the critical path q Scheduler makes a decision only every DRAM cycle 110

System Software Support n OS conveys each thread’s priority level to the controller q n Controller enforces priorities in two ways q q n Mark requests from a thread with priority X only every Xth batch Within a batch, higher-priority threads’ requests are scheduled first Purely opportunistic service q q n Levels 1, 2, 3, … (highest to lowest priority) Special very low priority level L Requests from such threads never marked Quantitative analysis in paper 112

Evaluation Methodology n 4 -, 8 -, 16 -core systems q q q n Detailed DRAM model based on Micron DDR 2 -800 q q n x 86 processor model based on Intel Pentium M 4 GHz processor, 128 -entry instruction window 512 Kbyte per core private L 2 caches, 32 L 2 miss buffers 128 -entry memory request buffer 8 banks, 2 Kbyte row buffer 40 ns (160 cycles) row-hit round-trip latency 80 ns (320 cycles) row-conflict round-trip latency Benchmarks q q Multiprogrammed SPEC CPU 2006 and Windows Desktop applications 100, 16, 12 program combinations for 4 -, 8 -, 16 -core experiments 114

Comparison with Other DRAM Controllers n Baseline FR-FCFS [Zuravleff and Robinson, US Patent 1997; Rixner et al. , ISCA 2000] q q n FCFS q q n [Intel Pentium 4 chipsets] Oldest-first; low DRAM throughput Unfairly penalizes memory non-intensive threads Network Fair Queueing (NFQ) q q n Prioritizes row-hit requests, older requests Unfairly penalizes threads with low row-buffer locality, memory non-intensive threads [Nesbit et al. , MICRO 2006] Equally partitions DRAM bandwidth among threads Does not consider inherent (baseline) DRAM performance of each thread Unfairly penalizes threads with high bandwidth utilization [MICRO 2007] Unfairly prioritizes threads with bursty access patterns [MICRO 2007] Stall-Time Fair Memory Scheduler (STFM) q q q [Mutlu & Moscibroda, MICRO 2007] Estimates and balances thread slowdowns relative to when run alone Unfairly treats threads with inaccurate slowdown estimates Requires multiple (approximate) arithmetic operations 115

Unfairness on 4 -, 8 -, 16 -core Systems Unfairness = MAX Memory Slowdown / MIN Memory Slowdown [MICRO 2007] 1. 11 X 1. 08 X 116

System Performance (Hmean-speedup) 8. 3% 6. 1% 5. 1% 117

Summary n Inter-thread interference can destroy each thread’s DRAM bank parallelism q q q n A new approach to fair and high-performance DRAM scheduling q q q n Serializes a thread’s requests reduces system throughput Makes techniques that exploit memory-level parallelism less effective Existing DRAM controllers unaware of intra-thread bank parallelism Batching: Eliminates starvation, allows fair sharing of the DRAM system Parallelism-aware thread ranking: Preserves each thread’s bank parallelism Flexible and configurable: Supports system-level thread priorities Qo. S policies PAR-BS provides better fairness and system performance than previous DRAM schedulers 119

Thank you. Questions?

Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research

Backup Slides

Multiple Memory Controllers (I) n Local ranking: Each controller uses PAR-BS independently q n Global ranking: Meta controller that computes a global ranking across all controllers based on global information q n Computes its own ranking based on its local requests Only needs to track bookkeeping info about each thread’s requests to the banks in each controller The difference between the ranking computed by each scheme depends on the balance of the distribution of requests to each controller q Balanced Local and global rankings are similar 123

1. 18 X 1. 33 X Nomalized Hmean-Speedup Unfairness Multiple Memory Controllers (II) 7. 4% 11. 5% 16 -core system, 4 memory controllers 124

Example with Row Hits Stall time Thread 1 4 Thread 1 5. 5 Thread 1 1 Thread 2 4 Thread 2 3 Thread 2 2 Thread 3 5 Thread 3 4 Thread 4 7 Thread 4 4. 5 Thread 4 5. 5 AVG 4. 375 AVG 3. 125

End of Backup Slides

Now Your Turn to Analyze… n n n Background, Problem & Goal Novelty Key Approach and Ideas Mechanisms (in some detail) Key Results: Methodology and Evaluation Summary Strengths Weaknesses Thoughts and Ideas Takeaways Open Discussion 127

PAR-BS Pros and Cons n Upsides: q q n Downsides: q q q n First scheduler to address bank parallelism destruction across multiple threads Simple mechanism (vs. STFM) Batching provides fairness Ranking enables parallelism awareness Does not always prioritize the latency-sensitive applications Deadline guarantees? Complexity? Some ideas implemented in real So. C memory controllers 128

More on PAR-BS n Onur Mutlu and Thomas Moscibroda, "Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems" Proceedings of the 35 th International Symposium on Computer Architecture (ISCA), pages 63 -74, Beijing, China, June 2008. [Summary] [Slides (ppt)] 129