Seminar in Computer Architecture Meeting 2 Logistics and
- Slides: 129
Seminar in Computer Architecture Meeting 2: Logistics and Examples Prof. Onur Mutlu ETH Zürich Fall 2019 28 February 2019
Recap: Key Goal (Learn how to) rigorously analyze, present, discuss papers and ideas in computer architecture 2
Recap: Some Goals of This Course n Teach/enable/empower you to: q Think critically q Think broadly q Learn how to understand, analyze and present papers and ideas q Get familiar with key first steps in research q Get familiar with key research directions 3
Recap: Steps to Achieve the Key Goal n Steps for the Presenter q q q q n Read Absorb, read more (other related works) Critically analyze; think; synthesize Prepare a clear and rigorous talk Present Answer questions Analyze and synthesize (in meeting, after, and at course end) Steps for the Participants q q q Discuss Ask questions Analyze and synthesize (in meeting, after, and at course end) 4
Course Logistics n Requirements: q q q 1: Submit HW 0 2: Explicitly provide paper preferences https: //safari. ethz. ch/architecture_seminar/spring 2019 n 2 presentations each week we meet n Each presentation q q n One student presents one paper and leads discussion Max 30 -minute summary+analysis Max 20 -minute discussion+brainstorming+feedback Should follow the suggested guidelines No class next week 5
Course Requirements and n Attendance required for all meetings Expectations Sign in sheet q n Each student presents one paper q q n Non-presenters participate during the meeting q q n Ask questions, contribute thoughts/ideas Better if you read/skim the paper beforehand Everyone comments on papers in the online review system q n Prepare for presentation with engagement from the mentor(s) Full presentation + questions + discussion After presentation Write synthesis report at the end of semester q Sample synthesis report online 6
Grading Rubric n Quality of your presentation (60%) q q q n Quality of the final synthesis paper (30%) q n n How well did you understand the material? How well did you present it? How well did you answer the questions? Be prepared to explain technical terms We will take into account the difficulty of the paper and the time you had to prepare. How well did you understand some of the papers presented during the seminar? Attendance (10%) Participation (during class and online) (BONUS 10%) q q Did you ask good questions? Did you attend all sessions? 7
Algorithm for Presentation Preparation n Study Lecture 1 again for presentation guidelines n Read analyze your paper thoroughly q Discuss with anyone you wish + use any resources n Prepare a draft presentation based on guidelines n Meet mentor(s) and get feedback q n Meet mentor(s) again and get further feedback q n n Revise the presentation and delivery Meetings are mandatory – you have to schedule them with your assigned mentor(s). We may suggest meeting times. Practice, practice 8
Example Paper Presentations 9
Learning by Example n A great way of learning n We already did one example last time q n Memory Channel Partitioning We will do at least one more today 10
Structure of the Presentation n n Background, Problem & Goal Novelty Key Approach and Ideas Mechanisms (in some detail) Key Results: Methodology and Evaluation Summary Strengths Weaknesses Thoughts and Ideas Takeaways Open Discussion 11
Background, Problem & Goal 12
Novelty 13
Key Approach and Ideas 14
Mechanisms (in some detail) 15
Key Results: Methodology and Evaluation 16
Summary 17
Strengths 18
Weaknesses 19
Thoughts and Ideas 20
Takeaways 21
Open Discussion 22
Example Paper Presentation 23
Let’s Review This Paper n Vivek Seshadri, Yoongu Kim, Chris Fallin, Donghyuk Lee, Rachata Ausavarungnirun, Gennady Pekhimenko, Yixin Luo, Onur Mutlu, Michael A. Kozuch, Phillip B. Gibbons, and Todd C. Mowry, "Row. Clone: Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization" Proceedings of the 46 th International Symposium on Microarchitecture (MICRO), Davis, CA, December 2013. [Slides (pptx) (pdf)] [Lightning Session Slides (pptx) (pdf)] [Poster (pptx) (pdf)] 24
Row. Clone Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization Vivek Seshadri Y. Kim, C. Fallin, D. Lee, R. Ausavarungnirun, G. Pekhimenko, Y. Luo, O. Mutlu, P. B. Gibbons, M. A. Kozuch, T. C. Mowry
Background, Problem & Goal 26
Memory Channel – Bottleneck Core MC Channel High Energy Memory Core Cache Limited Bandwidth
Core MC Channel Memory Core Cache Goal: Reduce Memory Bandwidth Demand Reduce unnecessary data movement
Bulk Data Copy and Initialization Bulk Data Copy src dst Bulk Data Initialization val dst
Bulk Data Copy and Initialization Bulk Data Copy src dst Bulk Data Initialization val dst
Bulk Data Copy and Initialization memmove & memcpy: 5% cycles in Google’s datacenter [Kan ISCA’ 15] 00000 Zero initialization Forking (e. g. , security) Checkpointing Many more VM Cloning Deduplication Page Migration 31
Shortcomings of Today’s Systems 1) High latency 3) Cache pollution CPU L 1 Memory L 2 L 3 MC 2) High bandwidth utilization 4) Unwanted data movement 1046 ns, 3. 6 u. J (for 4 KB page copy via DMA) 32
Novelty, Key Approach, and Ideas 33
Row. Clone: In-Memory Copy 3) No cache pollution 1) Low latency Memory CPU L 1 L 2 L 3 MC 2) Low bandwidth utilization 4) No unwanted data movement 1046 ns, 3. 6 u. J 90 ns, 0. 04 u. J 34
Row. Clone: In-DRAM Row Copy 4 Kbytes Idea: Two consecutive ACTivates Negligible HW cost Step 1: Activate row A Step 2: Activate row B Transfer row DRAM subarray Transfer row Row Buffer (4 Kbytes) 8 bits Data Bus
Mechanisms (in some detail) 36
Subarray Chip I/O Memory Channel DRAM Chip Organization Bank I/O Row of DRAM Cells Row Buffer 37
Row. Clone Types n Intra-subarray Row. Clone (row granularity) q n Inter-bank Row. Clone (byte granularity) q n Fast Parallel Mode (FPM) Pipelined Serial Mode (PSM) Inter-subarray Row. Clone 38
Row. Clone: Fast Parallel Mode (FPM) s r c r o w ds sr ct r o w s r c r o w Row Buffer � 1. Source row to row buffer ? 2. Row buffer to destination row
Row. Clone: Intra-Subarray (I) VDD/2 V +DD δ src 0 dst 0 V VDD DD/2 + δ Amplify the difference Data gets copied Sense Amplifier (Row Buffer) VDD/2 0
Row. Clone: Intra-Subarray (II) s r c r o w ds sr ct r o w s r c r o w Row Buffer 1. Activate src row (copy data from src to row buffer) 2. Activate dst row (disconnect src from row buffer, connect dst – copy data from row buffer to dst)
Fast Parallel Mode: Benefits Bulk Data Copy Latency 11 x 1046 ns to 90 ns Energy 74 x 3600 n. J to 40 n. J No bandwidth consumption Very little changes to the DRAM chip 42
Fast Parallel Mode: Constraints n Location of source/destination q n Both should be in the same subarray Size of the copy q Copies all the data from source row to destination
Bank Chip I/O Memory Channel Row. Clone: Inter-Bank Shared internal bus Overlap the latency of the read and the write 1. 9 X latency reduction, 3. 2 X energy reduction
Generalized Row. Clone Chip I/O Memory Channel Inter Subarray Copy (Use Inter-Bank Copy Twice) Inter Bank Copy (Pipelined Internal RD/WR) Bank Intra Subarray Copy (2 ACTs) 0. 01% area cost Subarray Bank I/O
Row. Clone: Fast Row Initialization 0 0 0 Fix a row at Zero (0. 5% loss in capacity) 46
Row. Clone: Bulk Initialization n Initialization with arbitrary data q q n Initialize one row Copy the data to other rows Zero initialization (most common) q q Reserve a row in each subarray (always zero) Copy data from reserved row (FPM mode) 6. 0 X lower latency, 41. 5 X lower DRAM energy 0. 2% loss in capacity 47
Row. Clone: Latency & Energy Benefits Energy Reduction 74. 4 x Latency Reduction 40 20 3. 2 x 1. 5 x y ar ra ub an -B In te r ub tra -S ar ra y y ar ra an ar ra y k 0 y 1. 9 x 1. 0 x 41. 5 x 60 tra -S 6. 0 x Copy In In te r tra -S In ub ub -S te r In In te r -B Very low cost: 0. 01% increase in die area ub tra -S In 80 ub ar ra 11. 6 x -S 14 12 10 8 6 4 2 0 Zero 48
System Design to Enable Row. Clone 49
End-to-End System Design Application Operating System ISA How to communicate occurrences of bulk copy/initialization across layers? How to ensure cache coherence? Microarchitecture How to maximize latency and energy savings? DRAM (Row. Clone) How to handle data reuse? 50
1. Hardware/Software Interface Two new instructions • memcopy and meminit • Similar instructions present in existing ISAs Microarchitecture Implementation • Checks if instructions can be sped up by Row. Clone • Export instructions to the memory controller 51
2. Managing Cache Coherence Row. Clone modifies data in memory • Need to maintain coherence of cached data Similar to DMA • Source and destination in memory • Can leverage hardware support for DMA Additional optimizations 52
3. Maximizing Use of the Fast Parallel Mode Make operating system subarray-aware Primitives amenable to use of FPM • Copy-on-Write Allocate destination in same subarray as source Use FPM to copy • Bulk Zeroing Use FPM to copy data from reserved zero row 53
4. Handling Data Reuse After Zeroing Data reuse after zero initialization • Phase 1: OS zeroes out the page • Phase 2: Application uses cachelines of the page Row. Clone • Avoids misses in phase 1 • But incurs misses in phase 2 Row. Clone-Zero-Insert (Row. Clone-ZI) • Insert clean zero cachelines 54
Key Results: Methodology and Evaluation 55
Methodology n Out-of-order multi-core simulator 1 MB/core last-level cache n Cycle-accurate DDR 3 DRAM simulator n 6 Copy/Initialization intensive applications n +SPEC CPU 2006 for multi-core n Performance q q Instruction throughput for single-core Weighted Speedup for multi-core
Copy/Initialization Intensive Applications n System bootup (Booting the Debian OS) n Compile (GNU C compiler – executing cc 1) n Forkbench (A fork microbenchmark) n Memcached (Inserting a large number of objects) n My. Sql (Loading a database) n Shell script (find with ls on each subdirectory)
Copy and Initialization in Workloads Fraction of Memory Traffic Zero Copy Write Read 1 0. 8 0. 6 0. 4 0. 2 0 bootup compile forkbench mcached mysql shell 58
Single-Core – Performance and Energy Compared to Baseline IPC Improvement Memory Energy Reduction 70% 60% 50% 40% 30% Improvements correlate with fraction of 20% memory traffic due to copy/initialization 10% 0% bootup compile forkbench mcached mysql shell 59
Multi-Core Systems Reduced bandwidth consumption benefits all applications. Run copy/initialization intensive applications with memory intensive SPEC applications. Half the cores run copy/initialization intensive applications. Remaining half run SPEC applications. 60
Multi-Core Results: Summary Improvement over Baseline System Performance Memory Energy Efficiency 30% 25% 20% 15% 10% 5% Performance Consistent improvement increases in 0% with energy/instruction increasing 4 -Core count 8 -Core 2 -Core 61
Summary 62
Executive Summary Bulk data copy and initialization • Unnecessarily move data on the memory channel • Degrade system performance and energy efficiency Row. Clone – perform copy in DRAM with low cost • Uses row buffer to copy large quantity of data • Source row → row buffer → destination row • 11 X lower latency and 74 X lower energy for a bulk copy Accelerate Copy-on-Write and Bulk Zeroing • Forking, checkpointing, zeroing (security), VM cloning Improves performance and energy efficiency at low cost • 27% and 17% for 8 -core systems (0. 01% DRAM chip area) 63
Strengths 64
Strengths of the Paper n n n Simple, novel mechanism to solve an important problem Effective and low hardware overhead Intuitive idea! Greatly improves performance and efficiency (assuming data is mapped nicely) Seems like a clear win for data initialization (without mapping requirements) Makes software designer’s life easier q n n If copies are 10 x-100 x cheaper, how to design software? Paper tackles many low-level and system-level issues Well-written, insightful paper 65
Weaknesses 66
Weaknesses n Requires data to be mapped in the same subarray to deliver the largest benefits q q n n Inter-subarray copy is very inefficient Causes many changes in the system stack q q n n Helps less if data movement is not within a subarray Does not help if data movement is across DRAM channels End-to-end design spans applications to circuits Software-hardware cooperative solution might not always be easy to adopt Cache coherence and data reuse cause real overheads Evaluation is done solely in simulation Evaluation does not consider multi-chip systems Are these the best workloads to evaluate? 67
Recall: Avoid Rat Holes Source: https: //www. cse. wustl. edu/~jain/iucee/ftp/k_10 adp. pdf 68
Raj Jain, “The Art of Computer Systems Performance Analysis, ” Wiley, 1991. 69
Raj Jain, “The Art of Computer Systems Performance Analysis, ” Wiley, 1991. 70
Aside: A Recommended Book Raj Jain, “The Art of Computer Systems Performance Analysis, ” Wiley, 1991. 71
Thoughts and Ideas 72
Extensions n Can this be improved to do faster inter-subarray copy? q Yes, see the LISA paper [Chang et al. , HPCA 2016] n Can this be extended to move data at smaller granularities? n Can we have more efficient solutions to q q n n Cache coherence (minimize overhead) Data reuse after copy and initialization Can this idea be evaluated on a real system? How? Can similar ideas and DRAM properties be used to perform computation on data? q Yes, see the Ambit paper [Seshadri et al. , MICRO 2017] 73
LISA: Fast Inter-Subarray Data Movement Kevin K. Chang, Prashant J. Nair, Saugata Ghose, Donghyuk Lee, n Moinuddin K. Qureshi, and Onur Mutlu, "Low-Cost Inter-Linked Subarrays (LISA): Enabling Fast Inter-Subarray Data Movement in DRAM" Proceedings of the 22 nd International Symposium on High. Performance Computer Architecture (HPCA), Barcelona, Spain, March 2016. [Slides (pptx) (pdf)] [Source Code] 74
In-DRAM Bulk AND/OR n Vivek Seshadri, Kevin Hsieh, Amirali Boroumand, Donghyuk Lee, Michael A. Kozuch, Onur Mutlu, Phillip B. Gibbons, and Todd C. Mowry, "Fast Bulk Bitwise AND and OR in DRAM" IEEE Computer Architecture Letters (CAL), April 2015. 75
Ambit: Bulk-Bitwise in-DRAM Computation n Vivek Seshadri et al. , “Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology, ” MICRO 2017. 76
Efficient Data Coherence Support n Amirali Boroumand, Saugata Ghose, Minesh Patel, Hasan Hassan, Brandon Lucia, Kevin Hsieh, Krishna T. Malladi, Hongzhong Zheng, and Onur Mutlu, "Lazy. PIM: An Efficient Cache Coherence Mechanism for Processing-in-Memory" IEEE Computer Architecture Letters (CAL), June 2016. 77
Takeaways 78
Key Takeaways n A novel method to accelerate data copy and initialization n Simple and effective n Hardware/software cooperative n Good potential for work building on it to extend it q q q n To different granularities To make things more efficient and effective Multiple works have already built on the paper (see LISA, Ambit, and many other works in Google Scholar) Easy to read and understand paper 79
Open Discussion 80
Discussion Starters n Thoughts on the previous ideas? n How practical is this? n n Will the problem become bigger and more important over time? Will the solution become more important over time? Are other solutions better? Is this solution clearly advantageous in some cases? 81
More on Row. Clone n Vivek Seshadri, Yoongu Kim, Chris Fallin, Donghyuk Lee, Rachata Ausavarungnirun, Gennady Pekhimenko, Yixin Luo, Onur Mutlu, Michael A. Kozuch, Phillip B. Gibbons, and Todd C. Mowry, "Row. Clone: Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization" Proceedings of the 46 th International Symposium on Microarchitecture (MICRO), Davis, CA, December 2013. [Slides (pptx) (pdf)] [Lightning Session Slides (pptx) (pdf)] [Poster (pptx) (pdf)] 82
Row. Clone Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization Vivek Seshadri Y. Kim, C. Fallin, D. Lee, R. Ausavarungnirun, G. Pekhimenko, Y. Luo, O. Mutlu, P. B. Gibbons, M. A. Kozuch, T. C. Mowry
Seminar in Computer Architecture Meeting 2: Logistics and Examples Prof. Onur Mutlu ETH Zürich Fall 2019 28 February 2019
We Did Not Cover the Following Slides. They Are For Your Benefit.
Example Paper Presentation II 86
PAR-BS n Onur Mutlu and Thomas Moscibroda, "Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems" Proceedings of the 35 th International Symposium on Computer Architecture (ISCA), pages 63 -74, Beijing, China, June 2008. [Summary] [Slides (ppt)] 87
We Will Do This Differently n I will give a “conference talk” n You can ask questions and analyze what I described 88
Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research
Outline n n Background and Goal Motivation q n Parallelism-Aware Batch Scheduling q q n n n Destruction of Intra-thread DRAM Bank Parallelism Batching Within-batch Scheduling System Software Support Evaluation Summary 90
The DRAM System Columns Rows BANK 0 BANK 1 BANK 2 BANK 3 Row Buffer DRAM Bus FR-FCFS policy 1) Row-hit first 2) Oldest first DRAM CONTROLLER 91
Multi-Core Systems threads’ requests interfere CORE 0 CORE 1 CORE 2 CORE 3 L 2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 Bank 1 Bank 2 Multi-Core Chip Shared DRAM Memory System . . . DRAM Bank 7 92
Inter-thread Interference in the DRAM System n Threads delay each other by causing resource contention: q n Threads can also destroy each other’s DRAM bank parallelism q n n Bank, bus, row-buffer conflicts [MICRO 2007] Otherwise parallel requests can become serialized Existing DRAM schedulers are unaware of this interference They simply aim to maximize DRAM throughput q q q Thread-unaware and thread-unfair No intent to service each thread’s requests in parallel FR-FCFS policy: 1) row-hit first, 2) oldest first n Unfairly prioritizes threads with high row-buffer locality 93
Normalized Memory Stall-Time Consequences of Inter-Thread Interference in DRAM is the only shared resource n n High priority Memory Low performance priority hog Cores make very slow progress Unfair slowdown of different threads [MICRO 2007] System performance loss [MICRO 2007] Vulnerability to denial of service [USENIX Security 2007] Inability to enforce system-level thread priorities [MICRO 2007] 94
Our Goal n Control inter-thread interference in DRAM n Design a shared DRAM scheduler that q provides high system performance n q provides fairness to threads sharing the DRAM system n q preserves each thread’s DRAM bank parallelism equalizes memory-slowdowns of equal-priority threads is controllable and configurable n enables different service levels for threads with different priorities 95
Outline n n Background and Goal Motivation q n Parallelism-Aware Batch Scheduling q q n n n Destruction of Intra-thread DRAM Bank Parallelism Batching Within-batch Scheduling System Software Support Evaluation Summary 96
The Problem n Processors try to tolerate the latency of DRAM requests by generating multiple outstanding requests q q n n n Memory-Level Parallelism (MLP) Out-of-order execution, non-blocking caches, runahead execution Effective only if the DRAM controller actually services the multiple requests in parallel in DRAM banks Multiple threads share the DRAM controllers are not aware of a thread’s MLP q Can service each thread’s outstanding requests serially, not in parallel 97
Bank Parallelism of a Thread Bank 0 2 DRAM Requests Bank 1 Single Thread: Thread A : Compute Stall Compute Bank 0 Bank 1 Thread A: Bank 0, Row 1 Thread A: Bank 1, Row 1 Bank access latencies of the two requests overlapped Thread stalls for ~ONE bank access latency 98
Bank Parallelism Interference in Baseline Scheduler: DRAM Bank 0 Bank 1 2 DRAM Requests A : Compute Stall Compute Bank 0 Bank 1 2 DRAM Requests B: Compute Stall Bank 1 Bank 0 Thread A: Bank 0, Row 1 Stall Compute Thread B: Bank 1, Row 99 Thread B: Bank 0, Row 99 Thread A: Bank 1, Row 1 Bank access latencies of each thread serialized Each thread stalls for ~TWO bank access latencies 99
Parallelism-Aware Scheduler Baseline Scheduler: Bank 0 2 DRAM Requests A : Compute Bank 0 Compute Bank 1 2 DRAM Requests B: Compute Stall Bank 1 Thread A: Bank 0, Row 1 Stall Compute Bank 1 Thread B: Bank 1, Row 99 Thread B: Bank 0, Row 99 Bank 0 Thread A: Bank 1, Row 1 Parallelism-aware Scheduler: 2 DRAM Requests A : Compute Stall Compute Bank 0 Bank 1 Saved Cycles 2 DRAM Requests B: Compute Stall Compute Average stall-time: ~1. 5 bank access latencies Bank 0 Bank 1 100
Outline n n Background and Goal Motivation q n Parallelism-Aware Batch Scheduling (PAR-BS) q q n n n Destruction of Intra-thread DRAM Bank Parallelism Request Batching Within-batch Scheduling System Software Support Evaluation Summary 101
Parallelism-Aware Batch Scheduling (PAR-BS) n Principle 1: Parallelism-awareness q Schedule requests from a thread (to different banks) back to back q q n Preserves each thread’s bank parallelism But, this can cause starvation… Principle 2: Request Batching q Group a fixed number of oldest requests from each thread into a “batch” q Service the batch before all other requests q q q Form a new batch when the current one is done Eliminates starvation, provides fairness Allows parallelism-awareness within a batch T 1 T 2 T 0 T 2 T 3 T 2 T 0 T 3 T 2 T 1 T 0 Bank 1 Batch 102
PAR-BS Components n Request batching n Within-batch scheduling q Parallelism aware 103
Request Batching n Each memory request has a bit (marked) associated with it n Batch formation: q q q n Marked requests are prioritized over unmarked ones q n Mark up to Marking-Cap oldest requests per bank for each thread Marked requests constitute the batch Form a new batch when no marked requests are left No reordering of requests across batches: no starvation, high fairness How to prioritize requests within a batch? 104
Within-Batch Scheduling n Can use any existing DRAM scheduling policy q n FR-FCFS (row-hit first, then oldest-first) exploits row-buffer locality But, we also want to preserve intra-thread bank parallelism q Service each thread’s requests back to back HOW? n Scheduler computes a ranking of threads when the batch is formed q q Higher-ranked threads are prioritized over lower-ranked ones Improves the likelihood that requests from a thread are serviced in parallel by different banks n Different threads prioritized in the same order across ALL banks 105
How to Rank Threads within a Batch n Ranking scheme affects system throughput and fairness n Maximize system throughput q n Minimize unfairness (Equalize the slowdown of threads) q q n Minimize average stall-time of threads within the batch Service threads with inherently low stall-time early in the batch Insight: delaying memory non-intensive threads results in high slowdown Shortest stall-time first (shortest job first) ranking q q q Provides optimal system throughput [Smith, 1956]* Controller estimates each thread’s stall-time within the batch Ranks threads with shorter stall-time higher * W. E. Smith, “Various optimizers for single stage production, ” Naval Research Logistics Quarterly, 1956. 106
Shortest Stall-Time First Ranking n Maximum number of marked requests to any bank (max-bank-load) q n Rank thread with lower max-bank-load higher (~ low stall-time) Total number of marked requests (total-load) q Breaks ties: rank thread with lower total-load higher T 3 max-bank-load total-load T 3 T 2 T 3 T 0 1 3 T 1 T 0 T 2 T 0 T 1 2 4 T 2 T 1 T 2 2 6 T 3 T 1 T 0 T 3 T 1 T 3 T 2 T 3 5 9 Bank 0 Bank 1 Bank 2 Bank 3 Ranking: T 0 > T 1 > T 2 > T 3 107
T 3 6 7 6 T 3 T 2 T 3 T 1 T 0 T 2 T 2 T 1 T 2 T 3 T 1 T 0 T 3 5 4 3 2 T 1 T 3 T 2 T 3 1 Bank 0 Bank 1 Bank 2 Bank 3 Time T 3 T 3 T 2 T 2 T 2 T 3 T 1 T 1 T 2 5 4 3 2 T 1 T 0 T 0 1 Bank 0 Bank 1 Bank 2 Bank 3 Ranking: T 0 > T 1 > T 2 > T 3 Stall times T 0 T 1 T 2 T 3 4 4 5 7 AVG: 5 bank access latencies Stall times T 0 T 1 T 2 T 3 1 2 4 7 AVG: 3. 5 bank access latencies 108 Time Example Within-Batch Scheduling Baseline Scheduling PAR-BS Scheduling 7 Order (Arrival order) Order
Putting It Together: PAR-BS Scheduling Policy n PAR-BS Scheduling Policy Batching (1) Marked requests first (2) Row-hit requests first Parallelism-aware (3) Higher-rank thread first (shortest stall-time first) within-batch scheduling (4) Oldest first n Three properties: q q Exploits row-buffer locality and intra-thread bank parallelism Work-conserving n q Marking-Cap is important n n n Services unmarked requests to banks without marked requests Too small cap: destroys row-buffer locality Too large cap: penalizes memory non-intensive threads Many more trade-offs analyzed in the paper 109
Hardware Cost n <1. 5 KB storage cost for q 8 -core system with 128 -entry memory request buffer n No complex operations (e. g. , divisions) n Not on the critical path q Scheduler makes a decision only every DRAM cycle 110
Outline n n Background and Goal Motivation q n Parallelism-Aware Batch Scheduling q q n n n Destruction of Intra-thread DRAM Bank Parallelism Batching Within-batch Scheduling System Software Support Evaluation Summary 111
System Software Support n OS conveys each thread’s priority level to the controller q n Controller enforces priorities in two ways q q n Mark requests from a thread with priority X only every Xth batch Within a batch, higher-priority threads’ requests are scheduled first Purely opportunistic service q q n Levels 1, 2, 3, … (highest to lowest priority) Special very low priority level L Requests from such threads never marked Quantitative analysis in paper 112
Outline n n Background and Goal Motivation q n Parallelism-Aware Batch Scheduling q q n n n Destruction of Intra-thread DRAM Bank Parallelism Batching Within-batch Scheduling System Software Support Evaluation Summary 113
Evaluation Methodology n 4 -, 8 -, 16 -core systems q q q n Detailed DRAM model based on Micron DDR 2 -800 q q n x 86 processor model based on Intel Pentium M 4 GHz processor, 128 -entry instruction window 512 Kbyte per core private L 2 caches, 32 L 2 miss buffers 128 -entry memory request buffer 8 banks, 2 Kbyte row buffer 40 ns (160 cycles) row-hit round-trip latency 80 ns (320 cycles) row-conflict round-trip latency Benchmarks q q Multiprogrammed SPEC CPU 2006 and Windows Desktop applications 100, 16, 12 program combinations for 4 -, 8 -, 16 -core experiments 114
Comparison with Other DRAM Controllers n Baseline FR-FCFS [Zuravleff and Robinson, US Patent 1997; Rixner et al. , ISCA 2000] q q n FCFS q q n [Intel Pentium 4 chipsets] Oldest-first; low DRAM throughput Unfairly penalizes memory non-intensive threads Network Fair Queueing (NFQ) q q n Prioritizes row-hit requests, older requests Unfairly penalizes threads with low row-buffer locality, memory non-intensive threads [Nesbit et al. , MICRO 2006] Equally partitions DRAM bandwidth among threads Does not consider inherent (baseline) DRAM performance of each thread Unfairly penalizes threads with high bandwidth utilization [MICRO 2007] Unfairly prioritizes threads with bursty access patterns [MICRO 2007] Stall-Time Fair Memory Scheduler (STFM) q q q [Mutlu & Moscibroda, MICRO 2007] Estimates and balances thread slowdowns relative to when run alone Unfairly treats threads with inaccurate slowdown estimates Requires multiple (approximate) arithmetic operations 115
Unfairness on 4 -, 8 -, 16 -core Systems Unfairness = MAX Memory Slowdown / MIN Memory Slowdown [MICRO 2007] 1. 11 X 1. 08 X 116
System Performance (Hmean-speedup) 8. 3% 6. 1% 5. 1% 117
Outline n n Background and Goal Motivation q n Parallelism-Aware Batch Scheduling q q n n n Destruction of Intra-thread DRAM Bank Parallelism Batching Within-batch Scheduling System Software Support Evaluation Summary 118
Summary n Inter-thread interference can destroy each thread’s DRAM bank parallelism q q q n A new approach to fair and high-performance DRAM scheduling q q q n Serializes a thread’s requests reduces system throughput Makes techniques that exploit memory-level parallelism less effective Existing DRAM controllers unaware of intra-thread bank parallelism Batching: Eliminates starvation, allows fair sharing of the DRAM system Parallelism-aware thread ranking: Preserves each thread’s bank parallelism Flexible and configurable: Supports system-level thread priorities Qo. S policies PAR-BS provides better fairness and system performance than previous DRAM schedulers 119
Thank you. Questions?
Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research
Backup Slides
Multiple Memory Controllers (I) n Local ranking: Each controller uses PAR-BS independently q n Global ranking: Meta controller that computes a global ranking across all controllers based on global information q n Computes its own ranking based on its local requests Only needs to track bookkeeping info about each thread’s requests to the banks in each controller The difference between the ranking computed by each scheme depends on the balance of the distribution of requests to each controller q Balanced Local and global rankings are similar 123
1. 18 X 1. 33 X Nomalized Hmean-Speedup Unfairness Multiple Memory Controllers (II) 7. 4% 11. 5% 16 -core system, 4 memory controllers 124
Example with Row Hits Stall time Thread 1 4 Thread 1 5. 5 Thread 1 1 Thread 2 4 Thread 2 3 Thread 2 2 Thread 3 5 Thread 3 4 Thread 4 7 Thread 4 4. 5 Thread 4 5. 5 AVG 4. 375 AVG 3. 125
End of Backup Slides
Now Your Turn to Analyze… n n n Background, Problem & Goal Novelty Key Approach and Ideas Mechanisms (in some detail) Key Results: Methodology and Evaluation Summary Strengths Weaknesses Thoughts and Ideas Takeaways Open Discussion 127
PAR-BS Pros and Cons n Upsides: q q n Downsides: q q q n First scheduler to address bank parallelism destruction across multiple threads Simple mechanism (vs. STFM) Batching provides fairness Ranking enables parallelism awareness Does not always prioritize the latency-sensitive applications Deadline guarantees? Complexity? Some ideas implemented in real So. C memory controllers 128
More on PAR-BS n Onur Mutlu and Thomas Moscibroda, "Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems" Proceedings of the 35 th International Symposium on Computer Architecture (ISCA), pages 63 -74, Beijing, China, June 2008. [Summary] [Slides (ppt)] 129
- Difference between computer architecture and organization
- What is meeting and types of meeting
- What is meeting and types of meeting
- Bus architecture in computer architecture
- Simple computer design
- For todays meeting
- Meeting objective
- Seminar about computer software
- Computer training seminar memo
- Architecture meeting agenda
- Architecture meeting agenda
- Komponen togaf
- Call and return architecture in software engineering
- Computer organization and architecture 10th solution
- Iit kharagpur virtual lab
- Introduction to computer organization and architecture
- Timing and control in computer architecture
- Evolution of computer architecture
- Digital design and computer architecture: arm edition
- Spec rating formula in computer organization
- Computer organization and architecture 10th edition
- Linear pipeline processors in computer architecture
- Digital design and computer architecture
- What is decoder expansion in computer architecture
- Digital design and computer architecture
- Digital design and computer architecture
- Assembly language and computer architecture
- Computer arithmetic
- Computer organisation and architecture
- Hazard detection and resolution
- Datapath pipeline
- What is 1s complement
- Cs341 umb
- Bubble pushing example
- Digital logic and computer architecture
- Apollo guidance computer manual
- Digital design and computer architecture arm edition
- Difference between a computer and computer system
- What is computer organization
- The large program that controls how the cpu communicates
- Architecture business cycle
- Modular vs integral product architecture example
- Product architecture example
- What is logistics management
- Global production outsourcing and logistics
- Fashion and lifestyle logistics
- E-crm and scm
- Vha procurement and logistics office
- Homeward bound moving solutions and logistics
- International logistics and supply chain outsourcing
- Global production outsourcing and logistics
- Distribution channel and logistics
- Global logistics and distribution
- Elearning adveti
- Third party vs fourth party
- Adaptive logistics network
- Logistics and channel management
- Ford logistics and supply chain management
- Global distribution and logistics
- Distribution customer service and logistics
- Distribution customer service and logistics
- Tvsilp
- Global production, outsourcing and logistics
- Global logistics and supply chain management
- Strategic and financial logistics
- Vision statement for supply chain management
- Logistics and information technology
- Principles of marketing distribution channels
- Defense and logistics leaders
- 1pl 2pl 3pl and 4pl logistics
- Fourth party payment
- Components of physical distribution
- Va office of acquisition and logistics
- Sense and response logistics
- Intel pentium
- I/o modules in computer architecture
- Fp adder
- Addressing mode in computer architecture
- Chordal ring
- Smt computer architecture
- Pseudoinstructions
- 111011-100100
- Instruction format in computer architecture
- Nano control memory
- A microprogram sequencer
- Memory system design
- Virtual memory tlb
- Computer architecture definition
- Define parallelism in computer architecture
- Architect of the concept of number system
- What is computer architecture
- Isa in computer
- Write any three input devices
- Branch prediction in computer architecture
- Computer architecture david patterson
- Example of sisd
- What is guard bit in computer architecture
- The computer system and its interconnection structures
- Basic mips implementation in computer architecture
- Explain virtual memory in computer architecture
- Computer architecture definition
- Dynamic interconnection network in computer architecture
- Computer buses
- Computer architecture
- Gustafson's law in computer architecture
- Instruction cycle in computer architecture
- Advanced dram organization
- Memory hierarchy in computer architecture
- 430830
- Mips instruction format
- Eight ideas in computer architecture
- Computer architecture performance evaluation methods
- Scalar pipeline in computer architecture
- Cmp in computer architecture
- Dependability in computer architecture
- Computer architecture crash course
- Thread-level parallelism
- Tlb computer architecture
- Computer architecture tutorial
- Risc v sw instruction
- Computer system architecture m. morris mano
- Static instruction scheduling
- Simd in computer architecture
- Flip flops in computer organization
- Memory organization in computer architecture
- Branch prediction in computer architecture
- Latency in computer architecture
- Computer architecture
- Performance equation in computer architecture
- Branch prediction in computer architecture