Remember Memory Mark D Hill Univ of WisconsinMadison

Remember Memory? Mark D. Hill, Univ. of Wisconsin-Madison 5/2016 @ David A. Patterson Celebration I. Million-fold Memory Growth & Virtual Memory II. General-Purpose GPUs & Memory Consistency III. Non-Volatile Memory’s Fusing Memory & Storage

Mark D. Hill, No Change 30 Years? 10/7/2020 2

I. Million-fold Memory Growth Memory capacity for $10, 000* TB 10, 000. 00 10 Memory size GB MB 1 1, 000. 00 100. 00 Commercial servers with 16 TB memory 10 10. 00 1 1. 00 100 0. 10 0. 01 10 0. 00 0 1980 Interactive services need to access TB of data at low latency 1990 *Inflation-adjusted 2011 USD, from: jcmit. com 2000 2010 3

How is Paged Virtual Memory used? memcached server # n In-memory Hash table Network state Client E. g. : memcached servers Key X Value Y • But TLB sizes hardly scaled Year L 1 -DTLB entries 10/7/2020 1999 72 (Pent. III) 2008 2012 2015 96 100 (Nehalem) (Ivy Bridge) (Broadwell)4 ISCA 2013

Execution Time Overhead: TLB Misses Q: “Virtual Memory was invented in a time of scarcity. Is it still good idea? ” --- Charles Thacker, 2010 Turing Award Lecture A: As we see it, OFTEN but not ALWAYS. . 10/7/2020 5

A View of Computer Layers Problem Algorithm Application Middleware / Compiler Operating System Punch Thru Microarchitecture Logic Design Transistors, etc. (small) Instrn Set Architecture See 21 st Century Computer Architecture [CCC 2012]6 10/7/2020

Bypass Paging (Often) Conventional Paging: 1 guard page, COW, mapped files BASE 2 Direct Segment: Heap w/o swapping LIMIT VA OFFSET PA Direct Segment [ISCA 2013] but more-general ideas now 10/7/2020 7

Execution Time Overhead: TLB Misses Non-Volatile Memory to explode address space & sharing? 10/7/2020 ISCA 2013 8

II. Graphics Processing Units (GPUs) • GPUs = Throughput • Hierarchical “scoped” programming model • Share memory to expand viable programs – Rich data structures (w/o copying) – “Pointer is a pointer” – Coherence? Scopes? 10/7/2020 Open. CL Execution Hierarchy 9

GPU Memory Hierarchy = Throughput LLC Directory / Memory L 2 CPU GPU L 2 L 1 CU 0 L 1 CU 15 L 1 CPU 0 CPU 1 • Poor match CPU coherence w/ writeback caches • Coherence is means; memory consistency the end 10/7/2020 10

Sequential Consistency (SC) Thread 1 R 1 = TOS R 2 = R 1 – 1 TOS = R 2 Data 1 = *R 1 Thread 2 R 1 = TOS R 2 = R 1 – 1 TOS = R 2 Data 2 = *R 1 Total Memory Order 10/7/2020 11

Sequential Consistency (SC) w/ Locks Thread 1 Lock(Stack) R 1 = TOS R 2 = R 1 – 1 TOS = R 2 Data 1 = *R 1 Unlock(Stack) 10/7/2020 Thread 2 Lock(Stack) R 1 = TOS R 2 = R 1 – 1 TOS = R 2 Data 2 = *R 1 Unlock(Stack) 12

SC for Data Race Free Thread 1 Lock(Stack) R 1 = TOS R 2 = R 1 – 1 TOS = R 2 Data 1 = *R 1 Unlock(Stack) 10/7/2020 Thread 2 Lock(Stack) R 1 = TOS R 2 = R 1 – 1 TOS = R 2 Data 2 = *R 1 Unlock(Stack) 13

CPU History & GPU Future • CPUs 3 Decades! – SC [To. C 1979] – SC for Data Race Free [ISCA 1990] – SC for DRF Java/C++ [PLDI 2005] [PLDI 2008] GPUs Faster? 10/7/2020 14

GPU Memory Hierarchy = Throughput LLC Directory / Memory L 2 CPU GPU L 2 L 1 CU 0 L 1 CU 15 L 1 CPU 0 CPU 1 • GPU has “scopes” – nearer in faster 10/7/2020 15

CPU History & GPU Future • CPUs 3 Decades! – SC [To. C 1979] – SC for Data Race Free [ISCA 1990] – SC for DRF Java/C++ [PLDI 2005] [PLDI 2008] • GPUs Faster? – SC for Heterogeneous Race Free [ASPLOS 2014] – No data races & synchronization of “enough” scope – In Heterogeneous System Architecture [2015] Whither System on a Chip w/ many accelerators? 10/7/2020 16

III. Non-Volatile Memory (NVM) Compute Memory Storage 10/7/2020 Convergence/hype Off by (a) surprise or (b) design 17

III(a) Power Off by Surprise (Crash) STORE value = 0 x. C 02 STORE valid = 1 Non-Volatile Memory Write-back Cache Total Memory Order 10/7/2020 value 0 x. C 02 valid 1 value 0 x. DEADBEEF valid 0 1 18

Seek Consistent Durable State on Crash Persistency Order? Persistency Model [Pelley et al ISCA 2014] Total Memory Order 10/7/2020 – Strict persistency: Strong as (relaxed) memory model – Relaxed persistency: Even weaker 19

More Persistency Work Needed • Industry not there yet: If PCOMMIT is executed after a store to a persistent memory range is accepted to memory, the store becomes persistent when the PCOMMIT becomes globally visible. ” • While all store-to-memory operations are eventually accepted to memory, the following items specify the actions software can take to ensure that they are accepted: Non-temporal stores to write-back (WB) memory and all stores to uncacheable (UC), write-combining (WC), and write-through (WT) memory are accepted to memory as soon as they are globally visible. • If, after an ordinary store to write-back (WB) memory becomes globally visible, CLFLUSHOPT, or CLWB is executed for the same cache line as the store, the store is accepted to memory when the CLFLUSH, CLFLUSHOPT or CLWB execution itself becomes globally visible. • IMHO Need – Deeper & more formal models (e. g. , happens-before) – Better understanding of app durable state – Implement HW that orders; not gratuitously flushes 10/7/2020 20

III(b) Power off by Design (Prediction) idle • Greatly improve energy-efficiency Need work: • Especially when (briefly) doing nothing circuits, architecture, idle system SW • Or advice Patterson never gave me…. Do Nothing Well! Summary I. Million-fold Memory Growth & Virtual Memory II. General-Purpose GPUs & Memory Consistency III. Non-Volatile Memory’s Fusing Memory & Storage 10/7/2020 21

Backup 10/7/2020 22