Accelerating Pointer Chasing in 3 DStacked Memory Challenges

Executive Summary • Our Goal: Accelerating pointer chasing inside main memory • Challenges: Parallelism

Linked Data Structures • Linked data structures are widely used in many important applications

The Problem: Pointer Chasing • Traversing linked data structures requires chasing pointers Find(A) H

Our Goal Accelerating pointer chasing inside main memory Find(A) H E A Find (A)

Outline • Motivation and Our Approach • Parallelism Challenge • IMPICA Core Architecture •

Parallelism Challenge CPU core Memory access Comp In-Memory Comp Accelerato r Memory access Time

Parallelism Challenge and Opportunity • A simple in-memory accelerator can still be slower than

Our Solution: Address-Access Decoupling CPU core Comp Memory access Time Comp Memory access Comp

IMPICA Core Architecture DRAM Dies DRAM Layers Logic Layer Request Queue Travers al 1

Address Translation Challenge s e r i u q e r k l a

Our Solution: IMPICA Page Table • Completely decouple the page table of IMPICA from

IMPICA Page Table: Mechanism Virtual Address Bit [47: 41] Region Table Bit [40: 21]

Evaluated Workloads • Microbenchmarks • Linked list (from Olden benchmark) • Hash table (from

Evaluation Methodology • Simulator: gem 5 • System Configuration • CPU • 4 Oo.

Result – Microbenchmark Performance Baseline + extra 128 KB L 2 Speedup 2. 0

Database Throughput Result – Database Performance +16% 1. 22 1. 12 1. 02 +2%

Energy Consumption Normalized Energy Baseline + extra 128 KB L 2 1. 0 -24%

More in the Paper • Interface and design considerations • CPU interface and programming

Conclusion • Performing pointer-chasing inside main memory can greatly speed up the traversal of

150 100 50 0 Linked List Hash Table B-Tree Normalized cache miss latency TLB

Address Translation Speedup Sensitivity to IMPICA TLB size & Page Table Design 32 TLB

Full IMPICA Core Architecture DRAM Dies DRAM Layers Logic Layer Data RAM Request Queue

CPU Interface • We use packet-based interface between CPU and IMPICA • Execution steps

Programming Model • An IMPICA program is written as a function in the application

Page Table Management • The application allocates the memory for its linked data structures

IMPICA Page Table Size • Region Table • 4 entries (covers a 2 TB

Handling of Multiple Memory Stacks • The OS knows the IMPICA region because of

Cache Coherence • We execute every function that operates on the IMPICA regions in

Limit of Parallelism • The parallelism of IMPICA is limited by • Data RAM

Area and Power Overhead CPU (Cortex-A 57) 5. 85 mm 2 per core L

Slides: 34

Download presentation

Accelerating Pointer Chasing in 3 D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh Samira Khan, Nandita Vijaykumar, Kevin K. Chang, Amirali Boroumand, Saugata Ghose, Onur Mutlu

Executive Summary • Our Goal: Accelerating pointer chasing inside main memory • Challenges: Parallelism challenge and Address translation challenge • Our Solution: In-Memory Po. Inter Chasing Accelerator (IMPICA) • Address-access decoupling: enabling parallelism in the accelerator with low cost • IMPICA page table: low cost page table structure • Key Results: • 1. 2 X – 1. 9 X speedup for pointer chasing operations, +16% database throughput • 6% - 41% reduction in energy consumption 2

Linked Data Structures • Linked data structures are widely used in many important applications e r a s e r u t c u r t s a t inked da L Databas s Key-value r e t n i o p y b d e t c conne e B-Tree stores Hash Table 3

The Problem: Pointer Chasing • Traversing linked data structures requires chasing pointers Find(A) H E A Addr (A) (E) (H) Q F M Data (H) (E) (A) MEM pattern Serialized and irregular access 6 X cycles per instruction in real workloads 4

Our Goal Accelerating pointer chasing inside main memory Find(A) H E A Find (A) Q F M DRAM layers Data (A) MEM Logic layer 5

Outline • Motivation and Our Approach • Parallelism Challenge • IMPICA Core Architecture • Address Translation Challenge • IMPICA Page Table • Evaluation • Conclusion 6

Parallelism Challenge CPU core Memory access Comp In-Memory Comp Accelerato r Memory access Time Comp Memory access Comp s n o i t a r e p o n o o i w t t a r r o e f p r o e e w n o l o S r o f r e t s Fa 7

Parallelism Challenge and Opportunity • A simple in-memory accelerator can still be slower than multiple CPU cores CPU core Accelerator • Opportunity: a pointer-chasing accelerator spends a long time waiting for memory Comp Memory access (10 -15 X of Comp) Comp 8

Our Solution: Address-Access Decoupling CPU core Comp Memory access Time Comp Memory access Comp CPU s e l b a n e g n i l p u o c e d core s s e c c a s s e r d d A t s o c w o l h t i w s e n i g Addres n e h t o b n i m s i l lle Comp ra. Comp a p s Engine Access Engine Memory access 9

IMPICA Core Architecture DRAM Dies DRAM Layers Logic Layer Request Queue Travers al 1 Travers To/From al 2 IMPICA Cache Memory Controller Access Queue Address Engine CPU Access Engine Response Queue 10

Outline • Motivation and Our Approach • Parallelism Challenge • IMPICA Core Architecture • Address Translation Challenge • IMPICA Page Table • Evaluation • Conclusion 11

Address Translation Challenge s e r i u q e r k l a w e l b a t e g a p e Th s e s s e c c a y r o e mem l p i t l u m TLB/MMU Pointer (VA) Pointer (PA) PTW e d i s y r o m e h t n o U M M / B L No T PTW s e t a e r c d n a y l t s o c PTW s i t i g n i t a c i l p Du PTW e u s s i y t i l i b i t a comp PTW Page table walk 12

Our Solution: IMPICA Page Table • Completely decouple the page table of IMPICA from the page table of the. IMPICA CPUs Page Table CPU Page Table A C I P M I o t n i e r tu c u r t s a t a d d e k n Page Map li. Virtual Physical Page s n o i g e r IMPICA ny a o t l a i t r a p a is e l b a t Region e g a p A C IMPI mapping Virtual Page Virtual Address Space Physical Page Physical Address Space 13

IMPICA Page Table: Mechanism Virtual Address Bit [47: 41] Region Table Bit [40: 21] Bit [20: 12] Bit [11: 0] e l b a t e g Flat pa ss e c c a y r o m e m saves one + t + s o m l a s i e l b a Tiny region t he c a c e h t n i s y a alw Flat Page Table (2 MB) Small Page Table (4 KB) + Physical Address

Outline • Motivation and Our Approach • Parallelism Challenge • IMPICA Core Architecture • Address Translation Challenge • IMPICA Page Table • Evaluation • Conclusion 15

Evaluated Workloads • Microbenchmarks • Linked list (from Olden benchmark) • Hash table (from Memcached) • B-tree (from DBx 1000) • Application • DBx 1000 (with TPC-C benchmark) 16

Evaluation Methodology • Simulator: gem 5 • System Configuration • CPU • 4 Oo. O cores, 2 GHz • Cache: 32 KB L 1, 1 MB L 2 • IMPICA • 1 core, 500 MHz, 32 KB Cache • Memory Bandwidth • 12. 8 GB/s for CPU, 51. 2 GB/s for IMPICA • Our simulator code will be released in Dec. 17

Result – Microbenchmark Performance Baseline + extra 128 KB L 2 Speedup 2. 0 IMPICA 1. 9 X 1. 3 X 1. 5 1. 2 X 1. 0 0. 5 0. 0 Linked List Hash Table B-Tree 18

Database Throughput Result – Database Performance +16% 1. 22 1. 12 1. 02 +2% 0. 92 Baseline + extra 128 KB L 2 Database Latency +5% 1. 00 0. 95 Baseline + extra 1 MB L 2 IMPICA -0% -4% -13% 0. 90 0. 85 Baseline + extra 128 KB L 2 1 MB L 2 IMPICA 19

Energy Consumption Normalized Energy Baseline + extra 128 KB L 2 1. 0 -24% IMPICA -10% -6% -41% 0. 5 0. 0 Linked List Hash Table B-Tree DBx 1000 20

More in the Paper • Interface and design considerations • CPU interface and programming model • Page table management • Cache coherence • Area and power overhead analysis • Sensitivity to IMPICA page table 21

Conclusion • Performing pointer-chasing inside main memory can greatly speed up the traversal of linked data structures • Challenges: Parallelism challenge and Address translation challenge • Our Solution: In-Memory Po. Inter Chasing Accelerator • Address-access decoupling: enabling parallelism with low cost • IMPICA page table: low cost page table structure • Key Results: • 1. 2 X – 1. 9 X speedup for pointer chasing operations, +16% database throughput • 6% - 41% reduction in energy consumption 22

Accelerating Pointer Chasing in 3 D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh Samira Khan, Nandita Vijaykumar, Kevin K. Chang, Amirali Boroumand, Saugata Ghose, Onur Mutlu

150 100 50 0 Linked List Hash Table B-Tree Normalized cache miss latency TLB MPKI Microarchitecture Metrics 1. 00 0. 75 0. 50 0. 25 0. 00 Linked Hash B-Tree List Table 24

Address Translation Speedup Sensitivity to IMPICA TLB size & Page Table Design 32 TLB 64 TLB 32 TLB + RPT 64 TLB + RPT 1. 4 1. 2 1. 0 0. 8 0. 6 0. 4 0. 2 0. 0 Linked List Hash Table B-Tree DBx 1000 25

Full IMPICA Core Architecture DRAM Dies DRAM Layers Logic Layer Data RAM Request Queue Inst RAM IMPICA Cache Memory Controller Access Queue Address Engine To/From CPU Access Engine Response Queue 26

CPU Interface • We use packet-based interface between CPU and IMPICA • Execution steps • CPU sends function call and parameter to IMPICA • The packet is written to IMPICA data RAM • IMPICA loads the function into inst RAM • IMPICA writes results to the data RAM, from which the CPU polls the results. 27

Programming Model • An IMPICA program is written as a function in the application code with a compiler directive • The compiler compiles these functions into IMPICA instructions and wraps the function calls with communication codes 28

Page Table Management • The application allocates the memory for its linked data structures with a special API • The OS reserves a portion of the virtual address space as IMPICA regions • The OS maintains the coherence between CPU page table and IMPICA page table in the page fault handler 29

IMPICA Page Table Size • Region Table • 4 entries (covers a 2 TB memory range) • 68 B • Flat page table (each) • 220 entries • 8 MB • Small page table (each) • 29 entries • 4 KB 30

Handling of Multiple Memory Stacks • The OS knows the IMPICA region because of our page table management • The OS always maps the IMPICA region of the same application into the same memory stack, including the corresponding IMPICA page table 31

Cache Coherence • We execute every function that operates on the IMPICA regions in the accelerator • It can be extended with more advanced cache coherence mechanism. 32

Limit of Parallelism • The parallelism of IMPICA is limited by • Data RAM size (for call stacks) • Memory access time vs. address computation time • The size of the queues • Each IMPICA core can easily parallelize 10 – 15 pointer chasing requests. 33

Area and Power Overhead CPU (Cortex-A 57) 5. 85 mm 2 per core L 2 Cache 5 mm 2 per MB Memory Controller 10 mm 2 IMPICA (+32 KB cache) 0. 45 mm 2 • Power overhead: average power increases by 5. 6% 34