A Lightweight Hybrid HardwareSoftware Approach for ObjectRelative Memory
A Lightweight Hybrid Hardware/Software Approach for Object-Relative Memory Profiling Licheng Chen, Zehan Cui, Yungang Bao, Mingyu Chen, Yongbing Huang, and Guangming Tan Institute of Computing Technology (ICT) Chinese Academy of Sciences (CAS) ISPASS 2012 April 2, 2012
Background • Memory behavior is the key factor of the performance of a program. • Understanding memory behavior is significant for identifying the bottleneck of both architecture and application. • For example, – TLB is an essential component of memory system – Applications’ working set tends to be larger and lager, leading to serious TLB miss – Study 1: that TLB miss can degrade system performance by 5~14% [Bhargava’ 08] – Study 2: a large number of TLB misses in multi-threaded programs are redundant and predictable, which implies the optimization potential. [Bhattacharjee’ 08] Done by memory profiling
Memory Profiling • Memory profiling is to collect memory behavior information during the execution of programs. • Profiling can be performed for – different hardware components – at different software levels Function Application Whole System TLB/Cache/DRAM Objects (Array, List etc. )
Object Memory Profiling • Object refers to a group of data stored as a unit [Wu’ 04] – Distinguish regular patterns from mixed and irregular traces Application Traces Object Trace Whole System Traces • Valuable for optimization – – Memory trace compression Data layout Object-level prefetching Cache partition [Soft-OLP, PACT 2009] Irregular Regular
Current Profiling Approaches • Existing approaches – Compiler-driven: re-compile/re-link, source code – Instrumentation: heavy overhead – Simulation: accuracy problem, slow – Performance Counter: lack of detailed information • All cannot observe page table walks due to TLB Miss • We propose a hybrid hardware/software approach for object memory profiling – Accurate: real application & real system – Lightweight – Track page table walks at object-level
Outline • Background • Design and Implementation • Experimental Results • Conclusion
An Overview Physical Address Trace Virtual Address Trace 0 x 398 f 24 a 0 x 398 f 24 b 0 x 398 f 24 c …… 0 x 1 af 4 aa 0 x 1 af 4 a 6 0 x 1 af 4 a 8 …… 0 x 38 d 2 cfc 0 x 38 d 2 cfd …… 0 x 1 f 05000 0 x 1 f 06000 0 x 1 f 07000 …… 0 x 1 f 15000 0 x 1 f 16000 0 x 1 f 17000 …… 0 x 1 f 25000 0 x 1 f 26000 …… Object Access Pattern Matrix (VA: 0 x 1 f 05000)
HMTT • Hybrid Memory Trace Toolkit – A DDR 3 SDRAM compatible memory trace monitoring system – Adopts hardware snooping technology Memory Trace: <time_stamp, r/w, phy_addr> Advantages: • Platform independent • Negligible overhead • Full-system real memory traces, including OS, page table walks PCIE Cable Connector DIMM plugged on the other side
Challenges (1) • How to translate physical address trace to virtual address trace of a specific process? • Modify OS kernel to obtain page table • Lookup a phy_addr in the dumped page table • Generate virtual trace of each process
Challenge (2) • How to synchronize hardware and software when an page table update occurs in kernel? • Physical Page allocation/Free in kernel • Trigger annotations in OS VM module • Update dumped page table • Send a sync_tag to hardware
Challenge (3) • How to translate virtual address to objects without modifying source codes? Virtual Address Space Object: matrix==mymalloc(0 x 1000) Object-VA Mapping Table • The role of malloc() is to map VA to object • Use dynamic library overwrite to replace malloc()
Put them all together Physical Address Trace Virtual Address Trace 0 x 398 f 24 a 0 x 398 f 24 b 0 x 398 f 24 c …… sync_tag page walk 0 x 1 af 4 aa 0 x 1 af 4 a 6 0 x 1 af 4 a 8 …… sync_tag 0 x 38 d 2 cfc 0 x 38 d 2 cfd page walk …… 0 x 1 f 05000 0 x 1 f 06000 0 x 1 f 07000 …… 0 x 1 f 15000 0 x 1 f 16000 0 x 1 f 17000 …… 0 x 1 f 25000 0 x 1 f 26000 …… Dumped Page Table Object Access Pattern Matrix (VA: 0 x 1 f 05000) Object-VA Mapping Table Use page table to distinguish three types of memory access • Sync_tag update page table • Access page table itself page table walk due to TLB miss • Other memory access virtual address
Evaluation Methodology Intel Xeon E 5504, 2. 0 GHz, 2 Sockets, 4 Cores per Socket (8 core in total) Processor L 1 D-Cache: 32 KB, 8 -way, 64 Byte/line I-Cache: 32 KB, 4 -way, 64 Byte/Line L 2 256 KB, 8 -way, 64 Byte/line Shared Cache L 3 4 MB, 16 -way, 64 Byte/line TLB (private) DTLB 0 64 entries for 4 -KByte pages 32 entries for huge pages (2 MByte) TLB 1 512 entries for 4 -KByte pages Private Cache Memory DDR 3 -800 RDIMM, dual-rank, plugged into Socket 0, 4 GB 0. 25 GB reserved for HMTT configuration and buffer 3. 75 GB system available Operating System Cent. OS 5. 3, Linux kernel 2. 6. 32. 18 Benchmarks Multithreaded PARSEC 2. 1 A custom hybrid MPI/pthread implemented BFS of Graph 500 -1. 2
Validation • For Sp. MV benchmark (CSR) : y = ax * xhost Our system is able to distinguish regular access pattern from irregular pattern • Micro-benchmark: —The error is less than 2%
Overhead • Two main overhead: – Dumping page table traces: + dump_pt – Dumping object-VA mapping: + dump_obj 1, 07 Origin 1, 05 +dump_pt <2% <1% +dump_obj 1, 03 1, 01 0, 99 ea n M s bf . clu s. . p st re a m de du l ea nn ca x 2 64 s vip t ra yt ra sw ce ap flu tion s id an im a. . . fe rre sim fa ce ck tra dy bo ks ch ol . . . 0, 97 bl ac Normalized Overhead • Monitoring objects >= 4 KB: result in most memory references
Case Study 1: BFS (Breadth-First Search) • column object got about 71% of page walks key object • Optimization: use huge page for column object 120% rowstarts column pred oldq newq visited 100% 80% 60% 40% 20% Normalized Speedup Percentage of Page Walks – Speedup: about 12% for 8 -thread, 8% for 128 -thread 1, 4 w/o hugetlb 1, 3 w/ hugetlb 1, 2 8. 18% 1, 1 1 0, 9 0, 8 0% 1 2 4 32 128 Number of Threads 1 2 4 8 16 32 Number of Threads 64 128
Case Study 2: Canneal (PARSEC) 4 8 rs 2 he . . io. at _l oc . . io. _l oc at e. . . le m _e le m e. . . 1 ot 1 E+09 8 E+08 6 E+08 4 E+08 2 E+08 0 E+00 _e Number of memory requests • Cache-aware simulated annealing (SA) to minimize the routing cost of a chip design • Two objects contribute most of the memory accesses: _elements and _location Main Objects in Canneal The memory access almost do not change while increasing thread number.
• _elements object contributes the most of the increased page walks • Put the _elements object into huge page to reduce TLB miss Speedup: about 5% for 8 -thread 3 E+08 2 E+08 total _elements _locations 2 E+08 1 E+08 5 E+07 0 E+00 1 2 4 Number of Threads 8 1, 15 Normalized Speedup Number of Page Walks Case Study 2: Canneal w/o hugetlb w/ hugetlb 1, 1 1, 05 1 0, 95 0, 9 1 2 4 Number of Threads 8
A Visual Demo of the HMTT
Conclusion • We have designed and implemented a hybrid hardware/software approach to conduct objectrelative memory profiling. – Accurate: real application & real system – Lightweight – Track page table walks at object-level • We demonstrate two case studies to show the approach can help users better understand memory behavior and optimize performance. • We intend to use this approach to analyze virtual machine on real machines.
Thanks! &Questions?
Extra Slides
Memory Profiling Approaches Low overhead Page walks+ Accurate Detailed Instrument √ √ × × Simulator * √ × × Performance Counter Compiler √ × √ * √ √ √ × Hybrid H/S √ √ Note: √-Yes, ×-No, *-Maybe
Reverse Page Table • Physical address pid, virtual address
Validation Access objects with different pattern: • a 0: all read accesses, forward • a 1: 3/4 read and 1/4 write accesses, forward • a 2: 2/4 read and 2/4 write accesses, forward • a 3: 1/4 read and 3/4 write accesses, backward • a 4: all write accesses, backward Size 256 MB, access step 64 B, requests: 4 M Obj a 0 a 1 a 2 a 3 a 4 Read 4, 194, 370 4, 194, 310 4, 194, 369 4, 194, 303 4, 194, 436 Write 0 1, 048, 576 2, 096, 927 3, 087, 379 4, 149, 586 Rate 4: 0 4: 1 4: 2. 94 4: 3. 96 Per 4: 0 4: 1 4: 2 4: 3 4: 4 Error 0% 0% 0% 2. 04% 1. 01% a 0 a 4
HMTT Configuration Space • A reserved physical memory region • Can be accessed by source codes and binary codes
- Slides: 26