Memory Performance Profiling via Sampled Performance Monitor Event

Outline § Motivation § Data Collection Environment • Workload & Platform • Monitored Events

Motivation § Modern Systems Performance governed by memory subsystem § SMPs • Deeper and

Data Collection Environment § Workload • TPC-C benchmark § Commercial § OLTP § Platform

Platform P X 8 -processor p 690 configuration MCM 0 MCM 1 X P

Platform P 32 -processor p 690 configuration P MCM 0 MCM 1 P P

Monitored Events § L 2 -cache data-load misses • • • L 2. 5

L 2 Load Latencies 12 cycles L 2 P X MCM 0 MCM 1

Load Latencies L 2 12 cycles L 2. 5 73 cycles L 2. 5

Load Latencies L 2 12 cycles L 2. 5 73 cycles L 2. 75

Data Collection § 10 -minute observation interval § Performance Monitoring Unit (PMU) • Special-purpose

Sampled Event Traces § Sampling • Record periodic occurrences of an event • 100

Performance Framework Data Collection Environment TPC-C p 690 Sampled Event Traces PID TID Timestamp

Analysis • Identify application-specific sources of performance degradation associated with data references Address space

32 -Processor Results Memory Regions Department of Computer Science

32 -Processor Results L 3 Caches Department of Computer Science

32 -Processor Results Segments Department of Computer Science

32 -Processor Results Pages Department of Computer Science

32 -Processor Results Cache Lines Department of Computer Science

32 -Processor Results Instructions Lock Operations Atomic Operations simple_lock fetch_and_add simple_lock_ppc fetch_and_add_h simple_unlock fetch_and_addlp

Conclusions § Targets for performance improvement of TPC-C are associated mainly with two regions

Future Work § Suggest ways to improve p 690 application performance § Enhance performance

Thank You. Questions? Department of Computer Science

Slides: 27

Download presentation

Memory Performance Profiling via Sampled Performance Monitor Event Traces Diana Villa, Jaime Acosta, Patricia J. Teller The University of Texas at El Paso Department of Computer Science Bret Olszewski Trevor Morgan IBM Corporation – Austin, TX Exxon/Mobil Department of Computer Science

Outline § Motivation § Data Collection Environment • Workload & Platform • Monitored Events § § Sampled Event Traces Performance Evaluation Framework Data Analysis & Results Conclusions and Future Work Department of Computer Science

Motivation § Modern Systems Performance governed by memory subsystem § SMPs • Deeper and larger memory hierarchies • Performance analysis considerations Time to results and size of data set § Goal Develop a new performance analysis methodology Department of Computer Science

Data Collection Environment § Workload • TPC-C benchmark § Commercial § OLTP § Platform • IBM e. Server p. Series 690 architecture (p 690) 8 - and 32 -processor configurations Department of Computer Science

Platform P X 8 -processor p 690 configuration MCM 0 MCM 1 X P P L 2 X L 2 L 3 P X P L 2 Department of Computer Science X L 2 P X L 2

Platform P 32 -processor p 690 configuration P MCM 0 MCM 1 P P P L 2 L 3 P P P L 2 MCM 3 P P P L 2 P L 2 L 3 P P L 2 Department of Computer Science P P L 2

Monitored Events § L 2 -cache data-load misses • • • L 2. 5 L 2. 75 L 3. 5 MEM § L 1 -cache data-load miss • L 2 Department of Computer Science

L 2 Load Latencies 12 cycles L 2 P X MCM 0 MCM 1 X P P L 2 X L 2 L 3 P X L 2 Department of Computer Science P X L 2

Load Latencies L 2 12 cycles L 2. 5 73 cycles L 2. 5 P X MCM 0 MCM 1 X P P L 2 X L 2 L 3 P X L 2 Department of Computer Science P X L 2

Load Latencies L 2 12 cycles L 2. 5 73 cycles L 2. 75 96 cycles L 2. 75 P X MCM 0 MCM 1 X P P L 2 X L 2 L 3 P X L 2 Department of Computer Science P X L 2

Load Latencies L 2 12 cycles L 2. 5 73 cycles L 2. 75 96 cycles L 3 112 cycles L 3 P X MCM 0 MCM 1 X P P L 2 X L 2 L 3 P X L 2 Department of Computer Science P X L 2

Load Latencies L 2 12 cycles L 2. 5 73 cycles L 2. 75 96 cycles L 3 112 cycles L 3. 5 143 cycles P X MCM 0 MCM 1 X P P L 2 X L 3. 5 P L 2 X L 2 L 3 P X L 2 Department of Computer Science P X L 2

Load Latencies L 2 12 cycles L 2. 5 73 cycles L 2. 75 96 cycles L 3 112 cycles L 3. 5 143 cycles MEM 320 cycles P X MCM 0 MCM 1 X P P L 2 X L 2 L 3 P X L 2 Department of Computer Science P X L 2

Data Collection § 10 -minute observation interval § Performance Monitoring Unit (PMU) • Special-purpose registers • Programming interface Kernel extension § eprof • PMU configuration • Event-based sampling Department of Computer Science

Sampled Event Traces § Sampling • Record periodic occurrences of an event • 100 events/sec/CPU § Event record 372872 PID 184469 TID 0. 328104637 000000 A 8 C 4 00000218880 Timestamp Effective Instruction Address § Average number of samples collected/event • 238, 448 for 8 -processor data • 212, 396 for 32 -processor data Department of Computer Science Effective Data Address

Performance Framework Data Collection Environment TPC-C p 690 Sampled Event Traces PID TID Timestamp Instr. Addr. Data. Addr. PID Timestamp Instr. Addr. Data. Addr. Database Load DB Java Tool Report Generation Java Tool Reports 5 Buffer. Pool 56893 29384 6 Data, BSS, Heap 8799 4855 1 Kernel 23485 9840 Department of Computer Science Graphs

Analysis • Identify application-specific sources of performance degradation associated with data references Address space …. Page kernel Level of memory hierarchy Instruction/ Data Structure …. text …. data, bss, heap …. buffer pool …. Department of Computer Science Segment Page offset/ Cache line

Results Department of Computer Science

32 -Processor Results Memory Regions Department of Computer Science

32 -Processor Results L 3 Caches Department of Computer Science

32 -Processor Results Segments Department of Computer Science

32 -Processor Results Pages Department of Computer Science

32 -Processor Results Cache Lines Department of Computer Science

32 -Processor Results Instructions Lock Operations Atomic Operations simple_lock fetch_and_add simple_lock_ppc fetch_and_add_h simple_unlock fetch_and_addlp disable_lock fetch_and_or unlock_enable fetch_and_orlp simple_unlock_mem fetch_and unlock_enable_mem fetch_andlp Department of Computer Science

Conclusions § Targets for performance improvement of TPC-C are associated mainly with two regions of the address space: • buffer pool • data, bss, heap § TPC-C lock instructions are not key to performance degradation § 8 - and 32 -processor data have same reference pattern, thus, a model of TPC-C memory access may be possible Department of Computer Science

Future Work § Suggest ways to improve p 690 application performance § Enhance performance evaluation framework § Quantify representativeness of sampled event traces § Expand study of application data load behavior • • • Process characterization Process migration Other performance issues § Compulsory vs. capacity/conflict misses, false sharing, contention for resources § Develop synthetic applications § Mimic the behavior of key p 690 applications § Use these to study application behavior and experiment with modifications to applications that may affect performance Department of Computer Science

Thank You. Questions? Department of Computer Science