HOMME Trace Analysis Fabrice Mizero Mentor Dr John
HOMME Trace Analysis Fabrice Mizero Mentor: Dr. John Dennis Collaborators: Prof. Malathi Veeraraghavan (University of Virginia) Prof. Robert D. Russell (University of New Hampshire) Qian Liu(University of New Hampshire) Aug 1, 2014
Roadmap • • • Motivation Background Methodology Results Conclusion and Solutions Future Work 1
Big Picture • Understanding the causes of poor performance of CESM on Yellowstone: a 5 -step approach Ø Ø Ø Experimental execution and data collection HOMME trace analysis IBMgt. Sim: routing study Network simulation Integrated simulation 2
2 -hop 4 -hop 6 -hop *Credit: Dr. John Dennis Zhengyang Liu 3
Suspected Causes “…OS noise, shape of the allocated partition, and interference from other jobs. ” Abhinav Bhatele et al. SC 13 • Network Congestion Ø Head of Line Blocking Ø Credit-Based Flow Control • OS Jitter Ø Kernel Interrupts • Application Interference: Ø Self-Interference Ø Interference with others (Neighborhood Effect) 4
Congestion Ø Head of Line Blocking (HOL) H 1 Victim Flow S 1 Out of Buffer Space!! S 2 H 3 H 4 H 5 Stuck!!! ØWorst Case Scenario: ØCongestion Spreading due to HOL H 6 H 7 5
OS Jitter • Each compute node runs its own OS - RHEL • Interference caused by OS routines Ø Timer interrupts Ø OS Daemons Ø Hardware interrupts • Competition for CPU resources. Ø Example: Line Printer Daemon 6
3 Questions • How does congestion impact network latency? • How important is OS Jitter to network latency? • What has a bigger impact to message latency: OS Jitter or Congestion? 7
Experimental Set-Up • Congestion: Ø 2 Platforms • Jellystone: Non-production machine • Yellowstone: production machine Ø Different message sizes & Hop distance • OS Jitter: Ø Linux Transparent Huge Pages (THP) 8
Methodology Extrae Trace Collection Clock Skew Correction Hop, Size Wilcoxon Rank Sum Test 9
Extrae • Tracing tool Developed at BSC • Chronologic event, state, communications records • One way communication delays – Visuals with Paraver MPI-Isend Start Time End 10
Clock Skew Host A Ca(t 1) Ideally, CAB= Cb(t 2) – Ca(t 1) In reality, Offset = Ca(t) – Cb(t) != 0 Host B Cb(t 2) Skew = Ca’(t) - Cb’(t) != 0 • Same size, Same Hop-Count, host-pair level Ø Min delay: best approximation of offset Ø CAB(t) – min( CAB(t)) + minpingpong 11
Statistical Methods • Wilcoxon Rank Sum Test: Ø Non-parametric significance test Ø Compare the means of two independent populations Ø Tests: • OS Jitter? v Jellystone: no THP <=> with THP • Congestion? v Yellowstone: 0 -Hop delays 4 -Hop Delays v Jellystone: THP Yellowstone: THP 12
Perfquery • Perfquery: IB performance counters query tool. • Port. Xmit. Wait: Port congestion monitoring Ø Credit-Based Flow control TOR Switch Credits ? Port. Xmit. Wait No Yes Host A 13
Results • How important is OS Jitter to network latency? Ø Jellystone: : 0 -Hop: : No. THP vs. Jellystone: : 0 -Hop: : THP Msg size Sample size p-Value Interpretation 488 B 54624: : 45727 <0. 001, 1 No. THP is faster than with THP 1952 B 9503: : 7950 <0. 001, 1 No. THP is faster than with THP 2440 B 102120: : 85468 <0. 001, 1 No. THP is faster than with THP 2928 B 47504: : 39764 No. THP is faster than with THP <0. 001, 1 Ø Intranode communications delays with THP enabled are slower than without THP. 14
Results • What has a bigger impact to message latency: OS Jitter or Congestion? Ø Comparing: Yellowstone: 0 -Hop delays, 4 -Hop delays Msg size Sample size p-Values Interpretation 488 B 54325: : 23621 <0. 001, 1 4 -Hop is faster than 0 Hop 2440 B 101581: : 16529 <0. 001, 1 4 -Hop is faster than 0 Hop 2928 B 47243: : 21259 <0. 001, 1 4 -Hop is faster than 0 Hop 4880 B 49603: : 4720 <0. 001, 1 4 -Hop is faster than 0 Hop Ø For all considered message sizes, intranode communications delays can outweigh internode delays 15
Conclusion • OS Jitter can cause performance degradation or variability. • Inter-job interference can lead to application performance variability. Solutions Ø Congestion: v Dynamic Allocation of Virtual Lanes to redirect victim flows around congested ports. Ø OS Jitter: v Linux Tickless Kernel v MPI-3 for better control over share-memory communications. 16
Future Work • Further study on the Dynamic Virtual Lanes assignment solution • Plan and collect new HOMME traces with Port. Xmit. Wait monitored and LSF Logs saved. • Study intra-job interference • More efficient algorithm of correcting Clock Skew 17
Thank You Fabrice Mizero fm 9 ab@virginia. edu
- Slides: 19