Implementing an Open MP Execution Environment on Infini

  • Slides: 17
Download presentation
Implementing an Open. MP Execution Environment on Infini. Band Clusters Jie Tao ¹, Wolfgang

Implementing an Open. MP Execution Environment on Infini. Band Clusters Jie Tao ¹, Wolfgang Karl ¹, and Carsten Trinitis ² ¹ Institut für Technische Informatik Universität Karlsruhe ² Lehrstuhl für Rechnertechnik und Rechnerorganisation Technische Universität München

Outline q Motivation q Vi. SMI: Virtual Shared Memory for Infini. Band clusters q

Outline q Motivation q Vi. SMI: Virtual Shared Memory for Infini. Band clusters q Omni/Infini: towards Open. MP execution on cluster systems q Initial experimental results q Conclusions 2 © J. Tao, IWOMP 2005

Motivation q Infini. Band s Point-to-point, switched I/O interconnect architecture s Low latency, high

Motivation q Infini. Band s Point-to-point, switched I/O interconnect architecture s Low latency, high bandwidth s Basic connection: serial at data rates of 2. 5 Gbps s Bundled: e. g. 4 X connection -- 10 Gbps; 12 X -- 30 Gbps s Special feature: Remote Direct Memory Access (RDMA) q The Infini. Band Cluster at TUM s Configuration: 6 Xeon (2 -way), 4 Itanium 2 (4 -way), 36 Opteron s MPI available 3 © J. Tao, IWOMP 2005

Software Distributed Shared Memory q Virtually global address space on cluster architectures q Memory

Software Distributed Shared Memory q Virtually global address space on cluster architectures q Memory consistency models s Sequential consistency s Any execution has the same result as if operations were executed in a sequential order s Relaxed consistency s Consistency through explicit synchronization: acquire & release s Lazy Release Consistency (LRC) s Home-based Lazy Release Consistency (HLRC) s Multiple writable copies of the same page s A home for a page 4 © J. Tao, IWOMP 2005

Vi. SMI: Software DSM for Infini. Band Clusters q HLRC implementation s Home node:

Vi. SMI: Software DSM for Infini. Band Clusters q HLRC implementation s Home node: first-touch s Diff-based mechanism s Clean copy before first write s Diffs propagated by invalidations s Infini. Band hardware-based multicast q Programming interface s A set of annotations s s s HLRC_Malloc HLRC_Myself HLRC_Init. Parallel HLRC_Barrier HLRC_Aquire HLRC_Release 5 © J. Tao, IWOMP 2005

Omni/Infini: Compiler & Runtime for Open. MP Execution on Infini. Band q q Omni

Omni/Infini: Compiler & Runtime for Open. MP Execution on Infini. Band q q Omni for SMP as the basis A new runtime library: adapting to Vi. SMI interface s Scheduling s ompc_static_bschd () s ompc_get_num_threads () s Parallelization s ompc_do_parallel () s Synchronization s ompc_lock () s ompc_unlock () s ompc_barrier () s Reduction operation s ompc_reduction () s Specific code region s ompc_do_single () s ompc_is_master () s ompc_critical () © J. Tao, IWOMP 2005 6

Omni/Infini (cont. ) q Shared data allocation s Static data pointer s HLRC_Malloc into

Omni/Infini (cont. ) q Shared data allocation s Static data pointer s HLRC_Malloc into shared region q Adapting process structure to thread structure s Vi. SMI: process structure s Open. MP: thread structure 7 © J. Tao, IWOMP 2005

Thread-level & Process-level Parallelization master T 1 T 2 T 3 IDLE Sequential region

Thread-level & Process-level Parallelization master T 1 T 2 T 3 IDLE Sequential region IDLE P 1 P 2 P 3 P 4 Sequential region Parallel region IDLE Thread-level parallelization Process-level parallelization 8 © J. Tao, IWOMP 2005

Experimental Results q Platform and setup s Initial: 6 Xeon nodes s Most recently:

Experimental Results q Platform and setup s Initial: 6 Xeon nodes s Most recently: Opteron nodes q Application: s NAS parallel benchmark suite s Open. MP version of SPLASH-2 s Small codes s SMP programming course s Self-coded 9 © J. Tao, IWOMP 2005

Experimental Results q Platform 10 © J. Tao, IWOMP 2005

Experimental Results q Platform 10 © J. Tao, IWOMP 2005

Speedup 11 © J. Tao, IWOMP 2005

Speedup 11 © J. Tao, IWOMP 2005

Normalized Execution Time Breakdown 12 © J. Tao, IWOMP 2005

Normalized Execution Time Breakdown 12 © J. Tao, IWOMP 2005

13 © J. Tao, IWOMP 2005

13 © J. Tao, IWOMP 2005

CG Time Breakdown 14 © J. Tao, IWOMP 2005

CG Time Breakdown 14 © J. Tao, IWOMP 2005

Comparison with Omni/SCASH 15 © J. Tao, IWOMP 2005

Comparison with Omni/SCASH 15 © J. Tao, IWOMP 2005

Scalability 16 © J. Tao, IWOMP 2005

Scalability 16 © J. Tao, IWOMP 2005

Conclusions q Summary s An Open. MP execution environment for Infini. Band s Built

Conclusions q Summary s An Open. MP execution environment for Infini. Band s Built on top of a software DSM s Omni compiler as basis s Speedup on 6 nodes: up to 5. 22 q Future work s Optimization: barrier, page operation s Test with realistic applications 17 © J. Tao, IWOMP 2005