A Multiplatform Coarray Fortran Compiler for Highperformance Computing

  • Slides: 1
Download presentation
A Multi-platform Co-array Fortran Compiler for High-performance Computing www. hipersoft. rice. edu/caf Cristian Coarfa,

A Multi-platform Co-array Fortran Compiler for High-performance Computing www. hipersoft. rice. edu/caf Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey, Daniel Chavarría-Miranda {ccristi, dotsenko, johnmc, danich}@cs. rice. edu • Point-to-point synchronization – sync_notify(p) • SPMD programming model – fixed number of images during execution – images operate asynchronously – sync_wait(p) • Both private and shared data – real a(20, 20) [*] • Less restrictive memory fences at call site • Collective operations private: a 20 x 20 array in each image shared: a 20 x 20 array in each image • Simple one-sided communication (PUT & GET) – x(: , j: j+2) = a(r, : ) [p: p+2] Current Optimizations • Flexible explicit synchronization – sync_team(team [, wait]) • team = a vector of process ids to synchronize with • wait = a vector of processes to wait for • Procedure Splitting • Hints for non-blocking communication • Library-based and load/store communication • Packing strided communication • Pointers and dynamic allocation integer : : a(10, 20)[*] Planned Optimizations a(10, 20) image 1 image 2 image N Copies from left neighbor if (this_image() > 1) a(1: 10, 1: 2)=a(1: 10, 19: 20)[this_image()-1] image 1 image 2 image N • Source-to-source code generation • Open source compiler • Build on Open 64/SL infrastructure • Support for core language features • Code generation: – library-based communication: copy rows from p: p+2 into local columns a(10, 20) Rice CAF Compiler CAF Model Refinements Co-Array Fortran Language NSF • Communication vectorization • Synchronization strength-reduction • Automatic split-phase communication • Platform-driven communication optimizations – transform communication from one-sided into twosided and collective, if useful – multi-model code for hierarchical architectures – convert GETs into PUTs • Multi-buffer co-arrays for asynchrony tolerance • Employ virtualization for latency tolerance • Interoperability with other programming models widely-portable ARMCI communication library and array descriptor CHASM library – load/store communication: on shared-memory platforms • Operating systems: – Linux IA 64/IA 32 – Alpha Tru 64 – SGI IRIX 64 • Interconnects & Platforms: – Quadrics QSNet (Elan 3), QSNet II (Elan 4) – Myrinet 2000 – Ethernet – SGI Altix 3000, SGI Origin 2000 CAF Applications and Benchmarks • Sweep 3 D – wave-front parallelism • Spark 98 – sparse matrix vector multiply • NAS Parallel Benchmarks 2. 3: MG, CG, SP, BT, LU • Random Access • STREAM Performance Results on Cluster-based Platforms Sweep 3 D 3003 NAS SP Class C NAS MG Class C Performance Results on SGI Altix 3000 Sweep 3 D 3003 NAS SP Class C NAS MG Class C