An Emerging Portable CoArray Fortran Compiler for HighPerformance

An Emerging, Portable Co-Array Fortran Compiler for High-Performance Computing Daniel Chavarría-Miranda, Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey {danich, ccristi, dotsenko, johnmc}@cs. rice. edu Co-Array Fortran A sensible alternative to these extremes Programming Models for High-Performance Computing MPI Simple and expressive models for high performance programming based on extensions to widely used languages HPF • The compiler is responsible for • Portable and widely used • The programmer has explicit control • Performance: users control data and computation partitioning • Portability: same language for SMPs, MPPs, and clusters • Programmability: global address space for simplicity communication and data locality over data locality and communication • Annotated sequential code (semiautomatic • Using MPI can be difficult and error parallelization) • Requires heroic compiler technology • The model limits the application paradigms: prone • Most of the burden for communication extensions to the standard are required for supporting irregular computation optimization falls on application developers; compiler support is underutilized Co-Array Fortran Language • SPMD process images – number of images fixed during execution – images operate asynchronously • Both private and shared data – real a(20, 20) private: a 20 x 20 array in each image – real a(20, 20) [*] shared: a 20 x 20 array in each image • Simple one-sided shared memory communication – x(: , j: j+2) = a(r, : ) [p: p+2] columns Explicit Data and Computation Partitioning integer A(10, 10)[*] A(10, 10) image 0 image 1 image N copy rows from p: p+2 into local A(1: 10, 1: 10)[2] = A(1: 10, 1: 10)[2] • Flexible synchronization – sync_team(team [, wait]) • team = a vector of process ids to synchronize with • wait = a vector of processes to wait for (a subset of team) • Pointers and dynamic allocation • Parallel I/O A(10, 10) image 0 image 1 image N Finite Element Example subroutine assemble(start, prin, ghost, neib, x) integer : : start(: ), prin(: ), ghost(: ), neib(: ) integer : : k 1, k 2, p real : : x(: ) [*] call sync_all(neib) do p = 1, size(neib) ! Update from ghost regions k 1 = start(p); k 2 = start(p+1)-1 x(prin(k 1: k 2)) = x(prin(k 1: k 2)) + x(ghost(k 1: k 2)) [neib(p)] enddo call sync_all(neib) do p = 1, size(neib) ! Update the ghost regions k 1 = start(p); k 2 = start(p+1)-1 x(ghost(k 1: k 2)) [neib(p)] = x(prin(k 1: k 2)) enddo call sync_all end subroutine assemble Co-Array Fortran enables simple expression of complicated communication patterns Research Focus Sum Reduction Example • Compiler-directed optimization of communication tailored for target platform communication fabric – Transform as useful from 1 -sided to 1. 5 sided, twosided and collective communication – Generate both fine-grain load/store and calls to communication libraries as necessary – Multi-model code for hierarchical architectures • Platform-driven optimization of computation • Compiler-directed parallel I/O with UIUC • Enhancements to Co-Array Fortran synch. model Current Implementation Status • Source-to-source code generation for wide portability • Open source compiler will be available • Working prototype for a subset of the language • Initial compiler implementation performs no optimization – each co-array access is transformed into a get/put operation at the same point in the code • Code generation for the widely-portable ARMCI communication library • Front-end based on production-quality Open 64 front end, modified to support source-to-source compilation • Successfully compiled and executed NAS MG on SGI Origin; performance similar to hand coded MPI Original Co-Array Program program e. Caf. Sum integer, save : : caf 2 d(10, 10)[*] integer : : sum 2 d(10, 10) integer : : me, num_imgs, i ! what is my image number me = this_image() ! how many images are running num_imgs = num_images() ! initial data assignment caf 2 d(1: 10, 1: 10) = me call sync_all() ! compute the sum for 2 d co-array if (me. eq. 1) then sum 2 d(1: 10, 1: 10) = 0 do i = 1, num_imgs sum 2 d(1: 10, 1: 10) = sum 2 d(1: 10, 1: 10)& + caf 2 d(1: 10, 1: 10)[i] end do write(*, *) 'sum 2 d = ', sum 2 d endif call sync_all() end program e. Caf. Sum Resulting Fortran 90 parallel program e. Caf. Sum < Co-array Fortran initialization > ecafsum_caf 2 d%ptr(1: 10, 1: 10) = me call Caf. Armci. Synch. All() if (me. eq. 1) then sum 2 d(1: 10, 1: 10) = 0 do i = 1, num_imgs, 1 allocate( caf. Temp_2%ptr(1: 10, 1: 10) ) caf. Temp_4%ptr =>ecafsum_caf 2 d%ptr(1: 10, 1: 10) call Caf. Armci. Get. S(ecafsum_caf 2 d%handle, i, caf. Temp_4, caf. Temp_2) sum 2 d(1: 10, 1: 10) = caf. Temp_2%ptr(1: 10, 1: 10)+sum 2 d(1: 10, 1: 10) deallocate( caf. Temp_2%ptr ) end do write(*, *) 'sum 2 d = ', sum 2 d(1: 10, 1: 10) endif call Caf. Armci. Synch. All() call Caf. Armci. Finalize() end program e. Caf. Sum
- Slides: 1