CoArray Fortran Opensource compilers and tools for scalable

Co-Array Fortran • Open-source compilers and tools for • scalable global address space computing • John Mellor-Crummey • Rice University

Outline • Co-array Fortran • • language overview CAF compiler status and preliminary results language and compiler research issues interactions • Open. MP • compiler and runtime strategies for improving scalability • Dragon tool • hybrid MPI + Open. MP • Open 64 infrastructure • source-to-source and source-to-object code infrastructure Center for Programming Models for Scalable Parallel Computing Review, March 13, 2003 2

Co-Array Fortran (CAF) • Explicitly-parallel extension of Fortran 90/95 (Numrich & Reid) • Global address space SPMD parallel programming model • one-sided communication • Simple, two-level model that supports locality management • local vs. remote memory • Programmer control over performance critical decisions • data partitioning • communication • Suitable for mapping to a range of parallel architectures • shared memory, message passing, hybrid, PIM • Much in common with UPC Center for Programming Models for Scalable Parallel Computing Review, March 13, 2003 3

CAF Programming Model Features • SPMD process images • fixed number of images during execution • images operate asynchronously • Both private and shared data • real y(20, 20) a private 20 x 20 array in each image • real y(20, 20) [*] a shared 20 x 20 array in each image • Simple one-sided shared-memory communication • x(: , j: j+2) = y(r, : ) [p: p+2] copy rows from p: p+2 into local columns • Flexible synchronization • sync_team(notify [, wait]) • notify = a vector of process ids to signal • wait = a vector of process ids to wait for • Pointers and (perhaps asymmetric) dynamic allocation • Parallel I/O Center for Programming Models for Scalable Parallel Computing Review, March 13, 2003 4

One-sided Communication with Co-Arrays integer a(10, 20)[*] a(10, 20) image 1 image 2 image N if (thisimage() > 1) a(1: 5, 1: 10) = a(1: 5, 1: 10)[thisimage()-1] image 1 image 2 Center for Programming Models for Scalable Parallel Computing image N Review, March 13, 2003 5

Finite Element Example (Numrich) subroutine assemble(start, prin, ghost, neib, x) integer : : start(: ), prin(: ), ghost(: ), neib(: ), k 1, k 2, p real : : x(: ) [*] call sync_all(neib) do p = 1, size(neib) ! Add contributions from ghost regions k 1 = start(p); k 2 = start(p+1)-1 x(prin(k 1: k 2)) = x(prin(k 1: k 2)) + x(ghost(k 1: k 2)) [neib(p)] enddo call sync_all(neib) do p = 1, size(neib) ! Update the ghosts k 1 = start(p); k 2 = start(p+1)-1 x(ghost(k 1: k 2)) [neib(p)] = x(prin(k 1: k 2)) enddo call synch_all end subroutine assemble Center for Programming Models for Scalable Parallel Computing Review, March 13, 2003 6

Portable CAF Compiler • Compile CAF to Fortran 90 + runtime support library • source-to-source code generation for wide portability • expect best performance by leveraging vendor F 90 compiler • Co-arrays • access data in generated code using F 90 pointers • allocate storage with dope vector initialization outside F 90 • Porting to a new compiler / architecture • synthesize compatible dope vectors for co-array storage • tailor communication to architecture Center for Programming Models for Scalable Parallel Computing Review, March 13, 2003 7

CAF Compiler Status • Near production-quality F 90 front end from Open 64 • Working prototype for a CAF subset • allocate co-arrays using static constructor-like strategy • co-array access • remote data access uses ARMCI get/put • process local data access uses load/store • synch_all, synch_team synchronization • multi-dimensional array section operations • Successfully compiled and executed NAS MG • platforms: SGI Origin, IA 64 Myrinet • performance similar to hand-coded MPI Center for Programming Models for Scalable Parallel Computing Review, March 13, 2003 8

NAS MG Efficiency (Class C) IA 64/Myrinet 2000 Center for Programming Models for Scalable Parallel Computing Review, March 13, 2003 9

CAF Compiler Coming Attractions • Co-arrays as procedure arguments • Triplet notation for co-dimensions • Co-arrays of user defined types • types can contain pointers • Dynamic allocation of co-arrays • Compiler support for parallel I/O Center for Programming Models for Scalable Parallel Computing Review, March 13, 2003 10

CAF Language Research Issues • Synchronization • • locks instead of critical sections split-phase primitives synch_team/synch_all semantics can require pairwise notification may need synchronization matching hints to enable optimization • Language support for efficient reductions • manually-coded reductions unlikely to yield portable performance • Memory consistency model for co-array data • Controlling process to processor mapping • Support for hierarchical locality domains • support work sharing on SMPs? Center for Programming Models for Scalable Parallel Computing Review, March 13, 2003 11

CAF Compiler Research Issues Aim for performance transparency • Compiler optimization of communication and I/O • multi-mode communication: direct load/store + RDMA • combine synchronization with communication • put/get with flag • one-sided two-sided communication • transform from get to put communication • exploit split-phase communication and synchronization • communication vectorization • latency hiding for communication and parallel I/O • platform-tailored optimization • synchronization strength reduction • Interoperability with other parallel programming models • Optimizations to improve node performance Center for Programming Models for Scalable Parallel Computing Review, March 13, 2003 12

CAF Interactions • • • Working with CAF code from Numrich and Wallcraft (NRL) Refining ARMCI synchronization with Nieplocha Designing parallel I/O design for CAF with UIUC Exploring language design with Numrich and Nieplocha Coordinating with Rasmussen (LANL) on Fortran 90 array dope vector interface library • Planning a fall CAF workshop at PSC • coordinating with Ralph Roskies, Sergiu Sanielevici • encouragement from Rich Hirsch, Fred Johnson Center for Programming Models for Scalable Parallel Computing Review, March 13, 2003 13