A Multiplatform CoArray Fortran Compiler Yuri Dotsenko Cristian

  • Slides: 32
Download presentation
A Multi-platform Co-Array Fortran Compiler Yuri Dotsenko Cristian Coarfa John Mellor-Crummey Department of Computer

A Multi-platform Co-Array Fortran Compiler Yuri Dotsenko Cristian Coarfa John Mellor-Crummey Department of Computer Science Rice University Houston, TX USA 1

Motivation Parallel Programming Models • MPI: de facto standard – difficult to program •

Motivation Parallel Programming Models • MPI: de facto standard – difficult to program • Open. MP: inefficient to map on distributed memory platforms – lack of locality control • HPF: hard to obtain high-performance – heroic compilers needed! Global address space languages: CAF, Titanium, UPC an appealing middle ground 2

Co-Array Fortran • Global address space programming model – one-sided communication (GET/PUT) • Programmer

Co-Array Fortran • Global address space programming model – one-sided communication (GET/PUT) • Programmer has control over performance-critical factors – data distribution – computation partitioning – communication placement • Data movement and synchronization as language primitives – amenable to compiler-based communication optimization 3

CAF Programming Model Features • SPMD process images – fixed number of images during

CAF Programming Model Features • SPMD process images – fixed number of images during execution – images operate asynchronously • Both private and shared data – real x(20, 20) a private 20 x 20 array in each image – real y(20, 20)[*] a shared 20 x 20 array in each image • Simple one-sided shared-memory communication – x(: , j: j+2) = y(: , p: p+2)[r] copy columns from image r into local columns • Synchronization intrinsic functions – sync_all – a barrier and a memory fence – sync_mem – a memory fence – sync_team([team members to notify], [team members to wait for]) • Pointers and (perhaps asymmetric) dynamic allocation 4

One-sided Communication with Co-Arrays integer a(10, 20)[*] a(10, 20) image 1 image 2 image

One-sided Communication with Co-Arrays integer a(10, 20)[*] a(10, 20) image 1 image 2 image N if (this_image() > 1) a(1: 10, 1: 2) = a(1: 10, 19: 20)[this_image()-1] image 1 image 2 image N 5

Rice Co-Array Fortran Compiler (cafc) • First CAF multi-platform compiler – previous compiler only

Rice Co-Array Fortran Compiler (cafc) • First CAF multi-platform compiler – previous compiler only for Cray shared memory systems • Implements core of the language – currently lacks support for derived type and dynamic co-arrays • Core sufficient for non-trivial codes • Performance comparable to that of hand-tuned MPI codes • Open source 6

Outline • CAF programming model • cafc Core language implementation – Optimizations • Experimental

Outline • CAF programming model • cafc Core language implementation – Optimizations • Experimental evaluation • Conclusions 7

Implementation Strategy • Source-to-source compilation of CAF codes – uses Open 64/SL Fortran 90

Implementation Strategy • Source-to-source compilation of CAF codes – uses Open 64/SL Fortran 90 infrastructure – CAF Fortran 90 + communication operations • Communication – ARMCI library for one-sided communication on clusters – load/store communication on shared-memory platforms Goals –portability –high-performance on a wide range of platforms 8

Co-Array Descriptors real : : a(10, 10)[*] type CAFDesc_real_3 integer(ptrkind) : : handle !

Co-Array Descriptors real : : a(10, 10)[*] type CAFDesc_real_3 integer(ptrkind) : : handle ! ! real, pointer: : ptr(: , : ) ! ! end Type CAFDesc_real_3 Opaque handle to CAF runtime representation Fortran 90 pointer to local co-array data type(CAFDesc_real_3): : a • Initialize and manipulate Fortran 90 dope vectors 9

Allocating COMMON and SAVE Co-Arrays • Compiler – generates static initializer for each common/save

Allocating COMMON and SAVE Co-Arrays • Compiler – generates static initializer for each common/save variable • Linker – collects calls to all initializers – generates global initializer that calls all others – compiles global initializer and links into program • Launch – invokes global initializer before main program begins • allocates co-array storage outside Fortran 90 runtime system • associates co-array descriptors with allocated memory Similar to handling for C++ static constructors 10

Parameter Passing • Call-by-value convention (copy-in, copy-out) call f((a(I)[p])) – pass remote co-array data

Parameter Passing • Call-by-value convention (copy-in, copy-out) call f((a(I)[p])) – pass remote co-array data to procedures only as values • Call-by-co-array convention* � – argument declared as a co-array by callee subroutine f(a) real : : a(10)[*] – enables access to local and remote co-array data • Call-by-reference convention* (cafc) – argument declared as an explicit shape array – enables access to local co-array data only real : : x(10)[*] call f(x) subroutine f(a) real : : a(10) � – enables reuse of existing Fortran code * requires an explicit interface 11

Multiple Co-dimensions Managing processors as a logical multi-dimensional grid integer a(10, 10)[5, 4, *]

Multiple Co-dimensions Managing processors as a logical multi-dimensional grid integer a(10, 10)[5, 4, *] 3 D processor grid 5 x 4 x … • Support co-space reshaping at procedure calls – change number of co-dimensions – co-space bounds as procedure arguments 12

Implementing Communication x(1: n) = a(1: n)[p] + … • Use a temporary buffer

Implementing Communication x(1: n) = a(1: n)[p] + … • Use a temporary buffer to hold off processor data – allocate buffer – perform GET to fill buffer – perform computation: x(1: n) = buffer(1: n) + … – deallocate buffer • Optimizations – no temporary storage for co-array to co-array copies – load/store communication on shared-memory systems 13

Synchronization • Original CAF specification: team synchronization only – sync_all, sync_team • Limits performance

Synchronization • Original CAF specification: team synchronization only – sync_all, sync_team • Limits performance on loosely-coupled architectures • Point-to-point extensions – sync_notify(q) – sync_wait(p) Point to point synchronization semantics Delivery of a notify to q from p all communication from p to q issued before the notify has been delivered to q 14

Outline • CAF programming model • cafc – Core language implementation Optimizations • procedure

Outline • CAF programming model • cafc – Core language implementation Optimizations • procedure splitting • supporting hints for non-blocking communication • packing strided communications • Experimental evaluation • Conclusions 15

An Impediment to Code Efficiency • Original reference rhs(1, i, j, k, c) =

An Impediment to Code Efficiency • Original reference rhs(1, i, j, k, c) = … + u(1, i-1, j, k, c) - … • Transformed reference rhs%ptr(1, i, j, k, c) = … + u%ptr(1, i-1, j, k, c) - … • Fortran 90 pointer-based co-array representation does not convey – the lack of co-array aliasing – co-array contiguity – co-array bounds • Lack of knowledge inhibits important code optimizations 16

Procedure Splitting CAF to CAF preprocessing subroutine f(…) real, save : : c(100)[*]. .

Procedure Splitting CAF to CAF preprocessing subroutine f(…) real, save : : c(100)[*]. . . = c(50). . . end subroutine f(…) real, save : : c(100)[*] interface subroutine f_inner(…, c_arg) real : : c_arg[*] end subroutine f_inner end interface call f_inner(…, c) end subroutine f_inner(…, c_arg) real : : c_arg(100)[*]. . . = c_arg(50). . . end subroutine f_inner 17

Benefits of Procedure Splitting • Generated code conveys – lack of co-array aliasing –

Benefits of Procedure Splitting • Generated code conveys – lack of co-array aliasing – co-array contiguity – co-array bounds • Enables back-end compiler to generate better code 18

Hiding Communication Latency Goal: enable communication/computation overlap • Impediments to generating non-blocking communication –

Hiding Communication Latency Goal: enable communication/computation overlap • Impediments to generating non-blocking communication – use of indexed subscripts in co-dimensions – lack of whole program analysis • Approach: support hints for non-blocking communication – overcome conservative compiler analysis – enable sophisticated programmers to achieve good performance today 19

Hints for Non-blocking PUTs • Hints for CAF run-time system to issue non-blocking PUTs

Hints for Non-blocking PUTs • Hints for CAF run-time system to issue non-blocking PUTs region_id = open_nb_put_region(). . . Put_Stmt_1. . . Put_Stmt_N. . . call close_nb_put_region(region_id) • Complete non-blocking PUTs: call complete_nb_put_region(region_id) • Open problem: Exploiting non-blocking GETs? 20

Strided vs. Contiguous Transfers • Problem CAF remote reference might induce many small data

Strided vs. Contiguous Transfers • Problem CAF remote reference might induce many small data transfers a(i, 1: n)[p] = b(j, 1: n) • Solution pack strided data on source and unpack it on destination 21

Pragmatics of Packing Who should implement packing? • The CAF programmer – difficult to

Pragmatics of Packing Who should implement packing? • The CAF programmer – difficult to program • The CAF compiler – unpacking requires conversion of PUTs into two-sided communication (a difficult whole-program transformation) • The communication library – most natural place – ARMCI currently performs packing on Myrinet 22

CAF Compiler Targets (Sept 2004) • Processors – Pentium, Alpha, Itanium 2, MIPS •

CAF Compiler Targets (Sept 2004) • Processors – Pentium, Alpha, Itanium 2, MIPS • Interconnects – Quadrics, Myrinet, Gigabit Ethernet, shared memory • Operating systems – Linux, Tru 64, IRIX 23

Outline • CAF programming model • cafc – Core language implementation – Optimizations Experimental

Outline • CAF programming model • cafc – Core language implementation – Optimizations Experimental evaluation • Conclusions 24

Experimental Evaluation • Platforms – Alpha+Quadrics QSNet (Elan 3) – Itanium 2+Quadrics QSNet II

Experimental Evaluation • Platforms – Alpha+Quadrics QSNet (Elan 3) – Itanium 2+Quadrics QSNet II (Elan 4) – Itanium 2+Myrinet 2000 • Codes – NAS Parallel Benchmarks (NPB 2. 3) from NASA Ames 25

NAS BT Efficiency (Class C) 26

NAS BT Efficiency (Class C) 26

NAS SP Efficiency (Class C) lack of non-blocking notify implementation blocks CAF comm/comp overlap

NAS SP Efficiency (Class C) lack of non-blocking notify implementation blocks CAF comm/comp overlap 27

NAS MG Efficiency (Class C) • ARMCI comm is efficient • pt-2 -pt synch

NAS MG Efficiency (Class C) • ARMCI comm is efficient • pt-2 -pt synch in boosts CAF performance 30% 28

NAS CG Efficiency (Class C) 29

NAS CG Efficiency (Class C) 29

NAS LU Efficiency (class C) 30

NAS LU Efficiency (class C) 30

Impact of Optimizations Assorted Results • Procedure splitting – 42 -60% improvement for BT

Impact of Optimizations Assorted Results • Procedure splitting – 42 -60% improvement for BT on Itanium 2+Myrinet cluster – 15 -33% improvement for LU on Alpha+Quadrics • Non-blocking communication generation – 5% improvement for BT on Itanium 2+Quadrics cluster – 3% improvement for MG on all platforms • Packing of strided data – 31% improvement for BT on Alpha+Quadrics cluster – 37% improvement for LU on Itanium 2+Quadrics cluster See paper for more details 31

Conclusions • CAF boosts programming productivity – simplifies the development of SPMD parallel programs

Conclusions • CAF boosts programming productivity – simplifies the development of SPMD parallel programs – shifts details of managing communication to compiler • cafc delivers performance comparable to hand-tuned MPI • cafc implements effective optimizations – procedure splitting – non-blocking communication – packing of strided communication (in ARMCI) • Vectorization needed to achieve true performance portability with machines like Cray X 1 http: //www. hipersoft. rice. edu/caf 32