The HDF Group Parallel HDF 5 Design and

  • Slides: 58
Download presentation
The HDF Group Parallel HDF 5 Design and Programming Model May 30 -31, 2012

The HDF Group Parallel HDF 5 Design and Programming Model May 30 -31, 2012 HDF 5 Workshop at PSI 1 www. hdfgroup. org

Advantage of parallel HDF 5 • Recent success story • • Trillion particle simulation

Advantage of parallel HDF 5 • Recent success story • • Trillion particle simulation 120, 000 cores 30 TB file 23 GB/sec average speed with 35 GB/sec peaks (out of 40 GB/sec max for system) • Parallel HDF 5 rocks! (when used properly ) May 30 -31, 2012 HDF 5 Workshop at PSI 2 www. hdfgroup. org

Outline • • • Overview of Parallel HDF 5 design Parallel Environment Requirements Performance

Outline • • • Overview of Parallel HDF 5 design Parallel Environment Requirements Performance Analysis Parallel Tools PHDF 5 Programming Model Examples May 30 -31, 2012 HDF 5 Workshop at PSI 3 www. hdfgroup. org

OVERVIEW OF PARALLEL HDF 5 DESIGN May 30 -31, 2012 HDF 5 Workshop at

OVERVIEW OF PARALLEL HDF 5 DESIGN May 30 -31, 2012 HDF 5 Workshop at PSI 4 www. hdfgroup. org

PHDF 5 requirements • PHDF 5 should allow multiple processes to perform I/O to

PHDF 5 requirements • PHDF 5 should allow multiple processes to perform I/O to an HDF 5 file at the same time • Single file image to all processes • Compare with one file per process design: • Expensive post processing • Not usable by different number of processes • Too many files produced for file system • PHDF 5 should use a standard parallel I/O interface • Must be portable to different platforms May 30 -31, 2012 HDF 5 Workshop at PSI 5 www. hdfgroup. org

PHDF 5 requirements • Support Message Passing Interface (MPI) programming • PHDF 5 files

PHDF 5 requirements • Support Message Passing Interface (MPI) programming • PHDF 5 files compatible with serial HDF 5 files • Shareable between different serial or parallel platforms May 30 -31, 2012 HDF 5 Workshop at PSI 6 www. hdfgroup. org

Parallel environment requirements • MPI with POSIX I/O (HDF 5 MPI POSIX driver) •

Parallel environment requirements • MPI with POSIX I/O (HDF 5 MPI POSIX driver) • POSIX compliant file system • MPI with MPI-IO (HDF 5 MPI I/O driver) • MPICH 2 ROMIO • Vendor’s MPI-IO • Parallel file system • GPFS (General Parallel File System) • Lustre May 30 -31, 2012 HDF 5 Workshop at PSI 7 www. hdfgroup. org

PHDF 5 implementation layers HDF 5 Application Compute node HDF 5 I/O Library MPI

PHDF 5 implementation layers HDF 5 Application Compute node HDF 5 I/O Library MPI I/O Library HDF 5 file on Parallel File System Switch network/I/O servers Disk architecture and layout of data on disk May 30 -31, 2012 HDF 5 Workshop at PSI 8 www. hdfgroup. org

PHDF 5 CONSISTENCY SEMANTICS May 30 -31, 2012 HDF 5 Workshop at PSI 9

PHDF 5 CONSISTENCY SEMANTICS May 30 -31, 2012 HDF 5 Workshop at PSI 9 www. hdfgroup. org

Consistency semantics • Consistency semantics are rules that define the outcome of multiple, possibly

Consistency semantics • Consistency semantics are rules that define the outcome of multiple, possibly concurrent, accesses to an object or data structure by one or more processes in a computer system. May 30 -31, 2012 HDF 5 Workshop at PSI 10 www. hdfgroup. org

PHDF 5 consistency semantics • PHDF 5 library defines a set of consistency semantics

PHDF 5 consistency semantics • PHDF 5 library defines a set of consistency semantics to let users know what to expect when processes access data managed by the library. • When the changes a process makes are actually visible to itself (if it tries to read back that data) or to other processes that access the same file with independent or collective I/O operations • Consistency semantics varies depending on a driver used • MPI-POSIX • MPI I/O May 30 -31, 2012 HDF 5 Workshop at PSI 11 www. hdfgroup. org

HDF 5 MPI-POSIX consistency semantics • Same as POSIX semantics Process 0 write() MPI_Barrier()

HDF 5 MPI-POSIX consistency semantics • Same as POSIX semantics Process 0 write() MPI_Barrier() Process 1 MPI_Barrier() read() • POSIX I/O guarantees that Process 1 will read what Process 0 has written: the atomicity of the read and write calls and the synchronization using the barrier ensures that Process 1 will call the read function after Process 0 is finished with the write function. May 30 -31, 2012 HDF 5 Workshop at PSI 12 www. hdfgroup. org

HDF 5 MPI-I/O consistency semantics • Same as MPI-I/O semantics Process 0 MPI_File_write_at() MPI_Barrier()

HDF 5 MPI-I/O consistency semantics • Same as MPI-I/O semantics Process 0 MPI_File_write_at() MPI_Barrier() Process 1 MPI_Barrier() MPI_File_read_at() • Default MPI-I/O semantics doesn’t guarantee atomicity or sequence of calls! • Problems may occur (although we haven’t seen any) when writing/reading HDF 5 metadata or raw data May 30 -31, 2012 HDF 5 Workshop at PSI 13 www. hdfgroup. org

HDF 5 MPI-I/O consistency semantics • MPI I/O provides atomicity and sync-barriersync features to

HDF 5 MPI-I/O consistency semantics • MPI I/O provides atomicity and sync-barriersync features to address the issue • PHDF 5 follows MPI I/O • H 5 Fset_mpio_atomicity function to turn on MPI atomicity • H 5 Fsync function to transfer written data to storage device (in implementation now) • We are currently working on reimplementation of metadata cache for PHDF 5 (metadata server) May 30 -31, 2012 HDF 5 Workshop at PSI 14 www. hdfgroup. org

HDF 5 MPI-I/O consistency semantics • For more information see “Enabling a strict consistency

HDF 5 MPI-I/O consistency semantics • For more information see “Enabling a strict consistency semantics model in parallel HDF 5” linked from H 5 Fset_mpi_atomicity RM page May 30 -31, 2012 HDF 5 Workshop at PSI 15 www. hdfgroup. org

MPI-I/O VS. PHDF 5 May 30 -31, 2012 HDF 5 Workshop at PSI 16

MPI-I/O VS. PHDF 5 May 30 -31, 2012 HDF 5 Workshop at PSI 16 www. hdfgroup. org

MPI-IO vs. HDF 5 • MPI-IO is an Input/Output API • It treats the

MPI-IO vs. HDF 5 • MPI-IO is an Input/Output API • It treats the data file as a “linear byte stream” and each MPI application needs to provide its own file view and data representations to interpret those bytes May 30 -31, 2012 HDF 5 Workshop at PSI 17 www. hdfgroup. org

MPI-IO vs. HDF 5 • All data stored are machine dependent except the “external

MPI-IO vs. HDF 5 • All data stored are machine dependent except the “external 32” representation • External 32 is defined in Big Endianness • Little-endian machines have to do the data conversion in both read or write operations • 64 -bit sized data types may lose information May 30 -31, 2012 HDF 5 Workshop at PSI 18 www. hdfgroup. org

MPI-IO vs. HDF 5 • HDF 5 is data management software • It stores

MPI-IO vs. HDF 5 • HDF 5 is data management software • It stores data and metadata according to the HDF 5 data format definition • HDF 5 file is self-describing • Each machine can store the data in its own native representation for efficient I/O without loss of data precision • Any necessary data representation conversion is done by the HDF 5 library automatically May 30 -31, 2012 HDF 5 Workshop at PSI 19 www. hdfgroup. org

PERFORMANCE ANALYSIS May 30 -31, 2012 HDF 5 Workshop at PSI 20 www. hdfgroup.

PERFORMANCE ANALYSIS May 30 -31, 2012 HDF 5 Workshop at PSI 20 www. hdfgroup. org

Performance analysis • Some common causes of poor performance • Possible solutions May 30

Performance analysis • Some common causes of poor performance • Possible solutions May 30 -31, 2012 HDF 5 Workshop at PSI 21 www. hdfgroup. org

My PHDF 5 application I/O is slow • • Use larger I/O data sizes

My PHDF 5 application I/O is slow • • Use larger I/O data sizes Independent vs. Collective I/O Specific I/O system hints Increase Parallel File System capacity May 30 -31, 2012 HDF 5 Workshop at PSI 22 www. hdfgroup. org

Write speed vs. block size May 30 -31, 2012 HDF 5 Workshop at PSI

Write speed vs. block size May 30 -31, 2012 HDF 5 Workshop at PSI 23 www. hdfgroup. org

My PHDF 5 application I/O is slow • • Use larger I/O data sizes

My PHDF 5 application I/O is slow • • Use larger I/O data sizes Independent vs. Collective I/O Specific I/O system hints Increase Parallel File System capacity May 30 -31, 2012 HDF 5 Workshop at PSI 24 www. hdfgroup. org

Independent vs. collective access • User reported Independent data transfer mode was much slower

Independent vs. collective access • User reported Independent data transfer mode was much slower than the Collective data transfer mode • Data array was tall and thin: 230, 000 rows by 6 columns May 30 -31, 2012 HDF 5 Workshop at PSI : : : 230, 000 rows : : : 25 www. hdfgroup. org

Collective vs. independent calls • MPI definition of collective calls • All processes of

Collective vs. independent calls • MPI definition of collective calls • All processes of the communicator must participate in calls in the right order. E. g. , • Process 1 • call A(); call B(); Process 2 call A(); call B(); **right** call B(); call A(); **wrong** • Independent means not collective • Collective is not necessarily synchronous May 30 -31, 2012 HDF 5 Workshop at PSI 26 www. hdfgroup. org

Debug Slow Parallel I/O Speed(1) • Writing to one dataset - Using 4 processes

Debug Slow Parallel I/O Speed(1) • Writing to one dataset - Using 4 processes == 4 columns - data type is 8 -byte doubles - 4 processes, 1000 rows == 4 x 1000 x 8 = 32, 000 bytes • % mpirun -np 4. /a. out i t 1000 - Execution time: 1. 783798 s. • % mpirun -np 4. /a. out i t 2000 - Execution time: 3. 838858 s. • Difference of 2 seconds for 1000 more rows = 32, 000 bytes. • Speed of 16 KB/sec!!! Way too slow. May 30 -31, 2012 HDF 5 Workshop at PSI 27 www. hdfgroup. org

Debug slow parallel I/O speed(2) • Build a version of PHDF 5 with •

Debug slow parallel I/O speed(2) • Build a version of PHDF 5 with • . /configure --enable-debug --enable-parallel … • This allows the tracing of MPIO I/O calls in the HDF 5 library. • E. g. , to trace • MPI_File_read_xx and MPI_File_write_xx calls • % setenv H 5 FD_mpio_Debug “rw” May 30 -31, 2012 HDF 5 Workshop at PSI 28 www. hdfgroup. org

Debug slow parallel I/O speed(3) % setenv H 5 FD_mpio_Debug ’rw’ % mpirun -np

Debug slow parallel I/O speed(3) % setenv H 5 FD_mpio_Debug ’rw’ % mpirun -np 4. /a. out i t 1000 # Indep. ; contiguous. in H 5 FD_mpio_write mpi_off=0 size_i=96 in H 5 FD_mpio_write mpi_off=2056 size_i=8 in H 5 FD_mpio_write mpi_off=2048 size_i=8 in H 5 FD_mpio_write mpi_off=2072 size_i=8 in H 5 FD_mpio_write mpi_off=2064 size_i=8 in H 5 FD_mpio_write mpi_off=2088 size_i=8 in H 5 FD_mpio_write mpi_off=2080 size_i=8 … • Total of 4000 of these little 8 bytes writes == 32, 000 bytes. May 30 -31, 2012 HDF 5 Workshop at PSI 29 www. hdfgroup. org

Independent calls are many and small • Each process writes one element of one

Independent calls are many and small • Each process writes one element of one row, skips to next row, write one element, so on. • Each process issues 230, 000 writes of 8 bytes each. May 30 -31, 2012 HDF 5 Workshop at PSI : : : 230, 000 rows : : : 30 www. hdfgroup. org

Debug slow parallel I/O speed (4) % setenv H 5 FD_mpio_Debug ’rw’ % mpirun

Debug slow parallel I/O speed (4) % setenv H 5 FD_mpio_Debug ’rw’ % mpirun -np 4. /a. out i h 1000# Indep. , Chunked by column. in H 5 FD_mpio_write mpi_off=0 size_i=96 in H 5 FD_mpio_write mpi_off=3688 size_i=8000 in H 5 FD_mpio_write mpi_off=11688 size_i=8000 in H 5 FD_mpio_write mpi_off=27688 size_i=8000 in H 5 FD_mpio_write mpi_off=19688 size_i=8000 in H 5 FD_mpio_write mpi_off=96 size_i=40 in H 5 FD_mpio_write mpi_off=136 size_i=544 in H 5 FD_mpio_write mpi_off=680 size_i=120 in H 5 FD_mpio_write mpi_off=800 size_i=272 … Execution time: 0. 011599 s. May 30 -31, 2012 HDF 5 Workshop at PSI 31 www. hdfgroup. org

Use collective mode or chunked storage • Collective I/O will combine many small independent

Use collective mode or chunked storage • Collective I/O will combine many small independent calls into few but bigger calls • Chunks of columns speeds up too May 30 -31, 2012 HDF 5 Workshop at PSI : : : 230, 000 rows : : : 32 www. hdfgroup. org

Collective vs. independent write 1000 900 Seconds to write 800 700 600 500 Independent

Collective vs. independent write 1000 900 Seconds to write 800 700 600 500 Independent write 400 Collective write 300 200 100 0 0. 25 0. 5 1 1. 88 2. 29 2. 75 Data size in MBs May 30 -31, 2012 HDF 5 Workshop at PSI 33 www. hdfgroup. org

My PHDF 5 application I/O is slow • • Use larger I/O data sizes

My PHDF 5 application I/O is slow • • Use larger I/O data sizes Independent vs. Collective I/O Specific I/O system hints Increase Parallel File System capacity May 30 -31, 2012 HDF 5 Workshop at PSI 34 www. hdfgroup. org

Effects of I/O hints: IBM_largeblock_io • GPFS at LLNL ASCI Blue machine MB/sec •

Effects of I/O hints: IBM_largeblock_io • GPFS at LLNL ASCI Blue machine MB/sec • 4 nodes, 16 tasks • Total data size 1024 MB • I/O buffer size 1 MB May 30 -31, 2012 HDF 5 Workshop at PSI 35 www. hdfgroup. org

My PHDF 5 application I/O is slow • If my application I/O performance is

My PHDF 5 application I/O is slow • If my application I/O performance is slow, what can I do? • • Use larger I/O data sizes Independent vs. Collective I/O Specific I/O system hints Increase Parallel File System capacity • Add more disks, interconnect or I/O servers to your hardware setup • Expensive! May 30 -31, 2012 HDF 5 Workshop at PSI 36 www. hdfgroup. org

PARALLEL TOOLS May 30 -31, 2012 HDF 5 Workshop at PSI 37 www. hdfgroup.

PARALLEL TOOLS May 30 -31, 2012 HDF 5 Workshop at PSI 37 www. hdfgroup. org

Parallel Tools • h 5 perf • Performance measuring tool showing I/O performance for

Parallel Tools • h 5 perf • Performance measuring tool showing I/O performance for different I/O APIs May 30 -31, 2012 HDF 5 Workshop at PSI 38 www. hdfgroup. org

h 5 perf • An I/O performance measurement tool • Tests 3 File I/O

h 5 perf • An I/O performance measurement tool • Tests 3 File I/O APIs: • POSIX I/O (open/write/read/close…) • MPI-I/O (MPI_File_{open, write, read, close}) • PHDF 5 • H 5 Pset_fapl_mpio (using MPI-I/O) • H 5 Pset_fapl_mpiposix (using POSIX I/O) • An indication of I/O speed upper limits May 30 -31, 2012 HDF 5 Workshop at PSI 39 www. hdfgroup. org

Useful Parallel HDF Links • Parallel HDF information site http: //www. hdfgroup. org/HDF 5/PHDF

Useful Parallel HDF Links • Parallel HDF information site http: //www. hdfgroup. org/HDF 5/PHDF 5/ • Parallel HDF 5 tutorial available at http: //www. hdfgroup. org/HDF 5/Tutor/ • HDF Help email address help@hdfgroup. org May 30 -31, 2012 HDF 5 Workshop at PSI 40 www. hdfgroup. org

HDF 5 PROGRAMMING MODEL May 30 -31, 2012 HDF 5 Workshop at PSI 41

HDF 5 PROGRAMMING MODEL May 30 -31, 2012 HDF 5 Workshop at PSI 41 www. hdfgroup. org

How to compile PHDF 5 applications • h 5 pcc – HDF 5 C

How to compile PHDF 5 applications • h 5 pcc – HDF 5 C compiler command • Similar to mpicc • h 5 pfc – HDF 5 F 90 compiler command • Similar to mpif 90 • To compile: • % h 5 pcc h 5 prog. c • % h 5 pfc h 5 prog. f 90 May 30 -31, 2012 HDF 5 Workshop at PSI 42 www. hdfgroup. org

Programming restrictions • PHDF 5 opens a parallel file with a communicator • Returns

Programming restrictions • PHDF 5 opens a parallel file with a communicator • Returns a file handle • Future access to the file via the file handle • All processes must participate in collective PHDF 5 APIs • Different files can be opened via different communicators May 30 -31, 2012 HDF 5 Workshop at PSI 43 www. hdfgroup. org

Collective HDF 5 calls • All HDF 5 APIs that modify structural metadata are

Collective HDF 5 calls • All HDF 5 APIs that modify structural metadata are collective • File operations - H 5 Fcreate, H 5 Fopen, H 5 Fclose • Object creation - H 5 Dcreate, H 5 Dclose • Object structure modification (e. g. , dataset extent modification) - H 5 Dextend • http: //www. hdfgroup. org/HDF 5/doc/RM/Coll ective. Calls. html May 30 -31, 2012 HDF 5 Workshop at PSI 44 www. hdfgroup. org

Other HDF 5 calls • Array data transfer can be collective or independent -

Other HDF 5 calls • Array data transfer can be collective or independent - Dataset operations: H 5 Dwrite, H 5 Dread • Collectiveness is indicated by function parameters, not by function names as in MPI API May 30 -31, 2012 HDF 5 Workshop at PSI 45 www. hdfgroup. org

What does PHDF 5 support ? • After a file is opened by the

What does PHDF 5 support ? • After a file is opened by the processes of a communicator • All parts of file are accessible by all processes • All objects in the file are accessible by all processes • Multiple processes may write to the same data array • Each process may write to individual data array May 30 -31, 2012 HDF 5 Workshop at PSI 46 www. hdfgroup. org

PHDF 5 API languages • C and F 90, 2003 language interfaces • Platforms

PHDF 5 API languages • C and F 90, 2003 language interfaces • Platforms supported: • Most platforms with MPI-IO supported. E. g. , • IBM AIX • Linux clusters • Cray XT May 30 -31, 2012 HDF 5 Workshop at PSI 47 www. hdfgroup. org

Programming model • HDF 5 uses access template object (property list) to control the

Programming model • HDF 5 uses access template object (property list) to control the file access mechanism • General model to access HDF 5 file in parallel: - Set up MPI-IO access template (file access property list) - Open File - Access Data - Close File May 30 -31, 2012 HDF 5 Workshop at PSI 48 www. hdfgroup. org

Moving your sequential application to the HDF 5 parallel world MY FIRST PARALLEL HDF

Moving your sequential application to the HDF 5 parallel world MY FIRST PARALLEL HDF 5 PROGRAM May 30 -31, 2012 HDF 5 Workshop at PSI 49 www. hdfgroup. org

Example of PHDF 5 C program Parallel HDF 5 program has extra calls MPI_Init(&argc,

Example of PHDF 5 C program Parallel HDF 5 program has extra calls MPI_Init(&argc, &argv); 1. 2. 3. 4. 5. fapl_id = H 5 Pcreate(H 5 P_FILE_ACCESS); H 5 Pset_fapl_mpio(fapl_id, comm, info); file_id = H 5 Fcreate(FNAME, …, fapl_id); space_id = H 5 Screate_simple(…); dset_id = H 5 Dcreate(file_id, DNAME, H 5 T_NATIVE_INT, space_id, …); 6. xf_id = H 5 Pcreate(H 5 P_DATASET_XFER); 7. H 5 Pset_dxpl_mpio(xf_id, H 5 FD_MPIO_COLLECTIVE); 8. status = H 5 Dwrite(dset_id, H 5 T_NATIVE_INT, …, xf_id…); MPI_Finalize(); May 30 -31, 2012 HDF 5 Workshop at PSI 50 www. hdfgroup. org

Writing patterns EXAMPLE May 30 -31, 2012 HDF 5 Workshop at PSI 51 www.

Writing patterns EXAMPLE May 30 -31, 2012 HDF 5 Workshop at PSI 51 www. hdfgroup. org

Parallel HDF 5 tutorial examples • For simple examples how to write different data

Parallel HDF 5 tutorial examples • For simple examples how to write different data patterns see http: //www. hdfgroup. org/HDF 5/Tutor/parallel. html May 30 -31, 2012 HDF 5 Workshop at PSI 52 www. hdfgroup. org

Programming model • Each process defines the memory and file hyperslabs using H 5

Programming model • Each process defines the memory and file hyperslabs using H 5 Sselect_hyperslab • Each process executes a write/read call using hyperslabs defined, which is either collective or independent • The hyperslab start, count, stride, and block parameters define the portion of the dataset to write to - Contiguous hyperslab Regularly spaced data (column or row) Pattern Chunks May 30 -31, 2012 HDF 5 Workshop at PSI 53 www. hdfgroup. org

Four processes writing by rows HDF 5 "SDS_row. h 5" { GROUP "/" {

Four processes writing by rows HDF 5 "SDS_row. h 5" { GROUP "/" { DATASET "Int. Array" { DATATYPE H 5 T_STD_I 32 BE DATASPACE SIMPLE { ( 8, 5 ) / ( 8, 5 ) } DATA { 10, 10, 10, 11, 11, 11, 12, 12, 12, 13, 13, 13, 13 May 30 -31, 2012 HDF 5 Workshop at PSI 54 www. hdfgroup. org

Two processes writing by columns HDF 5 "SDS_col. h 5" { GROUP "/" {

Two processes writing by columns HDF 5 "SDS_col. h 5" { GROUP "/" { DATASET "Int. Array" { DATATYPE H 5 T_STD_I 32 BE DATASPACE SIMPLE { ( 8, 6 ) / ( 8, 6 ) } DATA { 1, 2, 10, 20, 100, 200, 1, 2, 10, 20, 100, 200, 1, 2, 10, 20, 100, 200 May 30 -31, 2012 HDF 5 Workshop at PSI 55 www. hdfgroup. org

Four processes writing by pattern HDF 5 "SDS_pat. h 5" { GROUP "/" {

Four processes writing by pattern HDF 5 "SDS_pat. h 5" { GROUP "/" { DATASET "Int. Array" { DATATYPE H 5 T_STD_I 32 BE DATASPACE SIMPLE { ( 8, 4 ) / ( 8, 4 ) } DATA { 1, 3, 2, 4, 1, 3, 2, 4, 2, 4 May 30 -31, 2012 HDF 5 Workshop at PSI 56 www. hdfgroup. org

Four processes writing by chunks HDF 5 "SDS_chnk. h 5" { GROUP "/" {

Four processes writing by chunks HDF 5 "SDS_chnk. h 5" { GROUP "/" { DATASET "Int. Array" { DATATYPE H 5 T_STD_I 32 BE DATASPACE SIMPLE { ( 8, 4 ) / ( 8, 4 ) } DATA { 1, 1, 2, 2, 3, 3, 4, 4, 3, 3, 4, 4 May 30 -31, 2012 HDF 5 Workshop at PSI 57 www. hdfgroup. org

The HDF Group Thank You! Questions? May 30 -31, 2012 HDF 5 Workshop at

The HDF Group Thank You! Questions? May 30 -31, 2012 HDF 5 Workshop at PSI 58 www. hdfgroup. org