HDF 5 Life cycle of data 12222021 HDF

  • Slides: 34
Download presentation
HDF 5 Life cycle of data 12/22/2021 HDF and HDF-EOS Workshop X, Landover, MD

HDF 5 Life cycle of data 12/22/2021 HDF and HDF-EOS Workshop X, Landover, MD 1

Outline • “Life cycle” of HDF 5 data • I/O operations for datasets with

Outline • “Life cycle” of HDF 5 data • I/O operations for datasets with different storage layouts • Compact dataset • Contiguous dataset • Datatype conversion • Partial I/O for contiguous dataset • Chunked dataset • I/O for chunked dataset • Variable length datasets and I/O 12/22/2021 HDF and HDF-EOS Workshop X, Landover, MD 2

Life cycle of HDF 5 data • Life cycle: what does happen to data

Life cycle of HDF 5 data • Life cycle: what does happen to data when it is transferred from application buffer to HDF 5 file? Application Data buffer Object API H 5 Dwrite Library internals Magic box Virtual file I/O Unbuffered I/O File or other “storage” 12/22/2021 HDF and HDF-EOS Workshop X, Landover, MD Data in a file 3

“Life cycle” of HDF 5 data: inside the magic box • Operations on data

“Life cycle” of HDF 5 data: inside the magic box • Operations on data inside the magic box • Datatype conversion • Scattering - gathering • Data transformation (filters, compression) • Copying to/from internal buffers • Concepts involved • HDF 5 metadata, metadata cache • Chunking, chunk cache • Data structures used • B-trees (groups, dataset chunks) • Hash tables • Local and Global heaps (variable length data: link names, strings, etc. ) 12/22/2021 HDF and HDF-EOS Workshop X, Landover, MD 4

“Life cycle” of HDF 5 data: inside the magic box • Understanding of what

“Life cycle” of HDF 5 data: inside the magic box • Understanding of what is happening to data inside the magic box will help to write efficient applications • HDF 5 library has mechanisms to control behavior inside the magic box • Goals of this and the next talk are to • Introduce the basic concepts and internal data structures and explain how they affect performance and storage sizes • Give some “recipes” for how to improve performance 12/22/2021 HDF and HDF-EOS Workshop X, Landover, MD 5

Operations on data inside the magic box • Datatype conversion • Examples: • float

Operations on data inside the magic box • Datatype conversion • Examples: • float integer • LE BE • 64 -bit integer to 16 -bit integer (overflow may occur!) • Scattering - gathering • Data is scattered/gathered from/to user’s buffers into internal buffers for datatype conversion and partial I/O • Data transformation (filters, compression) • Checksum on raw data and metadata (in 1. 8. 0) • Algebraic transform • GZIP and SZIP compressions • User-defined filters • Copying to/from internal buffers 12/22/2021 HDF and HDF-EOS Workshop X, Landover, MD 6

“Life cycle” of HDF 5 data: inside the magic box • HDF 5 metadata

“Life cycle” of HDF 5 data: inside the magic box • HDF 5 metadata • Information about HDF 5 objects used by the library • Examples: object headers, B-tree nodes for group, B-Tree nodes for chunks, heaps, super-block, etc. • Usually small compared to raw data sizes (KB vs. MB-GB) • Metadata cache • Space allocated to handle pieces of the HDF 5 metadata • Allocated by the HDF 5 library in application’s memory space • Cache behavior affects overall performance 12/22/2021 HDF-EOS Workshop 7 • Will cover. HDFinandthe next talk. X, Landover, MD

“Life cycle” of HDF 5 data: inside the magic box • Chunking mechanism •

“Life cycle” of HDF 5 data: inside the magic box • Chunking mechanism • Chunking – storage layout where a dataset is partitioned in fixed-size multi-dimensional tiles or chunks • Used for extendible datasets and datasets with filters applied (checksum, compression) • HDF 5 library treats each chunk as atomic object • Greatly affects performance and file sizes • Chunk cache • Created for each chunked dataset • Default size 1 MB 12/22/2021 HDF and HDF-EOS Workshop X, Landover, MD 8

Writing a dataset Metadata Dataspace Rank Dimensions 3 Dim_1 = 4 Dim_2 = 5

Writing a dataset Metadata Dataspace Rank Dimensions 3 Dim_1 = 4 Dim_2 = 5 Dim_3 = 7 Datatype IEEE 32 -bit float Storage info Attributes Time = 32. 4 Chunked Pressure = 987 Compressed Temp = 56 12/22/2021 HDF and HDF-EOS Workshop X, Landover, MD 9

I/O operations for HDF 5 datasets with different storage layouts • Storage layouts •

I/O operations for HDF 5 datasets with different storage layouts • Storage layouts • Compact • Contiguous • Chunked • I/O performance depends on • • 12/22/2021 Dataset storage properties Chunking strategy Metadata cache performance Etc. HDF and HDF-EOS Workshop X, Landover, MD 10

Writing a compact dataset Dataset header Metadata cache …………. Datatype Dataspace …………. Attribute 1

Writing a compact dataset Dataset header Metadata cache …………. Datatype Dataspace …………. Attribute 1 Attribute 2 Data Application memory Raw data is stored within the dataset header File 12/22/2021 HDF and HDF-EOS Workshop X, Landover, MD 11

Writing a contiguous dataset with no datatype conversion Dataset header …………. Datatype Dataspace ………….

Writing a contiguous dataset with no datatype conversion Dataset header …………. Datatype Dataspace …………. Attribute 1 Attribute 2 ………… Metadata cache User buffer (matrix 5 x 4 x 7) File 12/22/2021 HDF and HDF-EOS Workshop X, Landover, MD 12

Writing a contiguous dataset with conversion Dataset header Dataset raw data Metadata cache ………….

Writing a contiguous dataset with conversion Dataset header Dataset raw data Metadata cache …………. Datatype Dataspace …………. Attribute 1 Attribute 2 ………… Conversion buffer 1 MB Application memory File Dataset header 12/22/2021 HDF and HDF-EOS Workshop X, Landover, MD Dataset raw data 13

Sub-setting of contiguous dataset Series of adjacent rows Application data in memory N M

Sub-setting of contiguous dataset Series of adjacent rows Application data in memory N M One I/O operation M rows File 12/22/2021 Data is contiguous in a file HDF and HDF-EOS Workshop X, Landover, MD 14

Sub-setting of contiguous dataset Adjacent, partial rows Application data in memory N Several small

Sub-setting of contiguous dataset Adjacent, partial rows Application data in memory N Several small I/O operation M N elements File 12/22/2021 … Data is scattered in a file in M contiguous blocks HDF and HDF-EOS Workshop X, Landover, MD 15

Sub-setting of contiguous dataset Extreme case: writing a column Application data in memory N

Sub-setting of contiguous dataset Extreme case: writing a column Application data in memory N Several small I/O operation M 1 element … Data is scattered in a file in M different locations 12/22/2021 HDF and HDF-EOS Workshop X, Landover, MD 16

Sub-setting of contiguous dataset Data sieve buffer Application data in memory N Data is

Sub-setting of contiguous dataset Data sieve buffer Application data in memory N Data is gathered in a sieve buffer in memory 64 K memcopy M 1 element File 12/22/2021 … Data is scattered in a file in M contiguous blocks HDF and HDF-EOS Workshop X, Landover, MD 17

Performance tuning for contiguous dataset • Datatype conversion • Avoid for better performance •

Performance tuning for contiguous dataset • Datatype conversion • Avoid for better performance • Use H 5 Pset_buffer function to customize conversion buffer size • Partial I/O • Write/read in big contiguous blocks (at least the size of a block on FS) • Use H 5 Pset_sieve_buf_size to improve performance for complex subsetting 12/22/2021 HDF and HDF-EOS Workshop X, Landover, MD 18

Possible tuning work • Datatype conversion • Use of multiple threads for datatype conversion

Possible tuning work • Datatype conversion • Use of multiple threads for datatype conversion • Partial I/O • OS vector I/O • Asynchronous I/O 12/22/2021 HDF and HDF-EOS Workshop X, Landover, MD 19

Writing chunked dataset Dimension sizes X x Y x Z Dataset is partitioned into

Writing chunked dataset Dimension sizes X x Y x Z Dataset is partitioned into fixed-size multi-dimensional chunks of sizes X/4 x Y/2 x Z 12/22/2021 HDF and HDF-EOS Workshop X, Landover, MD 20

Extending chunked dataset in any dimension • Data can be added in any dimensions

Extending chunked dataset in any dimension • Data can be added in any dimensions • Compression is applied to each chunk • Datatype conversion is applied to each chunk 12/22/2021 HDF and HDF-EOS Workshop X, Landover, MD 21

Writing chunked dataset Chunked dataset A C Chunk cache C B Filter pipeline File

Writing chunked dataset Chunked dataset A C Chunk cache C B Filter pipeline File B A …………. . C • Each chunk is written as a contiguous blob • Chunks may be scattered all over the file • Compression is performed when chunk is evicted from the chunk cache • Other filters when data goes through filter pipeline (e. g. encryption) 12/22/2021 HDF and HDF-EOS Workshop X, Landover, MD 22

Writing chunked dataset Dataset_1 header Metadata cache ………… Dataset_N header Chunking B-tree nodes …………

Writing chunked dataset Dataset_1 header Metadata cache ………… Dataset_N header Chunking B-tree nodes ………… Chunk cache Default size is 1 MB • Size of chunk cache is set for file • Each chunked dataset has its own chunk cache • Chunk may be too big to fit into cache • Memory may grow if application keeps opening datasets Application memory 12/22/2021 HDF and HDF-EOS Workshop X, Landover, MD 23

Partial I/O for chunked dataset 1 2 3 4 12/22/2021 • Build list of

Partial I/O for chunked dataset 1 2 3 4 12/22/2021 • Build list of chunks and loop through the list • For each chunk: • Bring chunk into memory • Map selection in memory to selection in file • Gather elements into conversion buffer and perform conversion • Scatter elements back to the chunk • Apply filters (compression) when chunk is flushed from chunk cache For each element 3 memcopy performed HDF and HDF-EOS Workshop X, Landover, MD 24

Partial I/O for chunked dataset Application buffer 3 Chunk memcopy Elements participated in I/O

Partial I/O for chunked dataset Application buffer 3 Chunk memcopy Elements participated in I/O are gathered into corresponding chunk Application memory 12/22/2021 HDF and HDF-EOS Workshop X, Landover, MD 25

Partial I/O for chunked dataset Chunk cache Gather data Conversion buffer 3 Scatter data

Partial I/O for chunked dataset Chunk cache Gather data Conversion buffer 3 Scatter data Application memory On eviction from cache chunk is compressed and is written to the file File 12/22/2021 Chunk HDF and HDF-EOS Workshop X, Landover, MD 26

Variable length datasets and I/O • Examples of variable-length data • String A[0] “the

Variable length datasets and I/O • Examples of variable-length data • String A[0] “the first string we want to write” ………………… A[N-1] “the N-th string we want to write” • Each element is a record of variable-length A[0] (1, 1, 0, 0, 0, 5, 6, 7, 8, 9) length of the first record is 10 A[1] (0, 0, 110, 2005) ……………. . A[N] (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, …. , M) length of the N+1 record is M 12/22/2021 HDF and HDF-EOS Workshop X, Landover, MD 27

Variable length datasets and I/O • Variable length description in HDF 5 application typedef

Variable length datasets and I/O • Variable length description in HDF 5 application typedef struct { size_t length; void *p; }hvl_t; • Base type can be any HDF 5 type H 5 Tvlen_create(base_type) • ~ 20 bytes overhead for each element • Raw data cannot be compressed 12/22/2021 HDF and HDF-EOS Workshop X, Landover, MD 28

Variable length datasets and I/O Raw data Global heap Application buffer Elements in application

Variable length datasets and I/O Raw data Global heap Application buffer Elements in application buffer point to global heaps where actual data is stored 12/22/2021 HDF and HDF-EOS Workshop X, Landover, MD 29

Writing chunked VL datasets Metadata cache Chunk cache Dataset header ………… Application memory B-tree

Writing chunked VL datasets Metadata cache Chunk cache Dataset header ………… Application memory B-tree nodes Global heap ……… Raw data Chunk cache buffer Conversion Filter pipeline VL chunked dataset with selected region File 12/22/2021 HDF and HDF-EOS Workshop X, Landover, MD 30

VL chunked dataset in a file Chunking B-tree File Dataset header Raw data 12/22/2021

VL chunked dataset in a file Chunking B-tree File Dataset header Raw data 12/22/2021 Dataset chunks HDF and HDF-EOS Workshop X, Landover, MD 31

Variable length datasets and I/O • Hints • Avoid closing/opening a file while writing

Variable length datasets and I/O • Hints • Avoid closing/opening a file while writing VL datasets • global heap information is lost • global heaps may have unused space • Avoid writing VL datasets interchangeably • data from different datasets will is written to the same heap • If maximum length of the record is known, use fixed -length records and compression 12/22/2021 HDF and HDF-EOS Workshop X, Landover, MD 32

Thank you! Questions ? 12/22/2021 HDF and HDF-EOS Workshop X, Landover, MD 33

Thank you! Questions ? 12/22/2021 HDF and HDF-EOS Workshop X, Landover, MD 33

Acknowledgement This report is based upon work supported in part by a Cooperative Agreement

Acknowledgement This report is based upon work supported in part by a Cooperative Agreement with NASA under NASA NNG 05 GC 60 A. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Aeronautics and Space Administration. 12/22/2021 HDF and HDF-EOS Workshop X, Landover, MD 34