The HDF Group HDF 5 Datasets and IO
- Slides: 56
The HDF Group HDF 5 Datasets and I/O Dataset storage and its effect on performance May 30 -31, 2012 HDF 5 Workshop at PSI 1 www. hdfgroup. org
Outline • Dataset metadata and array data storage layouts • Types of dataset storage layouts • Factors affecting I/O performance • • I/O with compact datasets I/O with contiguous datasets I/O with chunked datasets Variable length data and I/O May 30 -31, 2012 HDF 5 Workshop at PSI 2 www. hdfgroup. org
HDF 5 Layers HDF 5 Application buffer HDF 5 Object Layer (API) H 5 Dwrite is called HDF 5 Internals VFD Layer Data is prepared for I/O SEC 2 driver performs I/O HDF 5 file May 30 -31, 2012 HDF 5 Workshop at PSI 3 www. hdfgroup. org
Goal of this talk • Present what is happening to data inside the HDF 5 library • Show application can control the HDF 5 library behavior • Specifically: - Describe some basic operations and data structures and explain how they affect performance and storage sizes - Give some “recipes” for how to improve performance May 30 -31, 2012 HDF 5 Workshop at PSI 4 www. hdfgroup. org
HDF 5 DATASET METADATA May 30 -31, 2012 HDF 5 Workshop at PSI 5 www. hdfgroup. org
HDF 5 Dataset • Data array • Also called raw data • Metadata - Dataspace - Rank, dimensions of dataset array - Datatype - Information on how to interpret data - Storage Properties - How array is organized on disk - Attributes - User-defined metadata (optional) May 30 -31, 2012 HDF 5 Workshop at PSI 6 www. hdfgroup. org
HDF 5 dataset components Dataset header Dataset data array Dataspace Rank Dimensions 3 Dim_1 = 4 Dim_2 = 5 Dim_3 = 7 Datatype IEEE 32 -bit float Storage info Attributes Time = 32. 4 Chunked Pressure = 987 Compressed Temp = 56 Metadata May 30 -31, 2012 Raw data HDF 5 Workshop at PSI 7 www. hdfgroup. org
HDF 5 metadata • Information about HDF 5 objects used by the HDF 5 library • Examples: object headers, B-tree nodes for group, B-Tree nodes for chunks, heaps, superblock, etc. • Usually small compared to raw data sizes (KB vs. MB-GB) May 30 -31, 2012 HDF 5 Workshop at PSI 8 www. hdfgroup. org
HDF 5 metadata cache Metadata cache (MDC) Dataset array data Dataset header Application memory Dataset header resides in MDC is handled by HDF 5 library HDF 5 metadata Dataset array data HDF 5 File Metadata is mixed with raw data in HDF 5 file May 30 -31, 2012 HDF 5 Workshop at PSI 9 www. hdfgroup. org
HDF 5 metadata cache • Metadata cache • Space allocated to handle pieces of the HDF 5 metadata • Allocated by the HDF 5 library in application’s memory space • Allocated per file; released when file is closed • Metadata cache behavior affects overall performance • Metadata cache implementation prior to HDF 5 1. 6. 5 could cause performance degradation for some applications May 30 -31, 2012 HDF 5 Workshop at PSI 10 www. hdfgroup. org
HDF 5 DATASET STORAGE LAYOUTS May 30 -31, 2012 HDF 5 Workshop at PSI 11 www. hdfgroup. org
HDF 5 datasets storage layouts • • Contiguous External Chunked Compact May 30 -31, 2012 HDF 5 Workshop at PSI 12 www. hdfgroup. org
Contiguous storage layout • Contiguous storage layout is a default storage layout for an HDF 5 dataset • Dataset raw data is stored in one contiguous block in HDF 5 file May 30 -31, 2012 HDF 5 Workshop at PSI 13 www. hdfgroup. org
Contiguous storage layout Metadata cache (MDC) Dataset array data Dataset header Application memory Dataset array data HDF 5 File Dataset header Raw data is stored in one contiguous block in HDF 5 file May 30 -31, 2012 HDF 5 Workshop at PSI 14 www. hdfgroup. org
External storage layout • Dataset raw data is stored in an external file(s) that should be kept together with the HDF 5 file • Layout in the external file is specified by an application • An easy way to make legacy data available to HDF 5 library May 30 -31, 2012 HDF 5 Workshop at PSI 15 www. hdfgroup. org
External storage layout Application memory Metadata cache (MDC) Dataset array data Dataset header Unix/Windows file HDF 5 file Dataset header Metadata is stored in HDF 5 file. Raw data is stored in a separate file as specified by application May 30 -31, 2012 HDF 5 Workshop at PSI 16 www. hdfgroup. org
Chunked storage layout • Chunking – storage layout where a dataset is partitioned in fixed-size multi-dimensional tiles or chunks • Each chunk is stored as contiguous block • HDF 5 library treats each chunk as atomic object for I/O • Greatly affects performance and file sizes • Use for extendible datasets and datasets with filters applied (checksum, compression) • Use for sub-setting of big datasets May 30 -31, 2012 HDF 5 Workshop at PSI 17 www. hdfgroup. org
Chunked storage layout Metadata cache (MDC) Dataset array data B A C D Dataset header Chunk index Application memory HDF 5 File C Dataset header D B Chunk index A Raw data is stored in separate chunks in HDF 5 file May 30 -31, 2012 HDF 5 Workshop at PSI 18 www. hdfgroup. org
Compact storage layout • Raw data is stored in a dataset object header • Raw data read/written with the header • Use for small (few K) datasets to minimize small I/O operations May 30 -31, 2012 HDF 5 Workshop at PSI 19 www. hdfgroup. org
Compact storage layout Metadata cache (MDC) Dataset array data Dataset header Application memory HDF 5 File Dataset header Dataset array data Raw data is stored in a dataset object header May 30 -31, 2012 HDF 5 Workshop at PSI 20 www. hdfgroup. org
FACTORS AFFECTING I/O PERFORMANCE May 30 -31, 2012 HDF 5 Workshop at PSI 21 www. hdfgroup. org
HDF 5 data structures • Data structures used by HDF 5 library • B-trees (groups, dataset chunks) • Hash tables • Local and global heaps (variable length data: link names, strings, etc. ) • Other concepts • • HDF 5 metadata cache HDF 5 chunk cache Free space management data structure Etc. May 30 -31, 2012 HDF 5 Workshop at PSI 22 www. hdfgroup. org
Operations on data inside HDF 5 library • Copying to/from internal buffers • Datatype conversion, e. g. , • • Float to integer Little-endian to big-endian 64 -bit integer to 16 -bit integer Variable-length data conversion from memory to file • Scattering - gathering • Data is scattered/gathered from/to application buffers into internal buffers for datatype conversion and partial I/O May 30 -31, 2012 HDF 5 Workshop at PSI 23 www. hdfgroup. org
Operations on data inside HDF 5 library • Data transformation (filters, compression) - Checksum on raw data and metadata Algebraic transform GZIP and SZIP compressions HDF 5 and user-defined data transformations May 30 -31, 2012 HDF 5 Workshop at PSI 24 www. hdfgroup. org
I/O performance • I/O performance depends on many factors • • Storage layouts Dataset storage properties Chunking strategy Metadata cache performance Datatype conversion performance Other filters, such as compression Access patterns May 30 -31, 2012 HDF 5 Workshop at PSI 25 www. hdfgroup. org
I/O WITH DIFFERENT STORAGE LAYOUTS May 30 -31, 2012 HDF 5 Workshop at PSI 26 www. hdfgroup. org
WRITING COMPACT DATASET May 30 -31, 2012 HDF 5 Workshop at PSI 27 www. hdfgroup. org
Writing compact dataset Metadata cache (MDC) Dataset array data Dataset header Application memory HDF 5 File Dataset header Raw data is written when object header is written May 30 -31, 2012 HDF 5 Workshop at PSI 28 www. hdfgroup. org
WRITING CONTIGUOUS DATASET May 30 -31, 2012 HDF 5 Workshop at PSI 29 www. hdfgroup. org
Writing contiguous dataset Metadata cache (MDC) Dataset array data Dataset header Application memory Dataset array data HDF 5 File Dataset header Raw data is written first. The header is written when flushed to file (H 5 Dclose, H 5 Fflush, or MDC flush done by the HDF 5 library) May 30 -31, 2012 HDF 5 Workshop at PSI 30 www. hdfgroup. org
Writing contiguous dataset with conversion Metadata cache (MDC) Dataset header Dataset array data 1 MB conversion buffer Application memory HDF 5 File Dataset header Raw data goes through conversion buffer. The header is written when flushed to file (H 5 Dclose, H 5 Fflush, or MDC flush done by HDF 5 library) May 30 -31, 2012 HDF 5 Workshop at PSI 31 www. hdfgroup. org
PARTIAL I/O FOR CONTIGUOUS DATASET May 30 -31, 2012 HDF 5 Workshop at PSI 32 www. hdfgroup. org
Sub-setting of contiguous dataset Series of adjacent rows Application data in memory M rows N One I/O operation M rows HDF 5 File N elements May 30 -31, 2012 Subset is contiguous in file HDF 5 Workshop at PSI 33 www. hdfgroup. org
Sub-setting of contiguous dataset Adjacent, partial rows Application data in memory N elements M rows Several I/O operation M rows HDF 5 File N elements May 30 -31, 2012 Subset is in M contiguous blocks in file HDF 5 Workshop at PSI 34 www. hdfgroup. org
Sub-setting of contiguous dataset Extreme case: writing a column Application data in memory M rows Several small I/O operation 1 element 1 element HDF 5 File Subset data is scattered in a file in M different locations May 30 -31, 2012 HDF 5 Workshop at PSI 35 www. hdfgroup. org
Sub-setting of contiguous dataset Data sieve buffer Application data in memory Data is copied to a sieve buffer in memory (64 K) memcopy M One write operation 1 element … HDF 5 File May 30 -31, 2012 HDF 5 Workshop at PSI 36 www. hdfgroup. org
Performance tuning for contiguous dataset • Datatype conversion • Avoid for better performance • Use H 5 Pset_buffer function to customize conversion buffer size • Partial I/O • Write/read in big contiguous blocks • Use H 5 Pset_sieve_buf_size to improve performance for complex sub-setting • Caution: • Sieve buffer is allocated when the first write occurs and is released when the dataset is closed. • Memory will grow if there a lot opened datasets. May 30 -31, 2012 HDF 5 Workshop at PSI 37 www. hdfgroup. org
I/O FOR CHUNKED DATASET May 30 -31, 2012 HDF 5 Workshop at PSI 38 www. hdfgroup. org
Recall: Chunked storage layout Metadata cache (MDC) Dataset array data B A C D Dataset header Chunk index Application memory HDF 5 File C Dataset header D B Chunk index A Raw data is stored in separate chunks in HDF 5 file May 30 -31, 2012 HDF 5 Workshop at PSI 39 www. hdfgroup. org
HDF 5 chunking • HDF 5 library treats each chunk as atomic object • Compression is applied to each chunk • Datatype conversion, other filters applied per chunk • Chunk size greatly affects performance • Chunk overhead adds to file size • Chunk processing involves many steps May 30 -31, 2012 HDF 5 Workshop at PSI 40 www. hdfgroup. org
HDF 5 chunk cache • Chunk cache (general points, details later) • Caches chunks for better performance; remains allocated across multiple calls • Created for each chunked dataset • Size of chunk cache is set for file (default size 1 MB) • Each chunked dataset has its own chunk cache • Chunk may be too big to fit into cache • Memory may grow if application keeps opening datasets May 30 -31, 2012 HDF 5 Workshop at PSI 41 www. hdfgroup. org
HDF 5 chunk cache Metadata cache (MDC) Dataset. Metadata header cache Chunking B-tree nodes Default size is 1 MB Chunk caches (per dataset) Application memory May 30 -31, 2012 HDF 5 Workshop at PSI 42 www. hdfgroup. org
Writing chunked dataset Application memory space Chunked dataset Chunk cache Conversion buffer A C B C Filter pipeline HDF 5 File B A C Datatype conversion is performed before chunked placed in cache Chunk is written when evicted from cache Compression and other filters are applied on eviction May 30 -31, 2012 HDF 5 Workshop at PSI 43 www. hdfgroup. org
PARTIAL I/O FOR CHUNKED DATASET May 30 -31, 2012 HDF 5 Workshop at PSI 44 www. hdfgroup. org
Partial I/O for chunked dataset 1 2 3 4 • Example: write the green subset from the dataset , converting the data • Dataset is stored as six chunks in the file. • The subset spans four chunks, numbered 1 -4 in the figure. • Hence four chunks must be written to the file. • But first, the four chunks must be read from the file, to preserve those parts of each chunk that are not to be overwritten. May 30 -31, 2012 HDF 5 Workshop at PSI 45 www. hdfgroup. org
Partial I/O for chunked dataset • For each of the four chunks: • Read chunk from file into chunk cache, unless it’s already there. • Determine which part of the chunk will be replaced by the selection. • Move those elements to conversion buffer and perform conversion • Move data elements to write from application buffer to conversion buffer • Move those elements back from conversion buffer to chunk cache. • Apply filters (compression) when chunk is flushed from chunk cache • For each element 3 memcopy performed May 30 -31, 2012 HDF 5 Workshop at PSI 46 www. hdfgroup. org
Partial I/O for chunked dataset Chunk cache memcopy Conversion buffer 3 memcopy Application memory Compress and write to file HDF 5 File May 30 -31, 2012 Chunk HDF 5 Workshop at PSI 47 www. hdfgroup. org
I/O FOR VARIABLE-LENGTH DATASET May 30 -31, 2012 HDF 5 Workshop at PSI 48 www. hdfgroup. org
Examples of variable length data • String A[0] “the first string we want to write” ………………… A[N-1] “the N-th string we want to write” • Each element is a record of variable-length A[0] (1, 1, 0, 0, 0, 5, 6, 7, 8, 9) [length = 10] A[1] (0, 0, 110, 2005) [length = 4] ……………. . A[N] (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, …. , M) [length = M] May 30 -31, 2012 HDF 5 Workshop at PSI 49 www. hdfgroup. org
Variable length data in HDF 5 • Variable length description in HDF 5 application typedef struct { size_t length; void *p; }hvl_t; • Base type can be any HDF 5 type H 5 Tvlen_create(base_type) • ~ 20 bytes overhead for each element • Data cannot be compressed May 30 -31, 2012 HDF 5 Workshop at PSI 50 www. hdfgroup. org
How variable length data is stored in HDF 5 Actual variable length data Global heap HDF 5 File Dataset header Dataset with variable length elements May 30 -31, 2012 HDF 5 Workshop at PSI Pointer into global heap 51 www. hdfgroup. org
Variable length datasets and I/O • Elements from application buffer “transferred” to/from heaps in the metadata cache during I/O Application buffer Raw data Global heap Pointers Metadata cache May 30 -31, 2012 HDF 5 Workshop at PSI 52 www. hdfgroup. org
There may be more than one global heap Raw data Application buffer Global heap Pointers Global heap May 30 -31, 2012 HDF 5 Workshop at PSI 53 www. hdfgroup. org
VL dataset and I/O Conversion buffers Application buffer Global heap Memory HDF 5 File May 30 -31, 2012 HDF 5 Workshop at PSI 54 www. hdfgroup. org
Hints for variable length data I/O • Avoid closing/opening a file while writing VL datasets • Global heap information is lost • Global heaps may have unused space • Avoid alternately writing different VL datasets • Data from different datasets will go into to the same heap • If maximum length of the record is known, consider using fixed-length records and compression May 30 -31, 2012 HDF 5 Workshop at PSI 55 www. hdfgroup. org
The HDF Group Thank You! Questions? May 30 -31, 2012 HDF 5 Workshop at PSI 56 www. hdfgroup. org
- Cs 246 stanford
- Cs 246 stanford
- Adam datasets example
- Sklearn.datasets.samples_generator
- Mining of massive datasets solution
- Stanford mining massive datasets
- Myafsaccount
- Resilient distributed datasets
- Iteratrion
- Proc datasets noprint
- Hdf to tiff
- Hdf dataset
- Hdf explorer
- Hdf explorer
- Hdf server
- Hdf filter
- Snuipp hdf
- Hdf cloud
- Hdf cloud
- Modis
- Matlab filedatastore
- Ter hdf
- Hdf clogin
- Hình ảnh bộ gõ cơ thể búng tay
- Frameset trong html5
- Bổ thể
- Tỉ lệ cơ thể trẻ em
- Gấu đi như thế nào
- Chụp tư thế worms-breton
- Alleluia hat len nguoi oi
- Môn thể thao bắt đầu bằng từ đua
- Thế nào là hệ số cao nhất
- Các châu lục và đại dương trên thế giới
- Cong thức tính động năng
- Trời xanh đây là của chúng ta thể thơ
- Cách giải mật thư tọa độ
- Làm thế nào để 102-1=99
- Phản ứng thế ankan
- Các châu lục và đại dương trên thế giới
- Thể thơ truyền thống
- Quá trình desamine hóa có thể tạo ra
- Một số thể thơ truyền thống
- Cái miệng nó xinh thế
- Vẽ hình chiếu vuông góc của vật thể sau
- Nguyên nhân của sự mỏi cơ sinh 8
- đặc điểm cơ thể của người tối cổ
- Thế nào là giọng cùng tên
- Vẽ hình chiếu đứng bằng cạnh của vật thể
- Vẽ hình chiếu vuông góc của vật thể sau
- Thẻ vin
- đại từ thay thế
- điện thế nghỉ
- Tư thế ngồi viết
- Diễn thế sinh thái là
- Các loại đột biến cấu trúc nhiễm sắc thể
- Các số nguyên tố
- Tư thế ngồi viết