net CDF An Effective Way to Store and

  • Slides: 19
Download presentation
net. CDF An Effective Way to Store and Retrieve Scientific Datasets Jianwei Li 02/11/2002

net. CDF An Effective Way to Store and Retrieve Scientific Datasets Jianwei Li 02/11/2002

Motivation • Related scientific datasets in a single file. – Define a file format

Motivation • Related scientific datasets in a single file. – Define a file format – Present the multi-dimensional arrays in a selfdescribing way • Common data access through a simple interface – Typical ways of access to multi-dimensional arrays – Platform-independent / Network-transparent Here comes net. CDF!

Outline Today • • • Overview of net. CDF File Format net. CDF library

Outline Today • • • Overview of net. CDF File Format net. CDF library and Programming Model Advantages & Disadvantages Future Plans for Net. CDF

What Is net. CDF? Net. CDF (network Common Data Form) is an interface for

What Is net. CDF? Net. CDF (network Common Data Form) is an interface for array-oriented data access. It defines a machine-independent file format for representing multi-dimensional arrays with ancillary data, and provide support for creation, access, and sharing of array-oriented data. The net. CDF software is implemented as I/O libraries for C, FORTRAN, C++, and Perl and other language for which a net. CDF library is available. • • • Self-describing. Portable. Direct-access. Appendable. Sharable.

Why use net. CDF? • Facilitate the use of common datasets by distinct applications.

Why use net. CDF? • Facilitate the use of common datasets by distinct applications. • Permit datasets to be transported between or shared by dissimilar computers transparently, i. e. , without translation. • Reduce the programming effort usually spent interpreting formats. • Reduce errors arising from misinterpreting data and ancillary data. • Facilitate using output from one application as input to another. • Establish an interface standard which simplifies the inclusion of new software into existing system.

Platforms that net. CDF runs on? • • • AIX HPUX-9. 05 IRIX, IRIX

Platforms that net. CDF runs on? • • • AIX HPUX-9. 05 IRIX, IRIX 64 MSDOS (gcc, f 2 c, GNU make) Linux OSF • • • Open. VMS OS/2 SUNOS ULTRIX UNICOS Windows NT Platforms that range from CRAYs to personal computers and include most UNIX-based workstations.

net. CDF dataset Components • Dimensions ü name, length – Fixed dimension – UNLIMITED

net. CDF dataset Components • Dimensions ü name, length – Fixed dimension – UNLIMITED dimension • Variables: named arrays ü name, type, shape, attributes – Fixed sized variables: array of fixed dimensions – Record variables: array with its most-significant dimension UNLIMITED – Coordinate variables: 1 -D array with the same name as its dimension • Attributes ü name, type, values, length – Variable attributes – Global attributes

An Example net. CDF File

An Example net. CDF File

net. CDF File Format • Header – – Number of record variables Dimension list

net. CDF File Format • Header – – Number of record variables Dimension list Global attribute list Variable list • Data (row-major) – Fixed-sized data for each variable is stored contiguously – Record data a variable number of fixed-size records, each of which contains one record for each of the record variables in order. • Both use extended XDR

Data access in net. CDF dataset Direct Access! Efficient for small subsets out of

Data access in net. CDF dataset Direct Access! Efficient for small subsets out of large datasets Forms of access: • access to all elements (whole array); • access to individual elements, specified with an index vector; • access to array sections, specified with an index vector, and count vector; • access to subsampled array sections, specified with an index vector, count vector, and stride vector; • access to mapped array sections, specified with an index vector, count vector, stride vector, and an index mapping vector (map the indices of an element in an external subarray to the offset in internal buffer).

Dataset Access Example Start[] = {2, 1}, Count[] = {3, 2}, Stride[] = {3,

Dataset Access Example Start[] = {2, 1}, Count[] = {3, 2}, Stride[] = {3, 2}, Imap[] = {1, 3}

net. CDF Library & Programming Model Modes: definition mode, data mode IDs: dataset ID,

net. CDF Library & Programming Model Modes: definition mode, data mode IDs: dataset ID, dimension ID, variable ID, attribute number Create & Write Read by name Read sequencially Add dim, var, att

Example library functions

Example library functions

net. CDF data types and type conversion net. CDF Data Type C API Mnemonic

net. CDF data types and type conversion net. CDF Data Type C API Mnemonic Discription byte NC_BYTE 8 -bit characters for text data. char NC_CHAR 8 -bit signed or unsigned integers short NC_SHORT 16 -bit signed integers int NC_INT 32 -bit signed integers float NC_FLOAT 32 -bit IEEE floating-point double NC_DOUBLE 64 -bit IEEE floating-point • Different library functions for each data type • Automatic conversion to or from external type • Conversion from one numeric type to another

Other Data Types? • Data Structure – No direct support – Can build some

Other Data Types? • Data Structure – No direct support – Can build some data structure by grouping arrays or by using data in one array as pointers into another array. Inefficient, and may not be self-describing! • Character String – not a primitive net. CDF external data type – Charater-string attributes: well represented by NC_CHAR values and its length. – Charater-string variables: use a character-position dimension as the most quickly varying dimension for the variable (dim-length = the max string length)

Share Data Between Read and Write • Support for multiple read and single write

Share Data Between Read and Write • Support for multiple read and single write • Use NC_SHARE mode to create or open dataset to use unbuffered I/O • Or call nc_sync() to force writing changes back to disk • Header changes in define mode are synchronized to disk only when nc_enddef() is called, reader must call nc_sync() to bring header up-to-date in its memory

Advantages • Simple and useful libraries for arrayoriented data access • The dataset is

Advantages • Simple and useful libraries for arrayoriented data access • The dataset is self-describing • Portable and easy to share • Flexible and efficient data access • Mechanism to support extendable arrays

Disadvantages • No support for multiple concurrent write • Limited number of external numeric

Disadvantages • No support for multiple concurrent write • Limited number of external numeric data types: 8 -, 16 -, 32 -bit integers, or 32 - or 64 -bit floating-point numbers. • No support for nested data structures such as trees, nested arrays, or other recursive structures • Only one unlimited dimension. Multiple variables sharing an unlimited dimension, must all grow together. • Redefine(structure changes) that requires more file space is accomplished by copying the dataset.

Future plans for net. CDF • Parallel write support • Support for bit data

Future plans for net. CDF • Parallel write support • Support for bit data type and transparent data packing for bit variables • Support for efficient structure changes • Support for nested arrays and other data structure