Three Flavors of Data Science Data n Simulations

  • Slides: 15
Download presentation
Three Flavors of Data Science Data n Simulations and Sensor Readings Catalog Data n

Three Flavors of Data Science Data n Simulations and Sensor Readings Catalog Data n Metadata; descriptors of datasets, data products and other processing artifacts. Active Data n Data associated with logging, monitoring and scheduling compute tasks.

Three Flavors of Data (1) Science Data w Simulation Data: Solutions to partial differential

Three Flavors of Data (1) Science Data w Simulation Data: Solutions to partial differential equations governing the physics of the Columbia River Estuary w Sensor Data: measurements of the physical characteristics used to guide and validate simulations Wanted: n Simple means for specifying new data products from these raw data and computing them efficiently Approach: n Data manipulation language based on a Grid. Field data model.

Three Flavors of Data (2) Catalog Data n Explicit metadata to describe system artifacts

Three Flavors of Data (2) Catalog Data n Explicit metadata to describe system artifacts Wanted: n n Tools to locate artifacts given descriptors (query) A metadata collection facility that tolerates change w The metadata we wish to collect may change (eg, new product ‘lines’ are developed) w The source of the metadata may change (eg, file naming conventions or directory structures evolve. ) Approach: n Generic database; custom collection scripts

Three Flavors of Data (3) Active Data n Data describing past, current, and future

Three Flavors of Data (3) Active Data n Data describing past, current, and future compute tasks. Wanted: n Tools for scheduling, monitoring, and managing. . . w individual tasks (eg, a single data product derivation) w groups of interdependent tasks (eg, a daily forecast run) w campaigns (eg, a series of calibration runs followed by a re-computation of the runs of 2002 with a different implicitness) Approach: n undecided

Simulation Data: Grid. Fields The data product suite exhibits recurring processing idioms n larger

Simulation Data: Grid. Fields The data product suite exhibits recurring processing idioms n larger grids reduced to smaller grids Ex: ‘estuary’ data products vs. ‘far’ data products n grids mapped to other grids Ex: 3 D grid mapped to a 2 D slice n grids combined Ex: 1 D depth grid ‘crossed’ with a 2 D horizontal grid.

Simulation Data: Grid. Fields (2) We’re expressing these idioms as operators over a grid-based

Simulation Data: Grid. Fields (2) We’re expressing these idioms as operators over a grid-based data model. Advantages: n Simpler recipes w 5 ops for all the data products (plus helper functions) n Flexible model; fewer maintenance troubles w N dimensions n uniform handling of space and time (maybe more. . . ) w Any cell type n n segments, triangles, quadrangles, arbitrary polytopes Optimization opportunities w operators prescribe semantics, but not implementation w topological equivalences exposed and exploited

Simulation Data: Grid. Fields (3) Status: n n n Core operators functional Simple examples

Simulation Data: Grid. Fields (3) Status: n n n Core operators functional Simple examples hooked to XMVIS for viewing Todo: w w Examples hooked to VTK Write/Test examples from the current product suite Support Grid. Fields too large for memory Expose a nice syntax for writing recipes

Catalog Data: Collection Where is the Metadata? /forecasts/2003 -184/run/images/isosal_estuary 7/anim-sal_estuary_7. gif File Path 1_salt.

Catalog Data: Collection Where is the Metadata? /forecasts/2003 -184/run/images/isosal_estuary 7/anim-sal_estuary_7. gif File Path 1_salt. 63 Version: 1. 04 Variable: salt : File Name File Content Other Files?

Collection scripts For each file type the meta-data collection mechanism is different. n n

Collection scripts For each file type the meta-data collection mechanism is different. n n n gifs binary output Param. in Use a script for each file type that will emit meta-data for that type of file. Only these simple scripts need change as the system evolves

Example: gif animation product line = “isoline” Variable = “Salinity” Depth = “ 7”

Example: gif animation product line = “isoline” Variable = “Salinity” Depth = “ 7” /forecasts/2003 -184/. . . /isosal_estuary 7/anim-sal_estuary_7. gif Corie. Date = “ 2003 -184” Type = “Animation” Region = “Estuary” Lat = xxxx Long = xxxx Here, a script can just parse the path and file name

Example: Binary output /forecasts/2003 -184/run/1_salt. gif Variable= “Salinity” 1_salt. 63 nodes: 55817 msl: 4285

Example: Binary output /forecasts/2003 -184/run/1_salt. gif Variable= “Salinity” 1_salt. 63 nodes: 55817 msl: 4285 : : What about number of nodes? Mean Sea Level? We need to access the file’s content Need a different mechanism than for gif animations; might be convenient to implement it in a different script.

Architecture invokes Collection Script Reflector Meta-data XML DB Reflector creates XML file containing meta-data

Architecture invokes Collection Script Reflector Meta-data XML DB Reflector creates XML file containing meta-data for each file and also stores the meta-data into the database Reflector determines file type (based on regular expressions) and calls appropriate collection script Collection script uses an “Add. Item” Perl function to return the meta-data back to the reflector

Metadata in XML and DB? These XML files give you filesystem-based access to the

Metadata in XML and DB? These XML files give you filesystem-based access to the metadata for an artifact Use “info” to present the XML in a readable form: /. . /run> info 1_salt. 63 variable: salt version: 1. 04 msl: 4285 nodes: 55817 Also useful if DB is inaccessible.

Minor Technical Change Previously we had suggested that the collection scripts should emit metadata

Minor Technical Change Previously we had suggested that the collection scripts should emit metadata on standard output We have provided a perl function Add. Item(Name, Value, Notes, Type)

How does this help ? Find artifacts via descriptors (query) w ‘find animations showing

How does this help ? Find artifacts via descriptors (query) w ‘find animations showing the estuary where we used a constant bottom friction coefficient’ w where region = “estuary” and type = “animation” and ntau = “ 0” Write robust metadata-driven programs w Chris’ low bandwidth zoom web app w Stay-Fresh Powerpoint Slides