Three Flavors of Data Science Data n Simulations















- Slides: 15
Three Flavors of Data Science Data n Simulations and Sensor Readings Catalog Data n Metadata; descriptors of datasets, data products and other processing artifacts. Active Data n Data associated with logging, monitoring and scheduling compute tasks.
Three Flavors of Data (1) Science Data w Simulation Data: Solutions to partial differential equations governing the physics of the Columbia River Estuary w Sensor Data: measurements of the physical characteristics used to guide and validate simulations Wanted: n Simple means for specifying new data products from these raw data and computing them efficiently Approach: n Data manipulation language based on a Grid. Field data model.
Three Flavors of Data (2) Catalog Data n Explicit metadata to describe system artifacts Wanted: n n Tools to locate artifacts given descriptors (query) A metadata collection facility that tolerates change w The metadata we wish to collect may change (eg, new product ‘lines’ are developed) w The source of the metadata may change (eg, file naming conventions or directory structures evolve. ) Approach: n Generic database; custom collection scripts
Three Flavors of Data (3) Active Data n Data describing past, current, and future compute tasks. Wanted: n Tools for scheduling, monitoring, and managing. . . w individual tasks (eg, a single data product derivation) w groups of interdependent tasks (eg, a daily forecast run) w campaigns (eg, a series of calibration runs followed by a re-computation of the runs of 2002 with a different implicitness) Approach: n undecided
Simulation Data: Grid. Fields The data product suite exhibits recurring processing idioms n larger grids reduced to smaller grids Ex: ‘estuary’ data products vs. ‘far’ data products n grids mapped to other grids Ex: 3 D grid mapped to a 2 D slice n grids combined Ex: 1 D depth grid ‘crossed’ with a 2 D horizontal grid.
Simulation Data: Grid. Fields (2) We’re expressing these idioms as operators over a grid-based data model. Advantages: n Simpler recipes w 5 ops for all the data products (plus helper functions) n Flexible model; fewer maintenance troubles w N dimensions n uniform handling of space and time (maybe more. . . ) w Any cell type n n segments, triangles, quadrangles, arbitrary polytopes Optimization opportunities w operators prescribe semantics, but not implementation w topological equivalences exposed and exploited
Simulation Data: Grid. Fields (3) Status: n n n Core operators functional Simple examples hooked to XMVIS for viewing Todo: w w Examples hooked to VTK Write/Test examples from the current product suite Support Grid. Fields too large for memory Expose a nice syntax for writing recipes
Catalog Data: Collection Where is the Metadata? /forecasts/2003 -184/run/images/isosal_estuary 7/anim-sal_estuary_7. gif File Path 1_salt. 63 Version: 1. 04 Variable: salt : File Name File Content Other Files?
Collection scripts For each file type the meta-data collection mechanism is different. n n n gifs binary output Param. in Use a script for each file type that will emit meta-data for that type of file. Only these simple scripts need change as the system evolves
Example: gif animation product line = “isoline” Variable = “Salinity” Depth = “ 7” /forecasts/2003 -184/. . . /isosal_estuary 7/anim-sal_estuary_7. gif Corie. Date = “ 2003 -184” Type = “Animation” Region = “Estuary” Lat = xxxx Long = xxxx Here, a script can just parse the path and file name
Example: Binary output /forecasts/2003 -184/run/1_salt. gif Variable= “Salinity” 1_salt. 63 nodes: 55817 msl: 4285 : : What about number of nodes? Mean Sea Level? We need to access the file’s content Need a different mechanism than for gif animations; might be convenient to implement it in a different script.
Architecture invokes Collection Script Reflector Meta-data XML DB Reflector creates XML file containing meta-data for each file and also stores the meta-data into the database Reflector determines file type (based on regular expressions) and calls appropriate collection script Collection script uses an “Add. Item” Perl function to return the meta-data back to the reflector
Metadata in XML and DB? These XML files give you filesystem-based access to the metadata for an artifact Use “info” to present the XML in a readable form: /. . /run> info 1_salt. 63 variable: salt version: 1. 04 msl: 4285 nodes: 55817 Also useful if DB is inaccessible.
Minor Technical Change Previously we had suggested that the collection scripts should emit metadata on standard output We have provided a perl function Add. Item(Name, Value, Notes, Type)
How does this help ? Find artifacts via descriptors (query) w ‘find animations showing the estuary where we used a constant bottom friction coefficient’ w where region = “estuary” and type = “animation” and ntau = “ 0” Write robust metadata-driven programs w Chris’ low bandwidth zoom web app w Stay-Fresh Powerpoint Slides