GCE Data Toolbox metadatabased tools for automated data
GCE Data Toolbox -- metadata-based tools for automated data processing and analysis Wade Sheldon University of Georgia GCE-LTER
Rationale Data processing, quality control, data analysis and metadata generation traditionally carried out as separate activities, often in different time frames using different technologies Ø Problems: Ø Metadata may not reflect all processing steps Ø Much routine data analysis done w/o Q/C, metadata Ø No economy of scale – leads to “one-off” solutions Ø Ø Metadata generation should ideally occur throughout the data cycle and “inform” data analysis
Design Goals Ø Develop Integrated Storage Standard Tabular Data Ø QA/QC Information Ø Metadata (overall data set & columns/attributes) Ø Ø Develop Software to Support Standard Code Library/API Ø User Interfaces Ø Apply Technology to Acquire, Manage, Distribute GCE-LTER Data Ø Explore Use as Prototype Technology for Metadata-based Data Processing, Synthesis Ø
Storage Standard Ø Developed Using MATLAB® Ø Ø Ø Local expertise, large scientific user base Cross-platform (Win 32, Solaris, *nix, Mac OS/x) Rapid development environment Supports multiple interfaces (interactive command line, batchmode scripts, GUI, WWW) Good interoperability with other technologies (Java, PERL, SQL) Defined “GCE Data Structure” Spec. (based on MATLAB/C structures) Structure with 17 named fields Ø Specific content rules for each field (software validation) Ø Combines data, metadata, QA/QC, processing history Ø
Storage Standard GCE Data Structure Specification (v 1. 1)
Software – GCE Data Toolbox Ø Core Function Library Ø Ø Ø Create, Validate Structures Import Data, Metadata (ASCII, MATLAB, SQL) Manipulate Data, Metadata (unit conversions, add/delete/update) Export Data, Metadata (various formats) Dynamic, Rule-base QA/QC Flagging Ø Self-documenting Processing Operation Logging (Processing History) Ø Transparent Metadata Creation/Updating Ø Dynamic (JIT) Metadata Generation for Columns Ø Ø Support for Metadata “Templating” Application of Boilerplate Metadata based on Parameter Matching Ø Supports Rapid Documentation of Routine Data Sources Ø
Software – GCE Data Toolbox Ø Support for Analysis Descriptive Statistics, Reports Ø Visualization, Mapping Ø Ø Support for Synthesis Ø Composite Data Set Creation Multiple Data Set Merge/Concatenation Ø Relational Join Ø Metadata Content Meshing Ø Ø Data Set Summarization Ø Ø Statistical Data Reduction/Re-sampling Data Set Standardization Unit Conversions (automatic, interactive) Ø Template-based Semantic Mapping Ø Automatic Semantic Mediation (prototype stage) Ø
Software – User Interfaces Unattended Batch Mode Processing Ø Interactive Command Line Processing (conventional MATLAB UI) Ø Full help text for each function Ø Well-defined input/output arguments Ø Ø GUI Applications Standard Forms, Dialogs, Controls Ø No MATLAB Experience Required Ø Ø WWW – MATLAB Web Server HTML Forms, Querystring Input Ø HTML Pages and/or Static File Output Ø
Command-Line Interface
GUI Applications
WWW Interface
Current Applications Ø Automated Data Processing Direct data import from data logger files, WWW data sources (USGS), SQL queries Ø Automatic metadata creation (templates, data mining) Ø Rule-based QA/QC flagging Ø Ø Data Set Packaging Batch processing to create/update data, metadata products Ø On-demand generation of data, metadata, stat reports in custom formats (end-user scripts, GUI applications, WWW forms) Ø
Current Applications Ø Data Exploration/Analysis by PIs Descriptive Statistics based on attribute metadata Ø Visualization with Interactive Filtering (Frequency Ø Histograms, 2 D Plots, Map Plots) Ø Data Reduction/Re-sampling to Provide Customized Data at Various “Scales” Aggregated Statistics Ø Binned Statistics Ø Query/Filtering (sub-selection) Ø
Current Applications Ø Data Harvesting (GCE) USGS Data (WWW real-time, daily, finalized data) Ø Campbell Scientific Data Arrays (post-processing triggered after Logger. Net Retrieval) Ø Sea-Bird Hydrographic Data Ø Ø USGS Data Harvesting Service for Hydro. DB Weekly harvest for 31 stations/7 LTER Sites Ø Automatic Resampling, Unit Conversions, Q/C Ø
Availability Ø Description, Screen-shots, Fully-functional Toolbox Available on WWW: http: //gce-lter. marsci. uga. edu/lter/research/tools/data_toolbox. htm Requires MATLAB 5. 3, 6. 0, 6. 5 (any platform) Ø “Public” Version Compiled Ø Source Code Requests Considered on Case-by. Case Basis Ø
Future Development Plans Ø EML 2. 0 Support Ø Metadata-mediated Data Set Integration Ø Unit conversions Ø Re-sampling Ø More WWW Interface Development
- Slides: 16