La Ti S https github comdlindholLa Ti S

La. Ti. S https: //github. com/dlindhol/La. Ti. S Doug Lindholm Laboratory for Atmospheric and Space Physics University of Colorado Boulder ESIP – July 8, 2014

Motivation - Get Data Into Analysis Code/Tools Disparate Data Unified Interface

La. Ti. S Server Architecture Native Data Descriptors TSML Filters Writers ASCII Subset CSV Constrain (sst > 20) JSON Convert Units DAP 2 Missing Values Image Derived Products code snippet Custom Binary JDBC FITS TSML Web Service TSML Custom La. Ti. S Data Model TSML Adapters Client Applications Web Browser Excel Analysis Tools Programs Web Service

La. Ti. S Client Options • Any OPe. NDAP client. Available for most programming languages (python, IDL, Matlab, . . . ). • Analysis/visualization tools with built in OPe. NDAP support. • Web browser: Directly enter http URL query. • wget, curl: command line tools for making an HTTP request. • Custom Web Applications (Open Source coming soon) that make AJAX requests to La. Ti. S to get JSON output and make interactive plots. • Custom programming APIs that wrap a La. Ti. S call.

Related Technology Comparisons • • • OPe. NDAP – Both implement DAP 2 protocol (standard service API) – OPe. NDAP servers tend to be file centric – La. Ti. S presents “virtual” dataset via aggregation – La. Ti. S aims to be easier to install, configure, and extend Net. CDF Common Data Model (CDM) – Multidimensional array centric – Coupled to Net. CDF file format – Climate and forecast model (simulation) emphasis THREDDS Data Server – Built around Net. CDF CDM – Provides OPe. NDAP and other service interfaces TSDS – First generation of La. Ti. S built on Net. CDF CDM Vis. AD – Essentially the same logical data model as La. Ti. S with a clunkier implementation based on old Java capabilities – La. Ti. S is implemented around modern paradigms like Functional Programming

What do I mean by Data Model • • • NOT a simulation or forecast (climate model) NOT a metadata model (ISO 19115) NOT a file format (Net. CDF) NOT how the data are stored (RDBMS) NOT the representation in computer memory (data structure) • Logical model • What the data represent, conceptually • How the data are used

Data Abstractions bits 101101010000010011110011001111111 0 bytes 00105 e 0 e 6 b 0 343 b 9 c 74 0804 e 7 bc 0804 e 7 d 5 0804 int, long, float, double, scientific notation (Number) 1, -506376193, 13. 52, 0. 177483826523, 1. 02 e-14 array 1. 2 3. 6 2. 4 1. 7 -3. 2

Scientific Data Abstractions Multi-dimensional Arrays Key Features: - Single data type - Access by index

Relational Database time flux class Table = Relation Row = Tuple of Attributes e. g. (0, 3. 5, B) 0 3. 5 B 1 4. 6 A 2 4. 7 A 3 4. 1 A Key Features: - Supports different data types - Well suited for access by value e. g. time>2, class=A 4 3. 2 B But the relation is limited to a sequence of tuples:

La. Ti. S Unified Data Model • Extends the Relational Model to add Functional relationships. • Represents multi-dimensional domain of data grids. • Access by value or index. Independent Variable (domain) Dependent Variables (range) Example: time series of gridded surface winds Time -> ((Lon, Lat) -> (U, V))

La. Ti. S Data Model Only Three Variable Types: Scalar: single Variable Tuple: group of Variables Function: mapping from one Variable to another Extend to capture higher level, domain specific abstractions

Discipline Agnostic Data Access with La. Ti. S Philosophy: Leave data in their native form Expose via a common interface Software: • Reusable adapters (software modules) to read common formats, extension points for custom formats • XML dataset descriptors, map native data model to the La. Ti. S data model • Open Source, community Web services: • Standard service interfaces, currently OPe. NDAP • Server side processing and output format options

Implementing the Data Model • The La. Ti. S Data Model is an abstract representation • Can be represented several ways – UML – Vis. AD grammar – Java Interface (no implementation) • Need an implementation in code • Scientific data Domain Specific Language (DSL) – Expose an API that fits the application domain • Scala programming language – http: //www. scala-lang. org/

Why Scala • • • Evolution of Java – Use with existing Java code – Runs on the Java Virtual Machine (JVM) – Command line (REPL), script, or compiled – Statically typed (safer than dynamic languages) – Industrial strength (Twitter, Linked. In, …) Object-Oriented – Encapsulation, polymorphism, … – Traits: interfaces with implementation, multiple inheritance, mix-ins Functional Programming – Immutable data structures – Functions with no side effects – Provable, parallelizable Syntactic sugar for Domain Specific Languages Operator “overloading”, natural math language for Variables Parallel collections

Scala Implementation • Dataset as a Scala collection • Functional Programming Paradigms: – Function composition over object manipulation – Functions as first class citizens • a La. Ti. S Function can be used like a programming function – Immutable data structures – No side-effects: parallelizable, provable – Lazy evaluation: scalable • Math and resampling mixed in – e. g. dataset 3 = (dataset 1 + dataset 2) / 2 • Metadata encapsulated – enforce data consistency: unit conversions. . . – track provenance

La. Ti. S Server Implementation • RESTful web service API (OPe. NDAP +) • Java Servlet, build and deploy war file • XML dataset descriptor (TSML) for each dataset – Specify Adapter to use – Map native data source to La. Ti. S data model – Define transformations as Processing Instructions • Catalog to map dataset names to TSML • Plugins: implement the Adapter, Filter or Writer interfaces or extend existing ones • Properties file to map filter and writer names to implementing classes

Example – Serving an ASCII File Sunspot data for October 2003 10 01 2003 10 02 2003 10 03 2003 10 04 2003 10 05 2003 10 06 2003 10 07 2003 10 08 2003 10 09 2003 10 10 2003 10 11 2003 10 12 2003 10 13 2003 10 14 2003 10 15 2003 10 16 2003 10 17 2003 10 18 2003 10 19 2003 10 20 2003 10 21 2003 10 22 2003 10 23 2003 10 24 2003 10 25 2003 10 26 2003 10 27 2003 10 28 2003 10 29 2003 10 30 2003 10 31 75 72 59 60 53 51 50 56 58 50 44 22 12 4 17 24 37 43 43 64 66 72 68 81 89 102 141 167 171 156 TSML Dataset descriptor <? xml version="1. 0" encoding="UTF-8"? > <tsml> <dataset name="Sunspot_Number" history="Read by La. Ti. S"> <adapter class="latis. reader. tsml. Ascii. Adapter" url="file: /data/latis/ssn. txt" /> <time units="yyyy MM dd” /> <integer name="ssn” /> </dataset> </tsml>

Example – Serving an ASCII File

Current Applications • LASP Interactive Solar Irradiance Data Center (LISIRD) – Uses La. Ti. S to read, subset, reformat data, metadata – http: //lasp. colorado. edu/lisird/ • Time Series Data Server (TSDS) – Common RESTful interface to NASA Heliophysics data – http: //tsds. net/ Other LASP projects: MMS, MAVEN, database statistics, log files External users?

Capabilities – Data Reader Modules • Operational: – ASCII (file, web service, system call), binary, Net. CDF, Relational database, data “generators” – Time Series of scalars, vectors, and spectra – Arbitrarily long time series • Prototyped: – HDF, CDF, FITS, GRIB, OPe. NDAP (e. g. other La. Ti. S servers), No. SQL (Mongo. DB) – Nested 2 D (gridded) data structures • Planned: – Arbitrarily complex data structures

Capabilities – Data Writer Modules • Operational: – OPe. NDAP, ASCII (e. g. csv), binary, JSON, Image (PNG), IDL code, HTML dataset landing page • Prototyped: – Net. CDF, HDF, IDL save file, interactive plot • Planned: – Geo. TIFF, …

Capabilities – Data Filter Modules • Operational: – Subset, aggregate, stride, thin, replace, integrate, bin average • Prototyped: – FFT, min, max, unique, resampling, unit conversion • Planned: – Coordinate system transformations – Make it easier to plug in custom computations – Track provenance

Capabilities – Service Interface • Operational: – OPe. NDAP – Java Servlet, simply deploy war file (Tomcat, Glassfish) • Prototyped: – Authentication – Single executable (jetty) – THREDDS Data Server (TDS) integration • Planned: – Open Geospatial Consortium (OGC) standards • Web Map Server • Web Coverage Server

Capabilities - Metadata • Operational: – THREDDS catalog, static XML, browse • Prototyped: – – Semantic Web triple store (RDF, SPARQL) Text search (Solr) Modeling RDF triples (subject, predicate, object) Track provenance, record Dataset modifications • Planned: – Serve metadata in various schema (e. g. ISO 19115, SPASE) – Unique IDs, Digital Object Identifiers (DOI) for publishing

Other Capabilities • Operational: – Time API with formatting – Time conversions with leap seconds • Prototyped: – Caching, improve performance – Parallel processing, multi-core • Planned: – Big Data, Hadoop, Map Reduce – Workflow integration

Source Code Management – Open Source • Time Series Server (a. k. a. TSS 1) – Core of Time Series Data Server (TSDS, tsds. net) – Built around Unidata Common Data Model – Source. Forge: https: //sourceforge. net/projects/tsds/ • La. Ti. S (a. k. a. TSS 2) – – New La. Ti. S data model, scala implementation Git. Hub: https: //github. com/dlindhol/La. Ti. S LASP internal development branch Plug-ins as separate projects (e. g. data collections, math, custom readers/writers, …), keep core small

My Background (i. e. bias) • Astrophysicist by degree, software engineer by profession • Data user and provider • Scientific data applications developer: – astrophysics, atmospheric science, space science • Holy Grail: common data model • Favorite scientific data models: – Vis. AD (http: //www. ssec. wisc. edu/~billh/visad. html) – Unidata Common Data Model (http: //www. unidata. ucar. edu/software/netcdf-java/CDM/) – OPe. NDAP (http: //www. opendap. org/)

Motivation – Stove Pipes

Single Data Access Interface
- Slides: 29