Pipelines and Scientific Workflows with Ptolemy II Deana

  • Slides: 47
Download presentation
Pipelines and Scientific Workflows with Ptolemy II Deana Pennington University of New Mexico LTER

Pipelines and Scientific Workflows with Ptolemy II Deana Pennington University of New Mexico LTER Network Office Shawn Bowers UCSD San Diego Supercomputer Center

Analytical Pipelines AP 0 ASx TS 1 ASy ASz TS 2 ASr AP 0

Analytical Pipelines AP 0 ASx TS 1 ASy ASz TS 2 ASr AP 0 Library of Analysis steps & Analytical Pipeline Parameters w/Semantics Semantic Mediation System Logic Rules ECO Query Processing ASx Analysis Step in an Execution Environment: SAS, MATLAB, etc. TS 1 Transformation Step Taxon Parameter Ontologies & Taxonomies

Scientific Workflows SW 0 Search for relevant data (Query) ASx TS 1 ASy ASz

Scientific Workflows SW 0 Search for relevant data (Query) ASx TS 1 ASy ASz TS 2 ASr TS 2 ASx TS 1 ASz Iterative TS 2 ASr

Benefits • Reusable analysis steps, pipelines, and workflows • Formal documentation of methods (output

Benefits • Reusable analysis steps, pipelines, and workflows • Formal documentation of methods (output in report format) • Reproducibility of methods • Visual creation and communication of methods • Versioning • Automated data typing and transformation

Ptolemy II demo

Ptolemy II demo

Ecological Niche Modeling Geographic Space Biodiversity information … e. g. , data from museum

Ecological Niche Modeling Geographic Space Biodiversity information … e. g. , data from museum specimens ecological niche modeling Precipitation Vegetation class Model of niche in ecological dimensions precipitation Geospatial and remotely sensed data Ecological Space vegetation class Results used for integration with other data realms (e. g. , human populations, public health, etc. ) Modified from B. Michener Projection back onto geography Native range prediction Model type: • Linear regression (GRASP) • Genetic algorithms (GARP) Invaded range prediction

Ecological Niche Models Excel File Sample 1, lat, long, presence Access File Sample 3,

Ecological Niche Models Excel File Sample 1, lat, long, presence Access File Sample 3, lat, long, absence Vegetation cover type Sample 2, lat, long, presence Integrated data: Elevation (m) Mean annual temperature (C) P, juniper, 2200 m, 16 C P, pinyon, 2320 m, 14 C A, creosote, 1535 m, 22 C

GARP Native-Species Pipeline (informal) Eco. Grid Data. Base Species pres. & abs. points Test

GARP Native-Species Pipeline (informal) Eco. Grid Data. Base Species pres. & abs. points Test sample Species pres. & abs. points +A 2 +A 3 Eco. Grid Query Physical Transformation Model quality parameters +A 1 Sample Data Eco. Grid Data. Base Training sample Data Calculation GARP rule set Integrated layers Env. layers Eco. Grid Data. Base Eco. Grid Query Eco. Grid Data. Base Validation Map Generation Integrated layers Layer Integration Native range prediction map User Selected prediction maps Scaling Archive To Ecogrid Generate Metadata

GARP Native-Species Pipeline (informal) Eco. Grid Data. Base Species pres. & abs. points We

GARP Native-Species Pipeline (informal) Eco. Grid Data. Base Species pres. & abs. points We will look at this analytic step Env. layers Eco. Grid Data. Base Eco. Grid Query Eco. Grid Data. Base Physical Transformation Model quality parameters +A 1 +A 2 +A 3 Eco. Grid Query Eco. Grid Data. Base Test sample Species pres. & abs. points Sample Data Training sample Data Calculation GARP rule set Validation GARP rule set Integrated layers Map Generation Integrated layers Layer Integration Native range prediction map User Selected prediction maps Scaling Archive To Ecogrid Generate Metadata

Sample Data: Basic Input/Output Species presence points parameters Dependent. Variable Coordinates +A 1 Presence

Sample Data: Basic Input/Output Species presence points parameters Dependent. Variable Coordinates +A 1 Presence under environmental conditions Test Sample of Conditioned Data +A 2 Independent. Variable Coordinates +A 3 Sample Data Training Sample of Conditioned Data Environmental Layers (temp. , vegetation, etc. ) input output

Analytic-Step Abstractions Physical Level An analytic step is a particular software implementation that takes

Analytic-Step Abstractions Physical Level An analytic step is a particular software implementation that takes and produces physical data (for example, files) Logical Level Defines the structure of input and output (like a database schema) Semantic Level Uses ontological information to conceptually define the analytic step (for discovery and integration)

Analytic-Step Abstractions Physical Level An analytic step is a particular software implementation that takes

Analytic-Step Abstractions Physical Level An analytic step is a particular software implementation that takes and produces physical data (for example, files) Logical Level Defines the structure of input and output (like a database schema) Semantic Level Uses ontological information to conceptually define the analytic step (for discovery and integration)

Sample Data: Physical Level Data as comma-delimited, plain text files parameters 33. 454606, 106.

Sample Data: Physical Level Data as comma-delimited, plain text files parameters 33. 454606, 106. 789098; 33. 454606, 106. 789097; … +A 1 1, 56. 25, 0, 20, …, 44; 0, 57. 34, 0, 55, …, 14; … +A 2 33. 454606, 106. 789098, 56. 25; 33. 454606, 106. 789097, 56. 37; …… 33. 454606, 106. 789097, 56. 37; … +A 3 Sample Data 0, 77. 33, 1, 50, …, 44; 1, 56. 01, 0, 55, …, 14; … An actual program that implements Sample Data input output

Analytic-Step Abstractions Physical Level An analytic step is a particular software implementation that takes

Analytic-Step Abstractions Physical Level An analytic step is a particular software implementation that takes and produces physical data (for example, files) Logical Level Defines the structure of input and output (like a database schema) Semantic Level Uses ontological information to conceptually define the analytic step (for discovery and integration)

GARP Native-Species Pipeline (informal) Eco. Grid Data. Base Species pres. & abs. points We

GARP Native-Species Pipeline (informal) Eco. Grid Data. Base Species pres. & abs. points We will look at this analytic step Env. layers Eco. Grid Data. Base Eco. Grid Query Eco. Grid Data. Base Physical Transformation Model quality parameters +A 1 +A 2 +A 3 Eco. Grid Query Eco. Grid Data. Base Test sample Species pres. & abs. points Sample Data Training sample Data Calculation GARP rule set Validation GARP rule set Integrated layers Map Generation Integrated layers Layer Integration Native range prediction map User Selected prediction maps Scaling Archive To Ecogrid Generate Metadata

Sample Data: Basic Input/Output Species presence points parameters Dependent. Variable Coordinates +A 1 Presence

Sample Data: Basic Input/Output Species presence points parameters Dependent. Variable Coordinates +A 1 Presence under environmental conditions Test Sample of Conditioned Data +A 2 Independent. Variable Coordinates +A 3 Sample Data Training Sample of Conditioned Data Environmental Layers (temp. , vegetation, etc. ) input output

Analytic-Step Abstractions Physical Level An analytic step is a particular software implementation that takes

Analytic-Step Abstractions Physical Level An analytic step is a particular software implementation that takes and produces physical data (for example, files) Logical Level Defines the structure of input and output (like a database schema) Semantic Level Uses ontological information to conceptually define the analytic step (for discovery and integration)

Analytic-Step Abstractions Physical Level An analytic step is a particular software implementation that takes

Analytic-Step Abstractions Physical Level An analytic step is a particular software implementation that takes and produces physical data (for example, files) Logical Level Defines the structure of input and output (like a database schema) Semantic Level Uses ontological information to conceptually define the analytic step (for discovery and integration)

Sample Data: Physical Level Data as comma-delimited, plain text files parameters 33. 454606, 106.

Sample Data: Physical Level Data as comma-delimited, plain text files parameters 33. 454606, 106. 789098; 33. 454606, 106. 789097; … +A 1 1, 56. 25, 0, 20, …, 44; 0, 57. 34, 0, 55, …, 14; … +A 2 33. 454606, 106. 789098, 56. 25; 33. 454606, 106. 789097, 56. 37; …… 33. 454606, 106. 789097, 56. 37; … +A 3 Sample Data 0, 77. 33, 1, 50, …, 44; 1, 56. 01, 0, 55, …, 14; … An actual program that implements Sample Data input output

Analytic-Step Abstractions Physical Level An analytic step is a particular software implementation that takes

Analytic-Step Abstractions Physical Level An analytic step is a particular software implementation that takes and produces physical data (for example, files) Logical Level Defines the structure of input and output (like a database schema) Semantic Level Uses ontological information to conceptually define the analytic step (for discovery and integration)

Logical descriptions Recall that a schema sets the allowable structure for data Employee name

Logical descriptions Recall that a schema sets the allowable structure for data Employee name : string age : integer ssn : string title : string salary : int These tables are not allowable instances of the logical description Smith 40 555 -… 5 Clark 50 555 -… Mgr. 75000 Allen Jones 36 555 -… 4 Lewis 36 555 -… Sales 40000 Young Davis 22 555 -… 2 too few columns, wrong datatypes too many columns

Sample Data: Logical Level 2 -dimensional matrix parameters matrix[x, y] +A 1 Relation of

Sample Data: Logical Level 2 -dimensional matrix parameters matrix[x, y] +A 1 Relation of n+1 attributes for n environmental layers sample 1(pres, temp, veg, …, zn) +A 2 list(matrix[x, y, z]) +A 3 Sample Data sample 2(pres, temp, veg, …, zn) List of 3 -dimensional matrices, one matrix per environmental layer input output

Why have the Logical Level? Data independence Hides the details of how information is

Why have the Logical Level? Data independence Hides the details of how information is represented (text or binary files) from what is represented (a table of integers) Reduced application development time Makes information more easily reusable, for example, by other applications or services – with programs for handling the physical/logical level Can help enable integration Explicit knowledge of the structure and types of data can help automate conversion, for

Choosing a logical representation 2 -dimensional matrix parameters matrix[x, y] +A 1 Relation of

Choosing a logical representation 2 -dimensional matrix parameters matrix[x, y] +A 1 Relation of n+1 attributes for n environmental layers sample 1(pres, temp, veg, …, zn) +A 2 list(matrix[x, y, z]) List of 3 -dimensional matrices, one matrix per environmental layer input +A 3 Sample Data sample 2(pres, temp, veg, …, zn) Can you see any potential problems with this choice of logical output? output

Choosing a logical representation matrix[x, y] +A 1 sample 1(pres, z 1, z 2,

Choosing a logical representation matrix[x, y] +A 1 sample 1(pres, z 1, z 2, …, zn) +A 2 list(matrix[x, y, z]) +A 3 Sample Data The output structure is dependent on the input data… sample 2(pres, z 1, z 2, …, zn) ? avail(pres, temp, veg, elev) +A 1 +A 2 +A 3 Service

GARP Native-Species Pipeline (informal) Eco. Grid Data. Base Species pres. & abs. points We

GARP Native-Species Pipeline (informal) Eco. Grid Data. Base Species pres. & abs. points We will look at this analytic step Env. layers Eco. Grid Data. Base Eco. Grid Query Eco. Grid Data. Base Physical Transformation Model quality parameters +A 1 +A 2 +A 3 Eco. Grid Query Eco. Grid Data. Base Test sample Species pres. & abs. points Sample Data Training sample Data Calculation GARP rule set Validation GARP rule set Integrated layers Map Generation Integrated layers Layer Integration Native range prediction map User Selected prediction maps Scaling Archive To Ecogrid Generate Metadata

Sample Data: Basic Input/Output Species presence points parameters Dependent. Variable Coordinates +A 1 Presence

Sample Data: Basic Input/Output Species presence points parameters Dependent. Variable Coordinates +A 1 Presence under environmental conditions Test Sample of Conditioned Data +A 2 Independent. Variable Coordinates +A 3 Sample Data Training Sample of Conditioned Data Environmental Layers (temp. , vegetation, etc. ) input output

Analytic-Step Abstractions Physical Level An analytic step is a particular software implementation that takes

Analytic-Step Abstractions Physical Level An analytic step is a particular software implementation that takes and produces physical data (for example, files) Logical Level Defines the structure of input and output (like a database schema) Semantic Level Uses ontological information to conceptually define the analytic step (for discovery and integration)

Analytic-Step Abstractions Physical Level An analytic step is a particular software implementation that takes

Analytic-Step Abstractions Physical Level An analytic step is a particular software implementation that takes and produces physical data (for example, files) Logical Level Defines the structure of input and output (like a database schema) Semantic Level Uses ontological information to conceptually define the analytic step (for discovery and integration)

Sample Data: Physical Level Data as comma-delimited, plain text files parameters 33. 454606, 106.

Sample Data: Physical Level Data as comma-delimited, plain text files parameters 33. 454606, 106. 789098; 33. 454606, 106. 789097; … +A 1 1, 56. 25, 0, 20, …, 44; 0, 57. 34, 0, 55, …, 14; … +A 2 33. 454606, 106. 789098, 56. 25; 33. 454606, 106. 789097, 56. 37; …… 33. 454606, 106. 789097, 56. 37; … +A 3 Sample Data 0, 77. 33, 1, 50, …, 44; 1, 56. 01, 0, 55, …, 14; … An actual program that implements Sample Data input output

Analytic-Step Abstractions Physical Level An analytic step is a particular software implementation that takes

Analytic-Step Abstractions Physical Level An analytic step is a particular software implementation that takes and produces physical data (for example, files) Logical Level Defines the structure of input and output (like a database schema) Semantic Level Uses ontological information to conceptually define the analytic step (for discovery and integration)

Logical descriptions Recall that a schema sets the allowable structure for data Employee name

Logical descriptions Recall that a schema sets the allowable structure for data Employee name : string age : integer ssn : string title : string salary : int These tables are not allowable instances of the logical description Smith 40 555 -… 5 Clark 50 555 -… Mgr. 75000 Allen Jones 36 555 -… 4 Lewis 36 555 -… Sales 40000 Young Davis 22 555 -… 2 too few columns, wrong datatypes too many columns

Sample Data: Logical Level 2 -dimensional matrix parameters matrix[x, y] +A 1 Relation of

Sample Data: Logical Level 2 -dimensional matrix parameters matrix[x, y] +A 1 Relation of n+1 attributes for n environmental layers sample 1(pres, temp, veg, …, zn) +A 2 list(matrix[x, y, z]) +A 3 Sample Data sample 2(pres, temp, veg, …, zn) List of 3 -dimensional matrices, one matrix per environmental layer input output

Why have the Logical Level? Data independence Hides the details of how information is

Why have the Logical Level? Data independence Hides the details of how information is represented (text or binary files) from what is represented (a table of integers) Reduced application development time Makes information more easily reusable, for example, by other applications or services – with programs for handling the physical/logical level Can help enable integration Explicit knowledge of the structure and types of data can help automate conversion, for

Choosing a logical representation 2 -dimensional matrix parameters matrix[x, y] +A 1 Relation of

Choosing a logical representation 2 -dimensional matrix parameters matrix[x, y] +A 1 Relation of n+1 attributes for n environmental layers sample 1(pres, temp, veg, …, zn) +A 2 list(matrix[x, y, z]) List of 3 -dimensional matrices, one matrix per environmental layer input +A 3 Sample Data sample 2(pres, temp, veg, …, zn) Can you see any potential problems with this choice of logical output? output

Choosing a logical representation matrix[x, y] +A 1 sample 1(pres, z 1, z 2,

Choosing a logical representation matrix[x, y] +A 1 sample 1(pres, z 1, z 2, …, zn) +A 2 list(matrix[x, y, z]) +A 3 Sample Data The output structure is dependent on the input data… sample 2(pres, z 1, z 2, …, zn) ? avail(pres, temp, veg, elev) +A 1 +A 2 +A 3 Service

Choosing a logical representation sample 1(obs, property, value) matrix[x, y] +A 1 +A 2

Choosing a logical representation sample 1(obs, property, value) matrix[x, y] +A 1 +A 2 list(matrix[x, y, z]) +A 3 sample 2(obs, property, value) Sample Data avail(obs, property, value) Reusability is easier when the logical representation is known ahead of time… Service

Analytic-Step Abstractions Physical Level An analytic step is a particular software implementation that takes

Analytic-Step Abstractions Physical Level An analytic step is a particular software implementation that takes and produces physical data (for example, files) Logical Level Defines the structure of input and output (like a database schema) Semantic Level Uses ontological information to conceptually define the analytic step (for discovery and integration)

Sample Data: Semantic input/output Statistical Context Statistical Model Regression Model Logistic Regression has. Context

Sample Data: Semantic input/output Statistical Context Statistical Model Regression Model Logistic Regression has. Context Dependent Variable has. Dep. Var Statistical Variable Ecological Model Independent Variable has. Ind. Var uses. Regression. Model Biodiversity Model Eco. Niche Model Regression Based ENM

Putting it all together has. Dep. Var Dependent Variable Statistical Context Statistical Dataset Grid

Putting it all together has. Dep. Var Dependent Variable Statistical Context Statistical Dataset Grid Coordinate has. Context Dependent Variable parameters Independent Variable sample 1(obs, property, value) matrix[x, y] 33. 454606, 106. 789098; 33. 454606, 106. 789097; … has. Ind. Var +A 1 1, 56. 25, 0, 20, …, 44; 0, 57. 34, 0, 55, …, 14; … +A 2 +A 3 Statistical Context Grid Coordinate has. Dep. Var Dependent Variable has. Context Independent Variable list(matrix[x, y, z]) 33. 454606, 106. 789098, 56. 25; 33. 454606, 106. 789098, 33. 454606, 106. 789097, 56. 37; 56. 25; 33. 454606, 106. 789097, 56. 37; … … input Sample Data Statistical Dataset has. Ind. Var Independent Variable sample 2(obs, property, value) 1, 56. 25, 0, 20, …, 44; 0, 57. 34, 0, 55, …, 14; … Physical = Data Logical + Semantic Metadata output

Domain Workflow Eco. Grid Data. Base Species pres. & abs. points Test sample Species

Domain Workflow Eco. Grid Data. Base Species pres. & abs. points Test sample Species pres. & abs. points +A 2 +A 3 Eco. Grid Query Physical Transformation Model quality parameters +A 1 Sample Data Eco. Grid Data. Base Training sample Data Calculation GARP rule set Integrated layers Env. layers Eco. Grid Data. Base Eco. Grid Query Eco. Grid Data. Base Validation Map Generation Integrated layers Layer Integration Native range prediction map User Selected prediction maps Scaling Archive To Ecogrid Generate Metadata

Generic Workflow Eco. Grid Data. Base Occurrence Data Binary, Categorical or Numeric Test sample

Generic Workflow Eco. Grid Data. Base Occurrence Data Binary, Categorical or Numeric Test sample Model quality parameters +A 1 +A 2 +A 3 Eco. Grid Query Physical Transformation Sample Data Eco. Grid Data. Base Training sample GARP Data (or other) Validation Calculation rule set GARP rule set Integrated layers Eco. Grid Data. Base Environmental layers Eco. Grid Query Eco. Grid Data. Base Map Generation Integrated layers Layer Integration Prediction map User Selected prediction maps Scaling Archive To Ecogrid Generate Metadata

Temperature Interpolation Workflow Eco. Grid Data. Base Weather station temperature data Test sample +A

Temperature Interpolation Workflow Eco. Grid Data. Base Weather station temperature data Test sample +A 2 +A 3 Eco. Grid Query Physical Transformation Sample Data Eco. Grid Data. Base Training sample Data Calculation GARP rule set Validation GARP rule set Environmental layers: elevation, aspect, land cover Eco. Grid Query Eco. Grid Data. Base Model quality parameters +A 1 Integrated layers Map Generation Integrated layers Layer Integration Prediction map: Interpolated temperature grid User Selected prediction maps Scaling Archive To Ecogrid Generate Metadata

Extending Workflows: Climate Current environmental layers: ASx TS 1 ASy ASz TS 2 ASr

Extending Workflows: Climate Current environmental layers: ASx TS 1 ASy ASz TS 2 ASr Prediction maps under current conditions Prediction model from native area Changed environmental layers: ASx TS 1 ASy ASz TS 2 ASr Prediction maps under changed conditions Compare to get predicted effect of environmental change on species

Extending Workflows: Invasion Native area occurrence and environmental layers: ASx TS 1 ASy ASz

Extending Workflows: Invasion Native area occurrence and environmental layers: ASx TS 1 ASy ASz TS 2 ASr Prediction maps in native area Prediction model from native area Invasion area environmental layers: ASx TS 1 ASy ASz TS 2 ASr Prediction maps in invasion area

Process 1. Create the domain workflow at a conceptual level 2. Define the physical

Process 1. Create the domain workflow at a conceptual level 2. Define the physical and logical data types for each step 3. Define the ontological data types for each step, for both the domain and a generic ontology 4. Map the domain workflow to a generic workflow 5. Map the generic workflow to other domain workflows

Exercise n Divide into two groups (roughly half in each): 1. 2. n Climate

Exercise n Divide into two groups (roughly half in each): 1. 2. n Climate change Invasive species Download generic workflow from: ftp: //ftp. lternet. edu/pub/outgoing/penningd n Work on conceptual workflows that: 1. 2. 3. n Reuse the generic pipeline Extend the generic pipeline Create new pipelines Use Power Point, Visio, or paper tablets…your choice!