Module 08 Data Analysis Workflows Tools Topics Review

  • Slides: 51
Download presentation
Module 08 Data Analysis Workflows & Tools

Module 08 Data Analysis Workflows & Tools

Topics Review of typical data analyses Reproducibility & provenance Overview of workflows Computer-based scientific

Topics Review of typical data analyses Reproducibility & provenance Overview of workflows Computer-based scientific workflows (SWF) Benefits of SWF Examples of SWF and associated tools Data Analysis

Learning Objectives After completing this lesson, the participant will be able to: ◦ ◦

Learning Objectives After completing this lesson, the participant will be able to: ◦ ◦ Understand a subset of typical analyses used Define a workflow Define an SWF Discuss the benefits of workflows in general and SWF in particular ◦ Locate resources for using SWF Data Analysis

The Data Life Cycle Plan Publish & Share Preserve Data Analysis Acquire & Process

The Data Life Cycle Plan Publish & Share Preserve Data Analysis Acquire & Process Analyze

Data Analyses Conducted via personal computer, grid, cloud computing Statistics, model runs, parameter estimations,

Data Analyses Conducted via personal computer, grid, cloud computing Statistics, model runs, parameter estimations, production of graphs/plots etc. Data Analysis

Types of Analyses Processing: subsetting, merging, manipulating ◦ Reduction: important for high-resolution datasets ◦

Types of Analyses Processing: subsetting, merging, manipulating ◦ Reduction: important for high-resolution datasets ◦ Transformation: unit conversions, linear and nonlinear algorithms 0711070500276000 0711070600276000 0711070700277003 0711070800282017 0711070900285000 0711071000293000 071100301000 0711071200304000 Date time 11 -Jul-07 11 -Jul-07 5: 00 6: 00 7: 00 8: 00 9: 00 10: 00 11: 00 12: 00 air temp C 27. 6 27. 7 28. 2 28. 5 29. 3 30. 1 30. 4 precip mm 000 003 017 000 000 Recreated from Michener & Brunt (2000) Data Analysis

Types of Analyses Graphical analyses ◦ Visual exploration of data: search for patterns ◦

Types of Analyses Graphical analyses ◦ Visual exploration of data: search for patterns ◦ Quality assurance: outlier detection Scatter plot of August Temperatures Strasser, unpub. data Data Analysis Box and whisker plot of temperature by month

Types of Analyses Statistical analyses Conventional statistics -Traditionally apply to experimental data -Examples: ANOVA,

Types of Analyses Statistical analyses Conventional statistics -Traditionally apply to experimental data -Examples: ANOVA, MANOVA, linear and nonlinear regression Example of Principle Component Analysis • Rely on assumptions: random sampling, random & normally distributed error, independent error terms, homogeneous variance Descriptive statistics • Traditionally apply to observational or descriptive data • Examples: diversity indices, cluster analysis, quadrant variance, distance methods, principal component analysis, correspondence analysis Data Analysis Oksanen 2011

Types of Analyses Statistical analyses (continued) ◦ Temporal analyses: time series ◦ Spatial analyses:

Types of Analyses Statistical analyses (continued) ◦ Temporal analyses: time series ◦ Spatial analyses: for spatial autocorrelation ◦ Nonparametric approaches: useful when conventional assumptions violated or underlying distribution unknown ◦ Other misc. analyses: risk assessment, generalized linear models, mixed models, etc. Analyses of very large datasets ◦ Data mining & discovery ◦ Online data processing Data Analysis

After Data Analysis Re-analysis of outputs Final visualizations: charts, graphs, simulations etc. Science is

After Data Analysis Re-analysis of outputs Final visualizations: charts, graphs, simulations etc. Science is iterative: The process that results in the final product can be complex Data Analysis

Reproducibility is at the core of scientific method Complex process = more difficult to

Reproducibility is at the core of scientific method Complex process = more difficult to reproduce Good documentation required for reproducibility ◦ Metadata: data about data ◦ Process metadata: data about process used to create, manipulate, and analyze data Data Analysis

Process Metadata Process metadata is information about the process used to get to the

Process Metadata Process metadata is information about the process used to get to the data outputs Related concept: data provenance ◦ Data provenance is information about the origins of data ◦ Good provenance = able to follow data throughout entire life cycle (collection, organization & quality control, analyses, visualization) ◦ Allows for Replication & reproducibility Analysis for potential defects, errors in logic, statistical errors Evaluation of hypotheses Data Analysis

Overview of Workflows A workflow is a formalization of process metadata Includes precise description

Overview of Workflows A workflow is a formalization of process metadata Includes precise description of scientific procedure Includes conceptualized series of data ingestion, transformation, and analytical steps Three components of a workflow: 1. Inputs: Information or material required 2. Outputs: Information or material produced & potentially used as input in other steps 3. Transformation rules/algorithms (e. g. analyses) Data Analysis

Overview of Workflows Simplest form of workflow: flow chart Data import into Excel Quality

Overview of Workflows Simplest form of workflow: flow chart Data import into Excel Quality control & data cleaning Analysis: mean, SD Graph production Data Analysis

Workflows in General Simplest form of workflow: flow chart Temperature data (T) Input: Raw

Workflows in General Simplest form of workflow: flow chart Temperature data (T) Input: Raw T and S data Data import into Excel Salinity data (S) “Clean” T & S data Quality control & data cleaning Inputs & Output: data in Outputs Excel format Data in Excel format Input: data in Excel format Analysis: mean, SD Summary statistics Graph production Data Analysis

Workflows in General � Simplest form of workflow: flow chart Transformation Rules Temperature data

Workflows in General � Simplest form of workflow: flow chart Transformation Rules Temperature data (T) Salinity data (S) “Clean” T & S data Data import into Excel Quality control & data cleaning Analysis: mean, SD Transformation rules describe what is done to/with the data to obtain the relevant outputs for Data Analysis publication. Data in Excel format Graph production Summary statistics

Overview of Workflows • Science is becoming more computationally intensive • Most transformations are

Overview of Workflows • Science is becoming more computationally intensive • Most transformations are done via computer programs • Sharing workflows benefits science • Defining a scientific workflow system makes documenting workflows easier Data Analysis

Scientific Workflows (SWF) A scientific workflow is an “analytical pipeline” Each step can be

Scientific Workflows (SWF) A scientific workflow is an “analytical pipeline” Each step can be implemented in different software systems Each step and its parameters/requirements are formally recorded This allows reuse of both individual steps and the overall workflow Data Analysis

Benefits of Scientific Workflows (SWF) Single access point for multiple analyses across software packages

Benefits of Scientific Workflows (SWF) Single access point for multiple analyses across software packages Keeps track of analysis and provenance: enables reproducibility ◦ Each step & its parameters/requirements formally recorded Workflow can be stored Allows sharing and reuse of individual steps or overall workflow ◦ Automate repetitive tasks ◦ Use across different disciplines and groups ◦ Can run analyses more quickly since not starting from scratch Data Analysis

Example of SWF: Kepler Open-source, free, cross-platform Drag-and-drop interface for workflow construction Steps (analyses,

Example of SWF: Kepler Open-source, free, cross-platform Drag-and-drop interface for workflow construction Steps (analyses, manipulations, etc) in workflow represented by an “actor” Actors connect via inputs and outputs to form a workflow Possible applications ◦ ◦ Theoretical models or observational analyses Hierarchical modeling Can have nested workflows Can access data from web-based sources (e. g. databases) Downloads and more information at kepler-project. org Data Analysis

Example of SWF: Kepler Drag & drop components from this list Data Analysis Actors

Example of SWF: Kepler Drag & drop components from this list Data Analysis Actors in workflow

Example of SWF: Kepler This model shows the solution to the classic Lotka-Volterra predator

Example of SWF: Kepler This model shows the solution to the classic Lotka-Volterra predator prey dynamics model. It uses the Continuous Time domain to solve two coupled differential equations, one that models the predator population and one that models the prey population. The results are plotted as they are calculated showing both population change and a phase diagram of the dynamics. Data Analysis

Example of SWF: Kepler Resulting output Data Analysis

Example of SWF: Kepler Resulting output Data Analysis

Other SWF Tools: Vis. Trails Open-source Workflow & provenance management support Geared toward exploratory

Other SWF Tools: Vis. Trails Open-source Workflow & provenance management support Geared toward exploratory computational tasks ◦ Can manage evolving SWF ◦ Maintains detailed history about steps & data www. vistrails. org Data Analysis Screenshot example

Other SWF Tools: my. Experiment Social networking site to support scientists that use SWF

Other SWF Tools: my. Experiment Social networking site to support scientists that use SWF Allows searching for, sharing, reuse of SWF Can comment on and discuss contributed SWF Gateway to journals and data repositories www. myexperiment. org Data Analysis

Best Practices for Data Analysis Scientists should document workflows used to create results ◦

Best Practices for Data Analysis Scientists should document workflows used to create results ◦ Data provenance ◦ Analyses and parameters used ◦ Connections between analyses via inputs and outputs Documentation can be informal (for example, a flowchart) or formal (for example, Kepler software) Data Analysis

Summary Modern science is computer-intensive ◦ Heterogeneous data, analyses, software Reproducibility is important Workflows

Summary Modern science is computer-intensive ◦ Heterogeneous data, analyses, software Reproducibility is important Workflows = process metadata ◦ Necessary for reproducibility, repeatability, validation There are formal systems for documenting process metadata ◦ Enable storage, sharing, visualization, reuse Data Analysis

Resources for Data Analysis & SWF Gil, Y, E Deelman, M Ellisman, T Fahringer,

Resources for Data Analysis & SWF Gil, Y, E Deelman, M Ellisman, T Fahringer, G Fox, D Gannon, C Goble, M Livny, L Moreau, and J Myers. Examining the Challenges of Scientific Workflows. Computer 40: 24– 32, 2007. Michener, K, J Beach, M Jones, B Ludaescher, D Pennington, R Pereira, A Rajasekar, and M Schildhauer. A knowledge environment for the biodiversity and ecological sciences. Journal of Intelligent Information Systems, 29: 111– 126, August 2007. Ludäscher, B, I Altintas, S Bowers, J Cummings, T Critchlow, E Deelman, DD Roure, J Freire, C Goble, M Jones, S Klasky, T Mc. Phillips, N. Podhorszki, C Silva, I Taylor, and M Vouk. Scientific Process Automation and Workflow Management. Computational Science Series Ch 13. Chapman & Hall, Boca Raton, 2009. Mc. Phillips, T, S Bowers, D Zinn, B Ludäscher. Scientific workflow design for mere mortals. Future Generation Computer Systems 25: 541 -551, 2009. B Ludäscher, I Altintas, C Berkley, D Higgins, E Jaeger-Frank, M Jones, E Lee, J Tao, and Y Zhao. Scientific workflow management and the kepler system. Concurrency and Computation: Practice & Experience, 18, 2006. W Michener and J Brunt, editors. Ecological Data: Design, Management and Processing. Blackwell Science, 180 p, 2000. Data Analysis

What did you learn? START QUIZ

What did you learn? START QUIZ

1. A workflow involves the formalization and processing of metadata with a precise description

1. A workflow involves the formalization and processing of metadata with a precise description of scientific procedures and analytical steps. Which of the following is a component of a workflow? Analyses The input that contains the information The output which is the information produced All of the above Data Analysis

Think about this … Review this section Return Data Analysis

Think about this … Review this section Return Data Analysis

Excellent! Proceed to the next question NEXT Data Analysis

Excellent! Proceed to the next question NEXT Data Analysis

2. Which of the following may not be typically used when analyzing a large

2. Which of the following may not be typically used when analyzing a large amount of data? Data mining and discovery Grid computing Spatial analyses Pattern searching and decision trees Data Analysis

Think about this … Review this section Return Data Analysis

Think about this … Review this section Return Data Analysis

Excellent! Proceed to the next question Next Data Analysis

Excellent! Proceed to the next question Next Data Analysis

3. After data analysis, outputs can be generated as ________. Scatter plots Box-and-whisker plots

3. After data analysis, outputs can be generated as ________. Scatter plots Box-and-whisker plots Plots that show you potential data errors All graphical formats and analyses Data Analysis

Think about this … Review this section Return Data Analysis

Think about this … Review this section Return Data Analysis

Excellent! Proceed to the next question Next Data Analysis

Excellent! Proceed to the next question Next Data Analysis

4. Best practices for data analysis should involve the documentation of workflows to show

4. Best practices for data analysis should involve the documentation of workflows to show results of data provenance, analyses, and parameters used. Workflows are necessary for which of the following? Reproducibility Repeatability Validation All of the above Data Analysis

Think about this … Review this section Return Data Analysis

Think about this … Review this section Return Data Analysis

Excellent! Proceed to the next question Next Data Analysis

Excellent! Proceed to the next question Next Data Analysis

5. SWF stands for _____ and offers computer-based formal systems for documenting the metadata

5. SWF stands for _____ and offers computer-based formal systems for documenting the metadata process. Scientific workflow Systematic workflow Scientific workforce Systematic work information Data Analysis

Think about this … Review this section Return Data Analysis

Think about this … Review this section Return Data Analysis

Excellent! Proceed to the next question Next Data Analysis

Excellent! Proceed to the next question Next Data Analysis

6. Which of the following is a key benefit of SFW? Each step can

6. Which of the following is a key benefit of SFW? Each step can be implemented in different software systems with requirements formally recorded Single access point for multiple analyses. Workflow can be stored Allows sharing of individual steps. All of the above. Data Analysis

Think about this … Review this section Return Data Analysis

Think about this … Review this section Return Data Analysis

Excellent! Proceed to the next question Next Data Analysis

Excellent! Proceed to the next question Next Data Analysis

7. ______ enables one to follow data throughout the entire data life cycle (collection,

7. ______ enables one to follow data throughout the entire data life cycle (collection, organization, quality control, analyses, and visualization). Good organization Good data maintenance Good provenance Good metadata Data Analysis

Think about this … Review this section Return Data Analysis

Think about this … Review this section Return Data Analysis

Congratulations! You have completed this learning module. Data Analysis Next

Congratulations! You have completed this learning module. Data Analysis Next

Before you go. . . We want to hear from you! CLICK the arrow

Before you go. . . We want to hear from you! CLICK the arrow to take our short survey. Data Analysis