Management and Analysis of Long Term data Darren























- Slides: 23
Management and Analysis of Long Term data Darren James 1. The process of data analysis 2. Managing and handling long term data 3. Some considerations for designing long term experiments
The process of scientific research question new experiment design model diagnostics experiment analysis new research question fit a model to the data
The Role of Models Nature, Reality Model x “Essentially, all models are wrong, but some are useful. ” ? BIG MESS! Function nice and clean - George E. P. Box, Empirical Model-Building and Response Surfaces (1987) ? y • Mathematical model: Replace reality of infinite dimension with much smaller set of system descriptors: parameters • Response (y) = Function of input/explanatory variables (x)
What data analysis looks like to most people Data
Charles Babbage (1791 -1871), English mathematician - Originator of the concept of a programmable computer “On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out? ". . . I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. ” —Charles Babbage, Passages from the Life of a Philosopher (1864) GIGO: “Garbage In, Garbage Out” as opposed to “Garbage In, Gospel Out” Babbage Difference Engine
What data analysis looks like to most people Data Magic
What really happens: Raw data The data analysis road map Explore data Compile data Plot data Clean and check data Identify outliers and missing values Format data Is the model appropriate? NO Transform data or change model NO Does the experiment follow the design? YES Interpret Results YES Run analysis Data set for analysis
Jornada LTER climate data example http: //jornada-www. nmsu. edu/studies/lter/datasets/climate/wthrstn/day/wsday. csv
Data structure: how the data is organized in a computer file • tabular: summary table • vertical / horizontal • univariate / multivariate • separators • what does a line represent? • what does a column represent? • file type Best practices: Use the simplest possible data structure. For data entry, organize the data by the observational unit. This maximizes the information contained in the data.
Metadata is data about data. Minimally, metadata should include: Who collected the data? Who owns the data? What does the data file consist of (variable names, units, etc. ) When were the data collected? How were the data collected (methods)? Why were the data collected (research questions)?
Metadata is crucial to the data analysis process • Are the data appropriate for the research question? • What is the inference space? • Can the data be combined with others? • If so, how? • Are some observations more reliable than others?
Data entry and error checking • Data types can be mixed (numeric and character) • Variables can have the same name • Missing data can mean different things • Links can break • Spaces in names and data can create errors • Many other possible problems
Standard rain gauges: 34 locations across the Jornada Experimental Range • Data: monthly precipitation, in inches • Now replaced with automatic tipping bucket rain gauges
Combining data sets by Stacking For this to work correctly, the variable names in the two data sets must match. Precip_2013 Year Rain_gauge Season Precip 2013 Precip_update Year Rain_gauge Season Precip 2013 2014 Precip_2014 Year Rain_gauge Season Precip 2014 Notice that we have to add a Year variable.
Combining data sets by Merging precip Rain_gauge elevation Season Precip Rain_gauge Elevation merge BY Rain_gauge precip_elevation Rain_gauge Season Precip Elevation The Rain_gauge names must match in both files • Using consistent names is essential for maintaining long term data
What about missing observations in long term data records?
What about gaps in data files? • Some statistical procedures do not work if any observations are missing • Gap-filling – Temporal imputation – Spatial imputation – Time + Space imputation
Managing Long-term data • • • Metadata is critical Organization and consistency are critical Monitor the data as they are coming in Keep data in the finest resolution possible Document changes Be flexible and realistic
When implementing Long term research, maximize the experimental units treatment 1 treatment 2 Site 1 treatment 3 There are 2 replications of treatment at each site Site 2 Site 3 But there is only 1 replication of treatment in the study
If I want to test for differences between sites then I must have more than 1 replication of each level of site Grassland 1 treatment 2 treatment 3 There are 2 replications of treatment at each site Grassland 2 Ecotone 1 There are 2 replications of site in the study Shrubland 1 Shrubland 2
Any sampling within an experimental unit does not gain more information about treatment 1 Shrub 2 Shrub 3 Shrub 4
When implementing Long term research, maximize the experimental units Site 1 treatment 2 treatment 3 Site 2 Site 3 There are 2 replications of treatment at each site
General guidelines for implementing Long term research • Clearly identify the research question • Identify the statistical analysis during the experimental design phase • Maximize the experimental units – Expect some to be lost during the study • Be flexible and realistic in scope