Amazon PIRE Data Processing Tutorial A guide to

  • Slides: 22
Download presentation
Amazon PIRE Data Processing Tutorial A guide to file management, data formatting, visualization, and

Amazon PIRE Data Processing Tutorial A guide to file management, data formatting, visualization, and analysis S. C. Wofsy, January 2010 version a-1. 0

Introduction Goals and scope of the tutorial Scientific data are typically created as, or

Introduction Goals and scope of the tutorial Scientific data are typically created as, or converted to, electronic files providing information in terms of numerical data complemented by metadata in terms of data descriptors (time, location, units, etc. ). Analysis of these data usually proceeds in a series of steps: 1. 2. 3. 4. 5. Acquisition of the data in electronic file format Formatting of the data set to enable it to be read using a data analysis application QA/QC of the data, using visualization tools, statistical tools, etc. Assessment of the data Analysis of the data to provide quantitative information and data products.

This tutorial is intended to help students prepare for analysis of the sets that

This tutorial is intended to help students prepare for analysis of the sets that they will obtain during the PIRE summer field course, and in their studies and careers afterward. A key principle is that all of the steps 1 -5 above must be traceable and reproducible. Often students may wish to explore and assess data files using graphical user interfaces (GUIs), and the tutorial will help students develop their skills with GUIs. However our key principle translates into the following requirements: the entire process must be repeatable starting from the most raw version of the data. We must therefore eschew commonly used spreadsheet programs in favor of much more capable object-oriented data analysis applications. These programs may be applied using both powerful GUIs, which record for future application each command that you execute, and scripts that are essentially sets of command line instructions. Also, analysis of environmental data will lead us from simple statistical tests and figures to sophisticated, rigorous procedures and carefully customized graphics, providing additional impetus to develop the expertise to use data manipulation and analysis programs. Finally, colleagues do not use the same computer systems or have the same licenses for software. The applications we will use, and our other data products, will be independent of platform and operating system, and will be open source and free of licensing fees.

Telecons/Webcasts: Professors Wofsy, Saleska, and or the Proctor will lead section-type discussions where students

Telecons/Webcasts: Professors Wofsy, Saleska, and or the Proctor will lead section-type discussions where students can ask questions and receive assistance. Dates and times to be announced. The proctor will offer assistance via email throughout. All students in the 2010 Amazon PIRE field course are expected to complete the Basic tutorials (R or Octave/Matlab). You are strongly encouraged to attend the Telecon/Webcasts, which will be scheduled in the evening hours to facilitate attendance. Session 1. Preparing your computer. Session 2 a. Basic R-tutorial, part 1. Session 3 a. Basic R-tutorial, part 2. Session 4 a. Intermediate R-tutorial. Session 2 b. Basic Octave/Matlab tutorial, part 1. Session 3 b. Basic Octave/Matlab tutorial, part 2. Session 4 b. Intermediate Octave/Matlab tutorial.

This tutorial provides training for students to analyze data using the following applications, free

This tutorial provides training for students to analyze data using the following applications, free for downloading (with proprietary equivalents): R (Splus) GNU-Octave (Matlab) Students who know how to use IDL, and have licenses for this application, can use IDL for data analysis. Important note: Experience has shown that Excel and similar spreadsheet programs cannot be successfully applied to analysis of PIRE data sets. Students can readily ingest data into the spreadsheets, but then find it extremely difficult to clean and assess data, and their manipulations are not traceable. Therefore the use of Excel for data analysis will not be permitted in the PIRE summer study. The tutorial provides specific help for students using the following operating systems on their computers: Microsoft windows (XP) Apple/Mac Leopard and Snow Leopard Linux (Ubuntu) Adjustments may need to be made for other versions of these operating systems.

Preparing your computer Download or install easy-to-use syntax text editor – required so that

Preparing your computer Download or install easy-to-use syntax text editor – required so that you can edit data files and scripts without any changes in file format. These applications also provide colorcoded "syntax highlighting" that greatly facilitate writing and editing program scripts. Win: notepad++ Mac: Jedit. app (use the Java installation procedure) Linux: gedit "Office" type applications (MS-Word/wordpad, Pages, etc) cannot be used. Vanilla notepad (Win) and Text. Edit (Mac) are inadequate. Download and install your data analysis application; we suggest R unless you are already familiar with Matlab/Octave R (http: //www. r-project. org/) [Ubuntu users download from the Repository, r-cran] Octave (http: // ) or Matlab (installation disk) Win only: Download and install required file management tools from Gnuwin 32 (http: //gnuwin 32. sourceforge. net/packages. html): "coreutils", "which", "gzip", "tar", and "grep" Important: Matlab or IDL installations requiring a license server will not work in Manaus!

Preparing your computer (continued) Win only: Make adjustments to your path environment variable Start

Preparing your computer (continued) Win only: Make adjustments to your path environment variable Start => control panel => system => advanced tab click "Environment Variables" Select "path" under system variables, add the following to the end of path: ; c: program filesgnuwin 32bin; c: program filesR<name R version>bin; c: program filesnotepad++bin [=<>] or similarly if you are using Octave or Matlab To find the full path to your R version, use Windows Explorer to navigate to the application file "R. exe". You can copy the path from the address bar (do not include "R. exe" itself in the path). Install a shortcut to cmd. exe on your desktop or quicklaunch (in C: WINDOWSsystem 32) In Windows Explorer, click tools => folder options, uncheck "Hide extensions" and check "Show hidden files and folders". Some participants who are borrowing an institutional computer may lack the permissions to undertake these changes. Have your system administrator give you the permissions, or if they cannot, have them make these changes.

Preparing your computer (continued) Linux, Mac: Put the Terminal application on your application bar.

Preparing your computer (continued) Linux, Mac: Put the Terminal application on your application bar. Mac only: Install X-code from your Mac installation CD (to install packages). Install Mac. Ports/Darwinports from the Internet (http: // ). Install "gfortran". Open the Terminal application, and in your home folder (/Users/your_username) add the following lines to the file. profile , using the editing program you have installed. defaults write com. apple. finder Apple. Show. All. Files TRUE killall Finder Ubuntu only: Install c++, gcc, and gfortran from the repository (may be needed to install some R packages).

Learning to use your computer (as a computer) Hands-on activities: • find out how

Learning to use your computer (as a computer) Hands-on activities: • find out how to use a command • create your data file structure • list files; locate files (Gui ok) • find out how big a file is, how many lines it has, etc. • create and edit a simple file: a data file; an R script • copy, move and delete files • search for strings within files

Learning to use your computer (continued) Find information on how to use a command

Learning to use your computer (continued) Find information on how to use a command Win: From the Desktop: Look up the help information for cmd, all the commands are listed From within the cmd window, type "<command> /? " or "<command> -h" Examples: mkdir /? pwd –h Linux, Mac: From the Terminal: "man <command>" e. g. man mkdir

Learning to use your computer (continued) Data file structure You will need a convenient

Learning to use your computer (continued) Data file structure You will need a convenient place to put your data files and the scripts that will analyze them. Since the path to this folder will have to be specified in your scripts, keep the name short and the location easy to find. Do not include any symbols other than letters, numbers, and "_"; no spaces should be used. Good locations might be c: pire (Win) or $HOME/pire for Linux/Mac. You will need subfolders for data, scripts, etc. You may use the GUI to do this (Windows Explorer (Win), Finder (Mac), or Nautilus (Ubuntu), but this is a good place to start using the command window (Win) or Terminal application (Mac/Linux). Using the command/terminal window: Win: mkdir c: pire Linux, Mac: mkdir $HOME/pire mkdir c: pirescripts mkdir $HOME/pire/scripts etc. Note: $HOME refers to your home directory on Linux/Mac (type "echo $HOME" from the terminal). Typing "mkdir pire" has the same effect as the above if you are working in the home folder ( "cd c: " or "cd $HOME" , "cd" = change directory)

Learning to use your computer (continued) Find properties of the files in a folder

Learning to use your computer (continued) Find properties of the files in a folder Before leaving c: or your home directory, try finding out about the files in the folder. ls –al (lists files and their properties; ls -1 : short list; ls –alt in time order, …) wc (gives number of lines, number of words, and number of bytes in a file; wc <filename> reports on only the named file; "*" is a wildcard) Some notable anomalies: Linux treats upper and lower case commands, filenames etc as different. Windows ignores upper/lowercase. Mac sometimes ignores case and sometimes does not. To make your work transportable, assume upper and lower case filenames are different, but do not give different files the same name with different case. Folder names in a path are distinguished by a forward slash "/" in Linux and Mac, and a backward slash "" in Windows also recognizes the "/" but inconsistently, and all three recognize the "" as an "escape character" that affects the treatment of the following character (Windows inconsistently). Spaces ("<space>") are used to separate parts of a command. To reference a file or folder with a <space> in its name, the name should be surrounded by quotes. Avoid putting spaces in file names.

Learning to use your computer (continued) Create and edit simple files from the cmd

Learning to use your computer (continued) Create and edit simple files from the cmd or Terminal window Change directory to piredata (Win) or pire/data (Linux/Mac) Open your editing application: Win: notepad++. exe Linux: gedit Mac: open /Applications/Jedit. app Create a file with the following content, and save it into the folder pire/data with file name "testfile. txt: X 1 2 3 4 5 6 7 8 9 10 Y 0. 53 4. 75 9. 37 16. 38 24. 67 37. 34 48. 44 64. 41 81. 93 99. 83

Learning to use your computer (continued) Also, create the following file with name "testfile

Learning to use your computer (continued) Also, create the following file with name "testfile 0. txt" x. 1 x. 2 0 0. 9053910 1 -0. 4245758 2 2. 1638530 3 3. 2013392 4 1. 0568681 5 2. 7682038 6 2. 7272512 7 2. 8435819 8 6. 3021333 9 6. 0850179 0 5. 9820819 11 6. 8738404 12 5. 6178844 13 5. 9797397 14 6. 8131798 15 8. 2127355 16 8. 5939752 18 8. 9382829 19 8. 6808897 11 8. 9008140 20 10. 0755642

Exercise 1. "Learning to use your computer. " 1. Make a copies of testfile.

Exercise 1. "Learning to use your computer. " 1. Make a copies of testfile. txt called testfile_copy. txt and dummy. txt using the command cp (in windows, "copy" will also work). Check the result using your installed special editor (not the default editor). Then remove dummy. txt using the command rm (del will also work in Win). Then rename/move file testfile_copy. txt to testfile_newcopy. txt using the command mv (move will also work in Win). Make a listing of the contents of your folder using ls –al > filelist. txt. Hand in electronic files testfile. txt and filelist. txt. 2. The command grep " 6" filename > newfile selects every line in "filename" that contains a space followed by the number 6, and put the output of the command grep into a file called newfile. Apply this command to find the lines that have a <space>5 in testifle 0. txt. and put them into a file called result. txt. Hints: First execute this command without the "> newfile" part, then inspect "newfile" to see if it contains the expected results. The symbol ">" directs the output of the command grep into file "newfile". Hand it result. txt 3. The command awk '{print $n}' filename > newfile extracts the nth column from file "filename" and places the output into "newfile". Extract the 2 nd column of textfile. txt and put it into a file called testfile_col 2. txt. Note: In Windows use " rather than ' in this command. Hand in testfile_col 2. txt. To submit answers, create a zip file (zip myname_ex 1. zip <list of files>) and email to the proctor (email: xxxx 0).

R-tutorial (Octave/Matlab users skip to "Octave Tutorial") The basic R-tutorial covers the first two

R-tutorial (Octave/Matlab users skip to "Octave Tutorial") The basic R-tutorial covers the first two chapters (11 pages) of the document Rintro. pdf "An introduction to R" by W. N. Venables, D. M. Smith and the R Development Core Team, plus items from some of the other sections listed below. Getting started: Read Chapters 1 and 2 of "An introduction to R", being sure to type into your computer each R command shown in the chapter. Take careful note of the results. Learn about, and try out, the command setwd("foldername"). When you complete this reading, save the result in the file pire/scripts/tutorial. r using the savehistory("filename") command. After closing R, open this file with your editor and note the syntax highlighting. Some notable anomalies: Due to the conflict involving windows "" symbol, folder separators in filenames referenced within R are designated with two backslashes "\" or one forward slash ("/"). Don't mix these in one path/file name. When a data frame is created by reading a file into R using "read. table()", columns of alphabetic data are by default made into "factors". This should be prevented using the argument "as. is=T" in the invocation of read. table(). Example: Win: read. table("c: /pire/data/testfile 0. txt, as. is=T) or read. table("c: /pire/data/testfile 0. txt", as is=T). Linux, Mac: read. table("$HOME/pire/data/testfile 0. txt, as. is=T)

R-tutorial (continued) *Basic Tutorial components: What is "Object oriented programming"? What are "attributes" ?

R-tutorial (continued) *Basic Tutorial components: What is "Object oriented programming"? What are "attributes" ? Matrix and data frame: creating and manipulating Plotting data, exploring data Fitting data to a straight line; to a curves line; ordinary regressions and RMA regressions. Simple statistics on data: means, medians, quantiles, t-test, confidence intervals, Outliers; time series of data Saving your work: objects, commands, functions, graphs *Intermediate tutorial Scripting: what, why, how. *Data sets Tree diameter data Soil flux chamber data Temperature data

R-tutorial (continued) Exercise 2 *Basic Tutorial 1. Create data frames from the files testfile.

R-tutorial (continued) Exercise 2 *Basic Tutorial 1. Create data frames from the files testfile. txt and testfile 0. txt that you made earlier in the tutorial. Hints: use the header=T argument to ensure that the columns will have the colnames attribute. 2. Make graphs of Y vs X and x. 2 vs x. 1 using the names of the columns in the plotting command. Save the figures as "png" or "jpg" graphics files (use dev. copy( ) followed by dev. off( ). 3. Fit the data to polynomials e. g. Y = a 1 + a 2*X + a 3*x^2 + … , selecting the order of the polynomial by looking at the graphs you have made. Hint: You will create an object with the command <fitted object name> = lm ( Y ~ X + X^2 +. . . , <maybe other arguments>) 4. Plot your best fit curve on the graph of Y vs X. Hint: look at what is accomplished by the function predict(). 5. Use summary() to examine the parameters of the fit and their uncertainties. 6. Read in the file T-test-file. txt downloaded from the website. Read about the ttest (http: // …). Examine the paired variables A and B from the file as to whether their respective means are different in a statistically significant way.

Octave/Matlab-tutorial Functionally equivalent to the R-tutorial

Octave/Matlab-tutorial Functionally equivalent to the R-tutorial

Octave/Matlab-tutorial (continued)

Octave/Matlab-tutorial (continued)

Introduction and preliminaries 2 1. 1 The R environment 2 1. 2 Related software

Introduction and preliminaries 2 1. 1 The R environment 2 1. 2 Related software and documentation 2 1. 3 R and statistics: 2 1. 4 R and the window system 3 1. 5 Using R interactively 3 1. 6 An introductory session: 4 1. 7 Getting help with functions and features 4 1. 8 R commands, case sensitivity, etc. 4 1. 9 Recall and correction of previous commands 5 1. 10 Executing commands from/ diverting output to a file 5 1. 11 Data permanency and removing objects: 5 2 Simple manipulations; numbers and vectors: 7 2. 1 Vectors and assignment 7 2. 2 Vector arithmetic 7 2. 3 Generating regular sequences 8 2. 4 Logical vectors 9 2. 5 Missing values 9 2. 6 Character vectors 10 2. 7 Index vectors; selecting and modifying subsets of a data set 10 2. 8 Other types of objects 11 3 Objects, their modes and attributes 13 3. 1 Intrinsic attributes: mode and length 13 3. 2 Changing the length of an object 14 3. 3 Getting and setting attributes 14 5 Arrays and matrices 18 5. 1 Arrays 18 5. 2 Array indexing. Subsections of an array 18 6 Lists and data frames 26 6. 1 Lists 26 6. 2 Constructing and modifying lists 26 6. 2. 1 Concatenating lists 27 6. 3 Data frames 27 6. 3. 1 Making data frames 27 6. 3. 2 attach() and detach() 27 6. 3. 3 Working with data frames 28 6. 3. 4 Attaching arbitrary lists 28 6. 3. 5 Managing the search path 29 7 Reading data from files 30 7. 1 The read. table() function 30 7. 2 The scan() function 31 7. 3 Accessing builtin datasets 31 7. 3. 1 Loading data from other R packages 31 7. 4 Editing data 32 9 Grouping, loops and conditional execution: 40 9. 1 Grouped expressions: 40 9. 2 Control statements 40 9. 2. 1 Conditional execution: if statements 40 9. 2. 2 Repetitive execution: for loops, repeat and while 40 10 Writing your own functions 42 10. 1 Simple examples 42 Appendix A A sample session 78 Appendix B Invoking R: 81 B. 1 Invoking R from the command line 81 B. 2 Invoking R under Windows 85 B. 3 Invoking R under Mac OS X 85 B. 4 Scripting with R 86 iv Appendix C The command-line editor 87 C. 1 Preliminaries 87 C. 2 Editing actions 87 C. 3 Command-line editor summary: 87 "R-intro_selection. txt" [New] 97 L, 2962 C written 12 Graphical procedures 62 12. 1 High-level plotting commands 62 12. 1. 1 The plot() function 62 12. 1. 2 Displaying multivariate data 63 12. 1. 3 Display graphics: 63 12. 1. 4 Arguments to high-level plotting 64 12. 2 Low-level plotting commands 65 optional reading 12. 2. 1 Mathematical annotation 66 12. 3 Interacting with graphics 66 12. 4 Using graphics parameters 67 12. 4. 1 Permanent changes: T par() 67 12. 4. 2 Arguments to graphics functions 68 12. 5 Graphics parameters list: 68 12. 5. 1 Graphical elements 69 12. 5. 2 Axes and tick marks: 70 12. 5. 3 Figure margins 70 12. 5. 4 Multiple figure environment 72 12. 6 Device drivers 73 12. 6. 1 Post. Script diagrams for typeset documents 73 12. 6. 2 Multiple graphics devices 74 12. 7 Dynamic graphics 75 13 Packages 76 13. 1 Standard packages 76 13. 2 Contributed packages and CRAN 76 13. 3 Namespaces 76

Default installations of R should have the following packages: base, stats 4, graphics, gr.

Default installations of R should have the following packages: base, stats 4, graphics, gr. Devices, and a few others (type "library()" to list what you have). Using "install. packages()", try adding the following packages (some may not install…don't be concerned). Type " help(install. packages) " or "help. search("install packages") to see how to use the function "install. packages()": akima datasets fields foreign gstat lattice mapdata mapproj maps matlab Matrix sp spatial splines splus 2 R tseries utils Interpolation of irregularly spaced data The R Datasets Package Tools for spatial data Read Data Stored by Minitab, S, SAS, SPSS, Stata, Systat, d. Base, . . . Geostatistical package Lattice Graphics Extra Map Databases Map Projections Draw Geographical Maps MATLAB emulation package Sparse and Dense Matrix Classes and Methods classes and methods for spatial data Functions for Kriging and Point Pattern Analysis Regression Spline Functions and Classes S-PLUS functionality missing from R Time series analysis and computational finance The R Utils Package