MLS 3302 Introduction to R and RStudio Basic
MLS 3302 Introduction to R and RStudio Basic Statistical Analyses Using R
R / Rstudio • R is a free computing environment for statistical computing and graphics • R is open source meaning that anyone can write and publish functions for use in the R environment • RStudio is an integrated development environment (IDE) for R that can be used across many platforms – It allows for the efficient use of R without having to be skilled in the R programming language.
Basic R coding • R requires a small bit of knowledge of computing languages • The basic syntax is consistent with many statistical and computing packages, but there can be a steep learning curve • RStudio will allow us to perform many functions without needing to know to much coding
Installing R and R Studio • First Download R by going to: http: //www. r-project. org/ • Find the download link to the left of the page; Download, Packages – CRAN • You must select a comprehensive R archive network (CRAN) – usually something geographically close, USA –Oregon State University • Then simply download and install the application on Mac OS X or Windows • R Studio is easier and can be downloaded from: http: //rstudio. org/
Using RStudio Workspace/History Console Plots/Packages/Files/Help
First click the “File” tab then mouse over “More” to find “set working directory”. Using a file menu similar to windows or OSX find the folder you wish to work in and set it as the directory.
1. Under the “Workspace” tab find “Import Dataset”. 2. Find the file you wish to import “testdat. csv” and import using this menu.
When the data are successfully loaded you will see the data object (testdat) in your workspace and a “data window” open above the console. Note: even when using the drop down menus, the R code will be presented on the “command line” in the “console” window.
Under the “Packages” tab you can install, update, load and unload select package add-ons. For our use we will need to INSTALL “ggplot 2” and the “psych” packages. After installing simple check the box to load the packages.
Basic R Usage • This general R command syntax uses the assignment operator '<-' (or '=') to assign data generated by command to its right to object on its left. A more recently introduced assignment operator is '='. Both of them work the same way and in both directions. For consistency reasons one should use only one of them. object <- function(argument)
You can create new variables in your data frame by using the above code. The “$” denotes a new or existing variable in a data frame. In the above case we first created a variable (var 3) in our test data. The var 3 variable is the difference between group 1 and group 2 data. Then I created a second variable var 4, which is the absolute difference of group 1 and group 2 using the function… abs(group 1 -group 2) The data frame was then detach because we had attached it before creating our variables. If we hadn’t attached the data frame first, we would have to tell R in which data frame the variables are located: e. g. testdat_1$var 3 <- (testdat_1$group 1 – testdat_1$group 2)
The first step in any statistical analysis should be to plot our data. Here we generated a simple scatter plot shown in the lower right window, then created a ggplot object “plot” in our workspace for more advanced graphing.
Using our ggplot 2 package we created another scatter plot of our data using “geom_point()”
The GREAT thing about ggplot 2 is that you can add as many layers as you desire in a simple and linear manner to make your plots look anyway you want. Here we created the “plot” ggplot object then added: a line connecting our data using (geom_line) a reference line with an intercept of 0 and a slope of 1 (geom_abline) then showed our data points with (geom_jitter)
Calculating descriptive statistics for our groups (data vectors) is simple in R describe does all the work for you! Note: Skewness(skew) and Kurtosis are reported; these statistics describe the distribution of our data/variables. In a normal distribution, Skewness and Kurtosis are both ZERO. A normal distribution or NORMALITY, is often an important assumption of many statistical tests.
Performing a t-Test is also easy in R using the t. test function. We first created an object that will hold our results from our t-Test then we displayed those results by putting the name of the object in the command line and pressing return. NOTE: This is a Welch Two Sample t-test, it doesn’t require the assumption of equal variances common to the standard Student’s t-test. Look at the degrees of freedom (df), not a whole number.
We can perform a Student’s t-test by adding the var. equal=True argument. Note the degrees of freedom is a whole number. In our example the calculated p-value for our t-test is greater than our critical pvalue of 0. 05. Thus we have failed to reject the null hypothesis… but why? The means of the two groups seem so different? The mean of group 2 (y) is nearly three times that of group 1 (x)… ? ?
More advanced analysis: fit <- aov(group 1 ~ group 2) Here we performed an analysis of variance (ANOVA) on our two groups. By creating the object “fit”, we can visualize a number of different statistical results and output from one analysis. typing “fit” in the command line gives a simple over view of our ANOVA model summary(fit) returns a standard ANOVA table plot(fit) returns diagnostic ANOVA plots
References for using R: http: //www. statmethods. net/ A great introduction to R http: //wiki. stdout. org/rcookbook/ Another good introduction R site http: //ryouready. wordpress. com/ More advanced R help http: //had. co. nz/ggplot 2/ Reference site for ggplot 2 graphical package http: //www. r-project. org/ R project home page http: //rstudio. org/ RStudio home page http: //manuals. bioinformatics. ucr. edu/home/R_Bio. Cond. Manual#R_intro
- Slides: 19