Introduction to R and R Studio Consider A

































- Slides: 33
Introduction to R and R Studio
� Consider ◦ ◦ �A the tasks you will need to perform Data collection and editing? Data transformation/calculations? Basic vs. advanced statistics? Making figures? single platform may not be enough, but use as few as possible Software Mac PC Data entry/ Forms Data editing Transform data Basic stats Adv statics Figures REDCap web +++ - - - + X ++ ++ ++ X X ++ ++ R/RStudio X X - + +++ +++ Epi Info Open Office
Agenda � About R and R Studio � Why and when do I use R? � Key Features � How does it fit in with other tools? � Basic tasks � Example – Import and analyze the Framingham dataset
About R and R Studio � Open source statistical analysis software � Large development community ◦ Many translational analysis packages only in R ◦ Lots of on-line help and examples � https: //www. r-project. org (R basic) ◦ EVERYTHING requires typed in commands � https: //www. rstudio. com ◦ ◦ (R + GUI wrapper) Easier to load and inspect files Assorted convenient extensions Still a lot of typed commands Suggest starting with R Studio
Why and when should I use R? � For any data manipulation for analysis beyond what Epi Info and Calc/Excel can do ◦ ◦ Complex regression and survival analysis Clustering analysis (all kinds) Times series analysis Translational and ‘omic’ analysis � Control of appearance of analysis output � When you need to make complex or high quality figures ◦ Heat maps ◦ Multiple panes, axes, colors ◦ Forest-type and other specialized plot types
Key features of R � ‘Command-line’ software � Can store and analyze very large datasets (100 s of millions of rows) � Extremely flexible in almost all commands ◦ Includes plot functions � Has analysis packages not available for anything else (vs. SAS, Stata, SPSS) ◦ Bioconductor!! � Many public data repositories include R commands to access and analyse (e. g. GEO) � Can automate analysis (e. g. generate many results form a single set of commands)
Rstudio - features
Excel vs. R Studio
Heat maps
Forest plots
How does R fit in with other tools? � Can connect to SQL servers and run commands � Can do complex power calculations � Usually the FINAL step in analysis pipeline ◦ After data collected and organized � Can be used for QC, (esp. large datasets) � Most robust statistical results � Generates final tables, results, and figures for interpretation and publication from data created elsewhere
Why not use R all the time? � Cannot use for data entry � Data manipulation is less intuitive ◦ Need some understanding of matrix manipulation � Harder to share results with others (e. g. like emailing an Excel file) � Takes some time to get comfortable with the commands ◦ Slower than Excel/Calc for some simple tasks like viewing results or simple sorting
Summary �R is a powerful analysis package that can perform statistics, make figures, and manipulate data � R is controlled using statements with a specific format less intuitive than other tools � R CANNOT be used as a database or to collect data
Resources for R � Content § § Reading http: //www. cyclismo. org/tutorial/R Bioinformatics Tzubioinformatics http: // compbio. ucdenver. edu/Hunter_lab/Phang/. . . /Bioinformatics/bioinformatics. html � Practice § § § by Doing http: //tryr. codeschool. com/ http: //www. statmethods. net/ http: //swirlstats. com/
Practical R overview
R statistical package
R Studio Datasets Data points Plots Commands
R packages �R has a lot of built-in functions � HOWEVER, real power is in ‘packages’ �devtools/tmap package
Spatial analysis � Population maps � Disease incidence � Environmental variables
Loading packages into R 1. Download package using window browser
Loading packages into R � Using command line: ◦ install. packages("xlsx") � Packages often use other packages ◦ You may need ”install dependencies”, or other packages that your package needs
Load library for use � Once you have installed a package, you must tell R to use it: � Window version (R Studio): � Command line: library(“xlsx”)
Loading data into R Studio 1. Command line – use ‘read’ command ◦ mydata <- read. csv(‘~/Desktop/mydata. csv’) ◦ myxlsx <- read. xlsx(‘~/Desktop/mydata. xslx’) �Must load ‘library(xlsx)’ 2. ◦ 3. ◦ R Studio GUI ‘Environment’ -> ‘Import dataset’ -> navigate to file > ‘Open’ Creates a data ‘frame’ containing file data Can have many components of different types � E. g. number, string, factor, Simplest is a table where columns are called by names ◦ � � fram$id = a list of all ids in the fram table Column sexc of fram where randid: fram$MF 4[fram$randid==2448]
Inspecting your data � To see column names, use names() ◦ E. g. names(fram) � To see all data in a “frame”, type its name � To see a specific column, use $ ◦ fram$sex � To see a few rows, treat it as a matrix ◦ 1 st 10 rows: fram[1: 10, ] � To see just a few columns, can use numbers or names: ◦ fram[, 1: 5] ◦ fram[, c(‘randid’, sex’, ’diabetes’, ’prevmi’, ’bmi’)
Making a simple summary table � Simple counts: ◦ Single variable: table(fram$sex); table(fram$educ) ◦ Two variables: table(fram$educ, fram$sex)
Make a more complex table � Use library gmodels ◦ library(gmodels) � Cross. Table function ◦ Cross. Table(fram$MF 79, fram$MF 71) � Remove proportion of total subjects ◦ Cross. Table(fram$MF 79, fram$MF 71, prop. t=FALSE) � Remove chi-squared contribution and proportion of column ◦ Cross. Table(fram$MF 79, fram$MF 71, prop. t=FALSE, prop. chisq=FALSE, prop. c=FALSE) � Add chi-square test ◦ Cross. Table(fram$MF 79, fram$MF 71, prop. t=FALSE, prop. chisq=FALSE, prop. c=FALSE, chisq=TRUE))
Make a more complex table � Use library gmodels ◦ library(gmodels) � Cross. Table function ◦ Cross. Table(fram$educ, fram$sex) � Remove proportion of total subjects ◦ Cross. Table(fram$educ, fram$sex, prop. t=FALSE) � Remove chi-squared contribution and proportion of column ◦ Cross. Table(fram$educ, fram$sex, prop. t=FALSE, prop. chisq=FALSE, prop. c=FALSE) � Add chi-square test ◦ Cross. Table(fram$educ, fram$sex, prop. t=FALSE, prop. chisq=FALSE, prop. c=FALSE, chisq=TRUE)
Make a simple plot � Basic plot function ◦ plot(fram$bmi, fram$sysbp) � Add chart labels ◦ plot(fram$bmi, fram$sysbp, xlab='BMI, kg/m 2', ylab='SBP, mm Hg', main='BMI vs. SBP')Change the appearance of the points ◦ plot(fram$bmi, fram$sysbp, xlab='BMI, kg/m 2', ylab='SBP, mm Hg', main='BMI vs. SBP', pch=17. col=‘blue’)
Make a simple plot � Basic plot function ◦ plot(fram$bmi, fram$sysbp) � Add chart labels ◦ plot(fram$bmi, fram$sysbp, xlab='BMI, kg/m 2', ylab='SBP, mm Hg', main='BMI vs. SBP')Change the appearance of the points ◦ plot(fram$bmi, fram$sysbp, xlab='BMI, kg/m 2', ylab='SBP, mm Hg', main='BMI vs. SBP', pch=17. col=‘blue’)
Regression example �R uses ’formula’ statements ◦ Dependent_var~var. A+var. B+var. C… � Linear regression ◦ Dependent variable is continuous (e. g. weight, BP) ◦ bmi~sex+educ ◦ Summary(glm(bmi~as. factor(sex)+educ, data=fram, family=gaussian)) � Logistic regression ◦ Dependent variable is dichotomous (Yes/No) ◦ anychd~sex+educ+bmi ◦ glm(anychd~as. factor(sex)+as. factor(educ)+ bmi, data=fram, family=binomial(link=“logit”))
Regression example �R uses ’formula’ statements ◦ Dependent_var~var. A+var. B+var. C… � Linear regression ◦ Dependent variable is continuous (e. g. weight, BP) ◦ bmi~sex+educ ◦ Summary(glm(bmi~as. factor(sex)+educ, data=fram, family=gaussian)) � Logistic regression ◦ Dependent variable is dichotomous (Yes/No) ◦ anychd~sex+educ+bmi ◦ glm(anychd~as. factor(sex)+as. factor(educ)+ bmi, data=fram, family=binomial(link=“logit”))
Practice � Make a table of educ vs. diabetes � Make a plot of age vs. sysbp � Perform a linear regression of systolic blood pressure sysbp vs. cursmoke, sex, and age � Perform a logistic regression of death vs. sex, diabetes, and cursmoke