R and Bioconductor open source software for analysing

R and Bioconductor: open source software for analysing genomics data Belinda Phipson 1 February 2016 1

Who am I? Who I’m not • Trained as a statistician • A geneticist or biologist • Moved into Bioinformatics in February 2007 • A software engineer • User of R for the last 10 years • Someone who is awesome with “computer stuff” • Contributor and maintainer of R Bioconductor packages 2

What is R? “R is a language and environment for statistical computing and graphics” 3

What is R? • R is the open source alternative to the commercially available S software – Free Software Foundation’s GNU General Public License • R is more flexible compared to other statistical software – Users can easily define their own functions and create packages – C, C++ and Fortran code can be linked • Runs on a wide variety of platforms • RStudio: https: //www. rstudio. com/ (a talk on it’s own) 4

R has become very popular IEEE spectrum language popularity rankings 2015 2014 5 http: //r 4 stats. com/articles/popularity/

Most popular tool for “analytics” The competing statistical software tools are very expensive Analytics tools used by respondents to the 2015 Rexer Analytics Survey. In this view, each respondent was free to check multiple tools. 6 http: //r 4 stats. com/articles/popularity/

R is a powerful tool • Command line driven • Empowers people to do reproducible research • Open source (also) means that you can learn from others • Ability to produce publication quality graphics • Access to cutting edge techniques • Access to lots of packages 7

Quick graphics examples 8

Example of a figure made using base R for publication in Genome Biology par(mar=c(6, 6, 5, 3)+0. 1) layout(matrix(c(1, 2, 3, 3, 4, 4), ncol=2, byrow = TRUE)) group<-rep(c(0, 1), c(160, 283)) design<-model. matrix(~group) stripchart(eg. DMDV[2, ]~design[, 2], method="jitter", pch=16, cex=0. 7, col=c(4, 2), group. names=c("Normal", "Cancer"), ylab="M values", vertical=T, cex. axis=1. 5, cex. lab=2) title("(A) Top DM Cp. G", cex. main=2) stripchart(eg. DMDV[1, ]~design[, 2], method="jitter", pch=16, cex=0. 7, col=c(4, 2), group. names=c("Normal", "Cancer"), ylab="M values", vertical=T, cex. axis=1. 5, cex. lab=2) title("(B) Top DV Cp. G", cex. main=2) par(mar=c(6, 6, 5, 1)+0. 1) z<-get. Levene. Residuals(eg. DMDV, design, coef=2) barplot(z$data[2, ], names="", col=c(rep(4, 160), rep(2, 283)), xlab="Samples", ylab="Absolute deviation", cex. lab=2, cex. axis=1. 5, border=NA) text(80, 3, labels="Normal tissue", col=4, cex=2) text(400, 3, labels="Cancer tissue", col=2, cex=2) title("(C) Top DM Cp. G", cex. main=2) barplot(z$data[1, ], names="", col=c(rep(4, 160), rep(2, 283)), xlab="Samples", ylab="Absolute deviation", cex. lab=2, cex. axis=1. 5, border=NA) text(80, 3, labels="Normal tissue", col=4, cex=2) text(380, 3, labels="Cancer tissue", col=2, cex=2) title("(D) Top DV Cp. G", cex. main=2) 9

Example of awesome plot using the “Gviz” library Combines 3 different data types 236 lines of R code Made by Jovana Maksimovic 10

The R community Packages Melbourne Users of R Network Yearly conference Where to get help Slide from Joseph B Rickert 11

R is pretty awesome… (Except not so much for high dimensional data) 12

Example: what’s going on in a tumour? RNA sample Tumour (Lots of steps…) Count data Can this inform us on which drugs to give the patient? 20 000 rows 13

Does the expression level of a gene change between cancer and normal samples? Normal samples DNER 11. 15 10. 01 10. 97 11. 03 Cancer samples 11. 49 5. 53 4. 31 3. 39 3. 51 3. 98 Is this difference statistically significant? 14

The first command I typed: > ? t. test starting httpd help server. . . done 15

16

Everything in R is vectorised* (even scalars) > ? t. test starting httpd help server. . . done > dner TCGA-B 0 -5709 -11 TCGA-CW-5591 -11 TCGA-CW-6087 -11 TCGA-CW-5585 -11 TCGA-CW-5589 -11 11. 150319 10. 006868 10. 970987 11. 029617 11. 492993 TCGA-B 0 -5695 -01 TCGA-B 0 -4710 -01 TCGA-B 4 -5836 -01 TCGA-B 2 -4099 -01 TCGA-B 0 -5083 -01 5. 528162 4. 314067 3. 393850 3. 508280 3. 981251 > group [1] Normal Normal Cancer Cancer Levels: Cancer Normal * Another talk on it’s own 17

> t. test(dner~group) Welch Two Sample t-test data: dner by group t = -14. 864, df = 6. 8485, p-value = 1. 828 e-06 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -7. 869296 -5. 700774 sample estimates: mean in group Cancer mean in group Normal 4. 145122 10. 930157 > t. test(dner[group=="Cancer"], dner[group=="Normal"]) 18

But I want to test 20 000 genes Cancer samples Normal samples Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8 Gene 9 Gene 10 8. 04 11. 32 7. 88 9. 67 10. 93 9. 68 11. 07 6. 57 6. 42 9. 61 7. 75 10. 71 6. 57 10. 66 12. 24 10. 53 10. 23 6. 80 6. 11 9. 41 7. 06 11. 51 7. 46 9. 78 11. 15 9. 21 10. 34 7. 09 5. 96 8. 43 7. 69 11. 50 6. 70 10. 02 9. 81 10. 30 11. 34 6. 74 6. 19 9. 37 7. 49 11. 43 7. 35 10. 43 11. 20 9. 34 11. 01 6. 33 5. 85 9. 70 7. 96 10. 92 8. 46 10. 64 10. 61 10. 87 10. 98 6. 03 5. 76 9. 24 7. 23 10. 09 8. 82 11. 31 9. 61 11. 25 7. 13 6. 28 9. 70 7. 54 10. 88 6. 69 9. 91 11. 32 9. 45 10. 16 7. 01 6. 07 10. 20 8. 07 11. 29 8. 63 10. 46 9. 78 9. 83 10. 92 7. 67 6. 97 8. 47 7. 98 11. 16 6. 77 5. 64 9. 83 9. 41 10. 30 3. 98 7. 19 4. 70 20 000 more rows 19

Many functions in base R are not designed for matrices > t. test(log. Counts[1: 10, ]~group) Welch Two Sample t-test data: n. C[1: 10, ] by group t = -2. 4845, df = 96. 502, p-value = 0. 0147 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -1. 6840884 -0. 1882764 sample estimates: mean in group Cancer mean in group Normal 8. 456263 9. 392445 20

You need to do some form of looping (there a number of ways to do this) 21

> tstat <- rep(NA, 10) > tstat [1] NA NA NA Set up empty vectors to store important statistics > Pval <- rep(NA, 10) > Pval [1] NA NA NA > for(i in 1: 10){ + out <- t. test(log. Counts[i, ]~group) + tstat[i] <- out$statistic + Pval[i] <- out$p. value + } > cbind(tstat, Pval) tstat Pval [1, ] 0. 65647357 0. 5299499 [2, ] -1. 65581800 0. 1402197 [3, ] 1. 28976536 0. 2444178 [4, ] -0. 50571919 0. 6380099 [5, ] -0. 96423394 0. 3636724 [6, ] 0. 05651803 0. 9563169 [7, ] -0. 25195820 0. 8074337 [8, ] -0. 51537649 0. 6316880 22

Statistical calculations are based on matrix algebra … and matrix calculations in R are a lot faster than running for loops 23

Matrix operations in R: * vs %*% > mat [, 1] [, 2] [, 3] [, 4] [, 5] [1, ] 1 2 3 4 5 [2, ] 1 2 3 4 5 [3, ] 1 2 3 4 5 [4, ] 1 2 3 4 5 [5, ] 1 2 3 4 5 > mat * mat > mat %*% mat [, 1] [, 2] [, 3] [, 4] [, 5] [1, ] 1 4 9 16 25 [1, ] 15 30 45 60 75 [2, ] 1 4 9 16 25 [2, ] 15 30 45 60 75 [3, ] 1 4 9 16 25 [3, ] 15 30 45 60 75 [4, ] 1 4 9 [4, ] 15 30 45 1660 25 75 [5, ] 1 4 9 [5, ] 15 30 45 1660 25 75 > mat [1, ] [2, ] [3, ] [4, ] [5, ] [, 1] [, 2] [, 3] [, 4] [, 5] 1 2 3 4 5 1 2 3 4 5 Multiplies each row by element together each column and adds elements together 24

Why don’t R functions do this automatically? • This problem has not commonly been encountered in classical statistical applications • The R Core team is reluctant to make “drastic” changes to base R • The Bioconductor project started in 2001 to address the unique issues facing researchers in bioinformatics 25

Bioinformatics and R: Bioconductor “Bioconductor is an open source, open development software project to provide tools for the analysis and comprehension of high-throughput genomic data. It is based primarily on the R programming language” 26

First Bioconductor paper published in 2004 27

Goals of the Bioconductor project: • fostering collaborative development and widespread use of innovative software • reducing barriers to entry into interdisciplinary scientific research • promoting the achievement of remote reproducibility of research results 28

Bioconductor packages (as of a few days ago) 1. Software (n=1104) - Statistical methods 2. Annotation. Data (n=895) - where does this gene come from in the genome? 3. Experiment. Data (n=257) - packages that contain data used for illustrative purposes - e. g. datasets for books 29

Download stats for Bioconductor packages 30

Testing 20 000 genes is easy (and fast) with the right package Normal samples Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8 Gene 9 Gene 10 8. 04 11. 32 7. 88 9. 67 10. 93 9. 68 11. 07 6. 57 6. 42 9. 61 7. 75 10. 71 6. 57 10. 66 12. 24 10. 53 10. 23 6. 80 6. 11 9. 41 7. 06 11. 51 7. 46 9. 78 11. 15 9. 21 10. 34 7. 09 5. 96 8. 43 7. 69 11. 50 6. 70 10. 02 9. 81 10. 30 11. 34 6. 74 6. 19 9. 37 Cancer samples 7. 49 11. 43 7. 35 10. 43 11. 20 9. 34 11. 01 6. 33 5. 85 9. 70 7. 96 10. 92 8. 46 10. 64 10. 61 10. 87 10. 98 6. 03 5. 76 9. 24 7. 23 10. 09 8. 82 11. 31 9. 61 11. 25 7. 13 6. 28 9. 70 7. 54 10. 88 6. 69 9. 91 11. 32 9. 45 10. 16 7. 01 6. 07 10. 20 8. 07 11. 29 8. 63 10. 46 9. 78 9. 83 10. 92 7. 67 6. 97 8. 47 7. 98 11. 16 6. 77 5. 64 9. 83 9. 41 10. 30 3. 98 7. 19 4. 70 20 000 more rows 31

> > > library(limma) design <- model. matrix(~group) fit <- lm. Fit(log. Counts, design) fit <- e. Bayes(fit, trend=TRUE) top. Table(fit, coef=2) > design Int Normal. Vs. Cancer 1 1 1 2 1 1 3 1 1 4 1 1 5 1 1 6 1 0 7 1 0 8 1 0 9 1 0 10 1 0 > top. Table(fit, coef=2) log. FC Ave. Expr t P. Value adj. P. Val B OVCH 2|341277 9. 169723 -1. 567496 10. 074333 7. 003360 e-07 0. 008273631 6. 168942 SERPINA 5|5104 8. 864641 3. 466592 9. 381868 1. 419199 e-06 0. 008273631 5. 571076 BMP 8 A|353500 -3. 050923 2. 523338 -8. 755693 2. 788463 e-06 0. 008273631 4. 985163 DDB 2|1643 -2. 404746 4. 612741 -8. 652805 3. 126878 e-06 0. 008273631 4. 884508 SGK 2|10110 2. 774716 4. 149838 8. 596655 3. 330052 e-06 0. 008273631 4. 829035 TFAP 2 A|7020 4. 618654 2. 034020 8. 571366 3. 426171 e-06 0. 008273631 4. 803926 AP 1 M 2|10053 3. 060636 4. 677211 8. 278972 4. 783436 e-06 0. 009096147 4. 507849 ST 6 GALNAC 2|10610 2. 998590 2. 368024 8. 231874 5. 051767 e-06 0. 009096147 4. 459151 DNER|92737 6. 813317 1. 705048 8. 109853 5. 825411 e-06 0. 009096147 4. 331658 LOC 91316|91316 -2. 373354 3. 688923 -7. 989076 6. 718301 e-06 0. 009096147 4. 203558 32

Getting a package into Bioconductor 33

Submitting your package • Every package goes through a curation process • Every package must meet certain standards to be accepted – http: //www. bioconductor. org/developers/packageguidelines/ • Every package must have proper documentation – Help pages and user manual is compulsory 34

Every package must compile and build successfully on multiple platforms: • Linux • Windows • Mac. OS (I was getting so frustrated debugging my package I went out and bought a Mac. ) 35

Benefits • Easy to install Bioconductor packages from within R: > source("https: //bioconductor. org/bioc. Lite. R") > bioc. Lite("miss. Methyl") • Software packages are built daily, someone will tell you if your package has broken! • Bioconductor developers mailing list • Convenient to distribute your package • Publishing methods – a Bioconductor package gives you kudos 36

“How I decide when to trust an R package” - Simply Statistics, Jeff Leek Picture from Jeff Leek’s blog post 37

Pros which can also be cons • Anyone can submit a package – Coding style is mostly curated, BAD CODE IS NOT • Dependencies • It can take a while to get through the curation process and have your package accepted Top tip: write a package with a “buddy” 38

Bioconductor community • Yearly Bioconductor conference which highlights current developments – Workshops – Developer Day • Bioconductor courses – all materials available online: https: //www. bioconductor. org/help/course-materials/ • Bioconductor is committed to open source – All licenses are either Artistic 2. 0, GPL 2 or BSD 39

Where to get help • Excellent support site 40

Summary • R and Bioconductor has played a pivotal role in shaping how Bioinformaticians analyse their data • Open source is the key to their success • There is a strong worldwide community of R and Bioconductor users and developers • You never stop learning about cool stuff in R! 41

Acknowledgements MCRI Bioinformatics - Alicia Oshlack - Simon Sadedin - Jovana Maksimovic - Harriet Dashnow - Anthony Hawkins MCRI Statistical Genetics - Ashley Farlow WEHI - Alan Rubin The internet 42