String Manipulations Web Scraping MultilevelModeling Population Genetics in
String Manipulations, Web. Scraping, Multilevel-Modeling, Population Genetics in R Session 19, 11/5/2018 Nick Brazeau
AKA Nick’s potpourri
Outline • String manipulations • Web-scraping • Multilevel Modeling • Stop Here • Population Genetics/Genomics in R
String Manipulations in R stringr
String Manipulations • Remember in R that all strings are of type “char” • We can use the stringr package to manipulate, parse, detect, etc. these strings • Looking forward: • Much of population genetic haplotypic analyses is just string manipulations
My favorite stringr commands • str_split (and str_split_fixed) • str_extract • str_detect • Many, many more!
Plan • Go to Rmarkdown for some examples • Vignette on your own (5 -10 minutes)
stringr cheat sheet
stringr cheat sheet
Web Scraping in R
Web Scraping • The process of extracting data from the web (i. e. a website, etc. ) • Take html code and convert it into a data format we can use • With great power, comes great responsibility (careful with serverload) • Some sites do not “allow” scraping. Laws behind this are evolving… • ^^ More of an issue for company espionage, not querying servers that are clearly meant for public resources (i. e. ncbi, plasmodb…)
Web Scraping in R Static Websites Dynamic Websites • rvest • RSelenium • Hadley Wickham • Straightforward and “tidy” We will do together • Originally a python package (selenium) now from r. Open. Sci • Wildly powerful • Manipulations of websites, navigation, etc. • “Selenium is a project focused on automating web browsers” I will show an example from my work
Plan 1 • RSelenium example from my work • rvest tutorial from Hadley Wickham • Advised to download the “Selector. Gadget” tool • Can also open up html code in google chrome using inspect • Regardless, read the vignette("selectorgadget") from the rvest package to get a sense of how to select components of a website based on their css type 5 -10 Minute Breakout to do this Vignette and Tutorial, Lego-Movie
Plan 2 Today’s Main Exercise 1. “Scrape” the average income of each county from Wikipedia. 2. Merge this information into our births dataset 3. Then we will simulate an income covariate for each individual in our dataset based on the average with some ”noise” added • Means we are creating a variable that violates the assumption of independence in regression modeling = PROBLEM (NEXT) 4. Perform multilevel (a solution)
Multilevel Modeling in R
Acknowledgements • Our. Coding. Club • OJ Watson • ME 3
Multilevel Modeling Use Case • Non-independence among covariates • Repeated measurements (or the same patient/person/individual) • Sampling from a classroom, village, county, etc. • Clusters differ systematically • This means we violates our independence assumption in regression Funny/Helpful Article: Keep Calm and Learn Multilevel Logistic Modeling: A Simplified Three-Step Procedure Using Stata, R, Mplus, and SPSS
Brief Notes • Essentially, we can fit a different slope and/or intercept to each cluster (intracluster— “random” effect) while accounting for an overall “fixed” effect • This means, we are now trying to determine the [risk, odds, etc] of someone with treatment exposure, i, that is part of cluster, j. Funny/Helpful Article: Keep Calm and Learn Multilevel Logistic Modeling: A Simplified Three-Step Procedure Using Stata, R, Mplus, and SPSS
Brief Notes Screenshot from Our. Coding. Club
Brief Notes • For more in depth tutorials, check out: • Our. Coding. Club • OJ Watson’s tutorial • https: //www. r-bloggers. com/multilevel-modeling-of-educational-data-using-r-part-1/ • https: //www. jaredknowles. com/journal/2013/11/25/getting-started-with-mixed-effect-models-in-r • Many, many more
Plan • Rmarkdown example
Intro to Pop Gen (for those who are interested)
Outline • String manipulations • Web-scraping • Multilevel Modeling • Stop Here • Population Genetics/Genomics in R
Population Genetics & Genomics in R
IMO Population Genetics for Molecular Epidemiologists 1. Molecular Surveillance A. Putative Drug Resistance 2. Identifying Positive Selection A. Selective Sweeps 3. Population Structure 4. Many more…
Brief Primer on Population Genetics • Genetics & Genomes » ATCG • Neutral Theory » Most mutations are neutral (noise) and are not biologically significant (Kimura, 1986) but can be tracked. *Disclaimer: This is not exhaustive. This is just a fraction of the breadth and depth of genetics and genetic theory.
Population Genetics for Molecular Epidemiologists – TYPES OF DATA 1. Haplotypes A. A consensus sequence/series of bases that are considered known and are inherited together (i. e. a series of SNPs/Ref bases) 2. Single Nucleotide Polymorphisms (SNPs) A. A base-pair variant at a specific location (loci) B. An insertion and/or deletion (INDEL) is usually not considered a “SNP” *Disclaimer: This is a very small fraction of the breadth and depth of genetics and genetic theory
Population Genetics for Molecular Epidemiologists – DATA GENERATION EMBL-EBI NGS Training FASTAs (haplotypes) SNPs *Disclaimer: This is a very small fraction of the breadth and depth of genetics and genetic theory
Population Genetics for Molecular Epidemiologists – TYPES OF DATA 1. Haplotypes A. Strings (fastas) 2. SNPs A. Stored in variant call file B. Convert this to a basic tibble (vcf. R) *Disclaimer: This is a very small fraction of the breadth and depth of genetics and genetic theory
Haplotypes • Strings » ATCG APE package is popular for phylogenetic analysis and manipulations • Very efficient storage (DNA as bytes)
Haplotype (i. e. strings) Manipulations in R Biostrings Compatible • Very Fast • S 4 objects with numerous layers • Easy manipulations IRanges • Very, very Fast • Allows for identification of specific regions in the genome http: //bioconductor. org/help/course-materials/2010/Seattle. Intro/DNAStrings. And. Ranges. pdf https: //web. stanford. edu/class/bios 221/labs/biostrings/lab_1_biostrings. html
Population Genetics for Molecular Epidemiologists – TYPES OF DATA 1. Haplotypes A. Strings (fastas) 2. SNPs A. Stored in variant call file B. Convert this to a basic tibble (vcf. R) *Disclaimer: This is a very small fraction of the breadth and depth of genetics and genetic theory
Variant Call Files Show on board as well
vcf. R by Brian Knaus & Niklaus Grunwald
vcf. R by Brian Knaus & Niklaus Grunwald VCF Structure
poppr by Niklaus Grunwald 1. Open R (Rstudio) 2. install. packages(“poppr”) 3. Online textbook & tutorials by Grunwald Lab
poppr by Niklaus Grunwald 1. Do all parts of this Population genetics and genomics in R primer/textbook (because it is great information/work)! A. But especially high-yield is Part III
An Example R Package Flow Others: 1. rehh 2. seqin. R 3. Bio. Strings 4. Etc. VCF File (or variant file) Pop. Genome GENOME. class vcf. R Tajima’s D Sliding windows site. specific SNPs Neutrality tests, etc APE Adegenet genind/genpop/ade 4 DNAbin PCA & DAPC Phylogenetic Trees hierfstat genind/genpop/ade 4 Fst & He poppr genind/genpop/ade 4 Note, this is not exhaustive. Several programs can perform overlapping functions…
Plan • Rmarkdown example
R Pop Gen Resources 1. poppr Textbook/Primer 2. NESCENT Hackathon A. B. https: //github. com/NESCent/r-popgen-hackathon popgen. nescent. org/ 3. Package Specific VIGNETTES, Tutorials, & Manuals 4. Molecular Ecology Resources: Review of Pop. Gen in R (http: //onlinelibrary. wiley. com/doi/10. 1111/men. 2017. issue-1/issuetoc) 5. Computational Genome Analysis (textbook) by Richard Deonier, Simon Tavare, Michael Waterman
- Slides: 43