Open Science Open Data Open Source Projects for

  • Slides: 58
Download presentation
Open Science, Open Data, Open Source Projects for Undergraduate Research Experiences Kam D. Dahlquist,

Open Science, Open Data, Open Source Projects for Undergraduate Research Experiences Kam D. Dahlquist, Ph. D. Department of Biology Loyola Marymount University Bio. QUEST/HHMI/Case. Net Summer Workshop June 13, 2015

Outline • An open science ecosystem enhances student learning • Quick example: XMLPipe. DB

Outline • An open science ecosystem enhances student learning • Quick example: XMLPipe. DB project in a Biological Databases course • Longer example: GRNmap project in Biomathematical Modeling course • Potential research projects for Bio. QUEST participants • Challenges are also opportunities – Computer literacy – Data literacy – Information literacy

Open Science Ecosystem Open Access (creative commons) Open Pedagogy Open Source Code Open Data

Open Science Ecosystem Open Access (creative commons) Open Pedagogy Open Source Code Open Data Open Science (open process) Reproducible Research Integrity Citizen Science With thanks to John Jungck

Open Science Pedagogy Adds Open Source Values and Tools to Problem Spaces • Students

Open Science Pedagogy Adds Open Source Values and Tools to Problem Spaces • Students solve an authentic research problem. • They investigate large, publicly available datasets. • They return the products of their research to the scholarly community. Image: http: //www. bioquest. org/bedrock/problem_spaces/

Official Open Source Definition (http: //opensource. org) Free redistribution No discrimination against fields of

Official Open Source Definition (http: //opensource. org) Free redistribution No discrimination against fields of endeavor Source code Distribution of license Derived works License must not be specific to a product Integrity of the author’s source code License must not restrict other software No discrimination against persons or groups License must be technology-neutral

Open Source Values Mirror STEM Curricular Reform Open Source Values Active Learning Pedagogy Open

Open Source Values Mirror STEM Curricular Reform Open Source Values Active Learning Pedagogy Open Source Practices & Tools Central code Source code is Authentic problem to repository; version available, modifiable, solve with realistic control; provenance and long-lived complexity of code Accountability to a developer and user community Responsibilities accompany rights Task and bug Participatory and trackers; continuous collaborative work; integration; testpeer review driven workflows Responsibility and ownership of the learning process Documentation: inline, user manual, web site, wiki

Pedagogy Implemented on Course Wikis • Team-taught and cross-listed − BIOL/CMSI 367: Biological Databases

Pedagogy Implemented on Course Wikis • Team-taught and cross-listed − BIOL/CMSI 367: Biological Databases https: //xmlpipedb. cs. lmu. edu/biodb/fall 2013/index. php/Main_Page − BIOL/MATH 388: Biomathematical Modeling http: //www. openwetware. org/wiki/BIOL 398 -04/S 15 • Single instructor − BIOL 368: Bioinformatics Laboratory http: //www. openwetware. org/wiki/BIOL 368/F 14 − BIOL 478: Molecular Biology of the Genome (wet lab, mostly offline) data analysis: http: //www. openwetware. org/wiki/BIOL 478/S 15: Microarray_Data_Analysis • Weekly assignments leading up to final research project • All projects involve exploration of DNA microarray data

Pedagogy Implemented on Course Wikis • Team-taught and cross-listed − BIOL/CMSI 367: Biological Databases

Pedagogy Implemented on Course Wikis • Team-taught and cross-listed − BIOL/CMSI 367: Biological Databases https: //xmlpipedb. cs. lmu. edu/biodb/fall 2013/index. php/Main_Page − BIOL/MATH 388: Biomathematical Modeling http: //www. openwetware. org/wiki/BIOL 398 -04/S 15 • Single instructor − BIOL 368: Bioinformatics Laboratory http: //www. openwetware. org/wiki/BIOL 368/F 14 − BIOL 478: Molecular Biology of the Genome (wet lab, mostly offline) data analysis: http: //www. openwetware. org/wiki/BIOL 478/S 15: Microarray_Data_Analysis • Weekly assignments leading up to final research project • All projects involve exploration of DNA microarray data

Biological Databases Team Final Project: create a gene database for a bacterial species http:

Biological Databases Team Final Project: create a gene database for a bacterial species http: //xmlpipedb. cs. lmu. edu/ Postgre. SQL Intermediate Database Gen. MAPP-compatible Gene Database Visualize data Microarray data

Each Student on the Team is Assigned a Specific Role Coder Project Manager Quality

Each Student on the Team is Assigned a Specific Role Coder Project Manager Quality Control Data Analysis

Student Products Are Shared with the Scientific Community http: //sourceforge. net/projects/xmlpipedb/

Student Products Are Shared with the Scientific Community http: //sourceforge. net/projects/xmlpipedb/

Pedagogy Implemented on Course Wikis • Team-taught and cross-listed − BIOL/CMSI 367: Biological Databases

Pedagogy Implemented on Course Wikis • Team-taught and cross-listed − BIOL/CMSI 367: Biological Databases https: //xmlpipedb. cs. lmu. edu/biodb/fall 2013/index. php/Main_Page − BIOL/MATH 388: Biomathematical Modeling http: //www. openwetware. org/wiki/BIOL 398 -04/S 15 • Single instructor − BIOL 368: Bioinformatics Laboratory http: //www. openwetware. org/wiki/BIOL 368/F 14 − BIOL 478: Molecular Biology of the Genome (wet lab, mostly offline) data analysis: http: //www. openwetware. org/wiki/BIOL 478/S 15: Microarray_Data_Analysis • Weekly assignments leading up to final research project • All projects involve exploration of DNA microarray data

Systems Biology Workflow DNA microarray data: wet lab-generated or published New experimental questions Statistical

Systems Biology Workflow DNA microarray data: wet lab-generated or published New experimental questions Statistical analysis, clustering, Gene Ontology, term enrichment Visualizing the results Generate gene regulatory network Modeling dynamics of the network

Systems Biology Workflow DNA microarray data: wet lab-generated or published New experimental questions Statistical

Systems Biology Workflow DNA microarray data: wet lab-generated or published New experimental questions Statistical analysis, clustering, Gene Ontology, term enrichment Visualizing the results Generate gene regulatory network Modeling dynamics of the network

Central Dogma of Molecular Biology (simplified) DNA Transcription m. RNA Translation Protein Freeman (2003)

Central Dogma of Molecular Biology (simplified) DNA Transcription m. RNA Translation Protein Freeman (2003)

And Now in the “omics” Era… Genome Transcription Transcriptome Translation Proteome Freeman (2002)

And Now in the “omics” Era… Genome Transcription Transcriptome Translation Proteome Freeman (2002)

Budding Yeast, Saccharomyces cerevisiae, is an Ideal Model Organism for Systems Biology • Small

Budding Yeast, Saccharomyces cerevisiae, is an Ideal Model Organism for Systems Biology • Small genome of ~6000 genes • Extensive genomewide datasets readily accessible • Molecular genetic tools available Alberts et al. (2004)

Environmental Changes and Stresses • All organisms must respond to changes in the environment

Environmental Changes and Stresses • All organisms must respond to changes in the environment – – – p. H oxygen availability pressure osmotic stress temperature (heat and cold) • Some changes in the environment cause cellular damage and trigger a “stress response” – damage from reactive oxygen species – damage from UV radiation – sudden and/or large change in temperature (increase or decrease)

Cold Shock Is an Environmental Stress that Is Not Well-Studied • Increases in temperature

Cold Shock Is an Environmental Stress that Is Not Well-Studied • Increases in temperature (heat shock) – response very well-characterized – proteins denature due to heat – induction of heat shock proteins (chaperonins), that assist in protein folding – conserved in all organisms (prokaryotes, eukaryotes) • Decreases in temperature (cold shock) – – – response less well-characterized decrease fluidity of membranes stabilize DNA and RNA secondary structures impair ribosome function and protein synthesis decrease enzymatic activities no equivalent set of cold shock proteins that are conserved in all organisms

Yeast Respond to Cold Shock by Changing Gene Expression • Cold shock temperature range

Yeast Respond to Cold Shock by Changing Gene Expression • Cold shock temperature range for yeast is 10 -18°C • Previous studies indicate that the cold shock response can be divided into: • Late response genes – 12 to 60 hours – General environmental stress response genes (ESR) are induced – Regulated by the Msn 2/Msn 4 transcription factors • Early response genes – 15 minutes to 2 hours – Genes unique to cold shock are induced, such as genes involved in ribosome biogenesis and membrane fluidity – Which transcription factors regulate this response is unknown

Transcription Factors Control Gene Expression by Binding to Regulatory DNA Sequences • Activators increase

Transcription Factors Control Gene Expression by Binding to Regulatory DNA Sequences • Activators increase gene expression • Repressors decrease gene expression • Transcription factors are themselves proteins that are encoded by genes

Experimental Design and Methods

Experimental Design and Methods

Yeast Cells Were Harvested for Microarrays Before, During, and After a Cold Shock and

Yeast Cells Were Harvested for Microarrays Before, During, and After a Cold Shock and During Recovery

Mixture of labeled c. DNA from two samples • 4 replicates of each experiment

Mixture of labeled c. DNA from two samples • 4 replicates of each experiment with dye swaps • wt and transcription factor deletion strains

DNA Microarray One spot = one gene Green = decreased relative to control Red

DNA Microarray One spot = one gene Green = decreased relative to control Red = increased Yellow = no change in gene expression Freeman (2002)

Gene Expression Changes Due to Cold Shock Return to Pre-shock Levels During Recovery t

Gene Expression Changes Due to Cold Shock Return to Pre-shock Levels During Recovery t 30/t 0 cold shock t 60/t 0 cold shock • Four sets of biological replicates were performed • Dye orientation was swapped for two sets of replicates t 90/t 0 recovery t 120/t 0 recovery

Steps Used to Analyze DNA Microarray Data 1. 2. 3. 4. 5. 6. 7.

Steps Used to Analyze DNA Microarray Data 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Quantitate the fluorescence signal in each spot Calculate the ratio of red/green fluorescence Log 2 transform the ratios Normalize the ratios on each microarray slide Normalize the ratios for a set of slides in an experiment Perform statistical analysis on the ratios Compare individual genes with known data Pattern finding algorithms/clustering Modeling the dynamics of the gene regulatory network Visualizing the results

Systems Biology Workflow DNA microarray data: wet lab-generated or published New experimental questions Excel,

Systems Biology Workflow DNA microarray data: wet lab-generated or published New experimental questions Excel, stem Statistical analysis, clustering, Gene Ontology, term enrichment Generate gene regulatory network Visualizing the results Modeling dynamics of the network

And so on…

And so on…

Within-strain ANOVA Reveals How Many Genes Had Significant Changes in Expression at Any Timepoint

Within-strain ANOVA Reveals How Many Genes Had Significant Changes in Expression at Any Timepoint ANOVA wt Δgln 3 p < 0. 05 2378/6189 (38. 42%) 1864/6189 (30. 11%) p < 0. 01 1527/6189 (24. 67%) 1008/6189 (16. 29%) p < 0. 001 860/6189 (13. 90%) 404/6189 (6. 53%) p < 0. 0001 460/6189 (7. 43%) 126/6189 (2. 04%) B-H p < 0. 05 1656/6189 (26. 76%) 913/6189 (14. 75%) Bonferroni p < 0. 05 228/6189 (3. 68%) 26/6189 (0. 42%)

A Modified T Test Was Used to Determine Significant Changes in Gene Expression at

A Modified T Test Was Used to Determine Significant Changes in Gene Expression at Each Timepoint wild type Number of Cold Shock Genes whose Expression t 15 t 30 Changes Recovery t 60 t 90 t 120 Increased p < 0. 05 439 (7%) 668 (11%) 609 (10%) 398 (6%) 191 (3%) Decreased p < 0. 05 331 (5%) 517 (8%) 249 (4%) 59 (1%) Total p < 0. 05 770 (12%) 1185 (19%) 1020 (17%) 411 (7%) 647 (10%) 250 (4%)

Expression (log 2 fold change) Short Time Series Expression Miner (stem) Software Clusters Genes

Expression (log 2 fold change) Short Time Series Expression Miner (stem) Software Clusters Genes with Similar Profiles Time (minutes)

Expression (log 2 fold change) Short Time Series Expression Miner (stem) Software Clusters Genes

Expression (log 2 fold change) Short Time Series Expression Miner (stem) Software Clusters Genes with Similar Profiles Time (minutes) Gene Ontology categories assigned to clusters: • Ribosome biogenesis • Zinc ion homeostasis • Hexose transport • Endomembrane system • Protein and vesicle transport • Negative regulation of nitrogen compound process

The Transcription Factor Gln 3 Regulates Genes Involved in Nitrogen Metabolism • Yeast differentiate

The Transcription Factor Gln 3 Regulates Genes Involved in Nitrogen Metabolism • Yeast differentiate between preferred and non-preferred nitrogen sources. • When the nitrogen source is poor, Gln 3 localizes to the nucleus and activates genes required to utilize the poor nitrogen source. • The Dgln 3 strain is impaired for growth at cold temperatures: − Doubling time at 13°C of 15 hours vs. 8. 3 hours for wild type. • A microarray experiment was performed on the Dgln 3 strain.

Gln 3 Target Genes Were Extracted from the YEASTRACT Database 37 out of 164

Gln 3 Target Genes Were Extracted from the YEASTRACT Database 37 out of 164 (23%) have significantly different expression profiles in the wild type versus the Dgln 3 strain

Systems Biology Workflow DNA microarray data: wet lab-generated or published Statistical analysis, clustering, Gene

Systems Biology Workflow DNA microarray data: wet lab-generated or published Statistical analysis, clustering, Gene Ontology, term enrichment New experimental questions Visualizing the results YEASTRACT, Excel Modeling dynamics of the network Generate gene regulatory network

Genome-wide Location Analysis has Determined the Relationships between Transcription Factors and their Target Genes

Genome-wide Location Analysis has Determined the Relationships between Transcription Factors and their Target Genes in Yeast • Does not show whether activation or repression occurs • Shows topology, but not the behavior of the network over time • Data found in YEASTRACT database Lee et al. (2002)

A Transcriptional Network Controlling the Cold Shock Response Assumptions made in our model: •

A Transcriptional Network Controlling the Cold Shock Response Assumptions made in our model: • Each node represents one gene encoding a transcription factor. • When a gene is transcribed it is immediately translated into protein; a node represents both the gene and the protein it encodes. • An edge drawn between two nodes represents a regulation relationship, either activation or repression, depending on the sign of the weight.

Systems Biology Workflow DNA microarray data: wet lab-generated or published New experimental questions Statistical

Systems Biology Workflow DNA microarray data: wet lab-generated or published New experimental questions Statistical analysis, clustering, Gene Ontology, term enrichment Visualizing the results Generate gene regulatory network Modeling dynamics of the network GRNmap (Windows-only)

GRNmap: Gene Regulatory Network Modeling and Parameter Estimation • Parameters are estimated from DNA

GRNmap: Gene Regulatory Network Modeling and Parameter Estimation • Parameters are estimated from DNA microarray data from wild type and transcription factor deletion strains subjected to cold shock conditions. • Weight parameter, w, gives the direction (activation or repression) and magnitude of regulatory relationship.

The “Worst” Rate Equation is:

The “Worst” Rate Equation is:

Least Squares Residual Optimization of the 92 Parameters Requires the Use of a Regularization

Least Squares Residual Optimization of the 92 Parameters Requires the Use of a Regularization (Penalty) Term • Plotting the least squares error function showed that not all the graphs had clear minima. • We added a penalty term so that MATLAB’s optimization algorithm would be able to minimize the function. • θ is the combined production rate, weight, and threshold parameters. • a is determined empirically from the “elbow” of the L-curve. Parameter Penalty Magnitude

Forward Simulation of the Model Fits the Microarray Data

Forward Simulation of the Model Fits the Microarray Data

Systems Biology Workflow DNA microarray data: wet lab-generated or published New experimental questions Statistical

Systems Biology Workflow DNA microarray data: wet lab-generated or published New experimental questions Statistical analysis, clustering, Gene Ontology, term enrichment Visualizing the results Generate gene regulatory network GRNsight Modeling dynamics of the network

GRNsight Rapidly Generates GRN graphs Using Our Customizations to the Open Source D 3

GRNsight Rapidly Generates GRN graphs Using Our Customizations to the Open Source D 3 Library Adobe Illustrator: several hours to create GRNsight: 10 milliseconds to generate, 5 minutes to arrange GRNsight: colored edges for weights reveal patterns in data

The First Round of Modeling Has Suggested Future Experiments

The First Round of Modeling Has Suggested Future Experiments

Systems Biology Workflow DNA microarray data: wet lab-generated or published New experimental questions Statistical

Systems Biology Workflow DNA microarray data: wet lab-generated or published New experimental questions Statistical analysis, clustering, Gene Ontology, term enrichment Visualizing the results Generate gene regulatory network Modeling dynamics of the network http: //www. openwetware. org/wiki/Dahlquist: Bio. QUEST_Summer_Workshop_2015

95% of Bioinformatics is Getting Your Data into the Correct File Format • Exposes

95% of Bioinformatics is Getting Your Data into the Correct File Format • Exposes deficiencies in computer literacy skills in so -called “digital natives” • When you leave your comfort zone, it is, by definition, uncomfortable • Emphasis on research process − − Teamwork Electronic lab notebook Keeping track of files and code Trouble-shooting problems that arise in the research process: bugs, data issues, etc.

Summary • An open science ecosystem enhances student learning • Quick example: XMLPipe. DB

Summary • An open science ecosystem enhances student learning • Quick example: XMLPipe. DB project in a Biological Databases course • Longer example: GRNmap project in Biomathematical Modeling course • Potential research projects for Bio. QUEST participants • Challenges are also opportunities – Computer literacy – Data literacy – Information literacy

Acknowledgments Ben G. Fitzpatrick LMU Math John David N. Dionisio LMU Computer Science Special

Acknowledgments Ben G. Fitzpatrick LMU Math John David N. Dionisio LMU Computer Science Special thanks to John Jungck & Sam Donovan Juan Carrillo, Natalie Williams, K. Grace Johnson, Kevin Wyllie, Kevin Mc. Gee Monica Hong, Nicole Anguiano, Anindita Varshneya, Trixie Roque, (Tessa Morris)