BisulfiteSequencing Theory and Quality Control felix kruegerbabraham ac
Bisulfite-Sequencing Theory and Quality Control felix. krueger@babraham. ac. uk v 2021 -04
Bisulfite-Seq theory and Quality Control a. m. coffee Mapping and QC practical Visualising and Exploring talk Lunch p. m. Visualising and Exploring practical coffee Differential methylation talk & practical
Epigenetics Studies changes in gene expression which are not encoded by the underlying DNA sequence Chromatin • histone modification • non-coding RNAs • higher order structure (accessibility/compaction) DNA cytosine methylation From The Cell Biology of Stem Cells (2010)
Types of DNA methylation CG context non-CG context me T C me G T A 5’ 3’ 5’ A G C A T 3’ 5’ 3’ T C A me H = T/A/C CG CHH Mammals present (mostly) absent Plants present A G T T A A T 5’ 3’
DNA methylation is stably maintained from W. Reik & J. Walter, Nat. Rev. Genet. 2001
Regulation by DNA methylation Silencing of gene expression Tissue differentiation and embryonic development Faults in correct DNA methylation may result in - early development failure - epigenetic syndromes - cancer Repeat activity Genomic stability
Imprinted Genes: mono-allelic expression Differential allelic DNA methylation X CGI (Cp. G island) methylated Cp. G unmethylated Cp. G Imprinted Genes: Mono-allelic expression with parent-of-origin specificity. Have key roles in energy metabolism, placenta functions.
DNA Methylation DNA methyl-transferases DNA-demethylase(s)? TETs? Passive demethylation? Cytosine 5 -methyl Cytosine
Other cytosine modifications Miguel R. Branco, Gabriella Ficz & Wolf Reik Nature Reviews Genetics 13, 7 -13 (January 2012)
Measuring DNA methylation by Bisulfite-sequencing Image by Illumina
Bisulfite Informatics me me CCAGTCGCTATAGCGCGATATCGTA Convert TTAGTTGCTATAGTGCGATATTGTA Map TTAGTTGCTATAGTGCGATATTGTA |||||||||||||. . . CCAGTCGCTATAGCGCGATATCGTA. . .
BS-Seq Analysis Workflow Explore and understand your data Sequencing Processing pipeline Methylation Analysis
Bisulfite conversion of a genomic locus m. C | >>CCGGCATGTTTAAACGCT>> <<GGCCGTACAAATTTGCGA<< | | | m. C Top strand hm. C Bisulfite conversion Bottom strand >>UCGGUATGTTTAAACGUT>> <<GGUCGTACAAATTTGCGA<< PCR amplification OT >>TCGGTATGTTTAAACGTT>> CTOT <<AGCCATACAAATTTGCAA<< >>CCAGCATGTTTAAACGCT>> CTOB <<GGTCGTACAAATTTGCGA<< OB - 2 different PCR products and 4 possible different sequence strands from one genomic locus - each of these 4 sequence strands can theoretically exist in any possible conversion state
3 -letter alignment of Bisulfite-Seq reads sequence of interest TTGGCATGTTTAAACGTT 5’…TTGGTATGTTTAAATGTT… 3’ 5’…TTAACATATTTAAACATT… 3’ (1) bisulfite convert read (treat sequence as both forward and reverse strand) (2) align to bisulfite converted genomes (3) Bismark …TTGGTATGTTTAAATGTT… …AACCATACAAATTTACAA… forward strand C -> T converted genome (1) (2) (4) …CCAACATATTTAAACACT… …GGTTGTATAAATTTGTGA… forward strand G -> A converted genome (equals reverse strand C -> T conversion) (3) (4) 5’…CCGGCATGTTTAAACGCT… 3’ read all 4 alignment outputs and extract the unmodified genomic sequence if the sequence could be mapped uniquely methylation call read sequence genomic sequence methylation call TTGGCATGTTTAAACGTTA CCGGCATGTTTAAACGCTA xz. . H. . Z. h. . h unmethylated C in CHH context H methylated C in CHH context x unmethylated C in CHG context X methylated C in CHG context z unmethylated C in Cp. G context Z methylated C in Cp. G context
Common sequencing protocols m. C | >>CCGGCATGTTTAAACGCT>> <<GGCCGTACAAATTTGCGA<< | m. C Top strand | hm. C >>UCGGUATGTTTAAACGUT>> | m. C Bottom strand <<GGUCGTACAAATTTGCGA<< 1) Directional libraries OT (vast majority of kits, also Epi. Gnome/Truseq) 2) PBAT libraries 3) Non-directional libraries (e. g. single-cell BS-Seq, Zymo Pico Methyl-Seq) >>TCGGTATGTTTAAACGTT>> <<GGTCGTACAAATTTGCGA<< CTOT <<AGCCATACAAATTTGCAA<< >>CCAGCATGTTTAAACGCT>> OT CTOT OB CTOB >>TCGGTATGTTTAAACGTT>> <<AGCCATACAAATTTGCAA<< >>CCAGCATGTTTAAACGCT>> CTOB <<GGTCGTACAAATTTGCGA<< OB
Validation
BS-Seq Analysis Workflow QC Trimming Mapping Analysis Mapped QC Methylation extraction
Raw Sequence Data (Fast. Q file) . . . up to 1, 000, 000 lines per lane
Part I: Initial QC What does QC tell you about your library? • • • # of sequences Basecall qualities Base composition Potential contaminants Expected duplication rate
QC Raw data: Sequence Quality Error rate 0. 1% 1% 10%
QC: Base Composition WGSBS RRBS 21
QC: Duplication rate
QC: Overrepresented sequences
Common problems in BS-Seq Not observed in ‘normal’ libraries, e. g. Ch. IP or RNA-Seq
Removing poor quality basecalls
Removing adapter contamination
Adapter trimming (Illumina adapter: AGATCGGAAGAGC) B: AGATCTTTTATTCGGTAGGATTAGCGGTAGTTATTTTGGAGGAT A: AGATCTTTTATTCGGTAGGATTAGCGGTAGTTATTTTGGAGGAT partial match full match B: AGATCTTTTATTCGGTAGGATAGATCGGAAGAGCXXXXXXXX A: AGATCTTTTATTCGGTAGGAT B: AGATCTTTTATTCGGTAGGATTAGCGGTAGTTATTTTGGAGATC A: AGATCTTTTATTCGGTAGGATTAGCGGTAGTTATTTTGG B: AGATCTTTTATTCGGTAGGATTAGCGGTAGTTATTTTGGAGAGA A: AGATCTTTTATTCGGTAGGATTAGCGGTAGTTATTTTGGAG B: AGATCTTTTATTCGGTAGGATTAGCGGTAGTTATTTTGGAGGAG A: AGATCTTTTATTCGGTAGGATTAGCGGTAGTTATTTTGGAGG B: AGATCTTTTATTCGGTAGGATTAGCGGTAGTTATTTTGGAGGAA A: AGATCTTTTATTCGGTAGGATTAGCGGTAGTTATTTTGGAGGA
Summary Adapter/Quality Trimming Important to trim because failure to do so might result in: § Low mapping efficiency § Mis-alignments § Errors in methylation calls since adapters are methylated § Basecall errors tend toward 50% (C: m. C)
Part II: Sequence alignment – Bismark primary alignment output (BAM file) Read 1 chromosome position HISEQ 2000 -06: 366: C 3 G 4 NACXX: 3: 1101: 1316: 2067_1: N: 0: 99 16 71322125 255 100 M = 71322232 207 sequence NTTATTTAGTTTTTTAGGGTTTGTGTGTAGGAGTGTGGGAATTATGTTTTTTATGGTTGATATTTAAAAGTGAGTATAAATTATATTTTTTT #1=DDDDDAAFFHIIIA: <FGHCCEFGHD? CFFBBBGEHHGHIII<FEHIIIII==DE? ? EHHFHEEEEEEEC>; >66; @CDEEEDCEEEEEEEDDDCBB quality NM: i: 14 XX: Z: G 8 C 2 C 7 C 21 C 13 C 6 CC 1 C 17 CC 3 C 4 CC 4 XM: Z: . . h. . . . x. . . . . h. . . x. . . hh. h. . . . hh. . XR: Z: CT XG: Z: CT XA: Z: 1 HISEQ 2000 -06: 366: C 3 G 4 NACXX: 3: 1101: 1316: 2067_1: N: 0: 147 16 71322232 255 100 M = 71322125 -207 GGTTATTTAGGGTTATTGTTTTAGAGTTTTATTGTTGTGAACAGATATATGATTAAGGTAATTTTTATAAGGATAATATTTAATTGGAGTTGGTT CCCEEECADCFFFFHHGHGHIIGIHFIJJIJIHFGHGGGEHIJIIJGIGFJJJJJIGJJJJIIIJJIJIJJJJJJIJHHHHHFFFFFCCC NM: i: 21 XX: Z: 2 G 2 CC 1 C 11 C 11 C 2 C 10 C 1 C 4 C 2 C 1 C 3 C 5 C 2 C 12 C 3 C 1 XM: Z: . . . hh. h. h. x. . . h. . . x. . . . X. . . h. h. . . x. . . h. XR: Z: GA XG: Z: CT XB: Z: 1 Read 2 methylation call
Sequence duplication Complex/diverse library: Duplicated library: percent methylation 55 17 100 100 71 100 percent methylation 33 50 100 100 50 100 deduplication
Deduplication - considerations Advisable for large genomes and moderate coverage - unlikely to sequence several genuine copies of the same fragment amongst >5 bn possible fragments with different start sites - maximum coverage with duplication may still be (read length)-fold (even more with paired-end reads) NOT advisable for RRBS or other target enrichment methods where higher coverage is either desired or expected RRBS CCGG deduplication
Methylation extraction Read 1. . Z. . . h. . . . x. . . Z. . x. . . hh. h. . . . z. . hx. . . hh. Z. . . hh. . x. . . Z. h. . . . x. . . h. . . . redundant methylation calls Read 2 Read 1. . Z. . . h. . . . x. . . Z. . x. . . hh. h. . . z. . hx. . . hh. Z. . . hh. . x. . . Z. h. . . . x. . . h. . . . Read 2 Cp. G methylation output read ID meth state chr pos context
Methylation extraction I Cp. G methylation output bismark 2 bed. Graph/coverage output chr pos methylation percentage meth unmeth
Methylation extraction II coverage output coverage 2 cytosine optional: merge into Cp. G dinucleotide entities Genome wide Cp. G report chr 34 pos strand meth unmeth di-nuc tri-nuc
Part III: Mapped QC Methylation bias good opportunity to look at conversion efficiency 35
Artificial methylation calls in paired-end libraries end repair + A-tailing 5’ 3’- GGGNNNNNNNNNNNNNN CCCA ACCCNNNNNNNNNNNNNN GGG -3’ -5’
Specialist applications WGBS (e)RRBS NOMe-seq single-cell target enrichment NMT-seq + different library kit protocols PBAT amplicon non-directional
Reduced representation BS-Seq (RRBS) Sequence composition bias High duplication rate
Artificial methylation calls in RRBS libraries C genomic cytosine C unmethylated cytosine Use appropriate trimming (trim_galore --rrbs)
Accel Swift kit Read 1 Read 2 Use appropriate trimming (trim_galore --clip_r 1 10 --clip_r 2 15)
Post-bisulfite adapter tagging (PBAT) WGBS PBAT suitable for low input material 41
PBAT-Seq trim off first few basepairs before alignment
Bismark User Guide https: //rawgit. com/Felix. Krueger/Bismark/master/Docs/Bismark_User_Guide. html
https: //github. com/Felix. Krueger/Bismark/tree/master/Docs# viii-notes-about-different-library-types-and-commercial-kits
https: //sequencing. qcfail. com/
Bismark workflow Pre Alignment Fast. QC Trim Galore Initial quality control Adapter/quality trimming using Cutadapt; handles RRBS and paired-end reads; Trim Galore and RRBS User guide Alignment Bismark Output BAM Post Alignment Deduplication Methylation extractor optional Output individual cytosine methylation calls; optionally bed. Graph or genome-wide cytosine report M-bias analysis bismark 2 report Graphical HTML report generation Example: http: //www. bioinformatics. babraham. ac. uk/projects/bismark/PE_report. html protocol: Quality Control, trimming and alignment of Bisulfite-Seq data
nf_bisulfite_WGBS Fast. QC Trim Galore Fast. Q Screen --bisulfite Fast. QC Bismark dedup Bismark meth. Xtract Bismark reports Bismark summary Multi. QC Bismark workflow using a workflow manager
Useful links • Fast. QC www. bioinformatics. babraham. ac. uk/projects/fastqc/ • Trim Galore https: //github. com/Felix. Krueger/Trim. Galore • Cutadapt https: //code. google. com/p/cutadapt/ • Bismark https: //github. com/Felix. Krueger/Bismark • Bowtie 2 http: //bowtie-bio. sourceforge. net/bowtie 2/ • Seq. Monk www. bioinformatics. babraham. ac. uk/projects/seqmonk/ • Cluster Flow www. bioinformatics. babraham. ac. uk/projects/clusterflow/ https: //sequencing. qcfail. com/ 48
Sierra: A web-based LIMS system for small sequencing facilities Seq. Monk: Genome browser, quantitation and data analysis re. Straining. Order Trim Galore! Quality and adapter trimming for (RRBS) sequencing libraries Fast. Q Screen: organism and contamination detection Bismark: Bisulfite-sequencing alignments and methylation calls Hi-C mapping ASAP: Allele-specific alignments Fast. QC: quality control for high throughput sequencing 49 https: //www. bioinformatics. babraham. ac. uk
50
DNA methylation is reset during reprogramming 51
Validation 52
Fragment size distribution in RRBS identical (redundant) methylation calls 53
- Slides: 53