Read Processing and Mapping From Raw to Analysisready

From Raw to Analysis-ready Reads Raw reads Read assessment and prep Mapping Duplicate Marking

Sample to Raw Reads Sample Preparation 3 Library Construction QC and Quantification Sequencing Raw

Instrument Output Illumina Mi. Seq Illumina Hi. Seq Ion PGM Ion Proton Pacific Biosciences

Solid Phase Amplification V 3 Hi. Sequencing Steps • Clusters are linearized • Sequencing

FASTQ Format (Illumina Example) Read Record Header Separator (with optional repeated header) Lane Flow

Base Call Quality: Phred Quality Scores Phred* quality score Q with base-calling error probability

Initial Read Assessment and Processing Raw reads Read assessment and prep Mapping Duplicate Marking

Quick Read Assessment: Fast. QC FREE DOWNLOAD § Download: http: //www. bioinformatics. babraham. ac.

Read Assessment Example (Cont’d) Trim for base quality or adapters (run or library issue)

Read Assessment Example (Cont’d) Tru. Seq Adapter, Index 9 5’ GATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATGCCGTCTTCTGCTTG

Comprehensive Read Assessment: Prinseq http: //prinseq. sourceforge. net/ 13

Selected Tools to Process Reads Fastx toolkit* http: //hannonlab. cshl. edu/fastx_toolkit/ (partial list) FASTQ

Many Analysis Pipelines Start with Read Mapping http: //www. broadinstitute. org/gatk/guide/topic? name=best-practices 15 http:

Read Mapping Raw reads Read assessment and prep Mapping Duplicate Marking Local realignment Base

Sequence References and Annotations § § § http: //www. ncbi. nlm. nih. gov/projects/genome/assembly/grc/data. shtml

Fasta Sequence Format • • • One or more sequences per file “>” denotes

Read Mapping Novoalign (3. 0) SOAP 3 (version 91) BWA (0. 7. 4) Bowtie

Read Mapping: BWA Features • Uses Burrows Wheeler Transform — fast — modest memory

Read Mapping: Bowtie 2 BOWTIE 2 § Uses dynamic programming (edit distance scoring) o

SAM (BAM) Format SEQUENCE ALIGNMENT/MAP FORMAT § Universal standard § Human-readable (SAM) and compact

SAM/BAM Format: Header [benpass align_genotype]$ samtools view -H all. Y. recalibrated. merge. bam samtools

SAM/BAM Format: Alignment Records [benpass align_genotype]$ samtools view all. Y. recalibrated. merge. bam 2

Preparing for Next Steps Raw reads Read assessment and prep Mapping Local realignment PICARD

Duplicate Marking Raw reads Read assessment and prep Mapping Duplicate Marking Local realignment Base

SAM/BAM Format: Alignment Records [benpass align_genotype]$ samtools view all. Y. recalibrated. merge. bam HW-ST

Local Realignment Raw reads Read assessment and prep Mapping Duplicate Marking Local realignment Base

Local Realignment Raw BWA alignment De. Pristo MA, et al. A framework for variation

Base Quality Recalibration Raw reads Read assessment and prep Mapping Duplicate Marking Local realignment

Raw reads Read assessment and prep Mapping Duplicate Marking Local realignment Base quality recalibration

Is there an easier way to get started? ! http: //galaxyproject. org/ Click on

Slides: 35

Download presentation

Read Processing and Mapping: From Raw to Analysis-ready Reads BEN PASSARELLI STEM CELL INSTITUTE GENOME CENTER NGS WORKSHOP 31 MAY 2013

From Raw to Analysis-ready Reads Raw reads Read assessment and prep Mapping Duplicate Marking Local realignment Base quality recalibration Analysis-ready reads 2 Session Topics • • • Overview of high-throughput sequencing platforms Understand read data formats and quality scores Identify and fix some common read data problems Find a genomic reference for mapping Mapping reads to a reference genome Understand alignment output Sort, merge, index alignment for further analysis Mark/eliminate duplicate reads Locally realign at indels Recalibrate base quality scores How to get started

Sample to Raw Reads Sample Preparation 3 Library Construction QC and Quantification Sequencing Raw Reads

Instrument Output Illumina Mi. Seq Illumina Hi. Seq Ion PGM Ion Proton Pacific Biosciences RS Images (. tiff) Movie Standard flowgram file (. sff) Cluster intensity file (. cif) Trace (. trc. h 5) Base call file (. bcl) Pulse (. pls. h 5) Base (. bas. h 5) Sequence Data (FASTQ Format) 4

Sequencing Platforms at a Glance

Solid Phase Amplification V 3 Hi. Sequencing Steps • Clusters are linearized • Sequencing primer annealed • All four d. NTPs added at each cycle Each with different **Fluorescent Tag** • Intensity of different tags base call • Error Profile: substitutions Library DNA binds to Oligos Immobilized on Glass Flowcell Surface 6

FASTQ Format (Illumina Example) Read Record Header Separator (with optional repeated header) Lane Flow Cell ID Tile Coordinates Barcode @DJG 84 KN 1: 272: D 17 DBACXX: 2: 1101: 12432: 5554 1: N: 0: AGTCAA CAGGAGTCTTCGTACTGCTTCTCGGCCTCAGCCTGATCAGTCACACCGTT + Read Bases BCCFFFDFHHHHHIJJIJJJJJJJJJJIJJJJ @DJG 84 KN 1: 272: D 17 DBACXX: 2: 1101: 12454: 5610 1: N: 0: AG AAAACTCTTACTACATCAGTATGGCTTTTAAAACCTCTGTTTGGAGCCAG Read Quality + Scores @@@DD? DDHFDFHEHIIIIIBBGEBHIEDH=EEHI>FDABHHFGH 2 @DJG 84 KN 1: 272: D 17 DBACXX: 2: 1101: 12438: 5704 1: N: 0: AG CCTCCTGCTTAAAACCCAAAAGGTCAGAAGGATCGTGAGGCCCCGCTTTC + CCCFFFFFHHGHHJIJJJJJJJI@HGIJJJJIIIJGIGIHIJJJIIIIJJ @DJG 84 KN 1: 272: D 17 DBACXX: 2: 1101: 12340: 5711 1: N: 0: AG NOTE: for paired-end runs, there is a second file GAAGATTTATAGGTAGAGGCGACAAACCTACCGAGCCTGGTGATAGCTGG with + one-to-one corresponding headers and reads CCCFFFFFHHHHHGGIJJJJJJIJJIJJJJJGIJJJHIIJJJ

Base Call Quality: Phred Quality Scores Phred* quality score Q with base-calling error probability P Q = -10 log 10 P * Name of first program to assign accurate base quality scores. From the Human Genome Project. Q score Probability of base error Base confidence Sangerencoded (Q Score + 33) ASCII character 10 0. 1 90% “+” 20 0. 01 99% “ 5” 30 0. 001 99. 9% “? ” 40 0. 0001 99. 99% “I” SSSSSSSSSSSSSSSSSSSSS. . . . . . IIIIIIIIIIIIIIIIIIIII. . . . . LLLLLLLLLLLLLLLLLLLLL. . . . !"#$%&'()*+, -. /0123456789: ; <=>? @ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~ | | | 33 59 64 73 104 126 S - Sanger Phred+33 I - Illumina 1. 3+ Phred+64 L - Illumina 1. 8+ Phred+33 range: 0 to 40 range: 0 to 41

Initial Read Assessment and Processing Raw reads Read assessment and prep Mapping Duplicate Marking Local realignment Base quality recalibration COMMON PROBLEMS THAT CAN AFFECT ANALYSIS LOW CONFIDENCE BASE CALLS § typically toward ends of reads § criteria vary by application PRESENCE OF ADAPTER SEQUENCE IN READS § poor fragment size selection § protocol execution or artifacts OVER-ABUNDANT SEQUENCE DUPLICATES LIBRARY CONTAMINATION Analysis-ready reads

Quick Read Assessment: Fast. QC FREE DOWNLOAD § Download: http: //www. bioinformatics. babraham. ac. uk/projects/fastqc/ § Tutorial : http: //www. youtube. com/watch? v=bz 93 Re. Ov 87 Y SAMPLES READS (200 K DEFAULT): FAST, LOW RESOURCE USE

Read Assessment Example (Cont’d) Trim for base quality or adapters (run or library issue) Trim leading bases (library artifact)

Read Assessment Example (Cont’d) Tru. Seq Adapter, Index 9 5’ GATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATGCCGTCTTCTGCTTG

Comprehensive Read Assessment: Prinseq http: //prinseq. sourceforge. net/ 13

Selected Tools to Process Reads Fastx toolkit* http: //hannonlab. cshl. edu/fastx_toolkit/ (partial list) FASTQ Information: Chart Quality Statistics and Nucleotide Distribution FASTQ Trimmer: Shortening FASTQ/FASTA reads (removing barcodes or noise). FASTQ Clipper: Removing sequencing adapters FASTQ Quality Filter: Filters sequences based on quality FASTQ Quality Trimmer: Trims (cuts) sequences based on quality FASTQ Masker: Masks nucleotides with 'N' (or other character) based on quality *defaults to old Illumina fastq (ASCII offset 64). Use –Q 33 option. Sep. Prep https: //github. com/jstjohn/Seq. Prep Adapter trimming Merge overlapping paired-end read Biopython http: //biopython. org, http: //biopython. org/DIST/docs/tutorial/Tutorial. html (for python programmers) Especially useful for implementing custom/complex sequence analysis/manipulation Galaxy http: //galaxy. psu. edu Great for beginners: upload data, point and click Just about everything you’ll see in today’s presentations Solexa. QA 2 http: //solexaqa. sourceforge. net Dynamic trimming Length sorting (resembles read grouping of Prinseq)

Many Analysis Pipelines Start with Read Mapping http: //www. broadinstitute. org/gatk/guide/topic? name=best-practices 15 http: //www. nature. com/nprot/journal/v 7/n 3/full/nprot. 2012. 016. html

Read Mapping Raw reads Read assessment and prep Mapping Duplicate Marking Local realignment Base quality recalibration Analysis-ready reads http: //www. broadinstitute. org/igv/

Sequence References and Annotations § § § http: //www. ncbi. nlm. nih. gov/projects/genome/assembly/grc/data. shtml http: //www. ncbi. nlm. nih. gov/guide/howto/dwn-genome Comprehensive reference information § § http: //hgdownload. cse. ucsc. edu/downloads. html Comprehensive reference, annotation, and translation information § § § ftp: //gsapubftp-anonymous@ftp. broadinstitute. org/bundle References and SNP information data by GATK Human only § § § http: //cufflinks. cbcb. umd. edu/igenomes. html Pre-indexed references and gene annotations for Tuxedo suite Human, Mouse, Rat , Cow, Dog, Chicken, Drosophila, C. elegans, Yeast § http: //www. repeatmasker. org

Fasta Sequence Format • • • One or more sequences per file “>” denotes beginning of sequence or contig Subsequent lines up to the next “>” define sequence Lowercase base denotes repeat masked base Contig ID may have comments delimited by “|” >chr 1 … TGGACTTGTGGCAGGAATgaaatccttagacctgtgctgtccaatatggt agccaccaggcacatgcagccactgagcacttgaaatgtggatagtctga attgagatgtgccataagtgtaaaatatgcaccaaatttcaaaggctaga aaaaaagaatgtaaaatatcttattattttatattgattacgtgctaaaa taaccatatttgggatatactggattttaaaaatatatcactaatttcat … >chr 2 … >chr 3 …

Read Mapping Novoalign (3. 0) SOAP 3 (version 91) BWA (0. 7. 4) Bowtie 2 (2. 1. 0) Tophat 2 (2. 0. 8 b) STAR (2. 3. 0 e) License Commercial GPL v 3 Artistic GPL v 3 Mismatch allowed up to 8 up to 3 user specified. max is function of read length and error rate user specified uses Bowtie 2 user specified Alignments reported per read random/all/none user selected uses Bowtie 2 user selected Gapped alignment up to 7 bp 1 -3 bp gap yes yes splice junctions introns Pair-end reads yes yes yes Best alignment highest alignment score minimal number of mismatches highest alignment score uses Bowtie 2 highest alignment score Trim bases 3’ end 3’ and 5’ end uses Bowtie 2 3’ and 5’ end Comments At one time, best performance and alignment quality Element of Broad’s “best practices” genotyping workflow Smith-Waterman quality alignments, currently fastest Currently most popular RNA-seq aligner Very fast; uses memory to achieve performance

Read Mapping: BWA Features • Uses Burrows Wheeler Transform — fast — modest memory footprint (<4 GB) • Accurate • Tolerates base mismatches — increased sensitivity — reduces allele bias • Gapped alignment for both single- and paired-ended reads • Automatically adjusts parameters based on read lengths and error rates • Native BAM/SAM output (the de facto standard) • Large installed base, well-supported • Open-source (no charge)

Read Mapping: Bowtie 2 BOWTIE 2 § Uses dynamic programming (edit distance scoring) o Eliminates need for realignment around indels o Can be tuned for different sequencing technologies § Multi-seed search - adjustable sensitivity § Input read length limited only by available memory § Fasta or Fastq input § Caveats o Longer input reads require much more memory o Trade-off parallelism with memory requirement http: //bowtie-bio. sourceforge. net/bowtie 2 Langmead B, Salzberg S. Fast gapped-read alignment with Bowtie 2, Nature Methods. 2012, 9: 357 -359 21 Dynamic Programming Illustration

SAM (BAM) Format SEQUENCE ALIGNMENT/MAP FORMAT § Universal standard § Human-readable (SAM) and compact (BAM) forms STRUCTURE § Header › version, sort order, reference sequences, read groups, program/processing history § Alignment records

SAM/BAM Format: Header [benpass align_genotype]$ samtools view -H all. Y. recalibrated. merge. bam samtools to view bam @HD VN: 1. 0 GO: none SO: coordinate header sort order @SQ SN: chr. M LN: 16571 @SQ SN: chr 1 LN: 249250621 @SQ SN: chr 2 LN: 243199373 reference sequence names @SQ SN: chr 3 LN: 198022430 with lengths … @SQ SN: chr 19 LN: 59128983 @SQ SN: chr 20 LN: 63025520 @SQ SN: chr 21 LN: 48129895 @SQ SN: chr 22 LN: 51304566 read groups with platform, @SQ SN: chr. X LN: 155270560 library and sample information @SQ SN: chr. Y LN: 59373566 … @RG ID: 86 -191 PL: ILLUMINA LB: IL 500 SM: 86 -191 -1 @RG ID: Bs. K 010 PL: ILLUMINA LB: IL 501 SM: Bs. K 010 -1 @RG ID: Bsk 136 PL: ILLUMINA LB: IL 502 SM: Bsk 136 -1 @RG ID: MAK 001 PL: ILLUMINA LB: IL 503 SM: MAK 001 -1 @RG ID: NG 87 PL: ILLUMINA LB: IL 504 SM: NG 87 -1 … program (analysis) history @RG ID: SDH 023 PL: ILLUMINA LB: IL 508 SM: SDH 023 @PG ID: GATK Indel. Realigner VN: 2. 0 -39 -gd 091 f 72 CL: known. Alleles=[] target. Intervals=tmp. intervals. li @PG ID: bwa PN: bwa VN: 0. 6. 2 -r 126

SAM/BAM Format: Alignment Records [benpass align_genotype]$ samtools view all. Y. recalibrated. merge. bam 2 3 4 5 6 8 9 HW-ST 605: 127: B 0568 ABXX: 2: 1201: 10933: 3739 147 chr 1 27675 60 101 M = 27588 -188 10 TCATTTTATGGCCCCTTCTTCCTATATCTGGTAGCTTTTAAATGATGACCATGTAGATAATCTTTATTGTCCCTCTTTCAGCAGACGGTATTTTCTTATGC 11 =7; : ; <=? ? <=BCCEFFEJFCEGGEFFDF? BEA@DEDFEFFDE>EE@E@ADCACB>CCDCBACDCDDDAB@@BCADDCBC@BCBB 8@ABCCCDCBDA@>: / RG: Z: 86 -191 1 HW-ST 605: 127: B 0568 ABXX: 3: 1104: 21059: 173553 83 chr 1 27682 60 101 M = 27664 -119 ATGGCCCCTTCTTCCTATATCTGGTAGCTTTTAAATGATGACCATGTAGATAATCTTTATTGTCCCTCTTTCAGCAGACGGTATTTTCTTATGCTACAGTA 8; 8. 7: : <? =BDHFHGFFDCGDAACCABHCCBDFBE</BA 4//BB@BCAA@CB@ABA>A? ? @B@BBACA>? ; A@8? ? CABBBA@AAAA? ? @BB 0 RG: Z: SDH 023 * Many fields after column 12 deleted (e. g. , recalibrated base scores) have been deleted for improved readability http: //samtools. sourceforge. net/SAM 1. pdf

Preparing for Next Steps Raw reads Read assessment and prep Mapping Local realignment PICARD TOOLS: FAST, PORTABLE, FREE § http: //picard. sourceforge. net/command-line-overview. shtml § Sort: Sort. Sam. jar § Merge: Merge. Sam. Files. jar § Index: Build. Bam. Index. jar Base quality recalibration ORDER: Duplicate Marking Analysis-ready reads 25 SUBSEQUENT STEPS REQUIRE SORTED AND INDEXED BAMS § Sort orders: karyotypic, lexicographical § Indexing improves analysis performance SORT, MERGE (OPTIONAL), INDEX

Duplicate Marking Raw reads Read assessment and prep Mapping Duplicate Marking Local realignment Base quality recalibration Analysis-ready reads $java -Xmx 4 g -jar <path to picard>/Mark. Duplicates. jar INPUT=aligned. sorted. bam OUTPUT=aligned. sorted. dedup. bam VALIDATION_STRINGENCY=LENIENT METRICS_FILE=aligned. dedup. metrics. txt REMOVE_DUPLICATES=false ASSUME_SORTED=true http: //picard. sourceforge. net/command-line-overview. shtml#Mark. Duplicates

SAM/BAM Format: Alignment Records [benpass align_genotype]$ samtools view all. Y. recalibrated. merge. bam HW-ST 605: 127: B 0568 ABXX: 2: 1201: 10933: 3739 147 chr 1 27675 60 101 M = 27588 -188 TCATTTTATGGCCCCTTCTTCCTATATCTGGTAGCTTTTAAATGATGACCATGTAGATAATCTTTATTGTCCCTCTTTCAGCAGACGGTATTTTCTTATGC =7; : ; <=? ? <=BCCEFFEJFCEGGEFFDF? BEA@DEDFEFFDE>EE@E@ADCACB>CCDCBACDCDDDAB@@BCADDCBC@BCBB 8@ABCCCDCBDA@>: / RG: Z: 86 -191 http: //picard. sourceforge. net/explain-flags. html http: //samtools. sourceforge. net/SAM 1. pdf

Local Realignment Raw reads Read assessment and prep Mapping Duplicate Marking Local realignment Base quality recalibration Analysis-ready reads BWT-BASED ALIGNMENT IS FAST FOR MATCHING READS TO REFERENCE INDIVIDUAL BASE ALIGNMENTS OFTEN SUB-OPTIMAL AT INDELS APPROACH § Fast read mapping with BWT-based aligner § Realign reads at indel sites using gold standard (but much slower) Smith-Waterman algorithm BENEFITS § Refines location of indels § Reduces erroneous SNP calls § Very high alignment accuracy in significantly less time, with fewer resources 1 Smith, Temple F. ; and Waterman, Michael S. (1981). "Identification of Common Molecular Subsequences". Journal of Molecular Biology 147: 195– 197. doi: 10. 1016/0022 -2836(81)90087 -5. PMID 7265238

Local Realignment Raw BWA alignment De. Pristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May; 43(5): 491 -8. PMID: 21478889 Post re-alignment at indels

Base Quality Recalibration Raw reads Read assessment and prep Mapping Duplicate Marking Local realignment Base quality recalibration Analysis-ready reads STEP 1: Find covariates at non-db. SNP sites using: Reported quality score The position within the read The preceding and current nucleotide (sequencer properties) java -Xmx 4 g -jar Genome. Analysis. TK. jar -T Base. Recalibrator -I alignment. bam -R hg 19/ucsc. hg 19. fasta -known. Sites hg 19/dbsnp_135. hg 19. vcf -o alignment. recal_data. grp STEP 2: Generate BAM with recalibrated base scores: java -Xmx 4 g -jar Genome. Analysis. TK. jar -T Print. Reads -R hg 19/ucsc. hg 19. fasta -I alignment. bam -BQSR alignment. recal_data. grp -o alignment. recalibrated. bam

Base Quality Recalibration (Cont’d)

Raw reads Read assessment and prep Mapping Duplicate Marking Local realignment Base quality recalibration Analysis-ready reads 32

Is there an easier way to get started? ! http: //galaxyproject. org/ Click on “Use Galaxy”

Getting Started

Raw reads Read assessment and prep Mapping Duplicate Marking Local realignment Base quality recalibration Analysis-ready reads 35