Read Processing and Mapping From Raw to Analysisready

  • Slides: 36
Download presentation
Read Processing and Mapping: From Raw to Analysis-ready Reads Ben Passarelli Stem Cell Institute

Read Processing and Mapping: From Raw to Analysis-ready Reads Ben Passarelli Stem Cell Institute Genome Center NGS Workshop 12 September 2012

Samples to Information Click to edit Master title style Variant calling Gene expression Chromatin

Samples to Information Click to edit Master title style Variant calling Gene expression Chromatin structure Methylome Immunorepertoires De novo assembly …

Many Analysis Pipelines Start with Read Mapping Click to edit Master title style Genotyping

Many Analysis Pipelines Start with Read Mapping Click to edit Master title style Genotyping (GATK) http: //www. broadinstitute. org/gsa/wiki/images/7/7 a/Overall_flow. jpg http: //www. broadinstitute. org/gatk/guide/topic? name=intro RNA-seq (Tuxedo) http: //www. nature. com/nprot/journal/v 7/n 3/full/nprot. 2012. 016. html

From Raw to Analysis-ready Reads Click to edit Master title style Raw reads Read

From Raw to Analysis-ready Reads Click to edit Master title style Raw reads Read assessment and prep Mapping Local realignment Duplicate marking Base quality recalibration Session Topics • • • Understand read data formats and quality scores Identify and fix some common read data problems Find and prepare a genomic reference for mapping Map reads to a genome reference Understand alignment output Sort, merge, index alignment for further analysis Locally realign at indels to reduce alignment artifacts Mark/eliminate duplicate reads Recalibrate base quality scores • An easy way to get started Analysis-ready reads

Instrument Output Click to edit Master title style Illumina Mi. Seq Illumina Hi. Seq

Instrument Output Click to edit Master title style Illumina Mi. Seq Illumina Hi. Seq Images (. tiff) Cluster intensity file (. cif) Base call file (. bcl) Ion. Torrent PGM Roche 454 Standard flowgram file (. sff) Sequence Data (FASTQ Format) Pacific Biosciences RS Movie Trace (. trc. h 5) Pulse (. pls. h 5) Base (. bas. h 5)

FASTQ Format (Illumina Example) Click to edit Master title style Raw reads Read Record

FASTQ Format (Illumina Example) Click to edit Master title style Raw reads Read Record Header Read assessment and prep Mapping Separator (with optional repeated header) Local realignment Duplicate marking Base quality recalibration Analysis-ready reads Lane Flow Cell ID Tile Coordinates Barcode @DJG 84 KN 1: 272: D 17 DBACXX: 2: 1101: 12432: 5554 1: N: 0: AGTCAA CAGGAGTCTTCGTACTGCTTCTCGGCCTCAGCCTGATCAGTCACACCGTT + Read Bases BCCFFFDFHHHHHIJJIJJJJJJJJJJIJJJJ @DJG 84 KN 1: 272: D 17 DBACXX: 2: 1101: 12454: 5610 1: N: 0: AG AAAACTCTTACTACATCAGTATGGCTTTTAAAACCTCTGTTTGGAGCCAG Read Quality + Scores @@@DD? DDHFDFHEHIIIIIBBGEBHIEDH=EEHI>FDABHHFGH 2 @DJG 84 KN 1: 272: D 17 DBACXX: 2: 1101: 12438: 5704 1: N: 0: AG CCTCCTGCTTAAAACCCAAAAGGTCAGAAGGATCGTGAGGCCCCGCTTTC + CCCFFFFFHHGHHJIJJJJJJJI@HGIJJJJIIIJGIGIHIJJJIIIIJJ @DJG 84 KN 1: 272: D 17 DBACXX: 2: 1101: 12340: 5711 1: N: 0: AG GAAGATTTATAGGTAGAGGCGACAAACCTACCGAGCCTGGTGATAGCTGG + CCCFFFFFHHHHHGGIJJJJJJIJJIJJJJJGIJJJHIIJJJ NOTE: for paired-end runs, there is a second file with one-to-one corresponding headers and reads

Base Call Quality: Phred Quality Scores Click to edit Master title style Phred* quality

Base Call Quality: Phred Quality Scores Click to edit Master title style Phred* quality score Q with base-calling error probability P Q = -10 log 10 P * Name of first program to assign accurate base quality scores. From the Human Genome Project. Q score Probability of base error Base confidence Sanger-encoded (Q Score + 33) ASCII character 10 0. 1 90% “+” 20 0. 01 99% “ 5” 30 0. 001 99. 9% “? ” 40 0. 0001 99. 99% “I” SSSSSSSSSSSSSSSSSSSSS. . . . . . IIIIIIIIIIIIIIIIIIIII. . . . . LLLLLLLLLLLLLLLLLLLLL. . . . !"#$%&'()*+, -. /0123456789: ; <=>? @ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~ | | | 33 59 64 73 104 126 S - Sanger Phred+33 I - Illumina 1. 3+ Phred+64 L - Illumina 1. 8+ Phred+33 range: 0 to 40 range: 0 to 41

File Organization Click to edit Master title style Raw reads Read assessment and prep

File Organization Click to edit Master title style Raw reads Read assessment and prep Mapping [benpass@solexalign]$ ls Sample_FS 53_EPCAM+_CD 10 -_IL 2270 -18 Sample_FS 53_EPCAM+_CD 10+_IL 2269 -19 Sample_COH 77_CD 49 F-_IL 2275 -13 Sample_COH 77_CD 49 F+_CD 66 -_IL 2274 -14 Sample_COH 77_CD 49 F+_CD 66+_IL 2273 -15 Sample_COH 74_EPCAM+_CD 10 -_IL 2272 -16 Sample_COH 74_EPCAM+_CD 10+_IL 2271 -17 Sample_COH 69_EPCAM+_CD 10 -_IL 2268 -20 Sample_COH 69_EPCAM+_CD 10+_IL 2267 -21 Local realignment Duplicate marking Base quality recalibration Analysis-ready reads [benpass@solexalign]$ ls Sample_COH 77_CD 49 F-_IL 2275 -13_AGTCAA_L 002_R 1_001. fastq. gz COH 77_CD 49 F-_IL 2275 -13_AGTCAA_L 002_R 1_002. fastq. gz COH 77_CD 49 F-_IL 2275 -13_AGTCAA_L 002_R 1_003. fastq. gz COH 77_CD 49 F-_IL 2275 -13_AGTCAA_L 002_R 1_004. fastq. gz COH 77_CD 49 F-_IL 2275 -13_AGTCAA_L 002_R 2_001. fastq. gz COH 77_CD 49 F-_IL 2275 -13_AGTCAA_L 002_R 2_002. fastq. gz COH 77_CD 49 F-_IL 2275 -13_AGTCAA_L 002_R 2_003. fastq. gz COH 77_CD 49 F-_IL 2275 -13_AGTCAA_L 002_R 2_004. fastq. gz Barcode Format Read gzip compressed

Initial Read Assessment Click to edit Master title style Raw reads Read assessment and

Initial Read Assessment Click to edit Master title style Raw reads Read assessment and prep Mapping Common problems that can affect analysis • Low confidence base calls – typically toward ends of reads – criteria vary by application • Presence of adapter sequence in reads Local realignment Duplicate marking Base quality recalibration Analysis-ready reads – poor fragment size selection – protocol execution or artifacts • Over-abundant sequence duplicates • Library contamination

Initial Read Assessment: Fast. QC Click to edit Master title style Raw reads Read

Initial Read Assessment: Fast. QC Click to edit Master title style Raw reads Read assessment and prep Mapping Local realignment Duplicate marking Base quality recalibration • Free Download Analysis-ready reads Download: http: //www. bioinformatics. babraham. ac. uk/projects/fastqc/ Tutorial : http: //www. youtube. com/watch? v=bz 93 Re. Ov 87 Y • Samples reads (200 K default): fast, low resource use

Read Examples Click to Assessment edit Master title style Sanger Quality Score by Cycle

Read Examples Click to Assessment edit Master title style Sanger Quality Score by Cycle Median, Inner Quartile Range, 10 -90 percentile range, Mean Read Duplication ~8% of sampled sequences occur twice ~71. 48% of sequences are duplicates ~6% of sequences occur more than 10 x http: //proteo. me. uk/2011/05/interpreting-the-duplicate-sequence-plot-in-fastqc Note: Duplication based on read identity, not alignment at this point

Read Example Click to. Assessment edit Master title style (Cont’d) Per base sequence content

Read Example Click to. Assessment edit Master title style (Cont’d) Per base sequence content should resemble this…

Read Example Click to. Assessment edit Master title style (Cont’d)

Read Example Click to. Assessment edit Master title style (Cont’d)

Read Example Click to. Assessment edit Master title style (Cont’d) Tru. Seq Adapter, Index

Read Example Click to. Assessment edit Master title style (Cont’d) Tru. Seq Adapter, Index 9 5’ GATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATGCCGTCTTCTGCTTG

Read Assessment Example (Cont’d) Click to edit Master title style Trim for base quality

Read Assessment Example (Cont’d) Click to edit Master title style Trim for base quality or adapters (run or library issue) Trim leading bases (library artifact)

Selected Tools to Process Reads Click to edit Master title style Fastx toolkit* http:

Selected Tools to Process Reads Click to edit Master title style Fastx toolkit* http: //hannonlab. cshl. edu/fastx_toolkit/ (partial list) FASTQ Information: Chart Quality Statistics and Nucleotide Distribution FASTQ Trimmer: Shortening FASTQ/FASTA reads (removing barcodes or noise). FASTQ Clipper: Removing sequencing adapters FASTQ Quality Filter: Filters sequences based on quality FASTQ Quality Trimmer: Trims (cuts) sequences based on quality FASTQ Masker: Masks nucleotides with 'N' (or other character) based on quality *defaults to old Illumina fastq (ASCII offset 64). Use –Q 33 option. Sep. Prep https: //github. com/jstjohn/Seq. Prep Adapter trimming Merge overlapping paired-end read Biopython http: //biopython. org, http: //biopython. org/DIST/docs/tutorial/Tutorial. html (for python programmers) Especially useful for implementing custom/complex sequence analysis/manipulation Galaxy http: //galaxy. psu. edu Great for beginners: upload data, point and click Just about everything you’ll see in today’s presentations

Read Mapping Click to edit Master title style Raw reads Read assessment and prep

Read Mapping Click to edit Master title style Raw reads Read assessment and prep Mapping Local realignment Duplicate marking Base quality recalibration Analysis-ready reads http: //www. broadinstitute. org/igv/

Read Mapping: Aligning to a Reference Click to edit Master title style Raw reads

Read Mapping: Aligning to a Reference Click to edit Master title style Raw reads SOAP 2 (2. 20) Read assessment and prep Duplicate marking Base quality recalibration Analysis-ready reads Novoalign (2. 07. 00) License GPL v 3 LGPL v 3 Mismatch allowed exactly 0, 1, 2 0 -3 max in read user specified. up to 8 or more max is function of read length and error rate Alignments reported per read random/all/none user selected random/all/none Gapped alignment 1 -3 bp gap no yes up to 7 bp Pair-end reads yes yes Best alignment minimal number of mismatches highest alignment score Trim bases 3’ end 3’ and 5’ end 3’ end Mapping Local realignment BWA (0. 6. 2) Bowtie (0. 12. 8) Commercial

Read Mapping: BWA Click to edit Master title style Raw reads BWA Features Read

Read Mapping: BWA Click to edit Master title style Raw reads BWA Features Read assessment and prep Mapping Local realignment Duplicate marking Base quality recalibration Analysis-ready reads • Uses Burrows Wheeler Transform — fast — modest memory footprint (<4 GB) • Accurate • Tolerates base mismatches — increased sensitivity — reduces allele bias • Gapped alignment for both single- and paired-ended reads • Automatically adjusts parameters based on read lengths and error rates • Native BAM/SAM output (the de facto standard) • Large installed base, well-supported • Open-source (no charge)

Sequence References and Annotations Click to edit Master title style http: //www. ncbi. nlm.

Sequence References and Annotations Click to edit Master title style http: //www. ncbi. nlm. nih. gov/projects/genome/assembly/grc/data. shtml http: //www. ncbi. nlm. nih. gov/guide/howto/dwn-genome Comprehensive reference information http: //hgdownload. cse. ucsc. edu/downloads. html Comprehensive reference, annotation, and translation information ftp: //gsapubftp-anonymous@ftp. broadinstitute. org/bundle References and SNP information data by GATK Human only http: //cufflinks. cbcb. umd. edu/igenomes. html Pre-indexed references and gene annotations for Tuxedo suite Human, Mouse, Rat , Cow, Dog, Chicken, Drosophila, C. elegans, Yeast http: //www. repeatmasker. org/

Fasta Sequence Format Click to edit Master title style • • • One or

Fasta Sequence Format Click to edit Master title style • • • One or more sequences per file “>” denotes beginning of sequence or contig Subsequent lines up to the next “>” define sequence Lowercase base denotes repeat masked base Contig ID may have comments delimited by “|” >chr 1 … TGGACTTGTGGCAGGAATgaaatccttagacctgtgctgtccaatatggt agccaccaggcacatgcagccactgagcacttgaaatgtggatagtctga attgagatgtgccataagtgtaaaatatgcaccaaatttcaaaggctaga aaaaaagaatgtaaaatatcttattattttatattgattacgtgctaaaa taaccatatttgggatatactggattttaaaaatatatcactaatttcat … >chr 2 … >chr 3 …

Running BWA Click to edit Master title style Input files: reference. fasta, read 1.

Running BWA Click to edit Master title style Input files: reference. fasta, read 1. fastq. gz, read 2. fastq. gz Step 1: Index the genome (~3 CPU hours for a human genome reference): bwa index -a bwtsw reference. fasta Step 2: Generate alignments in Burrows-Wheeler transform suffix array coordinates: bwa aln reference. fasta read 1. fastq. gz > read 1. sai bwa aln reference. fasta read 2. fastq. gz > read 2. sai Apply option –q<quality threshold> to trim poor quality bases at 3'-ends of reads Step 3: Generate alignments in the SAM format (paired-end): bwa sampe reference. fasta read 1. sai read 2. sai read 1. fastq. gz read 2. fastq. gz > alignment_ouput. sam http: //bio-bwa. sourceforge. net/bwa. shtml

Running BWA (Cont’d) Click to edit Master title style Simple Form: bwa sampe reference.

Running BWA (Cont’d) Click to edit Master title style Simple Form: bwa sampe reference. fasta read 1. sai read 2. sai read 1. fastq. gz read 2. fastq. gz > alignment. sam Output to BAM: bwa sampe reference. fasta read 1. sai read 2. sai read 1. fastq. gz read 2. fastq. gz | samtools view -Sbh - > alignment. bam With Read Group Information: bwa sampe -r "@RGt. ID: readgroup. IDt. LB: librarynamet. SM: samplenamet. PL: ILLUMINA“ reference. fasta read 1. sai read 2. sai read 1. fastq. gz read 2. fastq. gz | samtools view -Sbh - > alignment. bam

SAM (BAM) Format Click to edit Master title style Sequence Alignment/Map format – Universal

SAM (BAM) Format Click to edit Master title style Sequence Alignment/Map format – Universal standard – Human-readable (SAM) and compact (BAM) forms Structure – Header version, sort order, reference sequences, read groups, program/processing history – Alignment records

SAM/BAM Format: Header Click to edit Master title style samtools to view bam [benpass

SAM/BAM Format: Header Click to edit Master title style samtools to view bam [benpass align_genotype]$ samtools view -H all. Y. recalibrated. merge. bam @HD VN: 1. 0 GO: none SO: coordinate header sort order @SQ SN: chr. M LN: 16571 @SQ SN: chr 1 LN: 249250621 @SQ SN: chr 2 LN: 243199373 @SQ SN: chr 3 LN: 198022430 reference sequence names … with lengths @SQ SN: chr 19 LN: 59128983 @SQ SN: chr 20 LN: 63025520 @SQ SN: chr 21 LN: 48129895 @SQ SN: chr 22 LN: 51304566 read groups with platform, @SQ SN: chr. X LN: 155270560 @SQ SN: chr. Y LN: 59373566 library and sample information … @RG ID: 86 -191 PL: ILLUMINA LB: IL 500 SM: 86 -191 -1 @RG ID: Bs. K 010 PL: ILLUMINA LB: IL 501 SM: Bs. K 010 -1 @RG ID: Bsk 136 PL: ILLUMINA LB: IL 502 SM: Bsk 136 -1 @RG ID: MAK 001 PL: ILLUMINA LB: IL 503 SM: MAK 001 -1 @RG ID: NG 87 PL: ILLUMINA LB: IL 504 SM: NG 87 -1 program (analysis) history … @RG ID: SDH 023 PL: ILLUMINA LB: IL 508 SM: SDH 023 @PG ID: GATK Indel. Realigner VN: 2. 0 -39 -gd 091 f 72 CL: known. Alleles=[] target. Intervals=tmp. intervals. list LODThreshold. For. Cleaning=5. 0 consensus. Determination. Model=USE_READS entropy. Threshold=0. 15 max. Reads. In. Memory=150000 max. Isize. For. Movement=3000 max. Positional. Move. Allowed=200 max. Consensuses=30 max. Reads. For. Consensuses=120 max. Reads. For. Realignment=20000 no. Original. Alignment. Tags=false n. Way. Out=null generate_n. Way. Out_md 5 s=false check_early=false no. PGTag=false keep. PGTags=false indels. File. For. Debugging=null statistics. File. For. Debugging=null SNPs. File. For. Debugging=null @PG ID: bwa PN: bwa VN: 0. 6. 2 -r 126

SAM/BAM Format: Alignment Records Click to edit Master title style [benpass align_genotype]$ samtools view

SAM/BAM Format: Alignment Records Click to edit Master title style [benpass align_genotype]$ samtools view all. Y. recalibrated. merge. bam 3 1 4 5 6 8 9 HW-ST 605: 127: B 0568 ABXX: 2: 1201: 10933: 3739 147 chr 1 27675 60 101 M = 27588 -188 10 TCATTTTATGGCCCCTTCTTCCTATATCTGGTAGCTTTTAAATGATGACCATGTAGATAATCTTTATTGTCCCTCTTTCAGCAGACGGTATTTTCTTATGC 11 =7; : ; <=? ? <=BCCEFFEJFCEGGEFFDF? BEA@DEDFEFFDE>EE@E@ADCACB>CCDCBACDCDDDAB@@BCADDCBC@BCBB 8@ABCCCDCBDA@>: / RG: Z: 86 -191 HW-ST 605: 127: B 0568 ABXX: 3: 1104: 21059: 173553 83 chr 1 27682 60 101 M = 27664 -119 ATGGCCCCTTCTTCCTATATCTGGTAGCTTTTAAATGATGACCATGTAGATAATCTTTATTGTCCCTCTTTCAGCAGACGGTATTTTCTTATGCTACAGTA 8; 8. 7: : <? =BDHFHGFFDCGDAACCABHCCBDFBE</BA 4//BB@BCAA@CB@ABA>A? ? @B@BBACA>? ; A@8? ? CABBBA@AAAA? ? @BB 0 RG: Z: SDH 023 * Many fields after column 12 deleted (e. g. , recalibrated base scores) have been deleted for improved readability http: //samtools. sourceforge. net/SAM 1. pdf

Preparing for Next Steps Click to edit Master title style Raw reads • Subsequent

Preparing for Next Steps Click to edit Master title style Raw reads • Subsequent steps require sorted and indexed bams – Sort orders: karyotypic, lexicographical – Indexing improves analysis performance Read assessment and prep Mapping • Picard tools: fast, portable, free http: //picard. sourceforge. net/command-line-overview. shtml Local realignment Sort: Merge: Index: Duplicate marking Base quality recalibration Analysis-ready reads • Sort. Sam. jar Merge. Sam. Files. jar Build. Bam. Index. jar Order: sort, merge (optional), index

Local Realignment Click to edit Master title style Raw reads Read assessment and prep

Local Realignment Click to edit Master title style Raw reads Read assessment and prep Mapping Local realignment Duplicate marking Base quality recalibration Analysis-ready reads • BWT-based alignment is fast for matching reads to reference • Individual base alignments often sub-optimal at indels • Approach – Fast read mapping with BWT-based aligner – Realign reads at indel sites using gold standard (but much slower) Smith-Waterman 1 algorithm • Benefits – Refines location of indels – Reduces erroneous SNP calls – Very high alignment accuracy in significantly less time, with fewer resources 1 Smith, Temple F. ; and Waterman, Michael S. (1981). "Identification of Common Molecular Subsequences". Journal of Molecular Biology 147: 195– 197. doi: 10. 1016/0022 -2836(81)90087 -5. PMID 7265238

Local Realignment Click to edit Master title style Raw BWA alignment De. Pristo MA,

Local Realignment Click to edit Master title style Raw BWA alignment De. Pristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May; 43(5): 491 -8. PMID: 21478889 Post re-alignment at indels

Duplicate Marking Click to edit Master title style Raw reads Read assessment and prep

Duplicate Marking Click to edit Master title style Raw reads Read assessment and prep Mapping Local realignment Duplicate marking Base quality recalibration Analysis-ready reads • Covered in genotyping presentation • Note that this is done after alignment

Base Quality Recalibration Click to edit Master title style Raw reads Read assessment and

Base Quality Recalibration Click to edit Master title style Raw reads Read assessment and prep Mapping Local realignment Duplicate marking Base quality recalibration Analysis-ready reads STEP 1: Find covariates at non-db. SNP sites using: Reported quality score The position within the read The preceding and current nucleotide (sequencer properties) java -Xmx 4 g -jar Genome. Analysis. TK. jar -T Base. Recalibrator -I alignment. bam -R hg 19/ucsc. hg 19. fasta -known. Sites hg 19/dbsnp_135. hg 19. vcf -o alignment. recal_data. grp STEP 2: Generate BAM with recalibrated base scores: java -Xmx 4 g -jar Genome. Analysis. TK. jar -T Print. Reads -R hg 19/ucsc. hg 19. fasta -I alignment. bam -BQSR alignment. recal_data. grp -o alignment. recalibrated. bam

Base Quality Recalibration (Cont’d) Click to edit Master title style

Base Quality Recalibration (Cont’d) Click to edit Master title style

Getting Started Click to edit Master title style Is there an easier way to

Getting Started Click to edit Master title style Is there an easier way to get started? !!

Getting Started Click to edit Master title style http: //galaxy. psu. edu/ Click “Use

Getting Started Click to edit Master title style http: //galaxy. psu. edu/ Click “Use Galaxy”

Getting Started Click to edit Master title style http: //galaxy. psu. edu/ Click “Use

Getting Started Click to edit Master title style http: //galaxy. psu. edu/ Click “Use Galaxy”

Q&A Click to edit Master title style

Q&A Click to edit Master title style