Next Generation Sequence Alignment Variant Discovery on the

Overview � Practical guide to processing next generation sequencing data � No details on

Main Tools/Resources � � � � � BRC Cluster Software : http: //compbio. brc.

Analysis Pipeline �Convert Illumina Fastq to sanger Fastq �QC raw data �Mapping (BWA, QC-BWA,

Work flow Illumina Raw fastq Convert Illumina Fastq to sanger Fastq QC raw data

What does the raw data look like? � Fastq Format : *_sequence. txt; ◦

FASTQ Format � Raw Data : - @315 ARAAXX 090414: 8: 1: 567: 552#0

Illumina Raw fastq Convert Illumina Fastq to sanger Fastq QC raw data Mapping (BWA,

Quality Control & Pre-Alignment Processing. 1 � Convert Illumina Fastq to sanger Fastq $:

Quality Control & Pre-Alignment Processing. 2 � Fast. QC: Provides a simple way to

Reference genomes � Available genomes ◦ Homo_sapiens_assembly 18. fasta ◦ human_b 36_both. fasta ◦

BWA: Burrows-Wheeler Aligner ## Align with BWA $: REF=/scratch/data/reference_genomes/human_g 1 k_v 37. fasta $:

BWA : with pre-alignment QC. 1 � Fastx: http: //hannonlab. cshl. edu/fastx_toolkit ◦ QC

BWA : with pre-alignment QC. 2 � Compare QCd fastq files ◦ ◦ One

BWA : with pre-alignment QC. 3 � Align $: $: $: with BWA REF=/scratch/data/reference_genomes/human_g

BWA : with pre-alignment QC. 4 � Generate alignments in the SAM/BAM format $:

Novoalign : quality aware gapped aligner. 1 � Has options for adaptor stripping and

Novoalign : quality aware gapped aligner. 2 � Novoalign produces a name sorted SAM

Post Aligment processing of BAM files � Local realignment around Indels � Remove duplicate

Local realignment around Indels. 1 � Sequence aligners are unable to perfectly map reads

Local realignment around Indels. 2 Original BAM file Realigner. Target. Creator (GATK) for. Realigner.

Local realignment around Indels. 3 $: $: REF=/scratch/data/reference_genomes/human_g 1 k_v 37. fasta ROD=/scratch/data/reference_genomes/human/dbsnp_131_b 37.

Local realignment around Indels. 3 ## 3. Sort realigned BAM file using PICARD TOOLS

Remove duplicate reads � Examine aligned records in the supplied SAM or BAM file

Base Quality Score Recalibration. 1 � Correct for variation in quality with machine cycle

Base Quality Score Recalibration. 2 Co-ordinate sorted Realigned fixed BAM file Analyze. Covariates Pre-recalibration

Base Quality Score Recalibration. 3 ## set env variables $: GATK=/share/apps/gatk_20100930/Sting/dist/Genome. Analysis. TK. jar

Base Quality Score Recalibration. 4 ## set env variables $: GATK=/share/apps/gatk_20100930/Sting/dist/Genome. Analysis. TK. jar

Base Quality Score Recalibration. 5 � Generate the final analysis ready BAM file for

SNP & Indel calling with GATK Final Recalibrated BAM file Indel. Genotyper. V 2

Short Indels (GATK Indel. Genotyper. V 2) ## set env variables $: GATK=/share/apps/gatk_20100930/Sting/dist/Genome. Analysis.

Creating an indel mask file � The output of the Indel. Genotyper is used

SNP Calling (GATK Unified. Genotyper) ## set env variables $: GATK=/share/apps/gatk_20100930/Sting/dist/Genome. Analysis. TK. jar

SNP Filtering & annotation (GATK Variant. Filtration) � � Variant. Filtration is used annotate

SNP Filtering: VCFTOOLS � � VCFTOOLS: methods for working with VCF files: filtering, validating,

VCFTOOLS for SNP QC and statistics � VCFTOOLS can be used to generate useful

Picard Tools : Post Alignment Summary reports � See : http: //www. broadinstitute. org/files/shared/mpg/nextgen

Questions? � Email : stephen. newhouse@kcl. ac. uk � More useful links: ◦ http:

Slides: 44

Download presentation

Next Generation Sequence Alignment & Variant Discovery on the BRC-MH Linux Cluster Steve Newhouse 28 Jan 2011

Overview � Practical guide to processing next generation sequencing data � No details on the inner workings of the software/code & theory � Based on the 1 KG pipeline from the Broad Institute using their Genome Analysis Tool Kit (GATK). � Focus on Illumina paired-end sequence data � Alignment with BWA or Novoalign � SNP & Indel calling with GATK � NB: This is one way processing the data that works well

Main Tools/Resources � � � � � BRC Cluster Software : http: //compbio. brc. iop. kcl. ac. uk/cluster/software. php Maq: http: //maq. sourceforge. net/ Fastqc : http: //www. bioinformatics. bbsrc. ac. uk/projects/fastqc/ Fastx: http: //hannonlab. cshl. edu/fastx_toolkit/ cmpfastq. pl : http: //compbio. brc. iop. kcl. ac. uk/software/cmpfastq. php BWA: http: //bio-bwa. sourceforge. net/bwa. shtml Novoalign: http: //www. novocraft. com Genome Analysis Toolkit: http: //www. broadinstitute. org/gsa/wiki/index. php/The_Genome_Analysis_Toolkit PICARD TOOLS: http: //picard. sourceforge. net/ SAMTOOLS: http: //samtools. sourceforge. net/ VCFTOOLS: http: //vcftools. sourceforge. net/ FASTQ Files : http: //en. wikipedia. org/wiki/FASTQ_format, SAM/BAM Format : http: //samtools. sourceforge. net/SAM 1. pdf PHRED Scores: http: //en. wikipedia. org/wiki/Phred_quality_score Next Generation Sequencing Library: http: //ngslib. genome. tugraz. at http: //seqanswers. com http: //www. broadinstitute. org/gsa/wiki/index. php/File: Ngs_tutorial_depristo_1210. pdf

Analysis Pipeline �Convert Illumina Fastq to sanger Fastq �QC raw data �Mapping (BWA, QC-BWA, Novoalign) �Convert Sequence Alignment/Map (SAM) to BAM �Local realignment around Indels �Remove duplicates �Base Quality Score Recalibration �Variant Discovery

Work flow Illumina Raw fastq Convert Illumina Fastq to sanger Fastq QC raw data Mapping (BWA, QC-BWA, Novoalign) Convert SAM to BAM Local realignment around Indels Remove duplicates Base Quality Score Recalibration Analysis-ready reads Indels & SNPs

What does the raw data look like? � Fastq Format : *_sequence. txt; ◦ s_1_1_sequence. txt = lane 1, read 1 ◦ s_1_2_sequence. txt = lane 1, read 2 � Text file storing both nucleotide sequence and quality scores. � Both the sequence letter and quality score are encoded with a single ASCII character for brevity. � Standard for storing the output of high throughput sequencing instruments such as the Illumina Genome Analyzer � http: //en. wikipedia. org/wiki/FASTQ_format

FASTQ Format � Raw Data : - @315 ARAAXX 090414: 8: 1: 567: 552#0 TGTTTCTTTAAAAAGGTAAGAATGTTGTTGCTGGGCTTAGAAATATGAATAACCATATGCCAGATAGATGGA + ; <<=<======: : ==>====<<<; ; ; : : : 99999988887766655554443333222211111000// � � � � @315 ARAAXX 090414: the unique instrument name 8: flowcell lane 1: tile number within the flowcell lane 567: 'x'-coordinate of the cluster within the tile 552: 'y'-coordinate of the cluster within the tile # : index number for a multiplexed sample (0 for no indexing) 0 : the member of a pair, /1 or /2 (paired-end or mate-pair reads only) http: //en. wikipedia. org/wiki/FASTQ_format

Illumina Raw fastq Convert Illumina Fastq to sanger Fastq QC raw data Mapping (BWA, QC-BWA, Novoalign) Convert SAM to BAM Local realignment around Indels Remove duplicates Base Quality Score Recalibration Analysis-ready reads Indels & SNPs

Quality Control & Pre-Alignment Processing. 1 � Convert Illumina Fastq to sanger Fastq $: maq ill 2 sanger s_1_1_sequence. txt foo. 1. sanger. fastq $: maq ill 2 sanger s_1_2_sequence. txt foo. 2. sanger. fastq

Quality Control & Pre-Alignment Processing. 2 � Fast. QC: Provides a simple way to do some quality control checks on raw sequence data. ◦ ◦ ◦ Quick impression of whether the data has problems. Import of data from BAM, SAM or Fast. Q Summary graphs and tables to quickly assess your data Export of results to an HTML based permanent report Offline operation to allow automated generation of reports without running the interactive application $: fastqc foo. 1. sanger. fastq; $: fastqc foo. 2. sanger. fastq; http: //www. bioinformatics. bbsrc. ac. uk/projects/fastqc/

Fast. QC

Reference genomes � Available genomes ◦ Homo_sapiens_assembly 18. fasta ◦ human_b 36_both. fasta ◦ human_g 1 k_v 37. fasta (1000 genomes) � Indexed for use with BWA or Novoalign � Location: /scratch/data/reference_genomes/human � Human reference sequences and db. SNP reference metadata are available in a tarball: ◦ ftp: //ftp. broadinstitute. org/pub/gsa/gatk_resources. tgz

BWA: Burrows-Wheeler Aligner ## Align with BWA $: REF=/scratch/data/reference_genomes/human_g 1 k_v 37. fasta $: bwa aln -t 8 $REF foo. 1. sanger. fastq > foo. 1. sai; $: bwa aln -t 8 $REF foo. 2. sanger. fastq > foo. 2. sai; ## Generate alignment in the SAM format $: bwa sampe $REF foo. 1. sai foo. 2. sai foo. 1. sanger. fastq foo. 2. sanger. fastq > foo. bwa. raw. sam; ## Sort bwa SAM file using PICARD TOOLS Sort. Sam. jar - this will also produce the BAM file $: java -jar Sort. Sam. jar SORT_ORDER=coordinate VALIDATION_STRINGENCY=SILENT INPUT= foo. bwa. raw. sam OUTPUT= foo. bwa. raw. bam; ## samtools index $: samtools index foo. novo. raw. bam; � � Use option -q 15 if the quality is poor at the 3' end of reads http: //bio-bwa. sourceforge. net/bwa. shtml

BWA : with pre-alignment QC. 1 � Fastx: http: //hannonlab. cshl. edu/fastx_toolkit ◦ QC filter raw sequence data ◦ always use -Q 33 for sanger phred scaled data (-Q 64) $: cat foo. 1. sanger. fastq | fastx_clipper -Q 33 -l 20 -v -a ACACTCTTTCCCTACACGACGCTCTTCCGATCT | fastx_clipper -Q 33 -l 20 -v -a CGGTCTCGGCATTCCTACTGAACCGCTCTTCCGATCT | fastx_clipper -Q 33 -l 20 -v -a ATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATC | fastx_clipper -Q 33 -l 20 -v –a CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATC | fastq_quality_trimmer -Q 33 -t 20 -l 20 -v | fastx_artifacts_filter -Q 33 -v | fastq_quality_filter -Q 33 -q 20 -p 50 -v -o foo. 1. sanger. qc. fastq; $: cat foo. 2. sanger. fastq | fastx_clipper -Q 33 -l 20 -v -a ACACTCTTTCCCTACACGACGCTCTTCCGATCT | fastx_clipper -Q 33 -l 20 -v -a CGGTCTCGGCATTCCTACTGAACCGCTCTTCCGATCT | fastx_clipper -Q 33 -l 20 –v -a ATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATC | fastx_clipper -Q 33 -l 20 -v –a CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATC | fastq_quality_trimmer -Q 33 -t 20 -l 20 -v | fastx_artifacts_filter -Q 33 -v | fastq_quality_filter -Q 33 -q 20 -p 50 -v -o foo. 2. sanger. qc. fastq; #

BWA : with pre-alignment QC. 2 � Compare QCd fastq files ◦ ◦ One end of each read could be filtered out in QC BWA cant deal with mixed PE & SE data Need to id reads that are still paired after QC Need to id reads that are no longer paired after QC $: perl cmpfastq. pl foo. 1. sanger. qc. fastq foo. 2. sanger. qc. fastq � Reads matched on presence/absence of id's in each file : ◦ ◦ � foo. 1. sanger. qc. fastq : @315 ARAAXX 090414: 8: 1: 567: 552#1 foo. 2. sanger. qc. fastq : @315 ARAAXX 090414: 8: 1: 567: 552#2 Output: 2 files for each *QC. fastq file ◦ ◦ foo. 1. sanger. qc. fastq-common-out (reads in foo. 1. * == reads in foo. 2. *) foo. 1. sanger. qc. fastq-unique-out (reads in foo. 1. * not in foo. 2. *) foo. 2. sanger. qc. fastq-commont-out foo. 2. sanger. qc. fastq-unique-out http: //compbio. brc. iop. kcl. ac. uk/software/cmpfastq. php

BWA : with pre-alignment QC. 3 � Align $: $: $: with BWA REF=/scratch/data/reference_genomes/human_g 1 k_v 37. fasta bwa aln -t 8 $REF foo. 1. sanger. qc. fastq-common-out > foo. 1. common. sai; bwa aln -t 8 $REF foo. 2. sanger. qc. fastq-common-out > foo. 2. common. sai; bwa aln -t 8 $REF foo. 1. sanger. qc. fastq-unique-out > foo. 1. unique. sai; bwa aln -t 8 $REF foo. 1. sanger. qc. fastq-unique-ou > foo. 2. unique. sai; � Multi threading option : -t N � http: //bio-bwa. sourceforge. net/bwa. shtml

BWA : with pre-alignment QC. 4 � Generate alignments in the SAM/BAM format $: REF=/scratch/data/reference_genomes/human_g 1 k_v 37. fasta ## bwa sampe for *common* $: bwa sampe $REF foo. 1. common. sai foo. 2. common. sai foo. 1. sanger. qc. fastq-common-out foo. 2. sanger. qc. fastq-common-out > foo. common. sam; ## bwa samse for *unique* $: bwa samse $REF foo. 1. unique. sai foo. 1. sanger. qc. fastq-unique-out > foo. 1. unique. sam; $: bwa samse $REF foo. 2. unique. sai foo. 2. sanger. qc. fastq-unique-out > foo. 2. unique. sam; ## merge SAM files using PICARD TOOLS Merge. Sam. Files. jar - this will also sort the BAM file $: java -jar Merge. Sam. Files. jar INPUT=foo. common. sam INPUT=foo. 1. unique. sam INPUT=foo. 2. unique. sam ASSUME_SORTED=false VALIDATION_STRINGENCY=SILENT OUTPUT=foo. bwa. raw. bam; ## samtools index foo. bwa. raw. bam; � Details SAM/BAM Format : http: //samtools. sourceforge. net/SAM 1. pdf

Novoalign : quality aware gapped aligner. 1 � Has options for adaptor stripping and quality filters – and much more � More accurate than BWA but slower unless running MPI version � $1, 990/year for full set of tools – worth it! $: REF=/scratch/data/reference_genomes/human_g 1 k_v 37 $: novoalign -d $REF -F STDFQ -f foo. 1. sanger. fastq foo. 2. sanger. fastq -a GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG ACACTCTTTCCCTACACGACGCTCTTCCGATCT -r Random -i PE 200, 50 -c 8 --3 Prime -p 7, 10 0. 3, 10 -k -K foo. novo. test -o SAM $'@RGt. ID: foot. PL: Illuminat. PU: Illuminat. LB: tumourt. SM: foo' > foo. novo. stats > foo. novo. raw. sam; � http: //www. novocraft. com

Novoalign : quality aware gapped aligner. 2 � Novoalign produces a name sorted SAM file which needs to be co-ordinate sorted for downstream processing ## Sort novo SAM file using PICARD TOOLS Sort. Sam. jar - this will also produce the BAM file $: java -jar Sort. Sam. jar SORT_ORDER=coordinate VALIDATION_STRINGENCY=SILENT INPUT= foo. novo. raw. sam OUTPUT= foo. novo. raw. bam; ## samtools index $: samtools index foo. novo. raw. bam;

Post Aligment processing of BAM files � Local realignment around Indels � Remove duplicate reads � Base Quality Score Recalibration � GATK: http: //www. broadinstitute. org/gsa/wiki/index. php/T he_Genome_Analysis_Toolkit � PICARD TOOLS: http: //picard. sourceforge. net � SAMTOOLS: http: //samtools. sourceforge. net � Many other quality stats/options for processing files using these tools : see web documentation

Local realignment around Indels. 1 � Sequence aligners are unable to perfectly map reads containing insertions or deletions ◦ Alignment artefacts ◦ False positives SNPs � Steps to the realignment process: ◦ Step 1: Determining (small) suspicious intervals which are likely in need of realignment ◦ Step 2: Running the realigner over those intervals ◦ Step 3: Fixing the mate pairs of realigned reads http: //www. broadinstitute. org/gsa/wiki/index. php/Local_realignment_around_indels

Local realignment around Indels. 2 Original BAM file Realigner. Target. Creator (GATK) for. Realigner. intervals (interval file) Indel. Realigner (GATK) Realigned BAM file Sort. Sam (PICARD) Co-ordinate sorted Realigned BAM file Fix. Mate. Information (PICARD) Co-ordinate sorted Realigned fixed BAM file

Local realignment around Indels. 3 $: $: REF=/scratch/data/reference_genomes/human_g 1 k_v 37. fasta ROD=/scratch/data/reference_genomes/human/dbsnp_131_b 37. final. rod TMPDIR=~/scratch/tmp GATK=/share/apps/gatk_20100930/Sting/dist/Genome. Analysis. TK. jar ## 1. Creating Intervals : Realigner. Target. Creator $: java –Xmx 5 g -jar $GATK -T Realigner. Target. Creator -R $REF -D $ROD -I foo. novo. bam -o foo. novo. bam. for. Realigner. intervals; ## 2. Realigning : Indel. Realigner $: java -Djava. io. tmpdir=$TMPDIR –Xmx 5 g -jar $GATK -T Indel. Realigner -R $REF -D $ROD -I foo. novo. bam -o foo. novo. realn. bam -target. Intervals foo. novo. bam. for. Realigner. intervals; ## samtools index $: samtools index foo. novo. realn. bam;

Local realignment around Indels. 3 ## 3. Sort realigned BAM file using PICARD TOOLS Sort. Sam. jar ## GATK Indel. Realigner produces a name sorted BAM $: java –Xmx 5 g -jar Sort. Sam. jar INPUT= foo. novo. realn. bam OUTPUT=foo. novo. realn. sorted. bam SORT_ORDER=coordinate TMP_DIR=$TMPDIR VALIDATION_STRINGENCY=SILENT; ## samtools index $: samtools index foo. novo. realn. soretd. bam; ## 4. Fixing the mate pairs of realigned reads using Picard tools Fix. Mate. Information. jar $: java -Djava. io. tmpdir=$TMPDIR -jar -Xmx 6 g Fix. Mate. Information. jar INPUT= foo. novo. realn. sorted. bam OUTPUT= foo. novo. realn. sorted. fixed. bam SO=coordinate VALIDATION_STRINGENCY=SILENT TMP_DIR=$TMPDIR; ## samtools index foo. novo. realn. sorted. fixed. bam ;

Remove duplicate reads � Examine aligned records in the supplied SAM or BAM file to locate duplicate molecules and remove them $: TMPDIR=~/scratch/tmp ## Remove duplicate reads with Picard tools Mark. Duplicates. jar $: java -Xmx 6 g –jar Mark. Duplicates. jar INPUT= foo. novo. realn. sorted. fixed. bam OUTPUT= foo. novo. realn. duperemoved. bam METRICS_FILE=foo. novo. realn. Duplicate. metrics. file REMOVE_DUPLICATES=true ASSUME_SORTED=false TMP_DIR=$TMPDIR VALIDATION_STRINGENCY=SILENT; ## samtools index foo. novo. realn. duperemoved. bam;

Base Quality Score Recalibration. 1 � Correct for variation in quality with machine cycle and sequence context Recalibrated quality scores are more accurate Closer to the actual probability of mismatching the reference genome � Done by analysing the covariation among several features of a base � � ◦ Reported quality score ◦ The position within the read ◦ The preceding and current nucleotide (sequencing chemistry effect) observed by the sequencing machine ◦ Probability of mismatching the reference genome ◦ Known SNPs taken into account (db. SNP 131) � Covariates are then used to recalibrate the quality scores of all reads in a BAM file http: //www. broadinstitute. org/gsa/wiki/index. php/Base_quality_score_recalibration

Base Quality Score Recalibration. 2 Co-ordinate sorted Realigned fixed BAM file Analyze. Covariates Pre-recalibration analysis plots Count. Covariates table (. csv) Table. Recalibraion Final Recalibrated BAM file Count. Covariates Analyze. Covariates Post-recalibration analysis plots Recalibrated covariates table (. csv)

Base Quality Score Recalibration. 3 ## set env variables $: GATK=/share/apps/gatk_20100930/Sting/dist/Genome. Analysis. TK. jar $: REF=/scratch/data/reference_genomes/human_g 1 k_v 37. fasta $: ROD=/scratch/data/reference_genomes/human/dbsnp_131_b 37. final. rod ## 1. GATK Count. Covariates java -Xmx 8 g -jar $GATK -T Count. Covariates -R $REF --DBSNP $ROD -I foo. novo. realn. duperemoved. bam -recal. File foo. novo. realn. duperemoved. bam. recal_data. csv --default_platform Illumina -cov Read. Group. Covariate -cov Quality. Score. Covariate -cov Cycle. Covariate -cov Dinuc. Covariate -cov Tile. Covariate -cov Homopolymer. Covariate -nback 5; http: //www. broadinstitute. org/gsa/wiki/index. php/Base_quality_score_recalibration

Base Quality Score Recalibration. 4 ## set env variables $: GATK=/share/apps/gatk_20100930/Sting/dist/Genome. Analysis. TK. jar $: GATKacov=/share/apps/gatk_20100930/Sting/dist/Analyze. Covariates. jar $: GATKR=/share/apps/gatk_20100930/Sting/R $: REF=/scratch/data/reference_genomes/human_g 1 k_v 37. fasta $: ROD=/scratch/data/reference_genomes/human/dbsnp_131_b 37. final. rod $: Rbin=/share/apps/R_current/bin/Rscript ## 2. GATK Analyze. Covariates java -Xmx 5 g –jar $GATKacov -recal. File foo. novo. realn. duperemoved. bam. recal_data. csv -output. Dir foo. novo. realn. duperemoved. bam. recal. plots -resources $GATKR -Rscript $Rbin; http: //www. broadinstitute. org/gsa/wiki/index. php/Base_quality_score_recalibration

Base Quality Score Recalibration. 5 � Generate the final analysis ready BAM file for Variant Discovery and Genotyping ## set env variables $: GATK=/share/apps/gatk_20100930/Sting/dist/Genome. Analysis. TK. jar $: REF=/scratch/data/reference_genomes/human_g 1 k_v 37. fasta $: ROD=/scratch/data/reference_genomes/human/dbsnp_131_b 37. final. rod ## 3. GATK Table. Recalibration $: java –Xmx 6 g -jar $GATK -T Table. Recalibration -R $REF -I foo. novo. realn. duperemoved. bam --out foo. novo. final. bam -recal. File foo. novo. realn. duperemoved. bam. recal_data. csv --default_platform Illumina; ##samtools index $: samtools index foo. novo. final. bam; http: //www. broadinstitute. org/gsa/wiki/index. php/Base_quality_score_recalibration

SNP & Indel calling with GATK Final Recalibrated BAM file Indel. Genotyper. V 2 gatk. raw. indels. verbose. output. bed filter. Single. Sample. Calls. pl Unified. Genotyper gatk. raw. indels. detailed. output. vcf gatk. raw. indels. bed gatk. indels. filtered. bed gatk. raw. snps. vcf gatk. filtered. indels. simple. bed. . . chr 1 556817 +G: 3/7 chr 1 3535035 3535054 -TTCTGGGAGCTCCTCCCCC: 9/21 chr 1 3778838 +A: 15/48. . . make. Indel. Mask. py indels. mask. bed Variant. Filtration gatk. filtered. snps. vcf

Short Indels (GATK Indel. Genotyper. V 2) ## set env variables $: GATK=/share/apps/gatk_20100930/Sting/dist/Genome. Analysis. TK. jar $: GATKPERL=/share/apps/gatk_20100930/Sting/perl $: GATKPYTHON=/share/apps/gatk_20100930/Sting/python $: REF=/scratch/data/reference_genomes/human_g 1 k_v 37. fasta $: ROD=/scratch/data/reference_genomes/human/dbsnp_131_b 37. final. rod ## Generate raw indel calls $: java -Xmx 6 g -jar $GATK -T Indel. Genotyper. V 2 -R $REF --DBSNP $ROD -I foo. novo. final. bam -bed foo. gatk. raw. indels. bed -o foo. gatk. raw. indels. detailed. output. vcf --metrics_file foo. gatk. raw. indels. metrics. file -verbose foo. gatk. raw. indels. verbose. output. bed -min. Coverage 8 -S SILENT –mnr 1000000; ## Filter raw indels $: perl $GATKPERL/filter. Single. Sample. Calls. pl --calls foo. gatk. raw. indels. verbose. output. bed --max_cons_av_mm 3. 0 --max_cons_nqs_av_mm 0. 5 --mode ANNOTATE > foo. gatk. filtered. indels. bed http: //www. broadinstitute. org/gsa/wiki/index. php/Indel_Genotyper_V 2. 0

Creating an indel mask file � The output of the Indel. Genotyper is used to mask out SNPs near indels. � “make. Indel. Mask. py” creates a bed file representing the masking intervals based on the output of Indel. Genotyper. $: GATKPYTHON=/share/apps/gatk_20100930/Sting/python ## Create indel mask file $: python $GATKPYTHON/make. Indel. Mask. py foo. gatk. raw. indels. bed 5 indels. mask. bed � The number in this command stands for the number of bases that will be included on either side of the indel. http: //www. broadinstitute. org/gsa/wiki/index. php/Indel_Genotyper_V 2. 0

SNP Calling (GATK Unified. Genotyper) ## set env variables $: GATK=/share/apps/gatk_20100930/Sting/dist/Genome. Analysis. TK. jar $: REF=/scratch/data/reference_genomes/human_g 1 k_v 37. fasta $: ROD=/scratch/data/reference_genomes/human/dbsnp_131_b 37. final. rod ## SNP Calling $: java -Xmx 5 g -jar $GATK -T Unified. Genotyper -R $REF -D $ROD -baq CALCULATE_AS_NECESSARY -baq. GOP 30 -nt 8 -A Depth. Of. Coverage -A Allele. Balance -A Haplotype. Score -A Homopolymer. Run -A Mapping. Quality. Zero -A Qual. By. Depth -A RMSMapping. Quality -A Spanning. Deletions -I foo. novo. final. bam -o foo. gatk. raw. snps. vcf -verbose foo. gatk. raw. snps. vcf. verbose -metrics foo. gatk. raw. snps. vcf. metrics; � This results in a VCF (variant call format) file containing raw SNPs. ◦ VCF is a text file format. It contains meta-information lines, a header line, and then data lines each containing information about a position in the genome (SNP/INDEL calls). http: //www. 1000 genomes. org/wiki/Analysis/Variant%20 Call%20 Format/vcf-variant-call-format-version-40 http: //www. broadinstitute. org/gsa/wiki/index. php/Unified_genotyper

SNP Filtering & annotation (GATK Variant. Filtration) � � Variant. Filtration is used annotate suspicious calls from VCF files based on their failing given filters. It annotates the FILTER field of VCF files for records that fail any one of several filters: ◦ ◦ ◦ Variants that lie in clusters, using the specified values to define a cluster (i. e. the number of variants and the window size). Any variant which overlaps entries from a masking rod. Any variant whose INFO field annotations match a specified expression (i. e. the expression is used to describe records which should be filtered out). ## set env variables $: GATK=/share/apps/gatk_20100930/Sting/dist/Genome. Analysis. TK. jar $: REF=/scratch/data/reference_genomes/human_g 1 k_v 37. fasta $: ROD=/scratch/data/reference_genomes/human/dbsnp_131_b 37. final. rod ## Variant. Filtration & annotation $: java –Xmx 5 g -jar $GATK -T Variant. Filtration -R $REF -D $ROD -o foo. gatk. Variant. Filtration. snps. vcf -B variant, VCF, foo. gatk. raw. snps. vcf -B mask, Bed, indels. mask. bed --mask. Name In. Del --cluster. Size 3 --cluster. Window. Size 10 --filter. Expression "DP <= 8" --filter. Name "DP 8" --filter. Expression "SB > -0. 10" --filter. Name "Strand. Bias" --filter. Expression "HRun > 8" --filter. Name "HRun 8" --filter. Expression "QD < 5. 0" --filter. Name "QD 5" --filter. Expression "MQ 0 >= 4 && ((MQ 0 / (1. 0 * DP)) > 0. 1)" --filter. Name "hard 2 validate"; � More information on the parameters used can be found in: http: //www. broadinstitute. org/gsa/wiki/index. php/Variant. Filtration. Walker http: //www. broadinstitute. org/gsa/wiki/index. php/Using_JEXL_expressions

SNP Filtering: VCFTOOLS � � VCFTOOLS: methods for working with VCF files: filtering, validating, merging, comparing and calculate some basic population genetic statistics. Example of some basic hard filtering: ## filter poor quality & suspicious SNP calls vcftools --vcf foo. gatk. Variant. Filtration. snps. vcf --remove-filtered DP 8 --remove-filtered Strand. Bias --remove-filtered Low. Qual --remove-filtered hard 2 validate --remove-filtered Snp. Cluster --keep-INFO AC --keep-INFO AF --keep-INFO AN --keep-INFO DB --keep-INFO DP --keep-INFO DS --keep-INFO Dels --keep-INFO HRun --keep-INFO Haplotype. Score -keep-INFO MQ --keep-INFO MQ 0 --keep-INFO QD --keep-INFO SB --out foo. gatk. good. snps ; � this produces the file " foo. gatk. good. snps. recode. vcf"

VCFTOOLS for SNP QC and statistics � VCFTOOLS can be used to generate useful QC measures, PLINK ped/map, Impute input and more. . ## QC & info $: for MYQC in missing freq 2 counts 2 depth site-mean-depth geno-depth het hardy singletons; do vcftools --vcf foo. gatk. good. snps. recode. vcf --$MYQC --out foo. gatk. good. snps. QC; done ## write genotypes, genotype qualities and genotype depth to separate files $: for MYFORMAT in GT GQ DP; do vcftools --vcf foo. gatk. good. snps. recode. vcf --extract-FORMAT-info $MYFORMAT --out foo. gatk. good. snps; done ## make PLINK ped and map files vcftools --vcf foo. gatk. good. snps. recode. vcf --plink --out foo. gatk. good. snps http: //vcftools. sourceforge. net/

Picard Tools : Post Alignment Summary reports � See : http: //www. broadinstitute. org/files/shared/mpg/nextgen 2010/nextgen_fennell. pdf

Questions? � Email : stephen. newhouse@kcl. ac. uk � More useful links: ◦ http: //www. broadinstitute. org/gsa/wiki/index. php/Prer equisites ◦ http: //www. broadinstitute. org/gsa/wiki/index. php/Build ing_the_GATK ◦ http: //www. broadinstitute. org/gsa/wiki/index. php/Dow nloading_the_GATK ◦ http: //www. broadinstitute. org/gsa/wiki/index. php/Inpu t_files_for_the_GATK ◦ http: //www. broadinstitute. org/gsa/wiki/index. php/Prep aring_the_essential_GATK_input_files: _the_reference_gen ome ◦ http: //www. broadinstitute. org/gsa/wiki/index. php/The_ DBSNP_rod