Ch IPSeq Data Processing and QC Simon Andrews

  • Slides: 17
Download presentation
Ch. IP-Seq Data Processing and QC Simon Andrews simon. andrews@babraham. ac. uk @simon_andrews v

Ch. IP-Seq Data Processing and QC Simon Andrews simon. andrews@babraham. ac. uk @simon_andrews v 2020 -05

Data Creation and Processing Starting DNA Fragmented DNA Ch. IPped DNA Mapped BAM File

Data Creation and Processing Starting DNA Fragmented DNA Ch. IPped DNA Mapped BAM File Fast. Q Sequence File Sequence Library

A typical Ch. IP Library Barcode Adapter Ch. IP Fragment • Potential technical problems

A typical Ch. IP Library Barcode Adapter Ch. IP Fragment • Potential technical problems – Adapter contamination – PCR Duplication • Potential biological problems – Lack of enrichment – Other selection biases Adapter Barcode

QC of raw sequence Base Call Quality

QC of raw sequence Base Call Quality

QC of raw sequence Sequence Composition

QC of raw sequence Sequence Composition

QC of raw sequence Sequence Composition

QC of raw sequence Sequence Composition

QC of raw sequence Adapter Contamination Barcode Adapter Ch. IP Fragment Adapter Barcode Read

QC of raw sequence Adapter Contamination Barcode Adapter Ch. IP Fragment Adapter Barcode Read Trim Galore! Quality and Adapter Trimming

Mapping Ch. IP Data • All regions should be linear genomic stretches • Standard

Mapping Ch. IP Data • All regions should be linear genomic stretches • Standard genomic aligners are fine – Bowtie 2 – BWA http: //bowtie-bio. sourceforge. net/bowtie 2/ http: //bio-bwa. sourceforge. net/

Example Bowtie 2 Mapping • Create Genome Index (once - slow!) bowtie 2 -build

Example Bowtie 2 Mapping • Create Genome Index (once - slow!) bowtie 2 -build yeast_genome. fa yeast_index • Map a single Fast. Q file bowtie 2 -x yeast_index -U data. fastq. gz | samtools view -b. S -o data. bam

Post Alignment QC Mapping Statistics sample 1: sample 2: sample 3: sample 4: sample

Post Alignment QC Mapping Statistics sample 1: sample 2: sample 3: sample 4: sample 5: sample 6: 2264052 2698005 13434392 1108477 2143911 2980154 (14. 66%) (18. 79%) (67. 08%) (6. 70%) (17. 58%) (13. 98%) aligned aligned exactly exactly 1 1 1 time time

Post Alignment Processing MAPQ Filtering • Ch. IP-Seq relates sequences to positions in a

Post Alignment Processing MAPQ Filtering • Ch. IP-Seq relates sequences to positions in a reference genome • You need to be confident that the reported position is correct • Filtering on MAPQ value (likelihood of reported position being incorrect) is an easy way to do this • MAPQ filtering should be performed in most cases samtools view -q 20 -b -o filtered. bam data. bam

Post Alignment Processing Deduplication java -jar picard. jar Sort. Sam  INPUT=filtered. bam

Post Alignment Processing Deduplication java -jar picard. jar Sort. Sam INPUT=filtered. bam OUTPUT=sorted. bam SORT_ORDER=coordinate java -jar picard. jar Mark. Duplicates INPUT=sorted. bam OUTPUT=dedup. bam METRICS_FILE=metrics. txt DO NOT DEDUPLICATE AS A MATTER OF COURSE! THINK FIRST!

To Deduplicate or Not? • Deduplication can make enrichment visually clearer and help to

To Deduplicate or Not? • Deduplication can make enrichment visually clearer and help to spot truly enriched regions • Why not just deduplicate everything? – Quantitation compression

Good Deduplication

Good Deduplication

Quantitation Compression from Deduplication Quantitation Difference (Dedup – Normal) Number of reads per peak

Quantitation Compression from Deduplication Quantitation Difference (Dedup – Normal) Number of reads per peak

Observed Duplication (%) Assessing Duplication Read Density

Observed Duplication (%) Assessing Duplication Read Density

Standard Processing Workflow Mapping Stats Fast. Q File Fast. QC Report Trimmed FQ File

Standard Processing Workflow Mapping Stats Fast. Q File Fast. QC Report Trimmed FQ File Galore Trimmed FQ File Bowtie BWA BAM File SAM Tools Filtered BAM Fast. QC Report Multi. QC Report Visualisation and Assessment