Garbage In Garbage Out Quality control on sequence
Garbage In, Garbage Out: Quality control on sequence data
Key concepts of session • The quality of the data limits what you can confidently say about the data and how you can subsequently use it. • An important component to quality control is visualization: you must actually LOOK at your data.
So you have reads off a sequencer … where do you start? The fast. Q format: More on the file format and quality encoding: https: //en. wikipedia. org/wiki/FASTQ_format
Expectation
But the reality may be very different
So what? Why does QC matter? • You are going to spend a LOT of time (and $) on this dataset. • Downstream analysis software assumes pretty well behaved data!!
How to assess a bag of reads • Pre-mapping: Fast. QC – GC content – read quality (Phred score) • Post-mapping: – read coverage (which regions, how much) – complexity (# unique samples)
Protocol matters – how the experiment influences your QC • Mistakes in protocol can result in abnormal distributions • Poor read quality = poor mapping = poor coverage
WHY doesn’t it look like I wanted? • • Cell clustering – over-amplification Low library complexity Problems with amplification or size selection Problem with adapters See also: https: //sequencing. qcfail. com/
But one person’s garbage is another’s treasure.
You can still obtain information • Even low coverage samples can give you information: – Which genes are being actively transcribed – Differentially expressed genes (depending on depth and coverage)
Running Fast. QC – Pre-Trim • Determine which adapters are present if you are unsure of the protocol • Assess whether sequencing/protocol providing the results expected • Refine trimming options
In this script, we will: • Flip reads (reverse complement) – protocol dependent • Run Fast. QC To run (after adjusting parameters in green box): $ bash fastqc_pretrim. sh
Open up our fastqc. html report
Trimming • Many different trimming programs available • We will use “bbduk” – quick runtime, lots of trim options $ vi trim. sh
In this script, we will: • Trim for adapters (followed by length) • Trim for quality To run (after adjusting rootname/project): $ bash trim. sh
View trim stats $ cd /home/user/hackcon/trimmed $ ls $ vim sample. stats What can we learn from this report?
Running Fast. QC – Post-Trim • Determine which adapters are present if you are unsure of the protocol • Assess whether sequencing/protocol providing the results expected • Refine trimming options
In this script, we will: • Assess our trimming parameters • Determine if we need to re-trim or move forward with mapping To run (after adjusting rootname/project): $ bash fastqc_postrim. sh
Open up our fastqc. html report
- Slides: 20