Information Theory of DNA Sequencing David Tse Dept

DNA sequencing DNA: the blueprint of life Problem: to obtain the sequence of nucleotides.

Impetus: Human Genome Project 1990: Start 3 billion basepairs 2001: Draft 2003: Finished courtesy:

Sequencing Gets Cheaper and Faster Cost of one human genome • HGP: $ 3

But many genomes to sequence 100 million species (e. g. phylogeny) 7 billion individuals

Whole Genome Shotgun Sequencing Reads are assembled to reconstruct the original DNA sequence.

Sequencing Technologies • HGP era: single technology (Sanger) • Current: multiple “next generation” technologies

Assembly Algorithms • Many proposed algorithms. • Different algorithms tailored to different technologies. •

A Basic Question • What is the minimum number of reads needed to reconstruct

Coverage Analysis • Pioneered by Lander-Waterman • What is the minimum number of reads

Communication and Sequencing: An Analogy Communication: Sequencing:

Communication: Fundamental Limits Given statistical models for source and channel: Shannon 48 Asymptotically reliable

DNA Sequencing: Fundamental Limits? • Define: sequencing rate R = G/N basepairs per read

A Simple Model • DNA sequence: i. i. d. with distribution p. • Starting

The read channel AGCTTATAGGTCCGCATTACC • Capacity depends on – read length: L – DNA

Result: Sequencing Capacity Renyi entropy of order 2

Coverage Constraint G L T Starting positions of reads ~ Poisson(1/R) N reads

No-Duplication Constraint L L The two possibilities have the same set of length L

Achievability no-duplication constraint coverage constraint achievable?

Greedy Algorithm Input: the set of N reads of length L 1. Set the

Greedy algorithm: the beginning gap Most reads have large overlap with neighbors Expected #

Greedy algorithm: stage Expected # of errors at stage probability two disjoint reads appear

Summary: Two Regimes duplication-limited regime coverage-limited regime

Relation to Earlier Works • Coverage constraint: Lander-Waterman 88 • No-duplication constraint: Arratia et

Rest of Talk • Impact of read noise. • Impact of repeats in DNA

Read Noise ACGTCCTATGCGTAATGCCACATATTGCTATGCGTAATGCG TATA CTTA Model: discrete memoryless channel defined by transition probabilities

Modified Greedy algorithm X Y Y Do we merge the two reads at overlap

Impact on Sequencing Rate H 0: noisy versions of the same DNA subsequence (merge)

Impact on Sequencing Rate no-duplication constraint coverage constraint obtained by optimizing MAP threshold

More Complex DNA Statistics • i. i. d. is not a very good model

A Simple Model for Repeats K Model: M repeats of length K placed uniformly

Impact on Sequencing Rate K= repeat length J = paired-end separation constant indep of

Conclusion • DNA sequencing is an important problem. • Many new technologies and new

Slides: 33

Download presentation

Information Theory of DNA Sequencing David Tse Dept. of EECS U. C. Berkeley Guy Bresler LIDS Student Conference MIT Abolfazl Motahari Feb. 2, 2012 Research supported by NSF Center for Science of Information.

DNA sequencing DNA: the blueprint of life Problem: to obtain the sequence of nucleotides. …ACGTGACTGAGGACCGTG CGACTGACTGGGT CTAGACTACGTTTTA TATATACGTCGTCGT ACTGATGACTAGATTACAG ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT… courtesy: Batzoglou

Impetus: Human Genome Project 1990: Start 3 billion basepairs 2001: Draft 2003: Finished courtesy: Batzoglou

Sequencing Gets Cheaper and Faster Cost of one human genome • HGP: $ 3 billion • 2004: $30, 000 • 2008: $100, 000 • 2010: $10, 000 • 2011: $4, 000 • 2012 -13: $1, 000 • ? ? ? : $300 courtesy: Batzoglou

But many genomes to sequence 100 million species (e. g. phylogeny) 7 billion individuals (SNP, personal genomics) 1013 cells in a human (e. g. somatic mutations such as HIV, cancer) courtesy: Batzoglou

Whole Genome Shotgun Sequencing Reads are assembled to reconstruct the original DNA sequence.

Sequencing Technologies • HGP era: single technology (Sanger) • Current: multiple “next generation” technologies (eg. Illumina, So. Li. D, Pac Bio, Ion Torrent, etc. ) • All provide massively parallel sequencing. • Each technology has different read lengths, noise profiles, etc

Assembly Algorithms • Many proposed algorithms. • Different algorithms tailored to different technologies. • Each algorithm deals with the full complexity of the problem while trying to scale well with the massive amount of data. • Lots of heuristics used in the design.

A Basic Question • What is the minimum number of reads needed to reconstruct with a given reliability? • A benchmark for comparing different algorithms. • An algorithm-independent basis for comparing different technologies and designing new ones.

Coverage Analysis • Pioneered by Lander-Waterman • What is the minimum number of reads to ensure there is no gap between the reads with a desired prob. ? • Only provides a lower bound. • Can one get a tight lower bound?

Communication and Sequencing: An Analogy Communication: Sequencing:

Communication: Fundamental Limits Given statistical models for source and channel: Shannon 48 Asymptotically reliable communication at rate R source symbols per channel output symbol if and only if:

DNA Sequencing: Fundamental Limits? • Define: sequencing rate R = G/N basepairs per read • Question: can one define a sequencing capacity C such that: asymptotically reliable reconstruction is possible if and only if R < C?

A Simple Model • DNA sequence: i. i. d. with distribution p. • Starting positions of reads are i. i. d. uniform on the DNA sequence. • Read process is noiseless. Will extend to more complex source model and noisy read process later.

The read channel AGCTTATAGGTCCGCATTACC • Capacity depends on – read length: L – DNA length: G • Normalized read length: • Eg. L = 100, G = 3 £ 109 : read channel AGGTCC

Result: Sequencing Capacity Renyi entropy of order 2

Coverage Constraint G L T Starting positions of reads ~ Poisson(1/R) N reads

No-Duplication Constraint L L The two possibilities have the same set of length L subsequences.

Achievability no-duplication constraint coverage constraint achievable?

Greedy Algorithm Input: the set of N reads of length L 1. Set the initial set of contigs as the reads. 2. Find two contigs with largest overlap and merge them into a new contig. 3. Repeat step 2 until only one contig remains or no more merging can be done. Algorithm progresses in stages: at stage merge reads at overlap

Greedy algorithm: the beginning gap Most reads have large overlap with neighbors Expected # of errors in stage L-1: probability two disjoint reads are equal Very small since no-duplication constraint is satisfied.

Greedy algorithm: stage Expected # of errors at stage probability two disjoint reads appear to overlap This may get larger, but no larger than Very small since coverage constraint is satisfied. when

Summary: Two Regimes duplication-limited regime coverage-limited regime

Relation to Earlier Works • Coverage constraint: Lander-Waterman 88 • No-duplication constraint: Arratia et al 96 • Arratia et al focused on a model where all length L subsequences are given (seq. by hybridization) • Our result: the two constraints together are necessary and sufficient for shotgun sequencing.

Rest of Talk • Impact of read noise. • Impact of repeats in DNA sequence

Read Noise ACGTCCTATGCGTAATGCCACATATTGCTATGCGTAATGCG TATA CTTA Model: discrete memoryless channel defined by transition probabilities

Modified Greedy algorithm X Y Y Do we merge the two reads at overlap ? We observe two strings: X and Y. Are they noisy versions of the same DNA subsequence? (merge) Or from two different locations? (do not merge) This is a hypothesis testing problem!

Impact on Sequencing Rate H 0: noisy versions of the same DNA subsequence (merge) H 1: from disjoint DNA subsequences (do not merge) • Hypothesis test: MAP rule: declare H 0 if X Y Y Two types of error: • false positive (same as before) • missed detection (new type of error)

Impact on Sequencing Rate no-duplication constraint coverage constraint obtained by optimizing MAP threshold

More Complex DNA Statistics • i. i. d. is not a very good model for the DNA sequence. • More generally, we may want to model it as a correlated random process. • For short-scale correlation, H 2(p) can be replaced by the Renyi entropy rate of the process. • But for higher mammals, DNA contains long repeats, repeat length comparable or longer than reads. • This is handled by paired-end reads in practice.

A Simple Model for Repeats K Model: M repeats of length K placed uniformly into DNA sequence If repeat length K>> read length L, how to reconstruct sequence? Use paired-end reads: J These reads can bridge the repeats reads come in pairs with known separation

Impact on Sequencing Rate K= repeat length J = paired-end separation constant indep of K If J > 2 d + K then capacity is the same as without repeats no-duplication constraint coverage of repeats constraint

Conclusion • DNA sequencing is an important problem. • Many new technologies and new applications. • An analogy between sequencing and communication is drawn. • A notion of sequencing capacity is formulated. • A principled design framework?