WholeGenome Optical Mapping Michael Waterman University of Southern

Human Genome Variation n Types of Variation Substitutions Insertion/deletions Duplications Rearrangements SNPs: single nucleotide

Optical Mapping A single-molecular restriction mapping technology n Developed by D. Schwartz (University of

Optical Mapping: Overview + DNA extract Silicon bed with embedded grooves Molecules attached to

DNA Imaging Lambda DNA individual fragments Estimated sizes of fragments cuts

Optical Mapping: Data n Each optical map is represented by an array of DNA

Optical Mapping: Errors ¨ Sizing errors (sizes of individual restriction fragments are measured with

Optical Mapping: Pros and Cons n Pros: ¨ No cloning, no amplification, hence no

Optical Mapping: Goals Assembly of restriction maps for target organisms (before sequencing) n Variation

Map Making We are confronted with many relatively short somewhat inaccurate maps and want

Overview: Assembly for Sequences (Overlap – Layout – Consensus Paradigm) Genomic region cloning, sequencing

Assembly for Optical Maps: Overlap-Layout-Consensus Overlap Mutual overlaps are detected by finding similar size

Assembly for Optical Maps: Overlap Detection n n Huge number of false positive overlaps

Assembly for Optical Maps: Filtration n Filtration is used to find: ¨ ¨ n

Filtration continued n In sequences overlapping reads are expected to have several matching 20

Assembly for Optical Maps: Why Things are Hard n n n Consider a human

Alignment Score: Problem Description n Account for features specific for optical mapping: ¨ Sizing

Previous work on the subject Heuristic alignment score and DP for restriction map alignments

Optical Mapping: Calculation of Alignments n n Alignments are computed using standard DP algorithm

Sizing errors of the Optical Mapping: (about Data 10 -15% Models n fragment size)

Why normal error model? Fluorescent dye DNA n n Let be the # of

Testing the Error model: Scatter Plot Data collected from 10 -mers. vs Histogram of

Error model: qqnorm Data collected from 10 -mers.

Alignment Score: Key Idea Define two competing hypothesis • under maps similarity) and :

Alignment Score: Key Idea • define an alignment score as the –log(LR) to make

Two Alignment Types: Fit and Overlap n Fit alignment: to find genomic regions of

Optical Mapping: Alignment Scores Matching regions R Score = n R 2 score(R 1)

M 1: Our alignment score Comparison of two alignment scores n M 2: Alignment

Comparison of two alignment scores n n n Generate a map from a 40

Tumor study: analysis of variations n Variations to find: ¨ indels (5 Kb or

Selected variations n n By p-value (<0. 05) Discovered: ¨ Mole (Haploid tumor), 12

Mole: indels 501 out of 728 indels are 5 -10 Kb deletions

Why such a difference? n Hypothesis: L 1 line elements: ¨ 6 -8 Kb

Our Research Group: n n n Michael Waterman (USC) Lei Li (USC) Yi Yang

Slides: 43

Download presentation

Whole-Genome Optical Mapping Michael Waterman University of Southern California

Human Genome Variation n Types of Variation Substitutions Insertion/deletions Duplications Rearrangements SNPs: single nucleotide polymorphisms

Optical Mapping A single-molecular restriction mapping technology n Developed by D. Schwartz (University of Wisconsin-Madison) n

Optical Mapping: Overview + DNA extract Silicon bed with embedded grooves Molecules attached to the surface and straightened within the grooves Restriction enzymes are added in the solution DNA is fluorescently dyed and the chip is photographed. DNA is digested and cuts are formed by shrinking ends

DNA Imaging Lambda DNA individual fragments Estimated sizes of fragments cuts

Optical Mapping: Data n Each optical map is represented by an array of DNA sizes in the order they appear on imaged DNA molecules. n Individual maps correspond to different DNA molecules of length 0. 3 -1. 5 Mb. Each number in the map corresponds to size of the restriction fragment (in Kb) on the molecule. Order information of restriction fragments is preserved within each map. n n Map #1: 10. 23 54. 32 32. 43 12. 43 9. 54 0. 45 3. 98 2. 76 3. 45 19. 23 27. 81 92. 12 0. 65 4. 22 Map #2: 23. 12 68. 42 28. 12 15. 43 12. 92 32. 90 0. 34 0. 78 5. 43 54. 22 29. 69 27. 12 14. 23 13. 08 0. 54 12. 35 22. 19 1. 34. . .

Optical Mapping: Errors ¨ Sizing errors (sizes of individual restriction fragments are measured with errors) ¨ Missing cuts (due to underdigestion) ¨ False cuts (random DNA breaks) ¨ Missing fragments (unable to attach to the surface) ¨ Chimeras (due to concatenation of maps during imaging)

Optical Mapping: Pros and Cons n Pros: ¨ No cloning, no amplification, hence no PCR related errors. ¨ Deep (~100 x and more) coverage ¨ Reads span very large portions of chromosomes ~(up to 4 Mb). n Cons: ¨ Resolution at the restriction site level ¨ Maps contain many errors

Optical Mapping: Goals Assembly of restriction maps for target organisms (before sequencing) n Variation studies (cancer analysis) n Mapping of methylation patterns n Mapping of transcription factor binding sites n

Map Making We are confronted with many relatively short somewhat inaccurate maps and want to piece together a genome map n The problem was approached by a sophisticated statistical sampling model by Mishra et al. n We try another quite simple approach n

Overview: Assembly for Sequences (Overlap – Layout – Consensus Paradigm) Genomic region cloning, sequencing Overlap Piles of sequence reads (~600 Bp each) Physical overlaps between reads are captured by means of filtration GTTGA ATGATCC Filtration Overlapping sequence reads are put together to produce the scaffold of the reference genomic region (Layout) Layout Consensus map is inferred by means of multiple sequence alignment, Euler assembler, etc. Consensus

Assembly for Optical Maps: Overlap-Layout-Consensus Overlap Mutual overlaps are detected by finding similar size patterns Filtration significantly speeds up the computation of overlaps Overlaps are computed according to our new probabilistic score n Layout is produced similar to sequence layout n Consensus is inferred by refinement of the layout (HMM) n

Assembly for Optical Maps: Overlap Detection n n Huge number of false positive overlaps False negatives (missing overlaps) are not a problem for layout construction Many optical maps, hence all pairwise overlaps are expensive to calculate ( n(n-1) overlaps, if n optical maps ) Filtration is needed to speed up the search for overlaps

Assembly for Optical Maps: Filtration n Filtration is used to find: ¨ ¨ n Potential overlaps of optical maps Possible fit locations against the reference Filtration is based on finding matching tuples of fragments for optical maps: Matching tuples are calculated to form matching diagonal stretches in the alignment matrix ¨ Matching diagonal stretches in the alignment matrix are chained to find alignments and calculate the score (FASTA idea) ¨ Full dynamic programming is applied for candidate overlaps to calculate the overlap ¨

Filtration continued n In sequences overlapping reads are expected to have several matching 20 -tuples n In Optical Mapping filtration is challenging because of the sizing error and presence of missing/false cuts

Assembly for Optical Maps: Why Things are Hard n n n Consider a human size genome (3 000 K bp) Av. rf size 30 K (8 -cutter), hence 100 K restriction fragments in 1 genome With maps of 33 rf (1 Mbp) there is 1 x – 3 K maps ¨ 100 x – 300 K maps ¨ pair-wise overlaps ¨ n To calculate all pair-wise overlaps: At the rate of 5 overlaps per second or ¨ computer hours ¨ 4. 5 years on the 128 node cluster like hto-g. ¨ overlaps per hour

Alignment Score: Problem Description n Account for features specific for optical mapping: ¨ Sizing error distribution ¨ False cut distribution ¨ Missing cut distribution n Design a score as a –log(LR) for testing: true matching vs. random matching: ¨ True match assumes direct dependence between maps ¨ Random match assumes independence between maps n The optimal alignment has the lowest LR-test value (maximum score)

Previous work on the subject Heuristic alignment score and DP for restriction map alignments (Waterman et al, 1984) n Alignment score and DP for restriction maps with local rearrangements (Huang et al, 1992) n Extensive Bayesian models for map assemblies (Ananthraman et al, 1997) n

Optical Mapping: Calculation of Alignments n n Alignments are computed using standard DP algorithm for map comparison (due to Waterman et al, 1984) Time complexity: , but can be approximated by a restricted version is the size of the reference map is the size of the optical map

Sizing errors of the Optical Mapping: (about Data 10 -15% Models n fragment size) ¨ Modeled n n n for fragments longer than 4 Kb (CLT idea) for fragments shorter than 4 Kb About 20% of cuts are missing (80% digestion) ¨ Modeled n as normal r. v. as Bernoulli r. v. False cuts occur at the rate of 5 per Mb ¨ Modeled as Poisson Process with rate 0. 005

Why normal error model? Fluorescent dye DNA n n Let be the # of photons captured from the i-th base The total registered fluorescence from the DNA fragment is (n DNA bases) After applying CLT for an unbiased measurement , since L is proportional to n Hence for the measurement error

Testing the Error model: Scatter Plot Data collected from 10 -mers. vs Histogram of

Error model: qqnorm Data collected from 10 -mers.

Alignment Score: Key Idea Define two competing hypothesis • under maps similarity) and : are independent (have no • under maps and are related (e. g. optical map comes from the genomic region ) • write the likelihood ratio under and :

Alignment Score: Key Idea • define an alignment score as the –log(LR) to make it additive:

Two Alignment Types: Fit and Overlap n Fit alignment: to find genomic regions of origin for optical maps Sizing errors Aligned pairs of sites n Missed cut Reference restriction map False cut Overlap alignment: to detect overlaps between optical maps Aligned pairs of sites Optical maps

Optical Mapping: Alignment Scores Matching regions R Score = n R 2 score(R 1) + score(R 2) + Rd. . . + score(R d ) Score of the matching region is composed of two parts: score for the sizing error and score for extra/missing cut sites

Some Mathematical Facts

Fit Alignment Score

Overlap Alignment Score

Example of Fit Alignment

M 1: Our alignment score Comparison of two alignment scores n M 2: Alignment score due to Waterman et al 1984 n n P-values are consistently smaller for our new score

Comparison of two alignment scores n n n Generate a map from a 40 MB region of HS 13. Verify that optimal score places into a correct genomic location Examine 19 next best scoring alignments Study how sparsely populated are the neighborhoods of optimal alignments using M 1 and M 2 (using std of optimal score) For our new score (M 1): neighborhoods of optimal scores are very sparsely occupied For the old score (M 2): neighborhoods of optimal score are densely occupied

Tumor study: analysis of variations n Variations to find: ¨ indels (5 Kb or more) ¨ extra or missing restriction cut sites (EC or MC) n n Variations are relative to published DNA human sequence (build 35) Data: ¨ human hematadiform mole (haploid, @12 x) ¨ limphoblastoid control (diploid white blood cells, @8 x)

Selected variations n n By p-value (<0. 05) Discovered: ¨ Mole (Haploid tumor), 12 x, 93% cov: 728 indels (>5 Kb) n 394 EC n 489 MC n ¨ Lymphoblastoid (normal white blood cell), 8 x, 63% cov: 131 indel (>5 Kb) n 491 EC n 609 MC n

Mole: indels 501 out of 728 indels are 5 -10 Kb deletions

Control: Lymphoblastoid indels:

Why such a difference? n Hypothesis: L 1 line elements: ¨ 6 -8 Kb retrotransposons: Pop out in mole n Stay in place in normal cells n n Hypothesis: EC, MC are due to SNPs at the restriction sites

Our Research Group: n n n Michael Waterman (USC) Lei Li (USC) Yi Yang (USC) Yu-Chi Liu (USC) Yu Zhang (Harvard) Anton Valouev (USC) & Many many thanks to David Schwartz and his Lab (U. Wisconsin)