Genome Sequencing and Annotation Part 1 Objective of
Genome Sequencing and Annotation (Part 1)
Objective of most genome projects Sequencing – DNA, m. RNA Identify genes characterize gene features This chapter How blocks of DNA seqs. are obtained How these blocks are assembled into contigs then genomes Bioinformatics – how to do seq. alignment, such as c. DNA/EST, genome seqs. Annotation of ORF, Other features of gene – repetition elements, variable distribution of GC content, evolutionary conserved elements Gene annotation by cross species annotation
2. 1 (Part 2) The principle of dideoxy (Sanger) sequencing Automated DNA sequencing 1974, F. Sanger developed the chain-termination method (Sanger sequencing) Sanger won his second Noble prize for inventing this process
Automated DNA sequencing • Most current sequencing projects use the chain termination method – Also known as Sanger sequencing, after its inventor • Based on action of DNA polymerase – Adds nucleotides to complementary strand • Requires template DNA and primer
Chain-termination sequencing • Dideoxynucleotides (dd. A, dd. T, dd. C or dd. G) stop synthesis – Chain terminators (DNA polymerase cannot add another nucleotide) • Included in amounts so as to terminate every time the base appears in the template • Use four reactions – One for each base: A, C, G, and T Template 3’ ATCGGTGCATAGCTTGT 5’ Sequence reaction products 5’ TAGCCACGTATCGAACA* 3’ 5’ TAGCCACGTATCGA* 3’ 5’ TAGCCACGTA* 3’ 5’ TAGCCA* 3’ 5’ TA* 3’
Sequence detection • To detect products of sequencing reaction • Include labeled nucleotides • Formerly, radioactive labels (33 P or 35 S) were used • Now fluorescent labels • Use different fluorescent tag for each nucleotide • Can run all four reactions in a single gel lane or capillary tube TAGCCACGTATCGAA* TAGCCACGTATC* TAGCCACGT*
Sequence separation – • Terminated chains need to be separated • Requires one-base-pair resolution – See difference between chains of X and X+1 base pairs • Gel electrophoresis – Very thin gel – High voltage applied – Works with radioactive or fluorescent labels – Negative pole at the top CAGT +
Sequence reading of radioactively labeled reactions A T C G – • The final step of sequencing is to read the sequence • Radioactive labeled reactions – Gel dried – Placed on X-ray film – Film developed, the position of each band becomes visible • Sequence read from bottom up (the positive pole) • Each of the four lanes giving the position of a different base: A, T, C or G +
Sequence reading of fluorescently labeled reactions • Fluorescently labeled reactions scanned by laser as particular point is passed • Color picked up by detector • Output sent directly to computer • The read out is given both in terms of bases and the intensity of each color, so that ambiguous readings are easily identified
Summary of chain termination sequencing A primer is extended by DNA polymerase based on the sequence present in the template strand. The chain is terminated by different dd. NTP that are complementary to the template strand. Four reactions are separated on a gel that can resolve one-base differences. The seq. is then read from the bottom of gel to the top.
High-Throughput Sequencing The new techniques and equipment include: (1) Four-color fluorescent dyes have replaced the radioactive label (2) Rather than stopping the electrophoresis at a particular time, the products are scanned for laser-induced fluorescence just before the run off the end of the electrophoresis medium (3) Improvements in the chemistry of template purification and the sequencing reaction (4) Slab gel electrophoresis gave way to capillary electrophoresis with the introduction in 1999 of Applied Biosystem’s ABI Prism 3700 automated sequencers, which in turn were updated with ABI Prism 3730 DNA analyzers in 2003 (deliver extremely high quality, long reads; save time and money) ABI Prism 3730 DNA analyzers
Reading sequence traces Base-calling – the reading of raw sequence traces Now routinely performed using automated software that reads bases, aligns similar seqs. and editing Program – phred http: //www. phrap. org The program assign probability scores to the accuracy of each base call as the trace is read
2. 3 Automated sequence chromatograms (A) This seq. shows ‘noiseness’ of the first 30 bp of a run. (B) The middle two rows show a segment of two seqs. that are polymorphic for both SNPs and an indel. (C) A decline in seq. quality typically occurs after about 800 bp.
Ex. 2. 1 Reading a sequence trace The base labeled N – due to poor seq. quality Two peaks of the same height are observed at the same location, the site is heterozygous for a C and T SNP.
Contig Assembly Figure 2. 5 An aligned-reads window in consed
Assembling DNA seq. fragments NCBI dbest databases http: //www. ncbi. nlm. nih. gov/Database/ • View the EST statistics • FTP EST files
Assembling DNA seq. fragments • • IFOM assembler http: //bio. ifom-firc. it/ASSEMBLY/assemble. html Multiple EST seqs. contig max. number of seqs. you can enter is 10000 !! use gi(15744427, 19124086, 8147732, 8147734, 20393914, 13728017) Length (850, 1062, 634, 596, 869, 768) bp resulting in a single contig consensus seq. , can be used for similarity search against db
Assembling DNA seq. fragments – 6 GI fragments >gi|15744427|gb|BI 752849. 1|BI 752849 603022060 F 1 NIH_MGC_114 Homo sapiens c. DNA clone IMAGE: 5192510 5', m. RNA sequence. CGGGGTGCTGCGAGCGCGGGGCCAGACCAAGGCGGGCCCGGAGCGGAACTTCGGTCCCAGCTCGGTCCCCGGCTCAGTCCCGACGTGGAACTCAGCAGCGGAG GCTGGACGCTTGCATGGCGCTTGAGAGATTCCATCGTGCCTGGCTCACATAAGCGCTTCCTGGAAGTCGTGCTGTCCTGAACGCGGGCCAGGCAGCTGCGGCCTGG GGGTTTTGGAGTGATCACGAATGAGCAAGGCGTTTGGGCTCCTGAGGCAAATCTGTCAGTCCATCCTGGCTGAGTCCTCGCAGTCCCCGGCAGATCTTGAAGAAGGA AGAAGACAGCAACATGAAGAGCAGCCCAGAGAGCGTCCCAGGGCCTGGGACTACCCTCATGGCCTGGTTTACACAACATTGGACAGACCTGCTGCCTTAACTCC TTGATTCAGGTGTTCGTAATGTGGACTTCACCAGGATATTGAAGAGGATCACGGTGCCCAGGGGAGCTGACGAGCAGAGGAGAAGCGTCCCTTTCCAGATGCTTCTGCT GCTGGAGAAGATGCAGGACAGCCGGCAGAAAGCAGTGCGGCCCCTGGAGCTGGCTACTGCAGAAGTGCAACGTGCCCTTGTCCAACATGATGCTGCCAACTGTA CCTCAAACTCTGGAACCTGATTAAGGACCAGATCACTGATGTGCACTTGGTGGAGAGACTGCAGGCCCTGTATATGATCCGGGTGAAGGACTCCTTGATATGCGTTGACTGTG CCATGGGAGAGTAGCAGAAAACAGCAGCATGCTCAACCTCCCACTTTCTCTATTGGATGTGGACTCAAAGCCCT >gi|19124086|gb|BM 807263. 1|BM 807263 AGENCOURT_6574903 NIH_MGC_124 Homo sapiens c. DNA clone IMAGE: 5732238 5', m. RNA sequence. GTCCGGAATTCCCGGGATCTCAGCAGCGGAGGCTGGACGCTTGCATGGCGCTTGAGAGATTCCATCGTGCCTGGCTCACATAAGCGCTTCCTGGAAGTCGT GCTGTCCTGAACGCGGGCCAGGCAGCTGCGGCCTGGGGGTTTTGGAGTGATCACGAATGAGCAAGGCGTTTGGGCTCCTGAGGCAAATCTGTCAGTCCATCCTGGCTGAGT CCTCGCAGTCCCCGGCAGATCTTGAAGAAGGAAGAAGACAGCAACATGAAGAGCAGCCCAGAGAGCGTCCCAGGGCCTGGGACTACCCTCATGGCCTGGTTG GTTTACACAACATTGGACAGACCTGCTGCCTTAACTCCTTGATTCAGGTGTTCGTAATGTGGACTTCACCAGGATATTGAAGAGGATCACGGTGCCCAGGGGAGCTGAC GAGCAGAGGAGAAGCGTCCCTTTCCAGATGCTTCTGCTGCTGGAGAAGATGCAGGACAGCCGGCAGAAAGCAGTGCGGCCCCTGGAGCTGGCCTACTGCAGAAGTGC AACGTGCCCTTGTCCAACATGATGCTGCCCAACTGTACCTCAAACTCTGGAACCTGATTAAGGACCAGATCACTGATGTGCACTTGGTGGAGAGACTGCAGGCCCTGTAT ACGATCCGGGTGAAGGACTCCTTGATTTGCGTTGACTGTGCCATGGAGAGTAGCAGAAACAGCAGCATGCTCACCCTCCCACTTTCTCTTTTTGATGTGGACTCAAAGCCCCTG GAAGACACTGGAGGACGCCCTGCACTGCTTCTTCCAGCCCAGGAGTTATCAAGCAAGTGCTTCTGTGAGAACTGTGGGAAGAAGACCCGCGGGGAACAGGGTCCTG AAACCTGACCATTTTGCCCCAGACCTTGACCAATCCACCTCATGGCGATTCTCCAGGAATTCCCCGAGAAAAAATTGGCCACTTCCCCGGAATTTCCCCCCAAAAA CTTGGAATTTCACCCAAAACCTTTCCCATGTAAACCCGGAAACCCTGGGGAAGGCT >gi|8147732|gb|AW 958049. 1|AW 958049 EST 370119 MAGE resequences, MAGE Homo sapiens c. DNA, m. RNA sequence. GAACTAGTGGATCCCCCGGGCTGCAGGAATTCGGCACGAGTGGAGCTGGCCTACTGCAGAAGTGCAACGTGCCCTTGTCCAACATGATGCTGCCCAAC TGTACCTCAAACTCTGGAACCTGATTAAGGACCAGATCACTGATGTGCACTTGGTGGAGAGACTGCAGGCCCTGTATATGATCCGGGTGAAGGACTCCTTGATTTGCGTTGACT GTGCCATGGAGAGTAGCAGAAACAGCAGCATGCTCACCCTCCCACTTTCTCTTTTTGATGTGGACTCAAAGCCCCTGAAGACACTGGAGGACGCCCTGCACTGCTTCTTCCAG CCCAGGGAGTTATCAAGCAAGTGCTTCTGTGAGAACTGTGGGAAGAAGACCCGTGGGAAACAGGTCTTGAAGCTGACCCATTTGCCCCAGACCCTGACAATCCACCT CATGCGATTCTTCATCAGGAATTCACAGACGAGAAAGATCTGCCACTCCCTGTACTTCCCCCAGAGCTTGGATTTCAGCCAGAACCTTCCAATGAAGCGAGAATCTTGTGAAGC TGAAGAACAGTCTGGAAGGCAAGATGAGCTTTTTGCTGGGAATGCGCACGTGGAAAGGCAGAATTCGGTCATAA >gi|8147734|gb|AW 958051. 1|AW 958051 EST 370121 MAGE resequences, MAGE Homo sapiens c. DNA, m. RNA sequence. GGAGCTGGCCTACTGCAGAAGTGCAACGTGCCCTTGTCCAACATGATGCTGCCCAACTGTACCTCAAACTCTGGAACCTGATTAAGGACCAGATCACTG ATGTGCACTTGGTGGAGAGACTGCAGGCCCTGTATATGATCCGGGTGAAGGACTCCTTGATTTGCGTTGACTGTGCCATGGAGAGTAGCAGAAACAGCAGCATGCTCACCCTC CCACTTTCTCTTTTTGATGTGGACTCAAAGCCCCTGAAGACACTGGAGGACGCCCTGCACTGCTTCTTCCAGCCCAGGGAGTTATCAAGCAAGTGCTTCTGTGAGAAC TGTGGGAAGAAGACCCGTGGGAAACAGGTCTTGAAGCTGACCCATTTGCCCCAGACCCTGACAATCCACCTTATGCGATTCTCCATCAGGAATTCACAGACGAGAAAGATCTG CCACTCCCTGTACTTCCCCCAGAGCTTGGATTTCAGCCAGATCCTTCCAATGAAGCGAGAGTCTTGTGATGCTTGAGGAGCAATCTGGAGGGCATATGAGCTTTTTGCTGTGAT TGCGCACCTGGGAATGCAAAACTCCGTCATTACTG >gi|20393914|gb|BQ 213074. 1|BQ 213074 AGENCOURT_7559959 NIH_MGC_72 Homo sapiens c. DNA clone IMAGE: 6055692 5', m. RNA sequence. AGATCTGCCACTCCCTGTACTTCCCCCAGAGCTTGGATTTCAGCCAGATCCTTCCAATGAAGCGAGAGTCTTGTGATGCTGAGGAGCAGTCTGGAGGGCAGTATGAG CTTTTTGCTGTGATTGCGCACGTGGGAATGGCAGACTCCGGTCATTACTGTGTCTACATCCGGAATGCTGTGGAAAATGGTTCTGCTTCAATGACTCCAATATTTGCTTG GTGTCCTGGGAAGACATCCAGTGTACCTACGGAAATCCTAACTACCACTGGCAGGAAACTGCATATCTTCTGGTTTACATGAAGATGGAGTGCTAATGGAAATGCCCAAAACCT TCAGAGATTGACACGCTGTCATTTTCCATTTCCGTTCCTGGATCTACGGAGTCTTCTAAGAGATTTTGCAATGAGGAGAAGCATTGTTTTCAAACTATATAACTGAGCCTTATTTA TAATTAGGGATATTATCAAAATATGTAACCATGAGGCCCCTCAGGTCCTGATCAGAATGGATGCTTTCACCAGCAGACCCGGCCATGTGGCTGCTCGGTCCTGGGTGCT CGCTGCTGTGCAAGACATTAGCCCTTTAGTTATGAGCCTGTGGGAACTTCAGGGGTTCCCAGTGGGGAGAGCAGTGGGAGGCATCTGGGGGGCCAAGGGCAGTGG CAGGGGGTATTTCAGTATTATACCACTGCTGTGACCAGACTTGTATACTGGCTGAATATCAGGGCTGGTTGTAATTTTTTCCCTTTGAAGAAACACCATTAATTTCCTAATGAATC CAAGTGGTTTGTAACTTGCCTATTCCTTTTATTCCAGCAAAAAATTGATCATCCCCCAAAAAATAGGGG >gi|13728017|gb|BG 206330. 1|BG 206330 RST 25778 Athersys RAGE Library Homo sapiens c. DNA, m. RNA sequence. TCCTGGGAAGACATCCAGTGTACCTACGGAAATCCTAACTACCACTGGCAGGAAACTGCATATCTTCTGGTTTACATGAAGATGGAGTGCTAATGGAAATGCCCAAAA CCTTCAGAGATTGACACGCTGTCATTTTCCATTTCCGTTCCTGGATCTACGGAGTCTTCTAAGAGATTTTGCAATGAGGAGAAGCATTGTTTTCAAACTATATAACTGAGCCTTATAATTAGGGATATTATCAAAATATGTAACCATGAGGCCCCTCAGGTCCTGATCAGAATGGATGCTTTCACCAGCAGACCCGGCCATGTGGCTGCTCGGTCCTGGGTG CTCGCTGCTGTGCAAGACATTAGCCCTTTAGTTATGAGCCTGTGGGAACTTCAGGGGTTCCCAGTGGGGAGAGCAGTGGGAGGCATCTGGGGGCCAAAGGTCAGT GGCAGGGGGTATTTCAGTATTATACAACTGCTGTGACCAGACTTGTATACTGGCTGAATATCAGTGCTGTTTGTAATTTTTCACTTTGAGAACCAACATTAATTCCATATGAATCA AGTGTTTTGGAACTGCTATTCATTTATTCAGCAAATATTGGTCATCTTTTCTCCATAAGATAGTGTGATAAACACAGCATGAATAAAGGTATTTTCCACACAGACAAGTGTTT TTTCACAAAATTATTNATTTTGNTGGGGCTGTGGCGGCCGCTTCCTTTATGGGGGGGAATTTAGAACCCGTTCCTGACGCGGGGGN
Assembling DNA seq. fragments List of assembled fragments
Assembling DNA seq. fragments Overlap details
Assembling DNA seq. fragments End of overlap details Assembled m. RNA sequence
Box 2. 1 Pairwise Sequence Alignment • The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1 alignment 2 Seq. 1 ACGCTGA Seq. 2 A - - CTGT ACTGT - Seeks alignments high seq. identity, few mismatchs and gaps Assumption – the observed identity in seqs. to be aligned is the result of either random or of a shared evolutionary origin Identity ≠ similarity Sequence identity = Homology (a risky assumption) Sequence identity ≠ Homology
Box 2. 1 Pairwise Sequence Alignment Same true alignment arise through different evolutionary events Scoring scheme: substitution -1, indel -5, match 3 indel Score 9 5 4 4 Figure A Common evolutionary events and their effects on alignment
Box 2. 1 Pairwise Sequence Alignment Find the optimal score the best guess for the true alignment Find the optimal pairwise alignment of two seqs. inserted gaps into one or both of them maximize the total alignment score Dynamic programming (DP) – Needleman and Wunsch (1970), Smith and Waterman (1980), this algorithm guarantees that we find all optimal alignments of two seqs. of lengths m and n BLAST is based on DP with improvement on speed Prof. Waterman http: //www. usc. edu/dept/LAS/biosci/faculty/waterman. html
Box 2. 1 Pairwise Sequence Alignment The score for alignment of i residues of sequence 1 against j residues of sequence 2 is given by where c(i, j) = the score for alignment of residues i and j and takes the value 3 for a match or -1 for a mismatch, c(-, j) = the penalty for aligning a residue with a gap, which takes the value of -5
Box 2. 1 Pairwise Sequence Alignment • The entry for S(1, 1) is the maximum of the following three events: • S(0, 0) + c(A, A) = 0 + 3 = 3 [c(A, A) = c(1, 1)] • S(0, 1) + c(A, -) = -5 + -5 = -10 [c(A, -) = c(1, -)] • S(1, 0) + c(-, A) = -5 + -5 = -10 [c(- , A) = c(-, 1)] • Similarly, one finds S(2, 1) as the maximum of three values: (-5)-1=-6; 3 -5=-2; and (-10)-5=-15 the best is entry is the addition of the C indel to the A-A match, for a score of -2 (see next page).
Box 2. 1 Pairwise Sequence Alignment The alignment matrix of sequences 1 and 2 S(2, 1) = max {S(1, 0) + c(2, 1), S(1, 1) + c(2, -), S(2, 0) + c(-, 1)} = max { S(1, 0) + c(C, A), S(1, 1) + c(C, -), S(2, 0) + c(-, A) } = max { -5 -1, 3 -5, -10 -5 } = -2
Box 2. 1 Pairwise Sequence Alignment Traceback determine the actual alignment From the top right hand corner the (7, 5) cell For example the 1 in the (7, 5) cell could only be reached by the addition of the mismatch A-T ACGCTGA A - - CTGT or ACGCTGA AC - - TGT 4 matches 1 mismatch 2 indels Ambiguity – has to do with which C in seq. 1 aligns with the C in seq. 2
Box 2. 1 Pairwise Sequence Alignment Parameters settings - Gap penalties • Default settings are the easiest to use but they are not necessarily yield the correct alignment • constant penalty independent of the length of gap, A • proportional penalty is proportional to the length L of the gap, BL (that is what we used in the this lecture) • affine gap penalty gap-opening penalty + gap-extension penalty = A+BL • There is no rule for predicting the penalty that best suits the alignment • Optimal penalties vary from seq. to seq. it is a matter of trial and error • Usually A > B, because of opening a gap (usually A/B ~ 10) • Hint: (1) compare distantly related seqs. high A and very low B often give the best results penalized more on their existence than on their length, (2) compare closely related seqs. , penalize both of extension and extension
Exercise 2. 2 Computing an optimal sequence alignment Two score schemes (1) Gap penalty = -5, mismatch = -1, match =3 (2) Gap penalty = -1, mismatch = -1, match =3 (1) First alignment score = 5*3 + 2*(-1) =13 Second/Third alignment score = 6*3 + 2*(-5) = 8 (2) First alignment score = 5*3 + 2*(-1) =13 Second/Third alignment score = 6*3 + 2*(-1) = 16 A more serious problem – identify the wrong alignment
Exercise 2. 2 Computing an optimal sequence alignment Gap penalty = -5 Gap penalty = -1
Emerging Sequencing Methods Costs of genome sequencing Mid-2000 - $30 -50 Million dollars to sequencing a mammalian genome Target $1000 per human genome by the year 2010 J. Craig Benter Foundation - $500, 000 award for the first person to achieve this goal New technologies 1. Sequencing by hybridization (SBH) – detect whether an exact match is present in a sample of DNA or not 2. Mass spectrophotometric technique – ionized fragment, time of flight 3. Nanopore sequencing strategies - Ultrafast and relative inexpensive sequencing of long DNA fragments 4. Single-molecule approach – Solexa, Visigen and Genovoxx 5. Single-molecule polony sequencing
Emerging Sequencing Methods Dilute solution of DNA are plated onto a glass microscope slide. In situ PCR produces thousands of tiny colonies of DNA, which incorporated of single dye-labeled d. NTPs. Polony – PCR colonies (聚集區) The slide is read after each cycle of Incorporation of a new base, allowing short seqs. to be determined. Each numbered polony produces a short 20 -25 nucleotide seq. as shown. These can then be assembled computationally into a contiguous seq. Figure 2. 6 Single-molecule polony sequencing
Genome Sequencing Whole genome seqs. are assembled from ~105 of fragments, each typically between 500 and 1000 bp in length. Two general approaches for fragmentation and assembly: (1) hierarchical seq. (2) shotgun seq. For historical overview, see http: //www. sciencemag. org/feature/plus/sfg/human/timeline 1. shtml (1) Hierarchical seq. * First develop a low resolution physical alignment to measure the seq. is obtained in large order pieces. * Break the genome into small fragments and use computer algorithms to assemble them, see Figure 2. 7 Most new genome projects adopt the shotgun approach. Figure 2. 7 (Part 1) Hierarchical versus shotgun sequencing
Genome Sequencing – hierarchical sequencing Top down, map-based or clone-by-clone strategy ~ late 1980 Genome break into small fragments The relative locations of the fragments are known BEFORE sequencing Advantages (1) It fostered (help develop) assembly of high-resolution physical and genetic maps (2) Allow groups working around the global Technology for cloning large fragments of genomes are progressed rapidly throughout the 1990 s, such as E. coli, S. cerevisiae, C. elegans. A. thaliana. Top-down seq. clone seqs. as managable units of framgments (50 – 200 kb in length) Clone vectors – BAC (~300 kb), PAC (~100 kb), phage-derived cosmids
Genome Sequencing – Shotgun sequencing In the shotgun approach, no attempt is made to order the clones in advance, Instead, the whole genome is assembled using computer algorithms that order contigs based on their overlapping sequences. Figure 2. 7 (Part 2) Shotgun sequencing
Cloning vectors used in genome sequencing Figure 2. 8 Cloning vectors used in genome sequencing
Genome Sequencing – hierarchical sequencing DNA libraries • By restriction enzyme (RE) or sonication (以超音波處理) • Fragments are ligated into a multiple cloning site (mcs) in the vector • Aim for 5 - to 10 -fold redundancy larger than 5 to 10 times in the genome library • Each clone will have different ends possible to select a scaffold of clones that forms a contiguous seq. coverage – a tiling (貼瓷磚) path • By aligning the regions of overlap (Fig. 2. 9) • The tiling path can be assembled using a combination of 3 methods: (1) hybridization, (2) fingerprinting, and (3) endsequencing
Genome Sequencing – hierarchical sequencing • A minimal tiling path through a library of aligned BAC clones that ensures complete coverage of the chromosome is chosen. • After sequencing independent shotgun libraries for each BAC. • Small gaps in the sequenced clone contigs remain. • These are closed as far as possible by merging the two BAC sequences, as well as by the addition of mate-pair information (yellow) and c. DNA structural information (red), which establishes the orientation and distance between cloned segments. Figure 2. 9 Hierarchical assembly of a sequence-contig scaffold (supercontig)
Genome Sequencing – hierarchical sequencing Hybridization All of the clones in a library that carry a particular seq. can be identified rapidly by hybridizing a small radioactively or chemically labeled probe containing the seq. to a filter on which is printed an array of ~10000 of clones (Fig. 2. 10 A) Fingerprinting • Study the Restriction Enzyme (RE) patterns • Assemble contigs of large insert clones is to compare and align them according to RE • RE ~ 6 bp 46 = 212 ~ 4000 bp • For BAC, 100 kb/4 kbp ~ 20 – 30 fragments • these fragments can be separated by electrophesis Fingerprint profile BAC alignment by gel software alignment overlapping Contigs assemble of ~Mb length contigs
Genome Sequencing – hierarchical sequencing (A) A macroarray of BAC clones is probed with a short, radioactive fragment to identify all BACs that carry a specific fragment. (B) These clones are digested with a RE, endlabeled, and separated by gel electrophoresis, (C) Software converts the bands to a virtual profile, shown hypothetically for a small portion of four bands (high-ligated box in part B). Shared bands (red or blue) imply that the two clones share the same seq. Green indicates the vector band common to all clones. (D) The fingerprint profile is then converted into a BAC alignment, In this example, clone 2 does not share any bands with the others and so is placed into a seq. BAC contig, while the other three clones form a tiling path Figure 2. 10 Aligning BAC clones by hybridization and fingerprinting
Genome Sequencing – hierarchical sequencing • • • End-sequencing Fill in the gaps after fingerprinting. How ? sequencing both ends of the collection of BAC clones Once a critical threshold of seqs. have been achieved overlap For example, along a 10 Mbp genome, end seqs. of 10, 000 BAC clones, provide a seq. tag every 5 kb (for a 5 -fold coverage) Along a 10 Mbp genome 10 Mbp/10000 BAC 1 kbp/BAC Five fold 10 Mb/2000 BAC ~ 5 kb (a seq. tag distance) Given this tag density, it is possible to close gap < 50 kb Once the Tiling path is chosen shotgun the BAC clones into small fragments Subcloning, use M 13 phagemid (~1 kb, exist as ds. DNA and ss. DNA or clone 2 ~ 3 kb fragments into a plasmid vector
Genome Sequencing – Shotgun sequencing • Use computer algorithm to assemble the seqs. (~100, 000) • About 5 ~ 10 folds redundancy for each fragment • Library - From a single whole genome • After MSA screen out repetitive seqs. , overlap reads of the same seq. generate unitigs and scaffolds >90% of the seqs. are assembled • Finishing phase – closing gaps, cleaning up ambiguities take as much time as the shotgun phase • Users are asked to trust the assemblies Celera Genomics used the following software to assemble the seqs. Screener – to mask (not removed) seqs. that contain repetitive DNA (such as microsatellites, LINE, Alu repeats, retrotransposons and ribosomal DNA) Overlapper – compares every unscreened read against every other unscreened read, searching for overlaps of a predetermined length and identity. • Parallel processing on 40 supercomputers, each with 4 GB RAM, allowed the 27 M screened human seqs. reads to be overlapped in < 5 days ! • Repeat-induced overlaps of a seq. are resolved using the Unitigger (see Figure 2. 11). Scaffolder – uses mate-pair information to link U-unitigs into scaffold contigs
Genome Sequencing – Shotgun sequencing Figure 2. 11 (A) Seq. alignment between two or more shotgun clones can arise between unique seqs. (left) or repetitive seqs. (right). (B) The Overlapper aligns unitigs, which are identified as unique seq. alignments (Uuntigs) or overcollapsed repeats (blue). Two contigs can be aligned and oriented by using mate-pair seq. information from the ends of longer (10 - or 50 -kb) clones, as shown at the bottom, while mate-pairs from 2 -kb fragments allow assembly of scaffolds despite the presence of simple repeats such as microsatellites (blue) that are masked before performing alignments. Figure 2. 11 U-unitigs and repeat resolution
Genome Sequencing – Shotgun sequencing Figure 2. 12 shows the estimated coverage of the fly and human whole genomes after initial assembly: in both cases, 84% or more of the genomes was covered by scaffolds at least 100 kb in length, while most scaffolds were in the Mb range. seq. coverage from 5 x to 10 x a 10% in the proportion of scaffolds of lengths up to 1 Mb. The plot shows the percentage of Scaffolds that have a length greater than that indicated for the fly 10 x, human 8 x (CSA) and human 5 x (whole genome assembly WGA) seqs. generated by Celera. The fly and CSA assemblies include shredded (撕成碎片) seqs. generated from BAC clones by public genomes sequencing efforts. Figure 2. 12 Proportion of fly and human genomes in large scaffolds
NCTS http: //math. cts. nthu. edu. tw/Mathematics/c onference-PT 2005. html UCSD http: //research. calit 2. net/recombworkshop 05/
- Slides: 46