Genome sequence assembly concepts and methods ShihJon Wang

Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008

OUTLINE • • • Assembly Process Overview Assembly algorithms Repeats Scaffolding Phred/Phrap/Consed Assembly pipelines

Assembly process overview

A Genome Sequencing Project

Building a Library • Break DNA into random fragments (810 x)

SHOTGUNs • • • Whole Genome Shotgun Bac-Bac Shotgun Size of inserts: --Bac insert: ~150 KB --Fosmid insert: ~30 KB --Normal insert: ~3 KB

Clone and scaffold (a) Clone inserts are sequenced from both ends, yielding mated sequence reads. (b) A scaffold uses linking information provided by the clone-pairing data to order and orient contiguous sequences, or contigs, in the genome under assembly. Computer 35 (7): 47 -54

Building a Library • Break DNA into random fragments (~10 x) -- Amplify the fragments in a vector -- Sequence 800 -1000 bases at each end

Assembling the fragments

Assembling the fragments • Break DNA into random fragments • Sequence the ends of the fragments • Assemble the sequenced ends

Forward-reverse constraints • The sequenced ends are facing towards each other • The distance between the two fragments is known

Building Scaffolds

Assembly Gaps --sequencing gap: know the order & orientation of the contigs and have at least one clone spanning the gap --physical gap: no information about adjacent contigs, nor about the DNA spanning the gap

Finishing the Project

Unifying View of Assembly

Assembly Algorithms

Assembly Methods • Overlap-layout-consensus – greedy (Phrap, CAP 3, TIGR. . . ) – graph-based (Euler)

Phrap/CAP 3 Greedy • Build a rough map of fragment overlaps • Pick the largest scoring overlap • Merge the two fragments • Repeat until no more merges can be done !!! IDEAL CASE !!!

Real World Problems • • • Sequencing errors Chimera Repeats Contaminants Polymorphism Orientation

Error Correction

Overlap b/w two sequences

All pairs alignment • Try all pairs – must consider ~ n^2 pairs • Smarter solution: only n x coverage (e. g. 8) pairs are possible – Build a table of k-mers contained in sequences (single pass through the genome) – Generate the pairs from k-mer table (single pass through k-mer table)

Repeats

Repeat sequence The top represents the correct layout of three DNA sequences. The bottom shows a repeat collapsed in a misassembly. Computer 35 (7): 47 -54

重覆序列 ■重覆頻率分 Interspersed repeats Short interspersed element (SINE), eg. Alu <300 bp Long interspersed element (LINE), ca. 5 kb Tandem repeats Satellite DNA Minisat. & Variable number of tandem repeats Microsat. : mono-, di-, tri-, tetra-nucleotide ■重覆方向分同向重覆序列反向重覆序列

Repeat detection Pre-assembly: find fragments that belong to repeats • statistically (Reps) • repeat database (Repeat. Masker)

Statistical repeat detection • Significant deviations from average coverage flagged as repeats. - frequent k-mers are ignored - “arrival” rate of reads in contigs compared with theoretical value (e. g. , 800 bp reads & 8 x coverage - reads "arrive" every 100 bp) • Problem 1: assumption of uniform distribution of fragments - leads to false positives non-random libraries poor clonability regions • Problem 2: repeats with low copy number are missed

Scaffolding

Sequencing hierarchy • Random sequencing – unrelated reads ~700 pairs • Assembly – un-related contigs 5 K-10 K pairs • Scaffolding – unrelated scaffolds 30 K~ 50 K pairs • Finishing/gap closure – completed genomes millions-billions of base-pairs

Definition

Scaffolder output • order and orientation of contigs • size of gaps between contigs • linking evidence: mate-pairs spanning gaps

Clone-mates

Linking information

Hierarchical scaffolding

Ambiguous scaffold

Phred/Phrap/Consed Analysis

What is Phred/Phrap/Consed ? Phred/Phrap/Consed is a worldwide distributed package for: a. Trace file (chromatograms) reading; b. Quality (confidence) assignment to each individual base; c. Vector & repeat sequences identification and masking; d. Sequence assembly; e. Assembly visualization and editing; f. Automatic finishing.

How to deal with the enormous amount of reads generated by the high throughput DNA sequencers?

Phred Genome Research 8: 175 -194

Phred is a program that performs several tasks: a. Reads trace files – compatible with most file formats: SCF (standard chromatogram format), ABI, ESD (Mega. BACE) and LI-COR. b. Calls bases – attributes a base for each identified peak with a lower errorrate than the staard base calling programs.

Phred c. Assigns quality values to the bases – a “Phred value” based on an error rate estimation calculated for each individual base. d. Creates output files – base calls and quality values are written to output files.

File Directories • chromat_dir/ • edit_dir/ • phd_dir/

Trace File High quality region – no ambiguities (Ns)

Trace File Medium quality region – some ambiguities (Ns)

Trace File Poor quality region – low confidence

Phred value formula q = - 10 x log 10 (p) where q - quality value p - estimated probability error for a base call Examples: q = 20 means p = 10 -2 (1 error in 100 bases) q = 40 means p = 10 -4 (1 error in 10, 000 bases)

Base Calling • phred -id. -p -pd. . /phd_dir • phred -view pf 84 c 05. s 1

The structure of a phd file BEGIN_SEQUENCE 01 EBV 10201 A 02. g BEGIN_COMMENT CHROMAT_FILE: EBV 10201 A 02. g ABI_THUMBPRINT: PHRED_VERSION: 0. 990722. g CALL_METHOD: phred QUALITY_LEVELS: 99 TIME: Thu May 24 00: 18: 58 2001 TRACE_ARRAY_MIN_INDEX: 0 TRACE_ARRAY_MAX_INDEX: 12153 TRIM: CHEM: term DYE: big END_COMMENT BEGIN_DNA t 8 5 c 13 17 a 19 26 c 19 32 t a a a g c c t g g t g c c t a a t g 24 24 22 27 25 19 12 15 19 23 33 36 44 44 39 39 34 35 34 2221 2232 2245 2261 2272 2286 2302 2314 2324 2331 2346 2363 2378 2390 2404 2419 2433 2446 2460 2470 2482 t g t c g n c t t c c c t c g g a g g 16 8191 t 6 11908 19 8200 a 6 11921 13 8211 g 6 11927 13 8229 t 6 11947 4 8241 c 6 11953 4 8253 a 6 11964 4 8263 g 6 11981 10 8276 c 4 11994 9 8286 n 4 12015 c 4 12037 12 8301 n 4 12044 16 8313 n 4 12058 12 8329 n 4 12071 12 8336 n 4 12085 15 8343 n 4 12098 19 8356 n 4 12111 9 8371 n 4 12124 13 8386 c 4 12144 14 8397 n 4 12151 7 8417 END_DNA 9 8427 4 8445 END_SEQUENCE

phd 2 fasta • phd 2 fasta program – –converts. phdfiles to sequence in multifasta format – –writes. qualfile (quality scores) for each trace file – –phd 2 fasta -id. . /phd_dir -os CLONE. fasta oq CLONE. fasta. qual • Output: – –fasta. seqcontains fastasequences – –fasta. seq. qualcontains quality scores

Vector Sequence Cleaning (1) • DNA sequence cleaning: quality trimming and vector removal---Lucy: • Lucy Steps: – Read input seq#, seq info, and quality info – Chop off splice sites – Remove vector insert – Produce output seq for fragment assembly.

Vector Sequence Cleaning (2) • Restriction on file name: can’t contain any symbol eg. “–” “. “ “_” • Lucy major parameters to set up: -vector_complete. Seq splice_site_file (splice_site_file: 2 splice-site seq before and after the insertion point on the vector) • Lucy Output: – identified locations of good/clean region – trim seq without vector, linker, Ns (<3% Ns)

splice_site_file • ~ 150 bases, 50 bases overlap around splice • >PUCsplice. for. begin gattaagttgggtaacgccagggttttcccagtcacgacgttgtaaaacg acggccagtgccaagcttgcatgcctgcaggtcgactctagaggatcccc gggtaccgagctcgaattcgtaatcatggtcatagctgtttcctgtgtga • >PUCsplice. for. end acggccagtgccaagcttgcatgcctgcaggtcgactctagaggatcccc gggtaccgagctcgaattcgtaatcatggtcatagctgtttcctgtgtga aattgttatccgctcacaattccacacaacatacgagccggaagcataaa • >PUCsplice. rev. begin tttatgcttccggctcgtatgttgtgtggaattgtgagcggataacaatt tcacacaggaaacagctatgaccatgattacgaattcgagctcggtaccc ggggatcctctagagtcgacctgcaggcatgcaagcttggcactggccgt • >PUCsplice. rev. end tcacacaggaaacagctatgaccatgattacgaattcgagctcggtaccc ggggatcctctagagtcgacctgcaggcatgcaagcttggcactggccgt cgttttacaacgtcgtgactgggaaaaccctggcgttacccaacttaatc

Cross_match

Cross_match • cross_match -minmatch 12 -penalty -2 minscore 20 -screen CLONE. fasta /net/share/sequence_pipeline/vector. fasta

Phrap -- Phragment Assembly Program (or Phil’s Revised Assembly Program) • Phrap is a program for assembling shotgun DNA sequence data • Key Features: • a. Uses the entire read content – no need for trimming. • b. User supplied (i. e. Repbase) + internally computed data – better accuracy of assembly in the presence of repeats. • c. Contig sequence is constituted by a mosaic of the highest quality parts of the reads – it’s not a consensus!

Phrap --Phrap is a program for assembling shotgun DNA sequence data • d. Provides extensive information about assembly – contained in phrap. out, *. ace and *. screen. contigs. qual files • e. Handles very large datasets – hundreds of thousands of reads are easily manipulated. • f. Generate output files – contain some important data and enable visualization by other programs

Banded Search

K-mers • >GL 1234. b 1 gattaagttgggtaacgccagggttttcccagtcac… gattaagttgggtaa ttaagttgggtaacg. . .

Phrap output files • *. contigs – fasta file containing the contigs – Contigs with more than one read – Singletons (single reads with a match to some other contig but that couldn’t be merged consistently to it) • *. singlets – fasta file of the singlet reads – Reads with no match to other read • *. ace – allows for viewing the assembly using Consed • *. view – required for viewing the assembly using Phrapview

Phrap parameters • phrap -new_ace CLONE. fasta. screen >outfile • OPTIONS DEFAULT FUNCTION • -penalty -2 ↑=>↑Stringency. -gap_init penalty -2. -gap_ext penalty -1 • -minmatch* 14 ↑=>↓time↓Matches • -bandwidth 14 ↓=>↓time↓String. • -minscore 30 ↑=>↑String. • *highly sensitive! bigger genomes bigger value

Phrap parameters • • OPTIONS DEFAULT FUNCTION -forcelevel 0~10 ↓=>↑String. -repeat_stringency 0. 95 0<x<1 ↑=>↑String. -force_high* ↑=>↑String. -revise_greedy** ↓Misassembly -shatter_greedy** ↓Contig. Length * Ignore edited high-quality discrepancies **break assembly at weak joins

Phrap parameters • • • OPTIONS DEFAULT -max_subclone_size 5000 -default_qual 15 -preassemble* -group_delim* _ *used together FUNCTION F. -R. check

Consed Genome Research 8: 195 -202, 1998

Consed A program for viewing and editing assemblies produced by Phrap Key Features: a. Assembly viewer - allows for visualization of contigs, assembly (aligned reads), quality values of reads and final sequence. b. Trace file viewer – single and multiple trace files can be visualized allowing for comparison of a given sequence in several reads.

Consed A program for viewing and assemblies produced by Phrap editing Key Features: c. Navigation – identify and list regions which are below a given quality threshold, contain high quality discrepancies, single-strand coverage, etc. d. Autofinish – automatic set of functions for: gap closure, improvement of sequence quality, determination of relative orientation of contigs, identification of regions covered by a single read or by reads of a single strand. The program automatically performs primer picking and chooses the templates.

Phred/Phrap/Consed Pipeline Directories: Chromat_dir Phd_dir Edit_dir

Comparison of shotgun sequence data from Wolbachia genome Project Computer 35 (7): 47 -54

CAP 3 3 XA 6189 57 443 PHRAP 3 XA 6396 54 529 CAP 3 3 XB 12, 368 44 71 PHRAP 3 XB 13, 116 47 228 CAP 3 3 XC 10, 709 49 227 PHRAP 3 XC 11, 406 45 332 CAP 3 3 XD 11, 408 43 115 PHRAP 3 XD 11, 350 49 240 CAP 3 5 XA 10, 582 42 249 PHRAP 5 XA 18, 268 31 252 CAP 3 5 XB 26, 034 17 100 PHRAP 5 XB 33, 693 18 115 CAP 3 5 XC 20, 939 29 172 PHRAP 5 XC 20, 912 27 261 CAP 3 5 XD 14, 219 35 46 PHRAP 5 XD 14, 696 33 129

CAP 3 8 XA 71, 025 12 83 PHRAP 8 XA 71, 395 8 80 CAP 3 8 XB 53, 127 8 59 PHRAP 8 XB 53, 078 7 36 CAP 3 8 XC 52, 134 8 4 PHRAP 8 XC 76, 922 6 6 CAP 3 8 XD 72, 690 7 35 PHRAP 8 XD 102, 523 6 60 CAP 3 10 XA 91, 380 4 28 PHRAP 10 XA 91, 329 3 11 CAP 3 10 XB 167, 655 1 5 PHRAP 10 XB 138, 551 2 7 CAP 3 10 XC 106, 631 5 44 PHRAP 10 XC 77, 747 4 12 CAP 3 10 XD 79, 900 4 2 PHRAP 10 XD 79, 978 3 2

Softwares • CAP 3 (for EST): • • • http: //genome. cs. mtu. edu/cap 3. html Phrap (for large genome): http: //www. phrap. org --Similar algorithm --Insufficient documentation and support -- Always have to write scripts to parse outputs --NO PERFECT PROGRAM!!!

Questions? ? wangsj@yahoo. com