Genome sequence assembly concepts and methods ShihJon Wang

Genome sequence assembly concepts and methods Shih-Jon Wang May 6, 2009

OUTLINE • • • Assembly Process Overview Assembly algorithms Repeats Scaffolding Phred/Phrap/Consed Assembly pipelines

Assembly process overview

A Genome Sequencing Project

Building a Library • Break DNA into random fragments (810 x)

SHOTGUNs • • • Whole Genome Shotgun Bac-Bac Shotgun Size of inserts: --Bac insert: ~150 KB --Fosmid insert: ~30 KB --Normal insert: ~3 KB

Clone and scaffold (a) Clone inserts are sequenced from both ends, yielding mated sequence reads. (b) A scaffold uses linking information provided by the clone-pairing data to order and orient contiguous sequences, or contigs, in the genome under assembly. Computer 35 (7): 47 -54

Building a Library • Break DNA into random fragments (~10 x) -- Amplify the fragments in a vector -- Sequence 800 -1000 bases at each end

Assembling the fragments

Assembling the fragments • Break DNA into random fragments • Sequence the ends of the fragments • Assemble the sequenced ends

Forward-reverse constraints • The sequenced ends are facing towards each other • The distance between the two fragments is known

Building Scaffolds

Assembly Gaps --sequencing gap: know the order & orientation of the contigs and have at least one clone spanning the gap --physical gap: no information about adjacent contigs, nor about the DNA spanning the gap

Finishing the Project

Unifying View of Assembly

Assembly Algorithms

Assembly Methods • Overlap-layout-consensus – greedy (Phrap, CAP 3, TIGR. . . ) – graph-based (Euler)

Phrap/CAP 3 Greedy • Build a rough map of fragment overlaps • Pick the largest scoring overlap • Merge the two fragments • Repeat until no more merges can be done !!! IDEAL CASE !!!

Real World Problems • • • Sequencing errors Chimera Repeats Contaminants Polymorphism Orientation

Error Correction

Overlap b/w two sequences

All pairs alignment • Try all pairs – must consider ~ n^2 pairs • Smarter solution: only n x coverage (e. g. 8) pairs are possible – Build a table of k-mers contained in sequences (single pass through the genome) – Generate the pairs from k-mer table (single pass through k-mer table)

Repeats

Repeat sequence The top represents the correct layout of three DNA sequences. The bottom shows a repeat collapsed in a misassembly. Computer 35 (7): 47 -54

重覆序列重覆頻率分 Interspersed repeats Short interspersed element (SINE), eg. Alu <300 bp Long interspersed element (LINE), ca. 5 kb Tandem repeats Satellite DNA Minisat. & Variable number of tandem repeats Microsat. : mono-, di-, tri-, tetra-nucleotide 重覆方向分同向重覆序列反向重覆序列

Repeat detection Pre-assembly: find fragments that belong to repeats • statistically (Reps) • repeat database (Repeat. Masker)

Statistical repeat detection • Significant deviations from average coverage flagged as repeats. - frequent k-mers are ignored - “arrival” rate of reads in contigs compared with theoretical value (e. g. , 800 bp reads & 8 x coverage - reads "arrive" every 100 bp) • Problem 1: assumption of uniform distribution of fragments - leads to false positives non-random libraries poor clonability regions • Problem 2: repeats with low copy number are missed

Scaffolding

Sequencing hierarchy • Random sequencing – unrelated reads ~700 pairs • Assembly – un-related contigs 5 K-10 K pairs • Scaffolding – unrelated scaffolds 30 K~ 50 K pairs • Finishing/gap closure – completed genomes millions-billions of base-pairs

Definition

Scaffolder output • order and orientation of contigs • size of gaps between contigs • linking evidence: mate-pairs spanning gaps

Clone-mates

Linking information

Hierarchical scaffolding

Ambiguous scaffold

Phred/Phrap/Consed Analysis

What is Phred/Phrap/Consed ? Phred/Phrap/Consed is a worldwide distributed package for: a. Trace file (chromatograms) reading; b. Quality (confidence) assignment to each individual base; c. Vector & repeat sequences identification and masking; d. Sequence assembly; e. Assembly visualization and editing; f. Automatic finishing.

How to deal with the enormous amount of reads generated by the high throughput DNA sequencers?

Phred Genome Research 8: 175 -194

Phred is a program that performs several tasks: a. Reads trace files – compatible with most file formats: SCF (standard chromatogram format), ABI, ESD (Mega. BACE) and LI-COR. b. Calls bases – attributes a base for each identified peak with a lower errorrate than the staard base calling programs.

Phred c. Assigns quality values to the bases – a “Phred value” based on an error rate estimation calculated for each individual base. d. Creates output files – base calls and quality values are written to output files.

File Directories • chromat_dir/ • edit_dir/ • phd_dir/

Trace File High quality region – no ambiguities (Ns)

Trace File Medium quality region – some ambiguities (Ns)

Trace File Poor quality region – low confidence

Phred value formula q = - 10 x log 10 (p) where q - quality value p - estimated probability error for a call base Examples: q = 20 means q = 40 means p = 10 -2 (1 error in 100 bases) p = 10 -4 (1 error in 10, 000 bases)

Base Calling • phred -id. -p -pd. . /phd_dir • phred -view pf 84 c 05. s 1

The structure of a phd file BEGIN_SEQUENCE 01 EBV 10201 A 02. g BEGIN_COMMENT CHROMAT_FILE: EBV 10201 A 02. g ABI_THUMBPRINT: PHRED_VERSION: 0. 990722. g CALL_METHOD: phred QUALITY_LEVELS: 99 TIME: Thu May 24 00: 18: 58 2001 TRACE_ARRAY_MIN_INDEX: 0 TRACE_ARRAY_MAX_INDEX: 12153 TRIM: CHEM: term DYE: big END_COMMENT BEGIN_DNA t 8 5 c 13 17 a 19 26 t 24 2221 a 24 2232 a 22 2245 a 27 2261 g 25 2272 c 19 2286 c 12 2302 t 19 2314 g 12 2324 g 15 2331 g 19 2346 g 23 2363 t 33 2378 g 36 2390 c 44 2404 c 44 2419 t 39 2433 a 39 2446 a 34 2460 t 35 2470 g 34 2482 t 16 8191 t 6 11908 g 19 8200 a 6 11921 t 13 8211 g 6 11927 c 13 8229 t 6 11947 g 4 8241 c 6 11953 n 4 8253 a 6 11964 c 4 8263 g 6 11981 t 10 8276 c 4 11994 t 9 8286 n 4 12015 c 4 12037 c 12 8301 n 4 12044 t 16 8313 n 4 12058 c 12 8329 n 4 12071 c 12 8336 n 4 12085 c 15 8343 n 4 12098 t 19 8356 n 4 12111 c 9 8371 n 4 12124 g 13 8386 c 4 12144 g 14 8397 n 4 12151 a 7 8417 END_DNA g 9 8427 g 4 8445 END_SEQUENCE

phd 2 fasta • phd 2 fasta program – –converts. phdfiles to sequence in multifasta format – –writes. qualfile (quality scores) for each trace file – –phd 2 fasta -id. . /phd_dir -os CLONE. fasta oq CLONE. fasta. qual • Output: – –fasta. seqcontains fastasequences – –fasta. seq. qualcontains quality scores

Vector Sequence Cleaning (1) • DNA sequence cleaning: quality trimming and vector removal---Lucy: • Lucy Steps: – Read input seq#, seq info, and quality info – Chop off splice sites – Remove vector insert – Produce output seq for fragment assembly.

Vector Sequence Cleaning (2) • Restriction on file name: can’t contain any symbol eg. “–” “. “ “_” • Lucy major parameters to set up: -vector_complete. Seq splice_site_file (splice_site_file: 2 splice-site seq before and after the insertion point on the vector) • Lucy Output: – identified locations of good/clean region – trim seq without vector, linker, Ns (<3% Ns)

splice_site_file • ~ 150 bases, 50 bases overlap around splice • >PUCsplice. for. begin gattaagttgggtaacgccagggttttcccagtcacgacgttgtaaaacg acggccagtgccaagcttgcatgcctgcaggtcgactctagaggatcccc gggtaccgagctcgaattcgtaatcatggtcatagctgtttcctgtgtga • >PUCsplice. for. end acggccagtgccaagcttgcatgcctgcaggtcgactctagaggatcccc gggtaccgagctcgaattcgtaatcatggtcatagctgtttcctgtgtga aattgttatccgctcacaattccacacaacatacgagccggaagcataaa • >PUCsplice. rev. begin tttatgcttccggctcgtatgttgtgtggaattgtgagcggataacaatt tcacacaggaaacagctatgaccatgattacgaattcgagctcggtaccc ggggatcctctagagtcgacctgcaggcatgcaagcttggcactggccgt • >PUCsplice. rev. end tcacacaggaaacagctatgaccatgattacgaattcgagctcggtaccc ggggatcctctagagtcgacctgcaggcatgcaagcttggcactggccgt cgttttacaacgtcgtgactgggaaaaccctggcgttacccaacttaatc

Cross_match

Cross_match • cross_match -minmatch 12 -penalty -2 minscore 20 -screen CLONE. fasta /net/share/sequence_pipeline/vector. fasta

Phrap -- Phragment Assembly Program (or Phil’s Revised Assembly Program) • Phrap is a program for assembling shotgun DNA sequence data • Key Features: • a. Uses the entire read content – no need for trimming. • b. User supplied (i. e. Repbase) + internally computed data – better accuracy of assembly in the presence of repeats. • c. Contig sequence is constituted by a mosaic of the highest quality parts of the reads – it’s not a consensus!

Phrap --Phrap is a program for assembling shotgun DNA sequence data • d. Provides extensive information about assembly – contained in phrap. out, *. ace and *. screen. contigs. qual files • e. Handles very large datasets – hundreds of thousands of reads are easily manipulated. • f. Generate output files – contain some important data and enable visualization by other programs

Banded Search

K-mers • >GL 1234. b 1 gattaagttgggtaacgccagggttttcccagtcac… gattaagttgggtaa ttaagttgggtaacg. . .

Phrap output files • *. contigs – fasta file containing the contigs – Contigs with more than one read – Singletons (single reads with a match to some other contig but that couldn’t be merged consistently to it) • *. singlets – fasta file of the singlet reads – Reads with no match to other read • *. ace – allows for viewing the assembly using Consed • *. view – required for viewing the assembly using Phrapview

Phrap parameters • phrap -new_ace CLONE. fasta. screen >outfile • OPTIONS DEFAULT FUNCTION • -penalty -2 ↑=>↑Stringency. -gap_init penalty -2. -gap_ext penalty -1 • -minmatch* 14 ↑=>↓time↓Matches • -bandwidth 14 ↓=>↓time↓String. • -minscore 30 ↑=>↑String. • *highly sensitive! bigger genomes bigger value

Phrap parameters • • OPTIONS DEFAULT FUNCTION -forcelevel 0~10 ↓=>↑String. -repeat_stringency 0. 95 0<x<1 ↑=>↑String. -force_high* ↑=>↑String. -revise_greedy** ↓Misassembly -shatter_greedy** ↓Contig. Length * Ignore edited high-quality discrepancies **break assembly at weak joins

Phrap parameters • • • OPTIONS DEFAULT FUNCTION -max_subclone_size 5000 F. -R. check -default_qual 15 -preassemble* -group_delim* _ *used together

Consed Genome Research 8: 195 -202, 1998

Consed A program for viewing and editing assemblies produced by Phrap Key Features: a. Assembly viewer - allows for visualization of contigs, assembly (aligned reads), quality values of reads and final sequence. b. Trace file viewer – single and multiple trace files can be visualized allowing for comparison of a given sequence in several reads.

Consed A program for viewing and assemblies produced by Phrap editing Key Features: c. Navigation – identify and list regions which are below a given quality threshold, contain high quality discrepancies, single-strand coverage, etc. d. Autofinish – automatic set of functions for: gap closure, improvement of sequence quality, determination of relative orientation of contigs, identification of regions covered by a single read or by reads of a single strand. The program automatically performs primer picking and chooses the templates.

Phred/Phrap/Consed Pipeline Directories: Chromat_dir Phd_dir Edit_dir

Comparison of shotgun sequence data from Wolbachia genome Project Computer 35 (7): 47 -54

Softwares • CAP 3 (for EST): http: //genome. cs. mtu. edu/cap 3. html • Phrap (for large genome): http: //www. phrap. org • --Similar algorithm • --Insufficient documentation and support • -- Always have to write scripts to parse outputs • --NO PERFECT PROGRAM!!!

Questions? ? wangsj@yahoo. com