Microbial genome analysis and comparisons Dave Baumler Genome

Microbial genome analysis and comparisons Dave Baumler Genome Center of Wisconsin, UW-Madison dbaumler@wisc. edu

Today’s session overview: Introduction Module #1) Microbial genomes at NCBI (http: //www. ncbi. nlm. nih. gov/Class/minicourses/) -familiarize the tools and options using the NCBI tutorial “Microbial genomes Quickstart”, learn how to download genome (. gbk) files Module #2) Conduct genome alignments of phage genomes -using Mauve to conduct whole genome alignments, familiarize yourself with Mauve Module #3) Compare genomes from 3 outbreaks of E. coli O 157: H 7 -identify genomic islands using Mauve & conservation of virulence factors Module #4) Compare genomes from 5 strains of Yersinia pestis -identify genomic islands, conservation of virulence factors, analyze mutations with phenotypic consequences due to insertion and/or deletion events and Single nucleotide polymorphisms (SNP’s), and paleomicrobiology Conclusion

Choose one of the two Problems: #1 Escherichia coli O 157: H 7 strain Sakai #2 Rickettsia prowazekii strain Madrid E

Lists of all complete and in progress microbial genomes Download full genome sequence (. gbk) files

Downloading Microbial Genome Files #1) Look for the largest. gbk file which is the main genome, smaller. gbk files are plasmids #2) Double click on the file #3 From the file pull down choose “Save page as” give the file a name with a. gbk at the end

Links to other E. coli and database and/or resources

Brief information about the organism

Overview with links to assorted tools

Search on page for words using the Edit>>Find in this page pulldown

Entrez protein view

Clusters of Orthologous Groups of proteins (COGs) were delineated by comparing protein sequences encoded in complete genomes, representing major phylogenetic lineages. Each COG consists of individual proteins or groups of paralogs from at least 3 lineages and thus corresponds to an ancient conserved domain. COG link

Geneplot Entrez Genome offers a new pairwise comparison tool called Gene. Plot to visualize similarities among bacterial genomes. Support for fungal genomic comparisons is also planned. To construct a Gene. Plot, genes are numbered sequentially along the genomic sequences of two organisms and the two corresponding sets of predicted proteins are compared using BLAST. For every case in which a pair or proteins, one from each genome, are mutual best matches, a point is plotted using the indices of the equivalent gene in the two genomes as the X and Y coordinates. Use the Gene. Plot link from an organism’s genome record to see a Gene. Plot against the organism with which it shares the highest number of reciprocal best hits. Comprisons between other organisms can be made using pull-down menus.

Tax. Map

Comparisons of COG groups between various organisms

The ERIC database houses all of the available genomes of the members of family Enterobacteriaceae Boxes, represent organisms with at least one genome sequenced Human Pathogens -Calymmatobacterium -Moellerella -Cedecea -Morganella -Citrobacter -Plesiomonas Insect Pathogens /Endosymbionts -Edwardsiella -Proteus -Enterobacter -Providencia Environmental/ -Brenneria -Arsenophonus -Escherichia -Rahnella Animals/Industrial -Dickeya -Buchnera -Ewingella -Salmonella -Alterococcus -Erwinia -Sodalis -Hafnia -Serratia -Budvicia -Pantoea -Klebsiella -Shigella -Buttiauxella -Pectobacterium -Kluyvera -Tatumella -Obesumbacterium -Phlomobacter -Leclercia -Yersinia -Pragia -Sacchararobacter -Leminorella -Yokenella -Trabulsiella -Samsonia -Wigglesworthia -Xenorhabdus Phytopathogens/ Plant-associated

Orthologs If at least two of these criteria are met for the pair of genes in question they are typically assigned as orthologs. • Percentage identity and alignment percentage are in the typical range • Local genome context, the conserved gene is part of an operon with other genes that are already considered orthologs. • Larger scale conservation of genomic context, the conserved gene is in the same general genomic context as other orthologs. • Functional conservation, the conserved gene is predicted or known to perform the same function as the potential ortholog in another genome. Reciprocal Best Blast hits Blast. P X >60% Y Blast. P Y X >60%

Enterobacteria cont. Generated from 180 orthologs

ERIC-Enteropathogen Resource Integration Center Genomes Tools & Annotations Genome Views and Comparisons

Part of a genome sequence TCAGCGAAGATGAGATAGTTTTTAAAGGTGGGATTTCCCCACCTTTAAAAAGCGAGAAGTCCCGGTTTTAA AGAGGAGTAAAATCCTCTTTTTCTAGCCCACTCAGGTGGTTTTTTTGGTTTTCGCTCCTTGCCGCATCTTC TGTGCCTTTGATGGCGGCTGGTTGGGGTGAAAGGCTGCATATTCCAGAATTTCAGACAGTAGATTGTTTTT GAAATCTTCCGTTTTATCGTTGACGAACTTAACCATCCTGTTGAAATCATCTTCCTTTGATACACCTTCAG GAAATGCCTTAGGAACTGATGTTTGGCTATCCAAGGCATCTTGCAATATCTGCACGATCTCCGAATTCATT GATCGCCCATTGGCCTTTGCTCTGGCGGCAACTGCGTCACGCATACCGTCAGGCATCCTAACTGTAAATCT CTCAATGAAAGCTGGATCTTCTTTTTCAGTCATCATCTTAAACCATAAAAATTTATACAAAACACACTAGC ATCATATTGACATTACCCACAATGACATCATAATGGTGTCAGGCATCAAAATGATGTCATCATGACAAGGG GAAAGTAAATGCAAGATGTTCTCTATACAGGTCGTAAGAACGACAGCTTTCAGCTTCGTCTGCCTGAGCGA ATGAAAGAAGAGATCCGTCGCATGGCAGAGATGGACGGCATTTCGATTAATTCTGCAATCGTGCAGCGCCT TGCTAAAAGCTTGCGTGAGGAAAGAGTTAATGGGCAGTAAAAACAGCGAAGCCCGGAAGTGTGGGGACACT AACCGGGCTTCTAATGTCAGTTACCTAGCGGGAAACCAACAATGACCAGTATAGCAATCTTTGAAGCAGTA AACACTATCTCTCTTCCACGGACAGAAGATCATAACTGCGATGGTGGCGGGTGTGGCGTATGTGGC AATGAAGCCCATCGTGGAAAACATCGGTTTAGACTGGAAGAGCCAGTATGCCAAGCTCGTTAGTCAGCGTG AAAAGTTCGGGTGTGGTGATATCACCATACCAAAGGTGGTGTTCAGCAGATGCTTTGCATCCCTTTG AAGAAACTGAATGGCTCTTCAGCATTAACCCAGCAAAAGTACGTGATGCAGTTCGTGAAGGTTTAAT TCGCTATCAAGAAGAGTGTTTTACAGCTTTGCACGATTACTGGAGCAAAGGTGTTGCAACGAATCCCCGGA CACCGAAGAAACAGGAAGACAAAAAGTCACGCTATCACGTTCGCGTTATTGTCTATGACAACCTGTTTGGT GGATGCGTTGAATTTCAGGGGCGTGCGGATACGTTTCGGGGGATTGCATCGGGTGTAGCAACCGATATGGG ATTTAAGCCAACAGGATTTATCGAGCAGCCTTACGCTGTTGAAAAAATGAGGAAGGTCTACTGATTGGCGT ATTGGAAGGCGCAAAAAGCCAGCAGATGGGCTGCTGGCATTGGGTATATGAACTTTCGGAGA ACATATGAAGTCAATTATCAAGCATTTTGAGTTTAAGTCAAGTGAAGGGCATGTAGTGAGCCTTGAGGCTG CAAGCTTTAAAGGCAAGCCAGTTTTTTTAGCAATTGATTTGGCTAAGGCTCTCGGGTACTCAAATCCGTCA

What exactly are gene annotations? Genome annotation is the process of attaching biological information to sequences. It consists of two main steps: 1. -identifying elements on the genome, a process called “structural annotation” or “gene finding” 2. -attaching information to these elements such as their molecular and biological functions.

Annotation step #1: Structural Annotation Example of a gene - the start codon is green and the stop codon is red Structural annotation consists of the identification of genomic elements (e. g. genes). • Open Reading Frames (ORFs) also called coding sequences (CDSs) must have a start codon and a stop codon • location of regulatory motifs (such as promoters and ribosome binding sites) • This step is typically automated using gene prediction software (Automation only finds ~50 -90% of the genes)

Annotation step #1: Structural Annotation (cont. ) using Genemark. hmm a statistical model

Annotation step #2 Functional annotation: consists in attaching biological information to genomic elements. • biochemical function • involved regulation and interactions • expression • cellular location Three examples of annotations for one gene: • Name/synonym: a short “word” used to refer to the gene (Ex. ure. C) • Product: a descriptive protein name (Ex. Urease gamma subunit) • Function : Describes what the protein does (Ex. Catalyzes the hydrolysis of urea to form ammonia and carbon dioxide)

Module #2 Conduct genome alignments of phage genomes -this module is developed to teach how to use Mauve using enterobacteria phage -Phage genomes can be aligned using Mauve in a matter of minutes. -applicable as a teaching tool to decipher the mosaicism of phage genomes. -comparative studies of 30 mycobacteriophage genomes reveal new insights into the diverse architecture and insight about gene exchange (Hatfull et al. PLo. S genetics et al. 2006) -using Mauve, you could align EVERY mycobacteriophage genome available -How diverse are enterobacteriophage? (the following series of slides are Mauve alignments of phage isolated from E. coli, Salmonella spp. , Yersinia spp. , and Shigella spp. ) all alignments are also provided for further inquiry -we will run alignments with 3 phage genomes from E. coli O 157: H 7

Mauve: Multiple Genome Aligner • Able to identify and align collinear regions of multiple genomes even in the presence of rearrangements • Find and extend seed matches • Group into locally collinear blocks • Align intervening regions (Darling et al. Genome Res. 2004 Jul; 14(7): 1394 -403. )

Module #2 Understanding phage, the viruses that infect microorganisms, via genome alignments Recently aligned 56 enterobacterial phage, phage genomes are an ideal training tools for teaching how to set up mauve alignments

Why Phage? Genomics timeline es za n e en f e g 74 1 X g 0 1 ph o m a Ph ilu 00 09 7 , 0 3, 1 lu f n si e Ha 55 6 G 1 0 20 , 4 E. a ili h op Dr 9 6 6 , 2 l e m m e t e pl m 3 64 Co & s e 70 in 9 m o en g 2001 ED 30 L 9 40, 00 33 0 5, 20 0 or n e a C 00 0 , 19 li c c Sa bd a h e an s~ o r ha s iti co ce y m e c s E. g ns a leg H um Ph el g a 46 ae i s vi e r as g o os 1977 1982 1995 1996 1997 1998 2000 es n e o r c i an i. M l co al i b 1 r e t 2008 pr s es r og

Step #1 copy the folder called 3 phage genomes for alignment excercise, and paste it on the hard drive of your computer (C: drive) Step #2 from the start menu, in programs select Mauve 2. 1. 1 Step #3 under the File pull down select Align with progressive Mauve This new window will appear #4 click here to choose where to send the output file, find the folder (from Step#1), and double click on the folder #5 Type in a file name, and click on Save

Next add the sequences to align Click on Add sequence Select the first phage genome and click on Open, then continue with the 2 nd and 3 rd phage genomes. Then click on Align to start the genome alignment

When viewing the LCB’s, mauve displays regions that are highly conserved/identical as full color. Areas that are unique/variable to one genome appear in white, and represent unique islands

Your tool bar is at the top on the left, the tools you will use are in the View pulldown, and also the buttons Returns the viewer back to home Move left or right, you will find this useful to center a region of interest in the middle of the screen prior to zooming in Zoom in/out, you can also hold down the ctrl button and use the arrows on the keyboard Search for features

Other useful commands in Mauve Function Key Zoom in Ctrl+Up Zoom out Ctrl+Down Scroll Left Ctrl+Left Scroll Right Ctrl+Right Export the current view as Ctrl+E An image

Module #3) Dissecting virulence of E. coli O 157: H 7 using genome alignments

The first E. coli genome sequenced was the nonpathogenic E. coli K-12 genome MG 1655 -determination of the complete E. coli sequence required almost 6 years -E. coli is the preferred model in biochemical genetics, molecular biology, and biotechnology and its genomic characterization will undoubtedly further research toward a more complete understanding of this important experimental, medical, and industrial organism (Blattner et al. Science 1997)

The first pathogenic E. coli genome sequence was enterohaemorrhagic (EHEC) Escherichia coli O 157: H 7 strain 933 EDL -In 1982 Escherichia coli O 157: H 7 recognized as a pathogen for human disease -Also known as EDL 933 from the Michigan outbreak in 1982 from ground beef -shiga toxin producing (STEC) (Perna et al. Nature 2001)

The completion of the 2 nd E. coli O 157: H 7 (EHEC) sequence strain Sakai -In July 1996, an outbreak of Escherichia coli O 157: H 7 infection occurred among schoolchildren in Sakai City, Osaka, Japan. -8, 938 schoolchildren sickened, 3 deaths - We are starting to ask-What genomic differences determine differences in virulence, epidemiology, and fatality? (Hayashi et al. DNA Res 2001)

In 2006 E. coli O 157: H 7 outbreak from bagged spinach (from CDC) -multistate outbreak 205 people sickened, 3 deaths

Currently there are 13 E. coli O 157: H 7 Genomes sequenced, we will have you focus on three that are all in the Enteropathogen Resource Integration Center (ERIC) database (www. ericbrc. org) The three strains you will focus on are: Escherichia coli EDL 933 (EHEC) Escherichia coli Sakai (EHEC) also called RIMD Escherichia coli EC 4042 (EHEC)

In your start menu under programs go to Mauve 2. 1. 1, start up Mauve, notice there is a users guide in pdf form in this folder, this will contain useful information and commands to navigate Note: your computer may need to update Java, since mauve uses a Java platform for the alignment. You should see a window for Mauve appear

Next double click on the uncompressed 3 O 157 H 7 folder, it should contain the following 19 files, take the first one (3 O 157 alignment), and drag and drop it into the mauve window It should start to say reading sequences here, and in a few seconds the alignment will appear, note computers with less than 512 MB RAM may not be able to open the file

Your alignment should look like this Organism name notice the first is EDL 933, the second is RIMD(Sakai), and the third is EC 4042 (spinach) Using the up or down arrows, you can switch the position of the genomes

Top strand Bottom strand The colored blocks are called local colinear blocks (LCB’s), and represent regions of the genome that Mauve has identified as conserved, the lines connect the LCBS, notice that some are in different positions in the other genomes, some are inverted and appear on the bottom strand of the double stranded genome

When you move your mouse over a region of one genome it will show a black box and also show the corresponding region (boxes) in the other two genomes, try scrolling left to right on one genome

Notice, that when you scroll (slowly) over a white region (island) the black boxes pause in the other genomes, then comes back once you have passed over the island back into conserved regions

If you would like to look at all three LCB’s, even though one is in a different position, scroll over one LCB and click the mouse button

Lets use the zoom function, press the home button to restore the alignment to original view Now click on the white island in the top genome, and using the right button bring it to the center of the screen, now start to zoom in multiple times You will start to see the genes, scroll over one and pause, and a window will pop-up with the product annotation, so here you can view what genes are present in this EDL 933 island, and not in the other two

Now place you mouse over one of the genes, in my example I have iha irg. A homolog adhesion Click your mouse once on the gene, and a window will pop-up, scroll down and select View CDS iha in ERICdb This will open the page in the ERIC database for that gene, containing all of the annotations, you can look to see if it is involved in virulence

Lets use the search feature #1) Click on the search feature #2) Choose a genome (EDL 933) #4) Click on search #3) Type in a gene name (stx 2 A)

Notice that it has found the stx 2 A gene (highlighted in blue), and also in the RIMD strain. Just because it isn't aligned in the EC 4042 strain does not mean it isn't there, if you look to the right in the EC 4042 genome, you will find it Stx 2 A

One last feature you can use in Mauve To find an island that is in 2 out of 3 strains you will use the backbone view Press the home button first Then go to the View pull down select color scheme then backbone color

Your alignment should look like this in backbone color, regions in all three appear in light purple color, there will be regions that are different colors that will correspond to 2 out of 3 genomes (you may have to zoom in a bit to see these regions Regions in only EDL 933 and RIMD appear olive green Regions in only EDL 933 and EC 4042 appear maroon Regions in only RIMD and EC 4042 appear tan/brown This is how you identify islands unique to 2/3 strains

Using genomics to track the dissemination of Yersinia pestis strains Courtesy of www. cdc. gov Deng et al. 2002 J. Bacteriol. 184: 16 4601 -4611

Transmission cycle of Plague

Historic 3 pandemics of plague -pandemic: is defined as an epidemic that spreads throughout the human population across a large region such as a continent or worldwide -1 st pandemic ~550 A. D. confined to mainly Africa and some parts of the middle ease -2 nd pandemic originated in Central Asia and spread via trading routes into Europe (Killed ~30% of Europe population) Courtesy of edsitement. neh. gov -3 rd pandemic started in 1850’s in China’s Yunnan providence century confined mainly to Asia

The first two genomes of Yersinia pestis CO 92 & KIM Parkhill et al. 2001 Nature 413, 523 -527 Deng et al. 2002 J. Bacteriol. 184: 16 4601 -4611 Comparison of 2 genomes was not interactive initially

As of 04/2008 there are 7 complete and 14 Y. pestis draft genomes Traditionally the strains are classified as serovars (Antiqua, Mediaevalis, Orientalis, and other) based on the following phenotypic characteristics: -Antiqua = East Africa: (glycerol positive, arabinose positive, and nitrate positive) -Mediaevalis = Central Asia: (glycerol positive, arabinose positive, and nitrate negative) -Orientalis Central Asia (glycerol negative, arabinose positive, and nitrate positive) -other (ie Microtus, Pestoides) not consistent for these phenotypes

Paleomicrobiology Partial view of the grave in Dreux investigated in this work, which illustrates anthropologic features of a mass grave suitable for paleomicrobiology research. (courtesy of www. cdc. gov) -the prefix paleo comes from the Greek work palaios meaning “ancient” -bacterial colonization of dental pulp can occur during bacteremia -Bacteremia (also known as plague septicaemia with Y. pestis) is the presence of bacteria in the blood Courtesy of www. nidcr. nih. gov

Extraction of bacterial DNA from Dental pulp -Some historians believed that a flu-like virus and not Y. pestis was responsible for the 1 st and 2 nd pandemics -DNA detected in dental pulp confirm that Y. pestis was the cause -Which serovar(s) are most similar to the Y. pestis strain(s) from the dental pulp from the corpses? Figure 1 The original protocol developed in our study allows recovering the dental pulp and minimizes the risk of laboratory-acquired contamination of the specimen. The tooth was encasted into sterile resin (1 a) ; the apex was sterily sectioned (1 b) to give access to the canal system (1 c) ; solutions were injected (1 d) ; after incubation, the tooth was put upside down into sterile tube (1 e) and centrifuged (1 f). Tran-Hung et al. PLo. S ONE v. 2(10); 2007

Use of genomic tools to study Y. pestis Concepts in this module that you will address: #1) mutations that affect the production of a full functional gene product that has phenotypic consequences (insertions, deletions, single nucleotide polymorphisms [SNP’s]) to study the genes glp. D, nap. A, and ara. C #2) Paleomicrobiology investigation, determine which serovar(s) have the most similar matching genes compared to the amplified sequence from the dental pulp of 3 corpses. #3) use of genome alignments; determine a island that is unique to the 4 genomes that infect humans and is absent in Y. pestis strain 91001 #4) determine the conservation of a virulence factor in the 5 strains in the genome alignment. Determine if it is a full functional product in strain 91001.

Next double click on the uncompressed Yersinia pestis alignment 5 genome folder, it should contain the following 29 files, take the one (yersinia_pestis_alignment_5 genomes), and drag and drop it into the mauve window It should start to say reading sequences here, and in a few seconds the alignment will appear, note computers with less than 512 MB RAM may not be able to open the file

Your alignment should look like this Organism name notice the first is CO 92, the second is KIM, the third is 91001, the fourth is Antiqua, and the fifth is Nepal 516 Using the up or down arrows, you can switch the position of the genomes

You may find it easier to view the 5 genome alignment without the connecting lines: on your keyboard press Shift L (pressing this again makes them reappear)

Now place you mouse over one of the genes, Click your mouse once on a gene, and a window will pop-up, scroll down and select View CDS in ERICdb This will open the page in the ERIC database for that gene, containing all of the annotations, you can look to see what is known about it and/or if it is involved in virulence (note you may be prompted to a log-in screen, click on the button that says “Enter ASAP”)

Lets use the search feature to find the genes glp. D, nap. A, and ara. C #1) Click on the search feature #2) Choose a genome or search all of the genomes #4) Click on search #3) Type in a gene name (glp. D)

Notice that it has found the glp. D gene (highlighted in blue), and also a corresponding gene in each genome. You need to determine which of the five CDS’s produce the full-length functional protein Method #1: click on each gene and go to the view CDS in ERICdb, look at the length and if any are labeled as pseudogenes. If so look for a note that describes why it is thought to be a pseudogene

Identifying mutations in glp. D, nap. A, and ara. C cont. Method #2: from the feature page in ERIC Scroll down to the feature context part of the page This is a list of all features that are neighboring your gene in the genome, notice some are upstream, downstream, or contained within Notice that contained within your glp. D gene there are polymorphic sites (otherwise known as SNP’s) For SNP analysis, you will use a new tool called “Snippy”

In a new tab or web browser window go to http: //asap. ahabs. wisc. edu/~cabot/aep/snippy. php It should look like this: Highlight and copy all feature ID’s for polymorphic sites from glp. D and paste them into here and click submit feature ID’s

In your SNP analysis, you want to look for SNP’s that cause a change in the amino acid that it encodes for. In some cases the change results in a premature stop-codon, which may generate a truncated non-functional protein #1) note Snippy shows you if the SNP variation results in a amino acid change, in this case A (Alanine) to T (Threonine) #2) In this second SNP, the change resulted in a stop codon

In the middle of each region you will see the polymorphic site (in this case capitol G’s) and the corresponding base in each genome, note you are interested in variations in YPKIM, YPCO 92, YP 91001, YPNepal, and Yp. Antiqua. -in this case there is no difference in these 5 genomes in this analysis, scroll down and search the remaining polymorphic sites and see if there is any difference in the various polymorphic sites in the 5 genomes, if not it probably is a larger deletion or insertion event

Using the DNA sequence obtained from the dental pulp from three corpses (found in the file called Ypestis corpse and CA 88 -4125 YPE genes. doc), conduct a Blast. N search within the ERIC database with each sequence against the 91001, Nepal, Kim, Antiqua, and CO 92 genomes. For each of the three corpses, which serovar is most similar to the strains that caused the 1 st and 2 nd pandemics? From the ERIC home page you can select to run a Blast search here (http: //www. ericbrc. org/)

Paste the first nucleotide sequence from corpse #1 Select entire genomes Select the genomes to query, hold down the Ctrl key and select Y. pestis genomes 91001, Antiqua, CO 92, KIM, and Nepal Finally click on the Submit Query button, repeat with the other two corpses sequences

Next repeat the Blast. N process using the gene sequences from a known North American ancestor (Y. pestis CA 88 -4125/YPE) for glp. D, nap. A, and ara. C. Of the 5 genomes (91001, Antiqua, CO 92, KIM, and Nepal) representing the three serovars, which is most similar to the known North American ancestor? Based on your analysis did Y. pestis arrive in North America via shipping routes over the Atlantic or Pacific? Atlantic? Pacific? (Serovar Antiqua of African origin) Serovar Orientalis or Mediaevalis of Asian origin Courtesy of education. usgs. gov

Your alignment should look like this in backbone color, regions in all five appear in light purple color, there will be regions that are different colors that will correspond to 2, 3, 4 out of 5 genomes (you may have to zoom in a bit to see these regions) Look for a region in the lightest blue color that is present in CO 92, KIM, Antiqua, and Nepal, but absent in the 91001 strain. Analyze the contents and determine if any of the genes may contribute to human infection of Y. pestis.

Conclusion If you are interested in using some or all of these modules in your class, please sign up, and provide email, institution, course(s) -In the last two weeks of August 2008 I will be leading multiple Web. X training sessions to refresh and field Q&A, you need a telephone and internet-ready computer

Thanks for your time Collaborators: Dr. Kai F. (Billy) Hung (UW-Madison/assistant Prof. At Eastern Illinois University Fall 2008) Dr. Amy C. Wong (UW-Madison) Dr. Lois Banta (Williams College) Mentors: Dr. Nicole Perna (UW-Madison) Dr. Charles Kaspar (UW-Madison) Dr. Jeffrey Byrd (St. Mary’s College) Dr. Bob Kadner and the ASM Summer Institute Thank you: everyone on the ERIC database team (especially Guy Plunkett III for setting up module #1 & Eric Cabot for making Snippy) and all of the members of the Perna Genome Evolution Laboratory Funding: This project has been funded with Federal funds from the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human services, under contract No. HHSN 26620040 C