Whole Genome Sequencing for Epidemiologists A Brief Introduction
Whole Genome Sequencing for Epidemiologists – A Brief Introduction Joel R Sevinsky, Ph. D
Objectives � Microbial genomes � Common isolate identification techniques using molecular biology � Whole genome sequencing (WGS) � Example of WGS for outbreak investigation � Questions
Microbial Genomes � Genome � How size varies from 4. 56 to 5. 70 Mb big is 5 Mb? ? ?
Harry Potter Story � Long story, some big books! � 1, 084, 440 words in all seven books � Average word length ~5 letters � ~5, 422, 200 letters total in box set � E. coli genomes range from 4. 56 to 5. 70 Mb � A single E. coli genome @ 1 box set � Single human genome @ 1, 000 box sets!!!
PFGE (Pulsed Field Gel Electrophoresis) What do these bands really mean? ? ?
Genome size in Mb PFGE (Pulsed Field Gel Electrophoresis) 1 site 2 sites 3 sites 2 sites 4 sites 5 Mb 5 Mb 5 Mb 1 Mb 0. 5 Mb Restriction enzyme site
Harry Potter Story Specific word = enzyme restriction site � Word frequency determines banding pattern. � Different words represent different enzymes. � What does PFGE really tell you then? � Table 1 Frequency Book Voldemort (n) Sorcerer’s Stone 31 Chamber of Secrets 20 Prisoner of Azkaban 37 Table 2 Frequency Book Broomstick (n) Spell (n) Wand (n) Wizard (n) Sorcerer’s Stone 27 14 62 41 Chamber of Secrets 12 6 107 44 Prisoner of Azkaban 20 6 114 39
Isolate Identification Techniques Protein DNA � Serotyping � PFGE Pulsed Field Gel Electrophoresis Total g. DNA fragments ◦ ◦ Ribosomal RNA Sequencing 1 gene ◦ ◦ Multi Locus Sequence Typing 7 genes ◦ ◦ Whole Genome Multi Locus Sequence Typing Thousands of reference genes plus pan genome ◦ ◦ Whole Genome Single Nucleotide Polymorphism Typing Total g. DNA � 16 S Sequencing r. RNA � MLST � wg. MLST WGS � wg. SNP or hq. SNP Information ◦ ◦
Whole Genome Sequencing 40 box sets
Whole Genome Sequencing AT AT GT GC AT AT GC GC AGTTG TGG CAG GC GCA TTC AGTTC G GTA AT ATTG TGGTA CAGG TGC GC GCA TTCTA TTA AGTTC G GTAG AGG AT ATTG TGGTA TTC CAGG TGTCC GC GCA ATG TTCTA TTATA AGTTC GT ATG GTAG AGGG TGGTA CGT GA TTCGA CAGG TGTCC AAG ATGG TC TTCTA TTATA GCG A GTAG AGGG TGCCT TA TAG TTCGA GTAG TGTCC GT GAC A ATGG AGGC TTATA CTC AG A CGTCG AGGG TGCCT GGA AGGA TC TTCGA GTAG ACT A A ATGG AGGC CCTTT TA CTG A CGTCG CGGA AA TGCCT GG AGGA AAC GTAG TT AG A A TCT AGGC CCTTT A G CT CGTCG CGGA AA AGGA AAC GA TT A TCT CCTTT A CC G CGGA AAC TT TT TCT A G A AA TT A A TT GA TA TA CC AT AT GA CG CG CT AC AC AG TG TG GG GC GC TA A A TC GG GG AG TA TA TA GT TC AT TC TA A A AG CG AG TC T T GT AC GT TA AT AT GA A A G T A CG CG GT CT CT CT G C AT AG AT AC AC GC TA TA G G TG GAC AT AT AT GT AG GT CT T GC CG CG GC CT GC AG GA AGC AC AC AT AGT AT AGG G AG TG TG AG GT CT CT GC GC A A A CT TA GT GT GG TC TA AGT TA GA AG AG A TCACC GTC ATTA GT CT TAG CT TT C AG A G AT AT GC GA TG T T G G TG GAT TG AT CC CC GC A A G G CT GA CTGA GC CGT GC ATAG AT GCT G AT AT T G G AG C T T AT T GA G G GG GCAG GC AG TA AT GT AT TC TA AG AG TC GT GT GA TA TA GT TC TC GA GA GT GC AT TTA GA CC GA TTA T C GA AG AT GG GACC GC TA TC AGCT GT G A GA AT TCAGT TAGG TA TA GCATGCGTGATCTAGTAGTCTAGGAGCTGACCGATTA AT AT GT ATCT TAGGTC ATGCGTGATCTAGTAGTCTAGGAGCTGACCGATTA CG CG A TA G T AC GAC TA G TTCAG GT G C ATGCGTGATCTAGTAGTCTAGGAGCTGACCGATTA T T G CT GC GC AT TC TAG AT AG GA ATGCGTGATCTAGTAGTCTAGGAGCTGACCGATTA CG GA TA AT AGTTGA AG TAGGAGTGAC GC G G T C T TTA C C G GT GT T CTA TC AGC CTG GA AT ATGCGTGATCTAGTAGTCTAGGAGCTGACCGATTA GA GA GTAG GGGG AC ACAC A AT TTCATC TA A G A C T ATGCGTGATCTAGTAGTCTAGGAGCTGACCGATTA GC T T G G T TT CT AGCCGA AGTA ACTA GTC GTATC A ATGCGTGATCTAGTAGTCTAGGAGCTGACCGATTA G G C CC TA ATA G GATG GGTTAG CTGA GT GT AGAA TTA GC GC TTCCTTAA GA CT CTGTCGTCTG CC AT AT ATGCGTGATCTAGTAGTCTAGGAGCTGACCGATTA A A TT CCGA A GT AGGGGGGGAG G T AT GTAGTA A GTTACGTA TAAG AG GAA TTA GC GGCC T GT ATCGTCTGTC CT CTA GT GC GA AT ATGCGTGATCTAGTAGTCTAGGAGCTGACCGATTA AT GTAGGGATTA GTA AGGA AG GAGCGCAGCACCC AT T GC GCCTT CTC G CC CT GATGAT T T A A G G T T G A T G G A A T ATGCGTGATCTAGTAGTCTAGGAGCTGACCGATTA TA GG G GA T A GT TGATC AGTTAGTAGTTAG ACC GC TGCG GA GATT TT AC CTATC CTA AT ATGCGTGATCTAGTAGTCTAGGAGCTGACCGATTA TT CA ATGCGTGATCTAGTAGTCTAGGAGCTGACCGATTA GA GGTTAAG GG GG C AAT T A AGTG GT AG GC GCG CTA CT CTAGC CT ATGTGGAT ATGCGTGATCTAGTAGTCTAGGAGCTGACCGATTA AGGGGA G GT ATC A GAG CCG ACCG CT GC TA T CT AT AT T G ATGCGTGATCTAGTAGTCTAGGAGCTGACCGATTA A GA TA TA T T A G GT CC TA GA CT TC A TA G GG T G AG GC CT AT ATGCGTGATCTAGTAGTCTAGGAGCTGACCGATTA GA CC GA TT A
WGS for Outbreak Investigations Salmonella enterica serovar Enteritidis JEGX 01. 004 JEGX 01. 002
WGS for Outbreak Investigations JEGX 01. 002 JEGX 01. 004
WGS for Outbreak Investigations A = suspect isolate, same time/PFGE B = same patient over 5 weeks
WGS for Outbreak Investigations C = suspect isolate for outbreak 5 D = environmental isolate, egg farm swab
WGS Beyond Outbreak Investigations � “…comparison of these 61 genomes sequences revealed that neither the 16 S gene, nor the gene fragments usually used for MLST, provides biologically meaningful information on the relatedness of the sequenced isolates. The best way to analyze this is by taking into account all the genomic content, rather than looking at one or a few individual genes. ”
WGS Beyond Outbreak Investigations � Genome � This size varies from 4. 56 to 5. 70 Mb size variation demonstrates a genomic difference of up to 1 Mb between isolates. � 1 Mb = ~1, 000 genes
WGS Beyond Outbreak Investigations
Reference Characterization by WGS “One Shot” Characterization of STEC ANI Serotype. Finder Virulence. Finder 7 -gene MLST Res. Finder Phylogenetic ID GENUS/SPECIES: SEROTYPE: PATHOTYPE: Escherichia coli O 104: H 4 Shiga toxin producing and Enteroaggregative E. coli (STEC & EAEC) VIRULENCE PROFILE: SEQUENCE TYPE: stx 2 a, agg. R, agg. A, sig. A, sep. A, pic, aat. A, aai. C, aap ST 678 ANTIMICROBIAL RESISTANCE GENES: wg. MLST CODE: 102: 45. 26. 35. 3 bla. TEM-1, bla. CTX-M-15, str. AB, sul 2, tet(A)A, dfr. A 7
Summary of Potential WGS Applications � Outbreak investigation ◦ Sporadic vs outbreak ◦ Not just cluster but phylogenetic relationships � Microbial Source Tracking (MST) � Microbial Surveillance ◦ Food ◦ Environment �Animals, soil, food prep areas, hospitals, etc � Antibiotic resistance monitoring � Virulence gene monitoring ◦ Genotype predicts phenotype ◦ Mobile vs integrated � What else? ? ?
Questions?
- Slides: 20