Benchmarking assembly correction tools for the assemblybased comparison
Benchmarking assembly correction tools for the assembly-based comparison of complex Shiga Toxin-Producing Escherichia coli O 157: H 7 genomes. David R Greig *1, 2, 3, Claire Jenkins 1, 2 , David L Gally 1, 3, Saheer E Gharbia 1 & Timothy J Dallman 1, 2, 3. 1) National Infection Service, Public Health England, London, UK. 2) NIHR HPRU in Gastrointestinal Infections. 3) Division of Infection and Immunity, The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, UK. @gingerdavid 92 INTRODUCTION METHODS l The development of single molecule sequencing technologies (SMRT) has made it possible for public health microbiologists to sequence and generate complete assemblies of bacterial genomes both rapidly and inexpensively. l DNA extraction was performed using a Qiagen Qiasymphony followed by library preparation using the Nextera XP kit followed by sequencing on the Illumina Hi. Seq 2500. l DNA extraction was also performed, using the Fire Monkey kit (Revolugen) followed by library preparation using SQK-LSK 109 ligation kit and sequencing on the Oxford Nanopore Technologies (ONT) Min. ION on a FLO-MIN 106 D flow cell. l This has transformed our understanding of the accessory genome of bacteria particularly in prophage-rich organisms like Shiga toxin-producing Escherichia coli (STEC) O 157: H 7[1]. l The ability to scrutinise the accessory genomes of pathogens provides insight to the dynamic nature of the accessory genome, acquisition and loss of virulence genes and antibiotic resistance determinants, the genomic context of mobile genetic elements and large chromosomal rearrangements, that may have public health implications. l Nanopore basecalling, read trimming and read filtering were performed using Guppy v 3. 2. 6 HAC model, Porechop v 0. 2. 4[2] and Filtlong v 2. 0[3] respectively before assembly using Flye v 2. 7[4]. l We benchmarked several assembly correction tools including Nanopolish v 0. 11. 3[5], Medaka v 1. 0. 3[6], Racon v 1. 4. 13[7] and Pilon v 1. 23[8], in varying workflows and parameters using both Nanopore and Illumina reads to understand which method produces the most accurate corrected assembly with respect to the reference. l However, generating accurate complete and corrected genomes using these technologies are still in its infancy and benchmarking base calling, assembly and polishing accuracy is often confounded by the lack of a truth set. l An STEC O 157: H 7 Medaka trained model was created using 200 x coverage of the most accurate reads using strain Sakai. l In this study, we sequenced a well-characterised STEC O 157: H 7 reference genome (strain Sakai), using both Oxford Nanopore (Min. ION) and Illumina (Hi. Seq) technologies, with the aim to correct a long-read assembly to high enough accuracy to determine relatedness directly from the corrected chromosome. l All comparisons between corrected assemblies and strain Sakai were performed using Minimap 2 v 2. 17[9] using the -cx asm 5 parameters. RESULTS Figure 1. Workflow detailing the number of single base substitutions with respect to the Sakai reference genome upon completion of each correction component. Value Substitutions 1 bp Deletions 1 bp Insertions 2 bp Deletions 2 bp Insertions >2 bp Deletions >2 bp Insertions Raconx 4 optimised (ONT) 5193 5332 726 347 40 20 20 Medaka trained (ONT) 431 4828 378 244 4 17 0 Pilon optimised (Illumina) 200 84 53 10 2 2 0 Racon optimised (Illumina) 174 84 24 10 1 2 0 Table 1. Table showing the number of single base substitutions, single base indels, duel base indels and larger indels for each step of the optimised correction component. l An increase in correction accuracy was achieved when removing Nanopore reads that have multiple alignment hits. l There were two correction workflows that produced similar results (Figure 1, highlighted in green): 1. Nanopolish (Nanopore reads), Pilon (Illumina reads) and Racon (Illumina reads). 2. Racon correction (Nanopore reads), Medaka (with a trained STEC model and Nanopore reads), Pilon (Illumina reads) and finally Racon (Illumina reads). l Workflow 2 produced fewer single base substitutions and runs much faster when compared to workflow 1 (Figure 1, Table 1). l Of the remaining 174 single base substitutions from workflow 2, 155 (89%) single base substitutions where located within prophage and prophage-like regions of the chromosome. DISCUSSION & CONCLUSIONS l Regardless of the order of assembly correction tools used, the most accurate polishing protocol required the use of Illumina reads. l The most accurate workflow was four rounds of Racon correction (Nanopore reads), Medaka (with a trained STEC model and Nanopore reads), Pilon (Illumina reads) and finally Racon (Illumina reads). l This optimised workflow led to an overall accuracy of 99. 995% when compared to the publicly available reference genome. l Of the remaining errors, approximately 89% of single base substitutions where concentrated within the prophage regions (Figure 2), highlighting the complexities in polishing homologous/paralogous regions of the STEC genome using short reads. Figure 2. Circos[10] plot Showing location of remaining substitutions, and indels (black ring) relative to the reference genome. Also showing loci of bacteriophages within the reference genome, Red, stx-encoding prophage; Grey, non-stx–encoding prophages; Blue, prophage–like regions and Green, Locus of Enterocyte Effacement. ACKNOWLEDGEMENTS The research was funded by the National Institute for Health Research Protection Research Unit (NIHR HPRU) in Gastrointestinal Infections at University of Liverpool in partnership with Public Health England (PHE), in collaboration with University of Warwick. The views expressed are those of the author(s) and not necessarily the NIHR, the Department of Health and Social Care or Public Health England. REFERENCES 1) Croxen MA, Law RJ, Scholz R, Keeney KM, Wlodarska M, Finlay BB. Recent advances in understanding enteric pathogenic Escherichia coli. Clin Microbiol Rev. 2013; 26: 822 -80. doi: 10. 1128/CMR. 00022 -13 2) Wick R. Unpublished. https: //github. com/rrwick/Filt. Long. 3) Wick R. Unpublished. https: //github. com/rrwick/Porechop. 4) Kolmogorov M, Yuan J, Lin Y and Pevzner PA. 2019. Assembly of long, error-prone reads using repeat graphs. Nat Biotechnol. 37(5): 540 -546. doi: 10. 1038/s 41587 -019 -0072 -8. 5) Loman NJ, Quick J, Simpson JT. A complete bacterial genome assembled de novo using only nanopore sequencing data. Nat Methods. 2015. 12(8): 733– 5. doi: 10. 1038/nmeth. 3444. 6) https: //github. com/nanoporetech/medaka. 7) Vaser R, Sović I, Nagarajan N, Šikić M. 2017. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res. 27(5): 737 -46. doi: 10. 1101/gr. 214270. 116. 8)Walker BJ, Abeel T, Shea T, Priest M, Abouelliel A, Sakthikumar S, Cuomo CA, Zeng Q, Wortman J, Young SK, Earl AM. 2014. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLOS One. 9(11): e 112963. doi: 10. 1371/journal. pone. 0112963. 9) Li H. 2018. Minimap 2: pairwise alignment for nucleotide sequences. Bioinformatics. 34(18): 3094 -3100. doi: 10. 1093/bioinformatics/bty 191. 10) Krzywinski M, Schein J, Birol I, Connors J, Gascoyne R, Horsman D et al. 2009. Circos: an information aesthetic for comparative genomics. Genome Res. . 2009 Sep; 19(9): 1639 -45. doi: 10. 1101/gr. 092759. 109 © Crown copyright 2021 26 December
- Slides: 1