Sammendrag
In less than a decade, population genomics of microbes has progressed fromthe effort of sequencing dozens of strains to thousands, or even tens of thou-sands of strains in a single study. There are now hundreds of thousands ofgenomes available even for a single bacterial species, and the number of gen-omes is expected to continue to increase at an accelerated pace given theadvances in sequencing technology and widespread genomic surveillanceinitiatives. This explosion of data calls for innovative methods to enablerapid exploration of the structure of a population based on different datamodalities, such as multiple sequence alignments, assemblies and estimatesof gene content across different genomes. Here, we present Mandrake, anefficient implementation of a dimensional reduction method tailored forthe needs of large-scale population genomics. Mandrake is capable of visua-lizing population structure from millions of whole genomes, and weillustrate its usefulness with several datasets representing major pathogens.Our method is freely available both as an analysis pipeline (https://github.com/johnlees/mandrake) and as a browser-based interactive application(https://gtonkinhill.github.io/mandrake-web/).This article is part of a discussion meeting issue‘Genomic populationstructures of microbial pathogens’.1. IntroductionAdvances in DNA sequencing technology have recently made whole-genomesequencing both affordable and scalable enough for routine use in pathogensurveillance by research organizations and public health agencies around theworld [1,2]. A striking example of this is genomic surveillance of the SARS-CoV-2 virus for which over one million genome sequences became availablein just 15 months after its initial discovery [3]. To shed light on population geno-mic data at this scale calls for new tools that can be used for rapid exploration ofthe structure among the samples, with particular emphasis on detecting clustersof similar sequences [4,5]. In this paper, we explore and extend a class ofmethods that aims to reduce the dimensionality of such data to only two dimen-sions, in a manner that supports ready visualization and identification ofclusters.© 2022 The Authors. Published by the Royal Society under the terms of the Creative Commons AttributionLicense http://creativecommons.org/licenses/by/4.0/, which permits unrestricted use, provided the originalauthor and source are credited.
Vis fullstendig beskrivelse