High-resolution sweep metagenomics using fast probabilistic inference

Determining the composition of bacterial communities beyond the level of a genus or species is challenging because of the considerable overlap between genomes representing close relatives. Here, we present the mSWEEP pipeline for identifying and estimating the relative sequence abundances of bacterial lineages from plate sweeps of enrichment cultures. mSWEEP leverages biologically grouped sequence assembly databases, applying probabilistic modelling, and provides controls for false positive results. Using sequencing data from major pathogens, we demonstrate significant improvements in lineage quantification and detection accuracy. Our pipeline facilitates investigating cultures comprising mixtures of bacteria, and opens up a new field of plate sweep metagenomics.


Introduction
High-throughput sequencing technologies have enabled researchers to study bacterial populations in unprecedented detail using whole-genome sequencing of pure individual bacterial colonies. Sequencing of individual isolates has provided insights into antimicrobial resistance and the complex ecology of the spread of antimicrobial resistant variants globally. The application of community profiling metagenomics, in which the 16S rRNA gene is sequenced from complex multi-species samples, can provide information about the composition and dynamics of highly diverse bacterial populations. However, the resolution of this approach is limited due to insufficient nucleotide variation 1 and profiling beyond the level of genus/species is generally not possible. Whole-genome shotgun metagenomics delivers a much higher resolution than 16s rRNA sequencing 2 but widespread application is hindered by the cost associated with sequencing a sample to a sufficient depth to capture the diverse set of organisms and strain-level variation that may be present in the sample 3 .
Current methods for taxonomic profiling of bacteria from sequencing data 4 typically perform well only up to the specieslevel 5 or focus on analysing predetermined single nucleotide variants (SNVs) and/or marker genes to capture the variation contained in a mixed colony of closely related strains 6-8 . Sequencing isolated colonies offers means to ignore this variation but only focusing on pure colonies is insufficient for many potential applications 9,10 . Furthermore, whilst the SNV-based approach has been successful in studies of the history of the human population, focusing solely on SNVs inadequately captures the greater variability and different modalities of variation in bacterial genomes. Conversely, solely gene-based approaches can capture some of this while potentially losing finer detail. Therefore, we aimed to strike a balance between these two approaches by making use of a complete genome reference database.
Here, we have developed the mSWEEP pipeline, which is designed to make efficient use of large collections of reference genomes that are available for numerous important human pathogens and other culturable bacterial species. mSWEEP combines clustering of the reference genomes into biologically relevant groups, fast pseudoalignment of reads to the references, fast and accurate probabilistic inference of the cluster abundances and a method for controlling false positive detections. Similar methods taking advantage of pseudoalignment either with 11,12 or without 13 the application of probabilistic modelling have been developed but we show that our combination of clustering with large reference collections vastly increases the accuracy of obtained estimates.
Although applicable to any scenario where reference genomes for the sequenced bacteria are available, mSWEEP specifically enables a new kind of high-resolution analysis in plate sweep metagenomics, where a mixture of colonies is harvested from an enrichment culture by sweeping the whole plate in contrast to isolating a single colony. Plate sweep experiments fall between whole-genome sequencing of single colonies and culture-independent metagenomics by analysing the entire complexity of a community from a specific growth medium. Since the potential species are restricted in advance by the growth medium, plate sweeps offer a cost-effective way to obtain high-depth sequencing data from only the target organisms of interest and reduce potential sources for bias when comparing enrichment cultures from different timepoints. As illustrated in our experiments, this setting is ideal for analysing samples representing populations of pathogenic bacteria, where the infecting species of primary interest have generally been previously encountered and sequenced frequently. By leveraging on existing high-resolution genomic pictures of pathogen populations, mSWEEP provides means to address a range of novel biological questions related to within-host variation, transmission and the effect of ecological factors on the microbial diversity present in samples.

Lineage identification
Abundance estimation with mSWEEP is performed in two phases: reference preparation, performed once for a given reference collection, and analysis of samples ( Figure 1). Reference preparation consists of defining a reference sequence database and grouping the sequences according to biological criteria such as sequence type (ST), clonal complexes (CC), or by using a clustering algorithm for bacterial genomes. Grouping related reference sequences is essential in enabling identification of the taxonomic origin of each read 11 and enables abundance estimation when the sequencing reads originate from a sequence having no exact match in the reference database but which is represented by sequences from closely related organisms within the same group (typically bacterial lineage). Consequently, accuracy of the abundance estimates provided by mSWEEP is reliant on an extensive reference database and a biologically meaningful grouping.
We constructed detection thresholds for the groups during the reference preparation from the reads used to assemble the reference sequences ( Figure 1). We performed repeated in silico experiments in each reference group, where we randomly chose one reference sequence from the group, removed it from the reference set, resampled from the sequencing reads used to assemble the removed sequence, and estimated abundances from the resampled reads with mSWEEP. This process was repeated within the group for a predetermined number of iterations, and then repeated in all other groups within the reference

Amendments from Version 1
We have revised our manuscript based on the feedback provided by the two reviewers. Notably, we have added a new synthetic experiment assessing the performance of mSWEEP in a setting with many Escherichia coli lineages simultaneously present at wildly varying coverages ranging from 50× to 0.10×. The results from this assessment are included as Figure 5 in the revised manuscript. Additionally, we have expanded the Discussion section to cover a limitation of our method related to the presence of novel or uncharacterized lineages in the samples, and also made several changes and additions throughout the manuscript related to the minimum sequencing depth required to use mSWEEP. Finally, we would like to thank both reviewers for their time and apt comments that have enabled us to improve the quality of our manuscript.
set. The detection threshold for a given group was determined by first examining abundance estimates for the given group from the repeated experiments where a different group was the true source (meaning estimates for the given group should ideally be zero), and determining from those estimates a source-specific cutoff point where only a preset number of estimates exceed the cutoff. After determining the source-specific cutoffs in all other groups, the detection threshold of the given group is obtained by taking the maximum of the source-specific cutoffs. The detection thresholds are used to filter the relative abundance estimates by setting the estimates below the cutoff point to zero. Our approach also provides a statistical confidence score for estimates exceeding the detection thresholds with the confidence determined by the number of estimates from the resampled reads allowed to exceed the source-specific cutoffs.
The first phase of analysis is pseudoaligning 14 sequencing reads to the reference sequences. Pseudoalignment produces binary compatibility vectors indicating which reference sequences a read pseudoaligns to. Based on the pseudoalignment count to each reference group, we defined the likelihood of a read originating from each of the groups. We assumed that 1) if multiple groups have the same total number of reference sequences, the group with a higher fraction of pseudoalignments is more likely the source for the read, and 2) the likelihood of the read to originate from a group is not dependant on the number of reference sequences in the group. Basing the likelihood on the pseudoalignment counts defines an extension of a probabilistic model that has previously been applied in RNA-sequencing 15,16 and to bacterial data 11 . The extended model utilizes multiple reference sequences from each group as opposed to the previous attempts that rely on selecting a single, best-representative sequence from each of the groups 11 . Our model obtained the relative abundances of the reference groups by considering the generating process for a sample as a pooling of sequencing reads originating from the reference groups according to some unknown proportions, corresponding to a statistical mixture model. We fit the model and inferred the mixing proportions using variational inference 16 .
Assigning single-colony isolates to lineage We compared the performance of the mSWEEP pipeline (using kallisto 14 version 0.45 for pseudoalignment and mSWEEP software version 1.1.0 for abundance estimation) against two existing methods capable of identification beyond the species-level based on leveraging reference sequence collections: metakallisto 13 (version 0.45) and the BIB pipeline 11 (commit hash 2999540). We additionally attempted to compare mSWEEP with ditasic 12 (commit hash 90fee24b), but the comparison proved infeasible Figure 1. Flowchart of the mSWEEP pipeline describing a typical workflow for relative abundance estimation. The input part refers to the input data, reference preparation to the operations that need to be performed once per set of reference sequences, and analysis contains the steps run for every sample.
due to ditasic's quadratic scaling in the number of reference sequences -based on running the indexing step in ditasic for one day, indexing the reference data would have taken roughly 90 days and over 30 terabytes of disk space. The main differences between the chosen methods are that metakallisto attempts to identify individual strains based on all available sequences, BIB uses grouped reference sequences with a single representative sequence from each group to assign abundances to the groups, and mSWEEP identifies the presence of lineages by using grouped reference sequences with all the available sequence representatives.
As the reference data, we used bacterial sequence assemblies from four studies 17-20 augmented by single representative sequences from 27 species; a total of 3815 reference sequences. We grouped the sequences in either clonal complexes based on multilocus sequence typing 21 , lineages identified with the BAPS clustering algorithm 22 , or on the species-level. We removed 504 sequences from all groups represented by more than one sequence to create a dataset where the true group is known but the true sequence is not available to the method pipelines being compared (Table 1). In addition to the test data described in Table 1, we referred to a study sequencing 77 K. pneumoniae, K. variicola, and K. quasipneumonae isolates from Thailand 23 to assess the accuracy of all methods when the reference sequences and the test samples were not obtained from the same source. mSWEEP significantly outperformed BIB and metakallisto in cases measuring accuracy of abundance estimates in the true group ( Figure 2; p < 10 -9 , in all comparisons, Wilcoxon signed-rank test; median error in all estimates for mSWEEP was 0.00003, for BIB 0.23, and for metakallisto 0.54). When measured by highest estimates in the incorrect groups, mSWEEP outperformed the other two methods in all cases except the S. epidermidis 11-group clustering and the K. pneumoniae out-of-reference samples ( Figure 2; p < 0.0012, Wilcoxon signed-rank test; median error in all estimates for mSWEEP was 0.000002, for BIB 0.05, and for metakallisto 0.01). In these two latter cases, mSWEEP and metakallisto performed similarly ( Figure 2; p > 0.10 when testing for the difference in accuracy in either direction, Wilcoxon signed-rank test). Since metakallisto attempts to identify strains rather than lineages, the observed behaviour is likely a result of the majority of the abundance estimates being spread across strains belonging to the true lineage. We also examined the performance of a modified version of metakallisto, where the estimates for individual sequences within the lineages are pooled together by summing them up. However, this modification did not increase the performance of metakallisto enough to realistically compete with the results from mSWEEP (Extended Data Figure S1 24 ).
We additionally compared mSWEEP and BIB by measuring accuracy in classification based on assigning the samples to the lineage with the highest abundance estimate. With this criterion, both methods correctly identified the true clonal complex in all 100 C. jejuni and C. coli isolates, and in all 81 S. epidermidis isolates when using the 3-cluster grouping. In the 11-cluster S. epidermidis grouping, mSWEEP correctly identified the true lineage in 78 and BIB in 80 samples. In the 188 E. coli and 129 K. pneumoniae isolates, mSWEEP identified the lineage correctly in 187 and 126 samples, while BIB correctly identified 184 and 117. The K. pneumoniae and E. coli isolates that were misidentified by mSWEEP likely contain a sequence type that is missing from the reference, or are mixtures of K. pneumoniae and E. coli lineages (Extended Data Figures S1a and S1b 24 ). Out of the last 61 out-of-reference K. pneumoniae samples, mSWEEP identified the true origin in all 61 isolates and BIB in 53.
The least accurate estimates for all methods (measured by the true positives and highest true negatives) were obtained for the 81 S. epidermidis isolates when using the second level of the hierarchical BAPS clustering with 11 groups (Figure 2), where none of the three methods reached the level of accuracy observed  in the other cases. These inaccuracies are explained by the comparably small reference for the S. epidermidis population (Table 1), which does not exhibit a clear cluster structure (Extended Data Figure S3a 24 ) beyond the coarsest BAPS clustering into three groups. The lack of structure causes the abundance estimates to spread across the new groups defined within each of the three top-level clusters (Extended Data Figure S2c 24 ).
We examined the grouping of the reference sequences by producing t-SNE 25 plots of 31-mer distances estimated with mash 26 (version 2.0) between the reference sequences including the test isolates ( Figure 3; Extended Data Figures S2a-c 24 ). The C. jejuni and C. coli reference conforms to the clonal complex grouping while the S. epidermidis population only conforms to the coarsest 3-group BAPS clustering. The t-SNE plots correctly place the assemblies into the true groups but the method does not preserve the distances between the points or the clusters 25 and is on its own unsuited to analysing mixed isolate data.
Processing the 504 single-colony isolates with mSWEEP took an average of 23 minutes and 50 seconds per sample, metakallisto an average of 24 minutes and 42 seconds, and BIB an average of 143 minutes and 46 seconds per sample using the same reference data. mSWEEP used a maximum of 79.5 GB RAM (maximum of 24.6 GB counting only the abundance estimation step), metakallisto 108.1 GB, and BIB 31.5 GB. Resource usage was obtained by running each sample separately with a total of eight processor cores available. Reads were pseudoaligned with kallisto 14 (version 0.45) against the test reference of 3311 sequences (obtained from the 3815 reference sequences in Table 1 by removing the 504 single-colony isolates) in 259 groups (mSWEEP and metakallisto), or aligned with bowtie2 27 (version 2.3.5.1) against a reference consisting of randomly selected representative sequences from each of the 259 groups (BIB).

Quantifying synthetic mixtures of single-colony reads
We investigated the performance of mSWEEP in quantifying samples containing multiple lineages of bacteria from the same species by synthetically mixing reads from the single-colony samples. Each mixture sample was set to contain a total of one million reads from three single-colony samples from three lineages, with randomly assigned proportions from the set (0.20, 0.30, 0.50). We used a balanced incomplete block design to ensure that all lineages appear in at least 13 mixture samples, and each single-colony isolate appears at least once, producing 161 C. jejuni and C. coli, 477 E. coli, 584 K. pneumoniae, and 100 S. epidermidis synthetic mixture samples in total.
Compared to abundance estimates from the single-colony samples, estimates obtained from the synthetic mixture samples show that the presence of sequencing reads from multiple lineages in a synthetic mixture results in an error distribution resembling the one observed in the single-colony samples ( Figure 4; Extended Data Figure S4 24 ). Estimates from the synthetic S. epidermidis mixture samples using the 11-group split produce an error distribution that differs from the single-colony error distribution more than that observed with the other groupings.  Comparing the empirical distributions of the errors from the synthetic mixtures and the single-colony isolates (Extended Data Figures S3 and S4 24 ) shows that for estimates exceeding a threshold of 0.016, the accuracy of estimates from the mixture samples stochastically dominates the accuracy observed in the single-colony samples, except in the S. epidermidis 11-cluster case where stochastic dominance is observed only above a threshold of 0.17. Stochastic dominance establishes a partial ordering between two random variables and, in this case, implies that estimates from the mixture samples are more accurate (in a probabilistic sense) than estimates from the single-colony samples when the estimates are large enough. In the S. epidermidis 11-cluster case we do not establish the mixture estimates as more accurate since the distribution (Extended Data Figure S4 24 ) and the observed threshold differ considerably from the other cases.
The results indicate that above this relatively low background noise level of 0.016, quantifying mixture samples is not expected to produce more false positive results than would be obtained from single-colony samples. In the synthetic mixtures, the observed background noise level corresponds to sequencing depths of around 0.30x (E. coli and K. pneumoniae) and 0.65x (S. epidermidis), which provides the bare minimum sequencing depth required to distinguish between the lineages of each species in samples with similar read lengths and sequencing depth. This justifies simplifying the problem of determining the detection thresholds accompanying mSWEEP, which provide a threshold for reliable detection of the reference groups in mixture samples, to determining the thresholds based on the single-colony isolates. Due to the requirement that the abundance estimates must be large enough for this assumption to hold, we incorporate the threshold observed in comparing the estimates into the detection thresholds by using it as the minimum threshold regardless of the results from the resampling procedure.
We also evaluated the performance of mSWEEP with synthetic mixtures containing more complex strain compositions. Namely, we produced 87 synthetic mixture samples each consisting of 10 E. coli strains from 10 different lineages mixed together at varying relative abundances (0.50, 0.25, 0.125, 0.0625, 0.0312, 0.0156, 0.0078, 0.0039, 0.0020, 0.001). The total number of reads in each sample was set to correspond to sequencing a single typical E. coli genome at 100× sequencing depth (around 5 million 100bp reads), resulting in sequencing depths ranging from 50× to 0.10× for the 10 different strains.
Overall the design is intended to mimic performing plate sweeps with a similar amount of sequencing resources that would be available for the same number of colony picks.
The boxplot displays the relative error in the relative abundance estimates from mSWEEP compared against the true values. Error of >0% (horizontal axis) denotes estimates from mSWEEP exceeding the true value, while error of <0% denotes estimates lower than the true value. The dashed gray line corresponds to 0% error. The rows (vertical axis) separate the estimates by their approximate sequencing depth, with each sample contributing one value (estimate for one lineage) to each row.
The results from the complex synthetic mixtures ( Figure 5) show that mSWEEP accurately recovers the relative abundances for the dominant lineages (50× to 12.50× sequencing depths) and  identifies the mid-range lineages (6.25× to 1.56×) correctly but underestimates their relative abundance. As a result, the relative abundances of the two most dominant lineages are slightly overestimated because, taken together, the relative abundance estimates must sum up to 1. As for the lineages with < 1× sequencing depth, their relative abundances are not recovered at all, likely because the differences between the E. coli lineages are not pronounced enough to separate the strains at this sequencing depth. Coincidentally, the observed accuracy cut-off point roughly corresponds to the previously established background noise level of 0.016, or approximately 1.6× sequencing depth.  Figure S6 24 ), which we defined as multi-drug resistance 30 . One sample was found to contain the plasmid associated resistance gene MCR-1, which confers resistance to colistin, a last line antimicrobial drug 31 . We found no significant difference in the antimicrobial resistance gene profile between the healthy and diarrhoeal samples (Two tailed, Fisher's exact test p=0.5). Extended Data Table S2 32 details how many samples harboured antimicrobial resistance genes in each antimicrobial drug class; the full antimicrobial resistance gene data can be found in Extended Data Table S3 33 .
We additionally examined differences between the community composition in the healthy and diarrhoeal samples based on both the distribution of the relative abundance estimates (alpha diversity), and changes in the identified strains (beta diversity). The alpha diversity, measured by Shannon entropy (Extended Data Figure S7 24 ), showed no significant differences between the two paired samples (p > 0.90, Wilcoxon signed-rank test; median Shannon entropy in the diarrhoeal samples was 0.60, and in the healthy samples 0.59). However, we found significant shifts in lineage composition (see Figure 6) when comparing the beta diversity, measured by Bray-Curtis dissimilarity, between the two samples (p < 0.005, multivariate-ANOVA). Tests were performed on relative abundance estimates filtered by both 0.99 and 0.90 confidence thresholds.

Co-occurrence patterns in Campylobacter lineages
The network diagram (Figure 7a) shows ST-clonal complex (CC) (nodes) of the isolate genomes with the thickness of edges representing the number of times that isolates from these CCs  There is some preliminary evidence that common clinical lineages CC45 and CC21 are rarely found together in a single sample (plate sweep) while other lineages, such as the chicken associated CC353, are frequently isolated from samples containing multiple strains. From an evolutionary perspective, it is unlikely that closely related strains can stably occupy identical niches because competition would be expected to lead one to prevail. The results demonstrate co-occurrence of strains within individual host animals and multi-strain infections in humans and provide information about the complex ecology of co-occurring interacting species that leads to the observed community structure in a given sample.

Multi-drug resistant Klebsiella pneumoniae coexist with other lineages
The coexistence network (Figure 8) and the sample-lineage heatmap (Extended Data Figure S8 24 ) for the 179 human clinical samples of K. pneumoniae demonstrates common co-occurrence of K. pneumoniae with a wide variety of E. coli lineages, as well as occasional co-occurrence with Acinetobacter baumanii and other species. Since both E. coli and A. baumanii grow on the media used for culture of K. pneumoniae, frequent co-occurrence in samples selected for their diversity is expected. Clonal complexes of K. pneumoniae centered on sequence types associated with high levels of multi-drug resistance (e.g. ST258, ST147 and ST101) were frequently observed co-existing with a variety of other K. pneumoniae lineages as well as with each other, and with other important Gram-negative pathogens. Developing a deeper understanding of these community structures and interactions will be critical for monitoring horizontal transfer of antimicrobial resistance genes between taxa.

Discussion
Metagenomics using high-throughput sequencing has become a common approach when investigating the bacterial composition of different environments or changes introduced by intervention, such as in the human gut microbiome. In most epidemiological applications, the relevant target organisms are culturable using established media which offers a clear advantage to obtaining high sequencing depths in a cost-effective manner. We developed the mSWEEP pipeline to enable high-resolution inference of the lineages present in plate sweeps of enrichment cultures. mSWEEP can be used to infer the detailed population structure of a single species, or the diverse populations of bacteria typically encountered in clinical and public health settings where standard culturing media is routinely used to isolate epidemiologically relevant organisms. This pipeline also estimates the relative abundance of lineages and provides means to construct reliability cut-offs with accompanying confidence scores. Since the cut-offs are constructed before analysing any sequencing data by synthetically mimicking the properties of the in vitro data, the cut-offs and confidence scores can be used to assess the necessary sequencing depth to identify rare and low abundance lineages with confidence as well as the total number of reads required. mSWEEP was designed to have a minimal execution time using the latest advances in RNAseq analysis and its maximum memory footprint is determined by the pseudoalignment algorithm. We demonstrated significant improvements in accuracy over the previous state-of-the-art method in our experiments.
mSWEEP provides considerable power for improving our understanding of infection by recovering a true representation of bacteria in a complex sample. For example, genotyping studies have shown that C. jejuni and C. coli colonizing the primary host (birds and mammals) form clusters of related isolates that are host-associated 34 , which can be used to identify the reservoir for human infection 35 . However, multiple organisms can be isolated from the same sources 36,37 . The co-occurrence of different organisms could be a snapshot in time of a wider process of lineage succession 38 in which the resident microbiota might resist new colonizations or be displaced by recently acquired bacteria 39,40 . Further, we suggest we may be able to infer complex interactions between organisms that occupy different microniches 41 and are not in direct competition 42,43 by analysing their co-occurrence. Therefore, this approach provides a means to investigate the nature of polymicrobial infections to improve our understanding of the spread of a specific organism between hosts and transmission to humans in addition to enabling characterization of physical and temporal variation in the distribution of lineages among multi-strain samples.
Because of limitations in the initial culture and DNA isolation processes, we can only infer relative (not absolute) abundances and spike-in methods 44 must be used if an estimate of the absolute abundances is desired. However, only inferring relative abundances is not a significant limitation as the absolute abundances of target organisms are also subject to large biological and technical variation 45 . Memory requirements for large reference collections or simultaneous analysis of multiple samples necessitate a dedicated computer cluster to run the analysis pipeline, but even for very large reference collections the resource usage is still at the level available at most bioinformatics centres. Alternatively, the reference sequences can be modified to include only the directly relevant species, which makes mSWEEP widely applicable to biologists; or traditional alignment algorithms employed to trade decreased memory usage for increased runtime.
As with any method intended to identify sequence variation, the target species need to be relatively well known to allow building of sufficiently informative reference databases. Similarly, to allow for sensible and easily interpretable inferences, the biological clustering of the reference database should be based on well-established biological entities, such as multilocus sequence types (STs) or clonal complexes (CCs) which are frequently employed as labels of lineages. These limitations suggest that mSWEEP is most relevant for extensively studied bacteria, such as pathogens, and has only limited applications when working with samples containing high numbers of novel or uncharacterized species. Some of these limitations may be overcome by using gene flow network based approaches to perform the biological clustering, better capturing the population structure in species strongly shaped by horizontal gene transfer or ecological barriers 46,47 , but further study in this area is needed. However, following the introduction of mSWEEP, a separate computational tool has been developed to assess the suitability of the reference sequences to specific samples and to identify cases where mSWEEP struggles due to the presence of reads originating from novel genome sequences (https://github.com/harry-thorpe/demix_check). Consequently mSWEEP can also be used to estimate the quality of the reference collection and to discard contaminated or mis-identified genomes.
Strain identification from metagenomic data has been recently suggested by the StrainPhlAn method 48 . mSWEEP, and similar methods, are complementary to StrainPhlAn as these methods analyse similar data but from different directions. mSWEEP assigns strains present in the sample to biologically established genetically separated lineages and estimates the relative abundance of these, whereas StrainPhlAn infers SNPs and phylogenetic relations within the whole sample. Given the flexibility and generality of the mSWEEP approach, we anticipate this pipeline will pave the way for numerous novel applications of plate sweep metagenomics in many fields of microbiology.

Conclusions
mSWEEP represents a novel means to quantify the composition of bacterial communities beyond the resolution offered by bacterial identification methods based on 16S ribosomal RNA gene sequencing or whole-genome shotgun metagenomics. We have demonstrated significant improvements in accuracy over similar methods, and novel co-existence analyses using plate sweeps of enrichment cultures of the human pathogens Campylobacter jejuni, Campylobacter coli, Escherichia coli, Klebsiella pneumoniae and Staphylococcus epidermidis. We expect that mSWEEP will find use in similar studies of bacterial pathogens, where high-resolution inference is required, ample reference collections for the species of interest are available, and the plate sweep metagenomic approach can be applied in-depth at a fraction of the current cost of single-colony sequencing.

Reference construction
The reference sequences (Table 1, Extended Data Table S4 49 ) are the genomic assemblies of a number of strains or species that represent the organisms of interest in a sample. We used a collection of assemblies from four studies 17-20 augmented with the genomes of a representative strain from 27 species that were identified in the real mixture data by MetaPhlAn 50 .
Grouping the reference sequences We used multilocus sequence typing (MLST) of the C. coli, C. jejuni, E. coli and K. pneumoniae reference sequences to group them into clonal complexes defined by the allelic profile of a central sequence type, and all other sequence types that vary in at most a single MLST locus (C. coli and C. jejuni) or in at most two loci (E. coli and K. pneumoniae). The K. pneumoniae reference contained sequences belonging to K. variicola, K. quasipneumoniae, and K. quasivariicola which we assigned to three groups defined by the three species.
We similarly treated the 181 S. aureus contained in the S. epidermidis study as a single group, and split the 143 S. epidermidis sequences using the first and second levels of the hierarchical clustering produced by the hierBAPS 22 software (version 6.0). The complete grouping is provided in Extended Data Table S4 49 .

Pseudoalignment
We used kallisto 14 (version 0.45) with default settings to perform pseudoalignment. Pseudoalignment produces binary compatibility vectors which indicate whether the read pseudoaligns to a reference sequence or not. In our model, we sum the pseudoalignment counts within each reference group and thus consider the observations of N reads r n = (r n,1 , ... , r n,k , ... , r n,K ), n = 1, ... , N, k = 1, …, K as containing only the information about the number of pseudoalignments r n,k within each of the K groups.

Abundance estimation model
We assume that the reads r n are conditionally independent given the mixing proportions of the groups θ = (θ 1 , ... The formulation in Equation (1) corresponds to a standard mixture model with observations r n , categorically distributed latent variables I n , event probability parameters θ, and the likelihood p(r n | I n = k) of observing the full pseudoalignment count vector r n given that the group k is the true source.

Likelihood
The likelihood p(r n | I n = k) needs to be defined carefully in order to satisfy the goals of invariance to group identity and size, and monotonicity with increasing pseudoalignment counts within a group. Given the vector r n , we define the likelihood p(r n | I n = k) of observing the whole vector r n assuming that k is the true group in three parts depending on the number of reference sequences M k in the group k and the pseudoalignment count r n, k in group k as , , , and , where B(a, b) is the beta function and a k ,β k > 0 are hyperparameters for the group k.
Equation (2)  The formulation of f(r n,k , k) in Equation (4) and Equation (5) intuitively arises when the probability mass function of a beta-binomial random variable with hyperparameters (α k , β k ) is multiplied by the factor This factor is the inverse of the value of the probability mass function when the observed value is equal to the total number of sequences M k in the group, meaning in our context a read which is compatible with all reference sequences in a group. Formulating the likelihood in this manner (Equation (4) and Equation (5)) causes groups where all sequences in the group are compatible with the read to have equal likelihoods. Compared to a model assuming independence between the groups, this formulation reduces the effect of the likelihoods being flattened in groups with large numbers of assigned reference sequences when compared to small groups, which is necessary to compare groups that differ greatly in size.
Reads with identical pseudoalignment count vectors r n have the same likelihood and can be assigned into equivalence classes defined by the count of compatible sequences in each group. This enables a computational optimization where the likelihoods need only be calculated for the observed equivalence classes and then multiplied by the total number of times each equivalence class was observed.
In this parametrization, the first parameter π k ∈ (0, 1) corresponds to the mean of the beta distribution that has been compounded with a binomial distribution to obtain the beta-binomial-derived component in Equation (5), and the second parameter φ k > 0 represents a measure of variation in the success probability of each observation 51 .
We constrain the mean success rate π k in Equation (6) to π k ∊ (0.5, 1), which produces beta-binomial distributions with an increasing probability mass function 52 in the number of compatible sequences r n,k , which leads to the definition in Equation (5) having the same property. Increasing probability mass functions fulfil our requirement for the likelihood that of two equally sized groups with different number of compatible reference sequences, the one with more compatible sequences is always a better candidate for being the true source. The values of the parameters (π k , φ k ) are set to

Inference
We perform inference over the mixing proportions θ of the different groups using fast collapsed variational inference 53 . The method collapses the mixing proportions θ and uses natural gradients to optimise an approximation to the posterior distribution over the indicator variables I n , assuming the distribution factorises over θ and I n . The same variational Bayesian method was also used in BitSeqVB 16 to obtain transcript expression levels and has been applied to estimate mixing proportions in bacterial sequencing data in BIB 11 using a different likelihood. The prior distribution on the mixing proportions θ is set to Dirichlet (δα, ..., δα) with δα = 1. The same prior was also used by BIB. Since reads originating from the same equivalence class have the same likelihood, variational inference will yield identical posterior inferences for them. This allows us to perform the inference on the smaller number of equivalence classes rather than all reads, leading to faster inference.

Detection thresholds
Detection thresholds define a means to quantify reliable identification of the reference groups through constructing a minimum relative abundance threshold on the groups. Abundance estimates that fall under the threshold are considered unreliable and set to zero. To obtain the detection thresholds (Figure 1), we generate 100 samples from each reference group within a species by resampling one million sequencing reads from a randomly chosen reference sequence for that group, roughly matching the number of reads in our study samples, from the reads used to assemble the chosen sequence. Only reads corresponding to one sequence are used in each new sample. Reads included in the new samples are sampled with replacement with each read having the same probability of being included. The sequences used for resampling were chosen at random such that the number of reference sequences from each group corresponds to the square root of the total size of the group. Each group is represented at least once, except for groups which contain only one reference sequence where we apply the maximum detection threshold observed for other groups of the same species. Similarly, species that are represented in the reference by a single group were not resampled from, and the detection threshold was instead fixed at 0.05. After resampling, the new samples are put through pseudoalignment and mSWEEP abundance estimation without including the sequences used in resampling as pseudoalignment reference sequences.
In defining the detection thresholds, the relative abundance estimates obtained from the resampled sequencing reads are represented by , , , where n = 1, ..., N (in our examples we chose N = 100) indicates the new samples resampled from the reference group i = 1, ..., K. The third index j = 1, ..., K denotes the reference group that the abundance estimate was obtained for. We first define source-specific thresholds q i,j that give a threshold on the reference groups j assuming that the true group i in the sample is known. The source-specific threshold q i,j on group j ≠ i is defined by ordering the relative abundance estimates for the cluster j, , , , where is the constant minimum threshold for a specific grouping of the sequences within a species that is observed when comparing the empirical cumulative distribution functions in Extended Data Figure S6 24 . We recommend that be determined for new species by a synthetic mixing procedure similar to what was used to compare the accuracy of mixture estimates to their single-colony counterparts in Figure 4.
Based on the selected value of m, we may further define a statistical confidence score for the M = N -m + 1 remaining abundance estimates that exceed the detection threshold q i as which corresponds roughly to the fraction of resampled samples that exceed the threshold obtained with the selected value of m. Using values of m closer to the number of samples N in constructing the detection thresholds will result in stricter thresholds and thus higher confidence in the abundance estimates that exceed the threshold. The results reported in our experiments include thresholds constructed with m = 100 and m = 90, corresponding to confidence scores (Equation 7) of approximately 0.99 and 0.90, respectively.

Implementation
The mSWEEP software provides a C++ implementation of the abundance estimation part of the pipeline described in this manuscript. After pseudoaligning the sequencing reads, mSWEEP can be called from the command line to construct the abundance estimation model and infer the relative abundances of the reference lineages as described above in the Abundance estimation model, Likelihood, Model hyperparameters, and Inference sections. The mSWEEP pipeline consists of running both the pseudoalignment and the mSWEEP abundance estimation software.

Operation
Precompiled binaries and the source code for the mSWEEP software are available in GitHub. Compiling mSWEEP from source requires a compiler with full support for the C++11 standard (for example clang version 3.3 or later, or GCC version 4.8.1 or later) and the build process utility CMake (version 2.8.12 or later). The mSWEEP software itself does not have additional external dependencies. Prospective users of the mSWEEP pipeline will also need to install kallisto 14 for pseudoalignment, and possibly construct a set of scripts tailored to their reference data should they wish to take advantage of the detection threshold approach. The GitHub repository includes usage information, a general pipeline for abundance estimation with mSWEEP, and a toy dataset for an example workflow.
A typical workflow with mSWEEP begins by indexing the set of reference sequences (reference.fasta below) for pseudoalignment (here using kallisto version 0.45) kallisto -i kallisto_index reference.fasta The reference sequences need to be indexed only once and the same index can be used multiple times.
Analysing sequencing data (below the paired-end reads in two files: reads_1.fastq.gz and reads_2.fastq.gz) is done by first using kallisto to pseudoalign the reads kallisto pseudo -i kallisto_index -o pseudoalignments reads_1.fastq. gz reads_2.fastq.gz and then running the mSWEEP software (here using version 1.1.0) to produce the relative_abundances.txt file containing the relative abundance estimates. The synthetic E. coli mixture samples, and associated metadata, that were the basis for the results presented in Figure 5  We have thoroughly utilized the excellent feedback provided by both reviewers and made adjustments to our manuscript based on their recommendations and suggestions. Notably, we have added a new synthetic experiment assessing the performance of mSWEEP under conditions more resembling those found in the Vietnamese E. coli sequencing data as suggested by Mahendra Mariadassou. Included below is a point-by-point response to the concerns raised by both reviewers. We would also like to thank both reviewers for their time and outstanding feedback that has enabled us to improve the quality of our manuscript.

Mahendra Mariadassou:
The work is most relevant for well studied bacteria (like pathogens) for which there are good genome reference databases.

Author reply:
We have made it more clear in the revised manuscript that the intended application of our method is for extensively studied bacteria that have readily available reference sequences by including the following sentence in the Discussion section of our manuscript: As with any method intended to identify sequence variation, the target species need to be relatively well known to allow building of sufficiently informative reference databases. This means that mSWEEP is most relevant for extensively studied bacteria, such as pathogens, and has only limited applications when working with samples containing high numbers of novel or uncharacterized species.
Thank you for pointing out the potential for a misunderstanding.

Mahendra Mariadassou:
The comparison with metakallisto is slightly unfair, as pointed out by the authors, since it tries to tackle the more complex job of strain (rather than lineage) identification. A fairer comparison would be to pool the strains into the same group as mSWEEP and assess how it performs for those groups. This is especially important as the groups used in mSWEEP are well separated (see Fig. 3 t-SNE plot) and thus the task of assigning single-colony isolates to those groups is much easier than assigning them to strains.
were able to recover lineages at 0.0156 relative abundance (1.56x sequencing depth) and even a few of the lineages at 0.0078 relative abundance (0.78x sequencing depth). However, in general we do not expect mSWEEP to be able to separate lineages with < 1x sequencing depth unless they are far from each other in terms of genetic distance.
In any case, we thank the reviewer for pointing this issue out and have clarified the implications of the noise level in the manuscript with the following changes to the section where the noise level is mentioned: The results indicate that above this relatively low background noise level of 0.016, quantifying mixture samples is not expected to produce more false positive results than would be obtained from single-colony samples. In the synthetic mixtures, the observed background noise level corresponds to sequencing depths of around 0.30x (E. coli and K. pneumoniae) and 0.65x (S. epidermidis), which provides the bare minimum sequencing depth required to distinguish between the lineages of each species in samples with similar read lengths and sequencing depth.

Mahendra Mariadassou:
How does the method compare to other strain identification tools (like DUDes) that are not specific to plate sweeps? I expect the reduction in complexity induced by plate sweeps to benefit mSWEEP but it would be useful to prove it.

Author reply:
We thank you for pointing out other relevant strain identification tools. However, we would like to point out that our tool is designed to be used with large, bespoke reference databases with the goal of leveraging detailed information about strain-level variation in identification. This approach necessitates replacing the traditional alignment methods, such as bowtie2 used by for example DUDes and many other strain identification tools, with pseudoalignment to achieve practical runtimes.
Additionally, DUDes specifically utilizes a version of the NCBI taxonomy standard that is no longer in use, which restricts the applications of the method. The author points out this issue, as well as the scaling issue in bowtie2 alignment, in their doctoral dissertation ( http://dx.doi.org/10.17169/refubium-1123), published after the DUDes paper. To our knowledge, and based on the DUDes GitHub repository, DUDes has not been updated to address these issues nor, in fact, been maintained after its initial publication. In our opinion a comparison with mSWEEP is thus unwarranted and would likely add little of value to the comparisons already presented in our manuscript.
Moreover, other strain identification tools that make use of curated databases, such as subsets of the NCBI RefSeq database, are unsuited to analysing strain and lineage-level variation in clinical sequencing data due to the fact that for many of the strains the correct -both in-time and geographically -reference sequence is simply not necessarily included in the database. These types of databases are designed to provide a general overview of representative high-quality sequences for the species rather than something representative of the possibly highly localized variation, which renders their application beyond specieslevel identification conditional on the samples containing only distantly related strains. Although some methods do offer means to use customized reference databases, they are typically designed to be run with the precompiled databases and may not extend well when Indeed, bioinformatics tools to tap into the microbial diversity beyond genus and species level using metagenomics are limited. And particularly for those approaches involving an initial culture/plate medium step. The authors have closed this gap with mSWEEP but they have not taken advantage of this to mention in their title that the novel tool addresses strain/lineage diversity). Mathematical modelling, including maximum likelihood and binomial distances were used and formulas were shown. The author studied samples from a group of infected children to validate their system. They used MacConkey plates to obtain bacterial cells for metagenomics. Reference data comprised Campylobacter, Escherichia, Klebsiella, Staphylococcus. False positives were estimated and in general were below 0.3 (Fig 4; y axis %?). mSWEEP output is depicted in the frame of clonal complexes (CC; Figs 3, 7) which is useful in the context of previous studies on population structure of pathogens.
Some minor remarks to be incorporated in a revised version: Seven formulas were provided. However, the elements of each of these formulas were not explained. Please include an explanation in the new version.

○
The Hiseq sequencing coverage (and read numbers) were not informed. How many ilumina reads does one need to use mSWEEP with confidence? ○ Also another limitation is the need to select a reference genome for comparison. The sample may contain a novel genome, a novel pathogen. How does the system handle such possible situation? ○ It is not clear why certain CC, eg. Escherichia coli CC405 and CC95 are connected to Klebsiella pneumoniae CC36, CC45 (Fig 7). This figure seems rather mixed in taxonomic terms. Is this a result of co-occurrence? Are the conclusions about the method and its performance adequately supported by the findings presented in the article? Yes naturally fail to identify the correct lineage. Based on our observations, the relative abundance estimates resulting from this scenario will be spread between the most closely related genomes that are contained in the reference which can, in some cases, produce results that appear deceptively similar to those from a real mixed sample.
While our method does not address this concern -other than indirectly in some specific cases through the detection threshold analysis described in the manuscript -there have been excellent further developments of our method in solving this exact problem after the manuscript was published. In particular, a supplementary computational method ( https://github.com/harry-thorpe/demix_check) has been developed for assessing whether the reference sequences contain representative sequences for the strains present in a (mixed) sample that has been processed by mSWEEP. We believe this tool neatly solves this particular limitation of our method and provides researchers an additional tool to properly judge whether the results of their analyses are correct or contain errors related to the presence of novel strains. We have rewritten a part of our discussion section to take into account the limitations pointed out by the reviewer and include our suggested solution. The relevant paragraph has been changed to the following: As with any method intended to identify sequence variation, the target species need to be relatively well known to allow building of sufficiently informative reference databases. Similarly, to allow for sensible and easily interpretable inferences, the biological clustering of the reference database should be based on well-established biological entities, such as multi-locus sequence types (STs) or clonal complexes (CCs) which are frequently employed as labels of lineages. These limitations suggest that mSWEEP is most relevant for extensively studied bacteria, such as pathogens, and has only limited applications when working with samples containing high numbers of novel or uncharacterized species. However, following the introduction of mSWEEP, computational tools have been developed to assess the suitability of the reference sequences to specific samples and to identify cases where mSWEEP struggles due to the presence of reads originating from novel genome sequences (https://github.com/harry-thorpe/demix_check). Consequently mSWEEP can also be used to estimate the quality of the reference collection and to discard contaminated or mis-identified genomes.

Fabiano L. Thompson:
It is not clear why certain CC, eg. Escherichia coli CC405 and CC95 are connected to Klebsiella pneumoniae CC36, CC45 (Fig 7). This figure seems rather mixed in taxonomic terms. Is this a result of co-occurrence?
Author reply: Indeed, the connections between the E. coli and K. pneumoniae clonal complexes are a result of co-occurrence in the samples. These samples were specifically selected for our study because they had been identified as contaminated or containing mixed species or strains during the culture step in another study aimed at isolating K. pneumoniae strains. As a result, many of the samples presented in our study are taxonomically diverse and contain several different species, or the strains of, that also grow on the types of plates that were used to culture K. pneumoniae. We thank the reviewer for pointing out the confusing parts in our text and have clarified the reasons for the co-occurrence in the text with the addition of the following sentence: Since both E. coli and A. baumanii grow on the media used for culture of K. pneumoniae, frequent