Accurate plasmid reconstruction from metagenomics data using assembly–alignment graphs and contrastive learning

Overview of PlasMAAG
The inputs to the PlasMAAG pipeline are a set of reads per sample. Reads are assembled with metaSPAdes (version 3.15.5)42, creating an assembly graph and contigs for each sample. The contigs across all samples are concatenated together to create the contig catalog. Reads are mapped to the catalog with minimap2 (version 2.24)43 and SAMtools (version 1.18)44, creating per-sample BAM files. The alignment graph is generated by aligning the contigs across samples with the National Center for Biotechnology Information (NCBI) basic local alignment search tool (BLAST; version 2.15.0)45. The assembly and alignment graphs are merged into the AAG. Then, fastnode2vec (version 0.05)35, an optimized version of node2vec46, is used to embed local AAG context of each contig into an embedding space. The k-mer composition and abundance features of contigs are embedding using a VAE, where an additional loss term is added, which penalizes distance between contigs of the same community. Using the VAE embeddings, communities are expanded, merged and purified. The geNomad tool (version 1.8.0)36 is used to separate plasmid from nonplasmid contigs; communities of plasmid contigs are extracted as separate bins, whereas the remaining contigs are extracted in bins using a clustering algorithm.
Benchmark datasets
We based our benchmark datasets on the existing CAMI2 human microbiome toy dataset, the CAMI2 marine dataset and the CAMI2 skin human microbiome dataset with addition of phage reads from 280 phage genomes from the Metagenomic Gut Virus (MGV) database47 (Supplementary Table 5). All CAMI2 datasets are publicly available short-read metagenomic benchmarks generated under the CAMI initiatives37,48. The human dataset consists of five simulated environments: airways, gastrointestinal, oral, skin and urogenital, designed to mimic human microbiome communities37, whereas the CAMI2 marine dataset represents a sea environment48. For each sample, we simulated 5 Gbp of 2× 150-bp Illumina-like reads from the source genomes using the CAMISIM software49. Compared to the original CAMI2 datasets, we had to make several changes. The original dataset did not provide assembly graphs; thus, we assembled the reads using metaSPAdes (version 3.15.5) and mapped the resulting contigs back to the CAMI2 source genomes or the MGV phage genomes to determine their origin using minimap2, accepting hits with an identity > 97% and a query coverage > 90%. Because this approach initially led to many unmapped or ambiguously mapping contigs because of overeager read correction in the assembler, we resimulated the reads using wgsim (https://github.com/lh3/wgsim) with zero sequencing errors and a default fragment size of 270 (mean)26 and disabled error correction in the assembler. Moreover, CAMI2 considered plasmids to be part of their cellular host genome with the same abundance, which would inhibit our abundance-based binning approach. Instead, we simulated plasmids as distinct genomes with their own abundance proportional to the host abundance, modulated by a Gaussian random variable that incorporates a plasmid copy number model, as performed previously17. For the samples containing phages, phage abundances were sampled from a lognormal distribution with mean of 1 and s.d. of 4. Lastly, CAMI2 did not contain reads simulated from across the edges of the underlying circular sequences, which prevents assembly graph cycles and hobbles graph peeling-based approaches such as that used by SCAPP. We made sure to include such reads and enable circular contigs by simulating reads spanning the junction between the start and end positions of the genomes. Our reassembled CAMI2 human and marine datasets were, like the original CAMI2 datasets, composed of ten samples for all datasets except urogenital, which had nine. Our datasets had the following number of contigs after retaining only contigs larger than 2 kbp: airways, 35,537; gastrointestinal, 80,325; oral, 94,251; skin, 31,181; urogenital, 29,550; marine, 150,565; skin with phages, 34,979.
Assembly graph edge weighting
Assembly graphs were extracted from the assembly_graph_after_simplification.gfa file from metaSPAdes and converted into a NetworkX (version 3.4.2)50 directed graph, with contigs represented as nodes and links between segments in contigs represented as edges. To enrich the assembly graph signal for binning, graph edges were weighted with the normalized linkage metric, which is dependent on the number of links between any segments from each pair of contigs, normalized by the length of the contigs. For a pair of contigs ci, cj, the number of links connecting those contigs n_linksij and the contig lengths lc, normalized linkage is expressed as
$$\mathrm{normalized}\,{\mathrm{linkage}}_{{c}^{i}{c}^{j}}=\frac{{n\mathrm{\_links}}_{{ij}}}{\min ({l}^{{c}^{i}},{l}^{{c}^{j}})}$$
Alignment graph edge weighting
After assembly, contigs shorter than 2,000 bp were discarded, as described previously24. Contigs were aligned all against all using NCBI BLAST using the ‘blastn’ command with ‘-perc_identity 95’, only keeping between-sample hits, alignment identity ≥ 98.0% and alignment ≥ 500 bp. We also removed alignments between sequences that contained large sections that did not align because of sequence diversity, as we wanted the alignments to represent shared sequences across samples. The remaining set of alignments after filtering was defined as ‘restrictive’ alignments. From the alignments, we created an alignment graph with contigs as nodes and alignments as edges. Edges were weighted with the normalized alignment metric to reflect the alignment certainty. For a pair of contigs ci, cj, alignment identity id, alignment length L and contig length lc, normalized alignment is expressed as
$$\mathrm{normalized}\,{\mathrm{alignment}}_{{c}^{i}{c}^{j}}=\frac{\mathrm{id}}{100}\frac{\min (L,{l}^{{c}^{i}},{l}^{{c}^{j}})}{\min ({l}^{{c}^{i}},{l}^{{c}^{j}})}$$
AAG community extraction with fastnode2vec
Assembly and alignment graphs share no edges because their edges connect only within-sample and between-sample contigs, respectively. This allowed us to trivially merge the graphs by adding the edges from one graph into the other, thus creating the AAG. Thus, the AAG is a single graph that contains the nodes and edges from the individual assembly graphs and the alignment graph. We aimed to leverage this integrated graph to enhance the binning signal, while keeping all individual nodes (that is, contigs) intact and unchanged. To extract communities from the AAG, we first ran fastnode2vec on the AAG to obtain contig embeddings. We created a new graph by linking contigs within a cosine distance of 0.1 in embedding space, after which we defined each connected component to be a contig community. We optimized the fastnode2vec hyperparameters and clustering radius to generate pure communities at the genome level, running a small grid search over the resimulated CAMI2 airways dataset. The embedding dimensions, walk length, number of walks, window size, p and q parameters from fastnode2vec were set to 32, 10, 50, 10, 0.1 and 2.0, respectively.
Contrastive-VAMB for community merging and expansion
Contrastive-VAMB is a variation of the original VAMB model, with a modification on the loss function to account for the communities extracted from the fastnode2vec embeddings. Contrastive-VAMB is composed of an encoder, latent representation layer μ and a decoder. Each contig represented by the concatenation of the contig coabundances along samples Cin, the tetranucleotide frequencies Tin and the unnormalized contig abundances Ain is passed to the encoder. While Cin and Tin were already used in the original VAMB model, the inclusion of Ain is motivated by the observation that the normalization in the computation of Cin removes abundance signal by removing a degree of freedom. This signal is reintroduced as Ain. This way, the network explicitly models two distinct notions of abundance: how much abundance there is in total and how it is distributed among samples. The encoder projects the contigs into a latent normal N(μ, I) distribution parametrized by the μ layer, from which the decoder samples. The decoder is optimized to reconstruct Cin, Tin and Ain from the instances sampled from N(μ, I), decrease the latent cosine distance between contigs with closely related fastnode2vec graph embeddings and decrease the deviance between the latent normal distribution N(μ, I) parametrized by the μ layer and the standard normal distribution used as a prior N(0, I).
Loss functions
The contrastive-VAMB loss can be decomposed in three terms: reconstruction loss, contrastive loss and regularization loss. The reconstruction loss (Lrec) penalizes the reconstruction error of Ain, Tin, and Cin. Like in VAMB, the sum of squared error (SSE) loss was used for Tin and cross-entropy (CE) loss was used for Cin. For Ain, PlasMAAG also uses SSE loss. These three terms are weighted with hyperparameters wA, wT and wC.
$${L}_{\mathrm{rec}}={w}_{{\bf{A}}}\mathrm{CE}\left({{\bf{A}}}_{\mathrm{in}},{{\bf{A}}}_{\mathrm{out}}\right)+{w}_{{\bf{T}}}\mathrm{SSE}\left({{\bf{T}}}_{\mathrm{in}},{{\bf{T}}}_{\mathrm{out}}\right)+{w}_{{\bf{C}}}\mathrm{SSE}\left({{\bf{C}}}_{\mathrm{in}},{{\bf{C}}}_{\mathrm{out}}\right)$$
The contrastive loss (Lcontr) penalizes the cosine distance between the VAMB-latent representations contigs highly related in fastnode2vec embedding space when such cosine distance overcomes a predefined margin m, where m is a hyperparameter. For a contig ci and highly related fastnode2vec embedding space contigs Hci = {n0,…, nn}, Lcontr is expressed as
$${L}_{\mathrm{contr}}=\max \left(\frac{{\sum }_{{n}_{i}\in {H}^{ci}}\mathrm{cosine}\,\mathrm{distance}\left({{\rm{m}}}^{ci},{{\rm{m}}}^{{n}_{i}}\right)}{\left|{H}^{ci}\right|}-m,0\right)$$
The regularization loss (Lreg) penalizes the deviance between the latent normal distribution N(μ, I) parametrized by the μ layer and the standard normal distribution used as a prior N(0, I) with the Kullback–Leibler divergence; as the s.d. is set to 1, Lreg can be simplified to
$${L}_{\mathrm{reg}}=\frac{1}{2}+\sum {{\rm{\mu }}}^{2}$$
Lastly, the model total loss (L) was aggregated with weighting hyperparameters wLreg, and wLcontr:
$$L={L}_{\mathrm{rec}}+{{w}_{{L}_{\mathrm{reg}}}L}_{\mathrm{reg}}+{{w}_{{L}_{\mathrm{contr}}}+L}_{\mathrm{contr}}$$
Clustering plasmid and organism candidates with geNomad
Two sequential strategies were implemented to cluster the latent space tailored to extract plasmids and nonplasmids. The clustering strategy is composed of two phases: community clustering and iterative medoid clustering, both using latent space cosine distances. Community clustering works in five steps (Supplementary Fig. 11). First, for each community extracted from the fastnode2vec embeddings, contigs belonging to the same community are linked and links between contigs with a VAE embedding cosine distance > 0.2 are removed. Second, contigs within 0.01 cosine distance are linked. Third, connected components are extracted as bins. Fourth, potentially circular contigs are detected by mapping read pairs, where mates map to opposite contig ends within 50 bp of the contig end. These are extracted to their own bins. Lastly, plasmid score is defined for each cluster by aggregating the geNomad plasmid contig scores with a contig length weighted mean, defining plasmid candidates when cluster scores are larger than the defined threshold. When the geNomad plasmid threshold is larger than 0.5, a fixed geNomad plasmid threshold of 0.5 is applied to the circular contigs, accounting for the circular evidence relatable to plasmids. The nonplasmid clustering strategy consists of two steps. First, the VAMB-latent space is clustered using the iterative medoid clustering algorithm from VAMB. Second, contigs belonging to any plasmid candidate cluster as defined by community-based clustering are removed. In summary, candidate plasmids are first identified using community-based clustering combined with geNomad cluster scoring. Contigs belonging to these plasmid bins are then removed from the bins generated by the density-based clustering strategy.
Binning benchmarking: CAMI2 reassembled
We compared the plasmid and organism binning performance of PlasMAAG version 1.0.0, VAMB version 4.1.3, CONCOCT version 1.1.0, MetaBAT2 version 2.12.1, SemiBin2 version 2.1.0, Comebin version 1.0.4, MetaDecoder version 1.0.19, SCAPP version 0.1.4 and MetaPlasmidSPAdes version 3.15.3 over the resimulated CAMI2 datasets. Binning performance was evaluated in terms of the total number of genomes recovered across all samples with precision ≥95% and recall ≥90% (so-called NC genomes). As PlasMAAG, VAMB, MetaBAT2, SemiBin2, Comebin and MetaDecoder perform the binning after assembling the contigs, precision and recall of the bins were obtained from the contig references using BinBencher (version 0.3.0)51. On the other hand, SCAPP and MetaPlasmidSPAdes assemble their own contigs. Here, we produced a ground truth by aligning the output bins to the origin genomes using NCBI BLAST 2.15.0 accepting hits with an identity >97% and a query coverage > 90%, after which we benchmarked using BinBencher.
Sample benchmarking: CAMI2 reassembled
Plasmid purity rate (PPR), plasmid recovery rate (PRR) and F1 were computed for each set of plasmid candidates, reflecting the plasmid characterization at the sample level, not at the bin level. PPR was calculated as the fraction of candidate plasmids reconstructed above precision and recall thresholds in the total number of candidate plasmids output by the method. PRR was calculated as the fraction of candidate plasmids reconstructed above precision and recall thresholds divided by the total number of plasmids with coverage above the recall threshold. Given a sample (s), a set of plasmid candidates (candidates), binning precision and binning recall thresholds (pre, rec) and the set of true plasmids present in the sample (plasmids) are related by the following expressions:
$$\begin{array}{l}{\mathrm{PPR}}_{\mathrm{candidates},\mathrm{pre},\mathrm{rec}=\displaystyle \frac{\mathrm{no}.\,\mathrm{of}\,\mathrm{binner}\,\mathrm{plasmid}\,\mathrm{candidates} > (\mathrm{pre}\,,\mathrm{rec})}{\mathrm{no}.\,\mathrm{of}\,\mathrm{binner}\,\mathrm{candidates}}}\end{array}$$
$${\mathrm{PRR}}_{\mathrm{candidates},\,\mathrm{plasmids},\,\mathrm{pre},\,\mathrm{rec}}=\frac{\mathrm{no}.\,\mathrm{of}\,\mathrm{binner}\,\mathrm{plasmid}\,\mathrm{candidates} > (\mathrm{pre},\,\mathrm{rec})}{\mathrm{no}.\,\mathrm{of}\,\mathrm{recoverable}\,\mathrm{plasmids}\,\mathrm{in}\,\mathrm{sample}}$$
Lastly, F1 represents the harmonic mean between the PPR and PRR:
$${{F}_{1}}_{\mathrm{PPR},\,\mathrm{PRR}}=\frac{2\times \mathrm{PPR}\times \mathrm{PRR}}{\mathrm{PPR}+\mathrm{PRR}}$$
Hospital sewage sample sequence datasets
Two biological datasets were used in this study to assess the quality of plasmid binning. Urban sewage samples (UWSs) were collected from comparable UWSs from Denmark and Spain located in Odense and Santiago de Compostela, as previously described40. In this study, only hospital sewage samples from each location were used. Sewage samples were collected in the winter and summer of 2018 using ISCO automatic samplers for 24-h flow (50 ml per 5 min) in Denmark and using 24-h time-proportional samples in Spain (mixing hourly samples according to flow information) (Supplementary Table 8). Three replicates per site and season were collected on three consecutive days without rain. All samples were initially cooled with ice on site. Then, 100 ml of each sample was centrifuged at 10,000g for 8 min at 4 °C in the laboratory. After removing supernatant, pellets were resuspended in 20% of glycerol stock to reach a final volume of 10 ml for storage at −80 °C. In total, environmental DNA was extracted from all samples using the NucleoSpin soil kit (Macherey-Nagel) using 500 μl of glycerol stock material for direct shotgun metagenomic using Illumina NovaSeq using 2× 150-bp paired-end mode (all 18 samples) and PacBio Sequel2e (five samples from Denmark). PacBio libraries were built from the same DNA extracts using libraries using SMRTbell express template 2.0 kit and Sequel II binding kit 3.2 (Pacific Bioscience) and barcoded using SMRTbell barcoded adaptor plate 3.0 (Pacific Bioscience). Two libraries per 8M SMRTcell (Pacific Bioscience) were pooled and sequenced on a PacBio Sequel2e instrument at the University of Copenhagen.
For plasmid-enriched samples, we used specific methods to deplete nonplasmid DNA as described previously52,53. Briefly, hospital sewage samples were pretreated by filtration, vortex and sonication and resuspended in TE buffer. Afterward, a prelysis cocktail of cell-wall-degrading enzymes (lysozyme, mutanolysin and lysostaphin) was used to facilitate lysis of Gram-positive bacteria during alkaline lysis. Prelysis was followed by alkaline lysis to remove chromosomal DNA54, followed by Plasmid-Safe ATP-dependent DNase (Lucigen) digestion. Plasmid-Safe DNase will digest any fragments of dsDNA with open 3′ or 5′ termini, thereby removing fragmented chromosomal DNA. The purified plasmid DNA was then quality-checked and libraries were prepared and sequenced on an Illumina NextSeq platform with a v2.5 sequencing kit (Illumina) in paired-end mode.
Binning benchmarking: hospital sewage
We compared the binning performance of PlasMAAG version 1.0.0, VAMB version 4.1.3, MetaBAT2 version 2.12.1, SemiBin2 version 2.1.0, Comebin version 1.0.4, CONCOCT version 1.1.0, MetaDecoder version 1.0.19, metaplasmidSPAdes version 3.15.3 and SCAPP version 0.1.4 on the hospital sewage samples. The dataset consisted of 147,437 short-read contigs assembled from five hospital sewage samples. Each sample was collected at a unique combination of time and location. To evaluate binning performance on these samples, we generated high-quality long-read sequences from the same samples using HiFi sequencing on a Pacific BioSciences Sequel II platform. We assembled the sequences to long-read contigs with hifiasm_meta version 0.3-r073. To determine which short-read contigs corresponded to long-read contigs, we mapped short-read contigs to long-read contigs using minimap2 version 2.24, handling rotated circular contigs by concatenating each long-read contig to itself. We accepted hits with an identity > 97% and a query coverage > 90%. To quantify short-read binning, we counted ‘NC long-read assemblies’ summed across all samples: long-read contigs where a bin of short-read contigs existed such that 90% of the long-read contig was covered with short-read contigs from that bin and 95% of the bin’s base pairs were mapped to no other long-read contig. To evaluate the overall binning performance, the entire set of long-read contigs was used to build the reference, whereas, to evaluate the plasmid binning performance, only the long-read contigs that were circular or with metaplasmidomics reads coverage > 50% were used. Metaplasmidomics short reads were derived from the same samples but underwent a filtering step to enrich for plasmids, thereby representing highly plasmid-enriched sequencing data. NC long-read assemblies were counted with BinBencher version 0.3.0 and NC cellular genomes were estimated with CheckM2 version 0.1.3.
Host–plasmid and intraplasmid diversity exploration
PlasMAAG was used to bin the contig sequences from hospital sewage samples from hospitals in Spain. The dataset consisted of n = 828,260 contigs assembled from n = 24 hospital sewage samples. Each sample was collected at a unique combination of time and location. PlasMAAG bins were aggregated into PlasMAAG clusters and classified as plasmids if the aggregated geNomad plasmid score exceeded 0.75. Only plasmids clusters with more than 150 kb were considered for the host–plasmid association. Organism bin quality was estimated with CheckM2 version 0.1.3 and only high-quality (completeness ≥ 70% and precision ≥ 90%) bins were kept. GTDB-tk (version 2.4.0)55 was used to estimate taxonomy for the high-quality bins, with cluster taxonomy assigned on the basis of majority vote. Abundance correlation analysis was conducted for plasmids and organism clusters with nonzero abundance over at least 18 of 24 overlapping samples. Spearman correlation coefficients and P values were computed using SciPy’s scipy.stats.spearmanr with default settings, which tests the alternative hypothesis of a nonzero correlation. To account for multiple testing, P values were corrected using the Benjamini–Hochberg, which attempts to correct the false discovery rate, implemented in the statsmodels package. Plasmid cluster hosts were inferred from PLSDB (version 2024_05_21_v2)56 when aligning to any PLSDB entry with >80% identity and >80% coverage. Functional annotations of contigs were performed with anvi’o (version 8)57 software, using the ‘anvi-run-workflow -w contigs’ command.
Resource usage
We evaluated computational resource usage of all methods using the CAMI2 airways reassembled dataset and five samples from the hospital sewage dataset. For the airways dataset, PlasMAAG took 46 min, using 8 threads and 16 GB of RAM. By contrast, SCAPP, excluding the BAM file generation step, took 192 min, using 16 threads and 24 GB of RAM (Supplementary Table 6). Among the other binners, PlasMAAG was slower than VAMB, MetaDecoder and MetaBAT2. For example, VAMB completed the task in just 8 min while using 8 threads and 16 GB of RAM. However, we observed a different trend when evaluating performances on the five hospital sewage samples. When accounting for the additional steps of read assembly and read mapping required to compute abundances, PlasMAAG exhibited similar runtimes to most binners, except for SCAPP, which required much more time. Specifically, PlasMAAG took 3,575 min, VAMB took 3,435 min, ComeBin required 4,911 min and metaplasmidSPAdes took 4,430 min (Supplementary Table 7). By contrast, SCAPP required 116,965 min, 32 times longer than PlasMAAG. This difference in runtime remained consistent even when excluding the read assembly steps (Supplementary Table 8).
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.


