Improving metagenome binning by integrating intrinsic features and taxonomy

Bimodal VAE
The VAE is a generative model performing variational inference over the latent variable z. The model is formally defined as p(x,z) = p(z)p(x|z). The intractable posterior q(z|x) and the conditional distribution p(x|z) are approximated by neural networks using the ELBO-loss function:
$$\begin{array}{l}{\mathcal{L}}={{\mathbb{E}}}_{q\left({z|x}\right)}\left[\log p\left({x|z}\right)\right]-\mathrm{KL}\left(q\left({z|x}\right){||}{\mathscr{N}}\left(0,I\right)\right)\end{array}$$
(1)
The bimodal VAE extends the basic VAE by allowing training and inference on the dataset where (1) the input consists of two modalities and (2) a modality can be missing for one or more samples. Thus, notice that, while VAMB is trained on both TNFs and abundances, we do not define it as bimodal for the purpose of this summary, as both TNFs and abundances are present for all samples and can be converted into one modality by concatenating the corresponding input vectors.
While the VAE approximates the posterior q(z|x) with a neural network encoder that takes x as an input, the bimodal VAE extends this approach by modeling q(z|x1,x2), q1(z|x1) and q2(z|x2), which replace the single q(z|x). There are two decoders approximating distributions p(x1|z) and p(x2|z). Multimodal VAEs differ in (1) the way they approximate q(z|x1,x2), q1(z|x1) and q2(z|x2) by neural networks and/or (2) the structure of the loss function.
TaxVAMB implements the VAEVAE50 model from the bimodal VAE family, which models q(z|x1,x2), q1(z|x1) and q2(z|x2) by corresponding neural networks. The following ELBO-like loss L is minimized:
$$\begin{array}{l}{ {\mathcal L} }_{\text{paired}}{=}{{\mathbb{E}}}_{{p}_{\text{paired}}({x}_{1},{x}_{2})}[{{\mathbb{E}}}_{q(z| {x}_{1},{x}_{2})}[\log {p}_{1}({x}_{1}{|z})+\log {p}_{2}({x}_{2}{|z})]\\ -\mathrm{KL}(q({z|}{x}_{1},{x}_{2})\Vert p({z|}{x}_{1}))-\mathrm{KL}(q({z|}{x}_{1},{x}_{2})\Vert p({z|}{x}_{2}))]\\ +{{\mathbb{E}}}_{{p}_{\text{paired}}({x}_{1})}[{{\mathbb{E}}}_{q\left(z,|,{x}_{1}\right)}[\log {p}_{1}({x}_{1}{|z})]-\mathrm{KL}(q({z|}{x}_{1})\Vert p(z))]\\ +{{\mathbb{E}}}_{{p}_{\text{paired}}({x}_{2})}[{{\mathbb{E}}}_{q(z| {x}_{2})}[\log {p}_{2}({x}_{2}{|z})]-\mathrm{KL}(q({z|}{x}_{2})\Vert p(z))]\\ { {\mathcal L} }_{1}{=}{{\mathbb{E}}}_{{p}_{\text{unpaired}}({x}_{1})}[{{\mathbb{E}}}_{q(z| {x}_{1})}[\log {p}_{1}({x}_{1}{|z})]-\mathrm{KL}(q({z|}{x}_{1})\Vert p(z))]\\ { {\mathcal L} }_{2}{=}{{\mathbb{E}}}_{{p}_{\text{unpaired}}({x}_{2})}[{{\mathbb{E}}}_{q(z| {x}_{2})}[\log {p}_{2}({x}_{2}{|z})]-\mathrm{KL}(q({z|}{x}_{2})\Vert p(z))]\\ {\mathcal L} {=}{ {\mathcal L} }_{\text{paired}}+{{\mathcal{L}}}_{1}+{{\mathcal{L}}}_{2}\end{array}$$
with DKL (p(x)||q(x)) being the Kullback–Leibler divergence between two probability distributions p(x) and q(x), defined as:
$$\begin{array}{l}\mathrm{KL}\left(p\left(x\right)\mathrm{||}q\left(x\right)\right)={\int }_{-\infty }^{\infty }p\left(x\right)\log \left(\frac{p\left(x\right)}{q\left(x\right)}\right){\rm{d}}x\end{array}$$
(2)
The training procedure includes constructing the dataset with paired and unpaired samples. Let C be a list of all contigs. Three copies of C, denoted as Cpaired, C1 and C2, are independently shuffled. The paired samples are ordered tuples (x1,x2) where x1 is a concatenation of TNF vector and abundance vector (the input of VAMB) and x2 is a taxonomy label vector and x1 and x2 correspond to the same contig from the set Cpaired. An unpaired TNFs and abundances vector x1 corresponds to a contig from the list C1. An unpaired taxonomy label x2 corresponds to a contig from the list C2. The forward pass follows the steps from Algorithm 1.
Algorithm 1Loss computation (forward pass)
Require: Paired sample (\({x}_{1},{x}_{2}\)), unpaired sample \({x{\prime} }_{1}\), unpaired sample \({x{\prime} }_{2}\)
1: \(z{\prime} \sim q\left({z|}{x}_{1},{x}_{2}\right)\)
2: \({z}_{{x}_{1}}\sim {q}_{1}\left({z|}{x}_{1}\right)\)
3: \({z}_{{x}_{2}}\sim {q}_{2}\left({z|}{x}_{2}\right)\)
4: \({d}_{1}={D}_{\mathrm{KL}}(q(z{\prime} |{x}_{1},{x}_{2})\parallel {q}_{1}({z}_{{x}_{1}}|{x}_{1}))+{D}_{\mathrm{KL}}({q}_{1}({z}_{{x}_{1}}|{x}_{1})\parallel p(z))\)
5: \({d}_{2}={D}_{\mathrm{KL}}(q(z{\prime} |{x}_{1},{x}_{2})\parallel {q}_{2}({z}_{{x}_{2}}|{x}_{2}))+{D}_{\mathrm{KL}}({q}_{2}({z}_{{x}_{2}}|{x}_{2})\parallel p(z))\)
6: \({L}_{\mathrm{paired}}=\log {p}_{1}({x}_{1}{|z})+\log {p}_{2}({x}_{2}{|z})+\log {p}_{1}({x}_{1}|{z}_{{x}_{1}})\) \(+\log {p}_{2}({x}_{2}|{z}_{{x}_{2}})+{d}_{1}+{d}_{2}\)
7: \({L}_{{x}_{1}}=\log {p}_{1}({x{\prime} }_{1}|{z}_{{x}_{1}})+{D}_{\mathrm{KL}}({q}_{1}({z}_{{x}_{1}}|{x{\prime} }_{1})\parallel p(z))\)
8: \({L}_{{x}_{2}}=\log {p}_{2}({x{\prime} }_{2}|{z}_{{x}_{2}})+{D}_{\mathrm{KL}}({q}_{2}({z}_{{x}_{2}}|{x{\prime} }_{2})\parallel p(z))\)
9: \(L={L}_{\mathrm{paired}}+{L}_{{x}_{1}}+{L}_{{x}_{2}}\)
Data preprocessing
The workflow of preprocessing the data is the same as in Taxometer (version 5.0.4)42 and VAMB (version 5.0.4)11. The synthetic short paired-end reads from each sample were aligned using bwa-mem (version 0.7.15)73. BAM files were sorted using SAMtools (version 1.14)74. Contigs ≤ 2,000 bp were removed for each dataset. The long-read datasets were both sequenced using Pacific Biosciences HiFi technology. We assembled each sample using hifiasm-meta (version 0.3.1)3, mapped reads using minimap2 (version 2.24)75 with the ‘-ax map-hifi’ setting and then continued with the same workflow as with the short reads.
Abundances and TNFs
The workflow of computing abundances and TNFs is the same as in Taxometer (version 5.0.4)42 and VAMB (version 5.0.4)11. Computation of abundances and TNFs was performed using the VAMB metagenome binning tool11. To determine TNFs, tetramer frequencies of nonambiguous bases were calculated for each contig, projected into a 103-dimensional orthonormal space and normalized by z-scaling each tetranucleotide across the contigs. To determine the abundances of each sample, we used pycoverm (version 0.6.0; https://github.com/apcamargo/pycoverm/tree/main). The abundances were first normalized within the sample by the total number of mapped reads and then across samples to sum to 1. To determine absolute abundance, the sum of abundances for a contig was taken before the normalization across samples. The dimensionality of the feature table was then Nc × (103 + Ns + 1), where Nc is the number of contigs and Ns is the number of samples.
Network architecture and hyperparameters
The encoder architectures for the concatenated vector of abundances and TNFs is the same as in Taxometer (version 5.0.4)42 and VAMB (version 5.0.4)11. The input vector of dimensionality Nc × (103 + Ns + 1) was passed through four fully connected layers ((103 + Ns + 1) × 512, 512 × 512, 512 × 512, 512 × 512) with leaky ReLU activation function (negative slope 0.01), each using batch normalization (ϵ 1 × 10−5, momentum 0.1) and dropout (P = 0.2).
The encoder network for the taxonomy labels had the input dimensions of Nl, where Nl is the number of leaves in the taxonomic tree. The input vector was passed through four fully connected layers (Nl × 512, 512 × 512, 512 × 512, 512 × 512) with leaky ReLU activation function (negative slope: 0.01), each using batch normalization (ϵ 1 × 10−5, momentum 0.1) and dropout (P = 0.2).
The encoder network for the concatenation of the two modalities had the input dimensions of (103 + Ns + 1) + Nl, where Ns is the number of samples and Nl is the number of leaves in the taxonomic tree. The input vector was passed through four fully connected layers ((103 + Ns + 1) × 512, 512 × 512, 512 × 512, 512 × 512) with leaky ReLU activation function (negative slope: 0.01), each using batch normalization (ϵ 1 × 10−5, momentum 0.1) and dropout (P = 0.2).
The bimodal VAE has two decoder networks, one for each modality. Both of them follow the same architectures as the corresponding encoders, with the input vector having the dimensionality of the latent space and the output having the dimensionality of the corresponding modality.
For short-read datasets, the network was trained for 300 epochs with batch size 256 and latent space dimensionality 32. For long-read datasets, the network was trained for 1,000 epochs with batch size 1,024 and latent space dimensionality 64. All models were using the Adam optimizer with learning rates set through D-Adaptation76. The model was implemented using PyTorch (version 1.13.1)77 and CUDA (version 11.7.99) was used when running on a V100 GPU.
Hierarchical loss
The hierarchical loss is the same as in Taxometer (version 5.0.4)42. A phylogenetic tree was constructed for each dataset from the taxonomy classifier annotations for the set of contigs. Thus, the resulting taxonomy tree T was a subgraph of a full taxonomy and the space of possible predictions was restricted to the taxonomic identities that appeared in the search results. For the above experiments, we used a flat softmax loss. Let Nl be the number of leaves in the tree T. The likelihoods of leaf nodes of the taxonomy tree were obtained from the softmax over the network output layer with dimensionality 1 × Nl. The likelihood of an internal node was then a sum of likelihoods of its children and computed recursively bottom-up. The model output was a vector of likelihoods for each possible label. For the backpropagation, the negative log-likelihood loss was computed for all the ancestors of the true node and the true node itself. Predictions were made for all taxonomic levels and, for each level, the node descendant with the highest likelihood was selected. If no node descendant had likelihood > 0.5, the predictions from this level and the levels below were not included in the output.
Taxonomic classifiers
We obtained the taxonomic annotations for contigs of all nine short-read and two long-read datasets from MMseqs2 (version 17.b804f)33, Metabuli (version 1.1.0)78, Centrifuge (version 1.0.4.2)35 and Kraken2 (version 2.1.3)32. For MMseqs2, we used the mmseqs taxonomy command. For Metabuli, we used the metabuli classify command with ‘–seq-mode 1’ flag. For Centrifuge, we used the centrifuge command with ‘-k 1’ flag. For Kraken2, we used the kraken command with ‘–minimum-hit-groups 3’ flag. MMseqs2 and Metabuli were configured to use GTDB version 220 as the reference database. Centrifuge and Kraken2 were configured to use NCBI identifiers, release 229. All the taxonomic annotations were first refined with Taxometer42 (version 5.0.4) with the default parameters (epochs 100, batch size 1,024). For datasets in Figs. 4 and 5, the MMseqs2 classifier configured with GTDB was used for all datasets; for the wheat phyllosphere dataset, we used Kraken2 (version 2.1.3) configured with NCBI.
Benchmarked binners
The Metabat (version 2.17-66-ga512006) ‘metabat’ command with the default parameters was used. Metadecoder (version 1.0.19) ‘coverage’, ‘seed’ and ‘cluster’ commands were used as described on GitHub (https://github.com/liu-congcong/MetaDecoder). The Comebin (version 1.0.4) ‘run comebin.sh’ command with default parameters was used. ‘Comebin (single)’ indicates the use of Comebin in a single-sample mode. The SemiBin2 (version 2.2.1) ‘multi_easy_bin’ command was used with the flags ‘–engine gpu –separator C -t 20 –write-pre-reclustering-bins and –self-supervised’. VAMB, AVAMB and TaxVAMB were run as a part of the VAMB codebase (version 5.0.4), with the corresponding commands ‘vamb bin default’, ‘vamb bin avamb’ and ‘vamb bin taxvamb’. The workflows are available on GitHub (https://github.com/RasmussenLab/TaxVamb-Benchmarks/). The log files for the failed runs are also available on GitHub (https://github.com/RasmussenLab/TaxVamb-Benchmarks/tree/main/log_files_for_crashed_runs).
Reclustering using SCGs
Short-read and long-read reclustering algorithms that used SCGs were the same as introduced in SemiBin2 (ref. 21). The code was adapted from the SemiBin2 codebase (https://github.com/BigDataBiology/SemiBin/blob/main/SemiBin/longread_cluster.py and https://github.com/BigDataBiology/SemiBin/blob/main/SemiBin/cluster.py) for the TaxVAMB codebase (https://github.com/RasmussenLab/misc_scripts/tree/c5b483a/reclustering). TaxVAMB used the same 107 single-copy marker genes as used in SemiBin2 to estimate the completeness, contamination and F1 score of every bin. Completeness for each bin was calculated as \(\frac{N}{107}\), contamination was calculated as \(\frac{G-N}{G}\) and F1 score was calculated as \(\frac{2\times \mathrm{completeness}\times (1-\mathrm{contamination})}{\mathrm{completeness}+(1-\mathrm{contamination})}\), where 107 is the number of different SCGs in a bin and G is the total number of sequences matching any SCG.
For the short-read datasets, k-means-based reclustering of TaxVAMB and VAMB clusters was performed. Bins where more than one marker gene of the same kind was present were reclustered with the weighted k-means method using the contigs containing the repeated marker gene as the initial centroids. This resulted in bins with reduced contamination. For the long-read datasets, the DBSCAN algorithm from Python library scipy (version 1.10.0) was used to perform the clustering from scratch (the previous clusters, made by TaxVAMB/VAMB, were not used). As in SemiBin2, DBSCAN was run with ϵ values of 0.01, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5 and 0.55. From all resulting bins, the best one (F1 score) was recursively selected and its contigs were removed from all the remaining bins, after which the selection of the best bin was repeated. This was repeated until no more bins fulfilled the criteria for minimal quality (completeness > 90%, contamination < 5%). One change that was made in the TaxVAMB long-read reclustering was that it performed the described procedure per set of contigs assigned to the same genus by the Taxometer refinements of the provided taxonomic annotations.
CAMI2 benchmarks
For short-read benchmarking, we used five CAMI2 datasets: Airways (ten samples), Oral (ten samples), Skin (ten samples), Gastrointestinal (ten samples) and Urogenital (nine samples), the assemblies of which were sample-specific. The CAMI2 datasets contain the following number of genomes with nonzero abundance: Oral, 799; Skin, 610; Urogenital, 254; Gastrointestinal, 268; Airways, 828. The unique number of genomes per sample is listed in Supplementary Table 5. We benchmarked the following binners on the synthetic CAMI2 toy human microbiome dataset: Metabat10, MetaDecoder79, COMEBin22, SemiBin2 (ref. 21), AVAMB28 and VAMB11. We used taxonomic labels from four taxonomic classifiers as an input to TaxVAMB: MMSeqs2 (ref. 33), Metabuli77, Kraken2 (ref. 35) and Centrifuge78. AVAMB, VAMB and TaxVAMB bins were postprocessed with reclustering using SCGs. We used the number of HQ bins and assemblies estimated using BinBencher (version 0.3.0)54 as a metric. For the MAG taxonomic annotation experiment, we used CheckM2 (version 1.0.2)40. We benchmarked using BinBencher (version 0.3.0)54 against a reference computed from the CAMI2 ground truth. The metrics used were the numbers of HQ (defined as recall ≥ 0.9, precision ≥ 0.95) assemblies or genomes. As defined in the BinBencher paper, precision was counted as the number of true positive mapping positions for each genome–bin pair, divided by the total number of positions in a bin. For genomes, the recall was counted relative to the full length of the genome from which the reads were simulated from, whereas, when counting assemblies, the recall was relative to the assembled part of those genomes (that is, the part of the genomes covered by a contig that was used as input to the binner). The number of HQ genomes reflects the MAG quality relative to the underlying biological organism and, thus, depends more on limitations of the dataset, whereas the assembly metric may better reflect the methodological gains from using a different algorithm.
Short-read real data benchmarks
For benchmarking using real short-read datasets, we used the following: sea water with five samples80, bee hives with 18 samples81, forest soil with 12 samples82, rhizosphere with ten samples83, human saliva with 15 samples84 and vaginal microbiome with ten samples85. We assembled each sample using metaSPAdes (version 4.2.6)86 and mapped reads using minimap2 (version 2.24)87 with the ‘-ax sr’ setting. We used taxonomic labels from four taxonomic classifiers as an input to TaxVAMB: MMSeqs2 (ref. 33), Metabuli78, Kraken2 (ref. 32) and Centrifuge35. MMSeqs2 was evaluated with three databases: GTDB, TrEMBL88 (January 2025 release) and Kalmari89 (version 3.7). For evaluating the quality (completeness and contamination) of the resulting MAGs, we used CheckM2 (version 1.0.2)40. For detecting chimeric genomes, we ran GUNC (version 1.0.61)55 using the ‘gunc run’ command. The numbers of sequencing reads for each dataset and sample are listed in Supplementary Table 6.
Long-read benchmarks
For long-read benchmarking, we used a human gut microbiome dataset with four samples and a dataset from anaerobic digester sludge with three samples90, both sequenced using Pacific Biosciences HiFi technology. We assembled each sample using hifiasm-meta (version 0.3.1)3, mapped reads using minimap2 (version 2.24)87 with the ‘-ax map-hifi’ setting and, from there, proceeded as with the short reads. For evaluating the quality (completeness and contamination) of the resulting MAGs, we used CheckM2 (version 1.0.2)40. For detecting chimeric genomes, we ran GUNC (version 1.0.61)55 using the ‘gunc run’ command.
Multisample scaling
For the experiment that evaluated the number of bins given a different number of samples, we used a short-read human gut dataset with 1,000 samples from Almeida et al.58, as well as our own wheat phyllosphere dataset with 211 samples. For each dataset, we split all the samples into three sets of chunks: (1) chunks of 100 samples; (2) chunks of ten samples; and (3) chunks of one sample. Each chunk was treated as an independent dataset. We then summed the resulting number of HQ bins within each set of chunks. Taxonomic annotations were performed with MMseqs2 for the human gut dataset from Almeida et al. and with Kraken2 for the wheat phyllosphere dataset.
Taxonomic annotation validation: k-fold evaluation
All contigs in a dataset that were annotated by a classifier were randomly divided into five folds. Taxometer was then trained five times, each time using one fold as a validation set and the remaining four folds as the training set. Predictions were generated for the five validation sets after training on the remaining folds. The five validation sets were then concatenated to reconstruct the full dataset. This ensured that every contig received a prediction that was made without prior knowledge of its classifier annotation.
The evaluation metric was defined as the fraction of correctly predicted contigs on the domain level (Bacteria, Archea, etc.) over all contigs, while accounting for the fact that some ground-truth annotations may be missing. In other words, the score reflects how many predictions match the available ground-truth annotations, normalized by the total number of contigs in the dataset.\(\mathrm{Accuracy}=\frac{\mathrm{No}.\,\mathrm{of}\,\mathrm{correct}\,\mathrm{predictions}\,\left(\mathrm{where}\,\mathrm{ground}\,\mathrm{truth}\,\mathrm{exists}\right)}{\mathrm{total}\,\mathrm{number}\,\mathrm{of}\,\mathrm{contigs}\,\mathrm{in}\,\mathrm{dataset}}\) where a correct prediction was when the Taxometer output matched the ground-truth classifier annotation. If no ground-truth annotation was available for a contig, the contig was excluded from the numerator (as correctness cannot be determined) but still included in the denominator to reflect the fact that missing values exist in the data. This metric, thus, provides the overall share of correct predictions across the dataset. The command can be accessed in the codebase as vamb taxonomy_benchmark.
Bin annotations for CAMI2 MAGs
For taxonomic classification evaluation in Supplementary Figs. 9 and 10, we used Kraken2 configured with the NCBI database and compared its performance to GTDBtk, which provided GTDB annotations. Rather than directly comparing these two annotation sets, we evaluated both against ground-truth annotations from CAMI2 datasets. The original ground-truth taxonomy annotations were provided as NCBI identifiers as part of the CAMI2 challenge dataset. To establish ground-truth GTDB annotations for the CAMI2 datasets, we ran GTDBtk on the provided CAMI2 ground-truth genomes, obtaining complete annotations down to the species level. We manually verified several genomes to ensure consistency with NCBI annotations. Given that CAMI2 genomes are part of public databases, we have high confidence in the quality of the GTDBtk annotations and, therefore, used them as ground truth. In this experimental design, Kraken2 annotations were compared to the ground truth provided by CAMI2, while GTDBtk bin annotations were compared with GTDBtk ground-truth genome annotations.
Human gut (irritable bowel syndrome) dataset: sample collection and processing
Human fecal samples were collected from four healthy individuals and 20 persons with irritable bowel syndrome. In total, 1–5 samples were collected from each individual, yielding a total of 70 samples. DNA was extracted from fecal samples using a bead-beat micro AX gravity kit (A&A Biotechnology) according to the manufacturer’s instructions and the extracts were further purified using phenol–chloroform extraction and ethanol precipitation according to an established protocol91. Samples were sequenced by NovoGene, who prepared PCR-based libraries and generated 150-nt paired-end sequencing data on the NovaSeq 6000 platform. Sequencing reads were quality-controlled and adaptor-trimmed using trim_galore (version 1.15), which used cutadapt (version 3.6.9)92. The default quality threshold (Phred score: 20) was used but a further 16 and 6 nt were trimmed from the 5′ and 3′ ends of the reads, respectively, as this setting was found to yield better contiguity in assemblies in some benchmarking runs. BMtagger (version 1.1.0)93 was then used to remove potentially human reads from the sequencing data using GRCh38.p13 as a reference database. Assembly was performed using SPAdes (version 3.15.4)94.
Wheat phyllosphere dataset: sample collection and processing
A total of 24 field plots of Triticum aestivum were sampled by collecting composite samples of 30 flag leaves nine times between June 7 and July 14, 2022, at a field trial in Ringsted, Denmark. The experimental design included three wheat cultivars, four replicates and two treatments, which were unsprayed and sprayed with a fungicide. The samples were washed in 100 ml of wash solution (0.9% NaCl + 0.05% Tween-80), vigorously shaken for 2 min, sonicated for 2 min, vigorously shaken again for 2 min, filtered (10 µm) and centrifuged (4,000g, 15 min); the pellet resuspended in 1 ml of 1× PBS and stored at −20 °C until DNA extraction using the FastDNA SPIN kit (MP Biomedicals) for soil according to instructions, eluting in 100 µl of DES. DNA libraries were build using the Illumina Nextera XT kit (Illumina) but samples with <0.1 ng µl−1 DNA were built with a onefold-diluted amplicon tagment mix, 20 PCR cycles and a higher ratio of AMPure XP beads (Beckman Coulter)95. Libraries were sequenced using Illumina paired-end (2 × 150 bp) technology (NovaSeq 6000 S4 version 1.5).
Wheat phyllosphere dataset: data analysis
Raw sequencing reads were filtered using fastp (version 023.2)96 with the option ‘–trim_tail 1 –cut_tail –trim_poly_g –dedup –length_required 80’. Quality control of the filtered reads was assessed using MultiQC (version 1.12)97. To remove reads originating from wheat or potential human contamination, the reads were mapped to the reference sequences GCF_018294505.1, MG958554.1 and GCF_000001405.40 (GRCh38.p14). Mapping was performed using Bowtie (version 2.5.3)98. Paired reads where both mates were unmapped were extracted using SAMtools (version 1.18)73 with the ‘fastq -f 13’ option. Metagenomic assemblies were generated for each sample using SPAdes (version 3.15.4)94 with the ‘–meta -k 21,33,55,77,99’ option. Assembly statistics were computed using QUAST (version 5.2.0)99. MAGs were created with TaxVAMB using Kraken2 taxonomic annotations based on NCBI. The MAGs, for which the majority of contigs were annotated as Eukaryotes, were tested for completeness with BUSCO (version 5.8.2)67. Additionally, to build taxonomic trees, MAGs were assigned the taxonomy using GTDBtk (version 2.4.0) configured with the GTDB database version 220. Taxonomic trees were built using ggtree (version 3.19)100, tidytree (version 0.4.6) and treeio R (version 4.4.1) libraries. A two-sample Mann–Whitney U-test was performed on P. agglomerans abundances by splitting the samples into two groups: 143 samples from the earlier days (June 7, 10, 14, 17 and 21, 2022) and 103 samples from the later days (June 28 and July 4, 7 and 14, 2022) using scipy (version 1.10.0).
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.



