Structure-centric searching enables global mapping of the public metabolome

0 0 12 minutes read

Structure-centric searching enables global mapping of the public metabolome

https://www.profitableratecpm.com/f4ffsdxe?key=39b1ebce72f3758345b2155c98e6709c

FASSTrecords database construction

The entire workflow was set up in a nextflow (version 24.10.5 build 5935) pipeline with four distinct processes running Python scripts via Python 3.9 unless specified otherwise.

(1)

In the first process, GNPS reference libraries, including spectra from GNPS³, MoNA (https://massbank.us/) and MassBankEU³⁰ were aggregated. Specifically, we used the GNPS cleaned library (gnps_cleaned.mgf), the MULTIPLEX synthesis libraries in both filtered (MULTIPLEX-SYNTHESIS-LIBRARY-FILTERED-PARTITION-1.mgf to -4.mgf) and full variants (MULTIPLEX-SYNTHESIS-LIBRARY-ALL-PARTITION-1.mgf to -6.mgf), additional GNPS libraries (GNPS-BILE-ACID-MODIFICATIONS.mgf, GNPS-DRUG-ANALOG.mgf and GNPS-IIMN-PROPOGATED.mgf) and the REFRAME negative and positive libraries (REFRAME-NEGATIVE-LIBRARY.mgf and REFRAME-POSITIVE-LIBRARY.mgf) and clustered via falcon³¹ to group highly similar spectra. Falcon was adapted to return spectral library IDs for clustered spectra (https://github.com/YasinEl/falcon/tree/feature/fast-clustering; falcon-ms (version string 0.1.dev264+gdf7adb9fb) running via Python 3.9, numpy 1.26.4 and pylance 0.21.0). Clustering parameters were set to min_peaks = 2, scaling = root, min_mz = 40, max_mz = 2000, min_mz_range = 1, distance_threshold = 1, precursor_tol = 20 ppm and fragment_tol = 0.05.
(2)

In the next step, an SQLite (version 3.39.2) database was initiated, and a library table was added containing all molecular metadata available within the mgf-associated csv files available with the mgf files at https://external.gnps2.org/processed_gnps_data/gnps_cleaned/. Moreover, integer IDs were assigned to each library entry (spectrum_id_int), and falcon grouping ID (falcon_cluster_id) was added as a separate column. Next, raw data files accessible for MS/MS spectral matching via FASST MASST were retrieved from https://fasst.gnps2.org/library/files?library=metabolomicspanrepo_index_nightly, assigned an integer ID (mri_id_int) and deposited as mri_table. After that, sample metadata were downloaded from https://redu.gnps2.org/dump. The metadata table was subsetted to mri values present in the mri_table, and mri_id_int identifiers were added from the previously created mri_table. The metadata table was then added as redu_table. Finally, an empty table for masst_results was added.
(3)

We then utilized FASST MASST to individually query library spectra against the metabolomicspanrepo_index_nightly database. Spectral matching parameters were set to a cosine of 0.7, minimum of 3 matching peaks, precursor mass tolerance of 0.05 Da and fragment mass tolerance of 0.05 Da. Before depositing the results of each given query result into the masst_table, all returned text values were replaced with representing integer IDs to optimize storage and retrieval efficiency. Namely, we deposited the obtained cosine scores (rounded to 2 digits, the number of matching peaks, an ID for the query spectrum (spectrum_id_int), an ID for the matching file (mri_id_int) and the matching scan ID (scan_id).
(4)

In the final process, we ensured that no duplicates were accidentally included in any of the generated tables. In addition, indices were created on several columns to enable fast data retrieval. Specifically, the indices referenced in Table 1 were created.

For data handling in the above steps, pandas (2.3.2), sqlite3 (3.50.4) and sqlalchemy (2.0.43) packages were used.

The constructed database

In our informatics workflow, we implemented a compact, file-based SQLite (version 3.39.2) database to manage hundreds of millions of MS/MS match events across four public metabolomics repositories. Table 1 summarizes the five core tables in this database, detailing their roles, primary key columns and any indices used to speed up queries.

All spectrum matches are funneled into a single masst_table, which records only integer IDs (for library spectra, raw data files, datasets and scans) alongside similarity metrics (cosine score and matching peak count). Surrounding this central table are four lookup tables:

library_table: contains the full GNPS reference spectra metadata, keyed by spectrum_id_int.

mri_table: maps each raw data file path (MRI) to a small integer (mri_id_int), avoiding repeated storage of long file paths.

dataset_table: associates each public dataset accession (GNPS/MassIVE, MetaboLights or Metabolomics Workbench) with an integer ID (dataset_id_int).

redu_table: stores ReDU-curated metadata for files, joined via mri_id_int to integrate sample descriptors (for example, organism taxonomy, body part and instrument details) where available.

To balance performance with storage efficiency, we selectively built indices on the most critical join columns—specifically on masst_table.spectrum_id_int, the (mri_id_int, scan_id) pair in masst_table, the mri field in mri_table and redu_table.mri_id_int. This targeted indexing ensures that cross-table joins, even over tens of millions of rows, complete in seconds rather than minutes. By relying on integer-only core tables and a minimal set of indices, our system remains lightweight, portable and easily embedded into reproducible analysis pipelines.

StructureMASST

StructureMASST has been written as a streamlit (version 1.45) app³². Inputs are accepted as molecular names, which are interpreted via the PubChem Auto-Complete Search Service. Canonical SMILES for the obtained matches are then retrieved via the PubChem REST API. Whether SMILES are input through PubChem or manually, they are harmonized via functionality adapted from³³. The structure can also be edited or drawn using a streamlit component based on Ketcher (streamlit-ketcher, version 0.0.1), which will then be converted to SMILES for searching. If the input is provided as a SMARTS pattern, the SMARTSview REST API is used for generating a visual representation of it, making it easier to interpret and debug patterns³⁴. This, and all further, SMILES, structure and substructure processing is performed through rdkit’s (version 2024.09.6) HasSubstructMatch() function. Tanimoto matching is performed on the basis of Morgan fingerprints (ECFP4; radius 2, 2,048 bits) using the RDKit rdFingerprintGenerator.GetMorganGenerator. Tanimoto similarity coefficients were then calculated with the RDKit DataStructs.TanimotoSimilarity. Sankey diagrams were generated in Python using the Plotly gosankey implementation. Further data handling was performed through numpy (1.26.4), pyarrow¹⁵, requests (2.31), requests-cache (1.2), lxml (5.2), pyteomics (4.6) and celery (5.2.2).

Retrieving library spectra based on structures

All structure-based searches in FASSTrecords are performed using RDKit (version 2024.09.6). For substructure and similarity searches, precomputed RDKit fingerprints stored in the database are used: pattern fingerprints for substructure screening and Morgan (ECFP4; radius 2, 2,048 bits) fingerprints for similarity calculations. These fingerprints are stored as binary blobs and decoded during retrieval; they are loaded directly from the database rather than recalculated, ensuring fast and reproducible comparisons across the entire dataset.

In exact search mode, the query SMILES is parsed with RDKit to obtain its monoisotopic mass. Matches are retrieved through the first 14 characters of the InChIKey, which encode the molecule’s connectivity (regiochemistry) layer, and are further constrained by a ±0.02-Da mass to exclude matches with incorrect numbers of double bonds. Library spectra generated from low-mass-resolution instruments are excluded.

In substructure search mode, one representative molecule per unique InChIKey block is screened using its precomputed Pattern fingerprint, and candidate hits are confirmed with RDKit’s HasSubstructMatch() function. Once a representative block is identified as containing the query substructure, all associated spectra belonging to that block are retrieved.

In similarity (Tanimoto) mode, precomputed Morgan (ECFP4) fingerprints are used to compute Tanimoto similarity coefficients (DataStructs.TanimotoSimilarity). InChIKey blocks with similarity scores above the user-specified threshold are selected, and all spectra belonging to those blocks are expanded and returned with full metadata.

After performing substructure or similarity searches, multiple 2D structures may match a given query. In these cases, we assume that the user is interested in the distribution of the substructure or of structurally related molecules, rather than each individual compound. Therefore, we report only the molecule with the best MS/MS match for each sample. Users interested in the biodistribution of multiple distinct molecules can submit them individually or use the batch mode.

When performing analog searches, the best-matching analog per unique Δmass and sample is reported. Consequently, this is the only search mode where the same sample can appear multiple times in the results (once per detected analog).

Across all modes, low-resolution analyzers (quadrupole or ion-trap instruments) are excluded, textual fields are harmonized (missing values reported as ‘unknown’), and large queries are processed in batches to ensure scalable and reproducible results.

Raw data search modes

After retrieving structure-level matches from FASSTrecords, raw data searches are performed to locate experimental spectra corresponding to these structures across public MS datasets. Representative spectra for each molecule are defined during FASSTrecords creation through FALCON clustering, which groups highly similar MS/MS spectra within the library. These representative spectra are then used as queries against FASSTrecords or directly via FASST using the selected search parameters.

Search results are subsequently intersected with the PanReDU metadata resource, which provides curated sample-level information (for example, organism, tissue and environment annotations). Only samples for which metadata are available are retained, ensuring that all downstream analyses are contextually interpretable. To avoid overcounting, only the top-ranking MS/MS match is reported for each sample based on spectral similarity.

After performing substructure or Tanimoto similarity searches, multiple related 2D structures can correspond to the same molecular pattern or substructure. In such cases, the search is interpreted as aiming to describe the overall distribution of that substructure (or of molecules similar to the query). Consequently, only the molecule with the best MS/MS match per sample is reported, regardless of how many molecules matched that sample. Users interested in the distributions of individual molecules can instead submit them separately or use the batch search mode.

In analog search mode, which identifies molecules differing by specific mass offsets (Δmass) relative to the query structure, the best-matching analog per unique Δmass and sample is reported. As a result, analog searches are the only mode where the same sample can appear multiple times in the results—each instance corresponding to a distinct analog observation.

Across all search modes, this postprocessing ensures that reported hits represent unique, biologically interpretable findings at the sample level, while maintaining consistency between structure-level matching, raw data retrieval and metadata integration.

Downstream and support tooling

Multiple tools have been linked and integrated into StructureMASST to simplify analysis. For library spectra and spectral matches, the GNPS Spectral Resolver (https://metabolomics-usi.gnps2.org) can be used to visualize individual spectra and spectral matches by clicking the respective links in the library and results tables. For raw data results, linkouts to the GNPS Dashboard (https://dashboard.gnps2.org) are provided to inspect extracted ion chromatograms directly in the raw data files. After analog searches, the extracted ion chromatograms of both unmodified and modified species are extracted by default, allowing assessment of whether relative elution orders are as expected and whether co-elution indicates analytical artifacts, such as ISFs or other ion species from the same molecule, which could be mistaken for analogs. After modification/analog searches, Modifinder (https://modifinder.gnps2.org/) can be accessed for all supported adduct types from a linkout provided in the table to assess likely modification sites.

StructureMASST is meant as a tool for the comprehensive and interactive retrieval of raw data matching to query molecules. As such, it proves ways to visualize matches across raw data in a multitude of ways and allows subsetting to matches of interest. However, different applications, such as environmental or evolutionary studies, require different types of integration for these data. StructureMASST is meant as a starting point from which multiple tools can branch off for more specific visualizations. Some preliminary tools, which are still under development, are provided in the ‘Downstream and support tooling’ section.

Reported FASSTrecords numbers

Numbers reported on FASSTrecords were retrieved on 24 September 2025 from the FASSTrecords sqlite database using the following queries:

Number of files with metadata:

SELECT COUNT(*) FROM redu_table;

Number of spectra in the library:

SELECT COUNT(*) FROM library_table;

Number unique 2D structures in the library:

SELECT COUNT(DISTINCT InChIKey_smiles_fi rstBlock) FROM library_table;

Number of annotated scans:

SELECT COUNT(*) AS unique_mri_scan_pairs

FROM (

SELECT 1

FROM masst_table

GROUP BY mri_id_int, scan_id

);

Number of annotated scans in human data:

SELECT COUNT(*) AS unique_mri_scan_pairs

FROM (

SELECT 1

FROM masst_table AS m

WHERE EXISTS (

SELECT 1

FROM redu_table AS r

WHERE r.mri_id_int = m.mri_id_int

AND r.NCBITaxonomy = ‘9606|Homo sapiens’

)

GROUP BY m.mri_id_int, m.scan_id

);

Number of annotations with sample metadata:

SELECT COUNT(*)

FROM masst_table mt

WHERE EXISTS (

SELECT 1 FROM redu_table rt

WHERE rt.mri_id_int = mt.mri_id_int

);

Number of 2D structures with raw data matches:

SELECT COUNT(*)

FROM (

SELECT DISTINCT l.InChIKey_smiles_firstBlock

FROM library_table l

WHERE l.InChIKey_smiles_firstBlock IS NOT NULL

AND EXISTS (

SELECT 1

FROM masst_table m

WHERE m.spectrum_id_int = l.spectrum_id_int

AND m.annotation_rank = 1

LIMIT 1

)

);

Number of 2D structures with raw data matches in human samples:

WITH human_mri AS (

SELECT DISTINCT mri_id_int

FROM redu_table

WHERE NCBITaxonomy = ‘9606|Homo sapiens’

)

SELECT COUNT(*)

FROM (

SELECT DISTINCT l.InChIKey_smiles_firstBlock

FROM library_table l

WHERE l.InChIKey_smiles_firstBlock IS NOT NULL

AND EXISTS (

SELECT 1

FROM masst_table m

WHERE m.spectrum_id_int = l.spectrum_id_int

AND m.mri_id_int IN (SELECT mri_id_int FROM human_mri)

LIMIT 1

)

);

Biological examples – Matching and filtering criteria

Caffeine example

MS/MS spectra were retrieved via exact structure matching of the SMILES CN1C=NC2=C1C(=O)N(C(=O)N2C)C. We then searched FASSTrecords using a minimum cosine of 0.9 and minimum matching peaks set to 5.

Salicylic acid–thiazoline example

MS/MS spectra were retrieved via substructure search of the SMILES OC1=CC=CC=C1C2=NCCS2. We then searched FASSTrecords using a minimum cosine of 0.7 and minimum matching peaks set to 5. We then limited the returned results to human samples by selecting the NCBITaxonomy column in the Column-dropdown below the results table and then selecting ‘9606|Homo sapiens’ in the Value-dropdown. The filter was applied by clicking the ‘Keep only selected rows’ button.

Surfactin C example

MS/MS spectra were retrieved via exact structure matching of the SMILES CC(C)CCCCCCCCCC1CC(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)O1)CC(C)C)CC(C)C)CC(=O)O)C(C)C)CC(C)C)CC(C)C)CCC(=O)O. We then utilized FASST using a minimum cosine of 0.6, 5 minimum matching peaks, and fragment and precursor tolerances of 0.02 Da. Analog search was turned on, and the filter condition was set to ‘Raw file’. We then limited the returned results to human samples by selecting the NCBITaxonomy column in the Column-dropdown below the results table and then selecting ‘9606|Homo sapiens’ in the Value-dropdown. The filter was applied by clicking the ‘Keep only selected rows’ button.

Amiodarone example

MS/MS spectra were retrieved via exact structure matching of the SMILES CCCCC1=C(C2=CC=CC=C2O1)C(=O)C3=CC(=C(C(=C3)I)OCCN(CC)CC)I. We then utilized FASST using a minimum Cosine of 0.6, 5 minimum matching peaks, and fragment and precursor tolerances of 0.02 Da. Analog search was turned on, and the filter condition was set to ‘Raw file’. We then limited the returned results to human samples by selecting the NCBITaxonomy column in the Column-dropdown below the results table and then selecting ‘9606|Homo sapiens’ in the Value-dropdown. The filter was applied by clicking the ‘Keep only selected rows’ button.

Sertraline example

MS/MS spectra were retrieved via exact structure matching of the SMILES CNC1CCC(C2=CC=CC=C12)C3=CC(=C(C=C3)Cl)Cl. We then utilized FASST using a minimum cosine of 0.6, 5 minimum matching peaks, and fragment and precursor tolerances of 0.02 Da. Analog search was turned on, and the filter condition was set to ‘Raw file’. We then limited the returned results to human samples by selecting the NCBITaxonomy column in the Column-dropdown below the results table and then selecting ‘9606|Homo sapiens’ in the Value-dropdown. The filter was applied by clicking the ‘Keep only selected rows’ button.

Desferrioxamine H example

MS/MS spectra were retrieved via exact structure matching on the SMILES CC(=O)N(O)CCCCCNC(=O)CCC(=O)N(O)CCCCCNC(=O)CCC(=O)O. We then searched FASSTrecords using a minimum cosine of 0.7 and minimum matching peaks set to 5. Matches to the library spectrum ‘CCMSLIB00000845585’ were removed by selecting the query_spectrum_id column in the Column-dropdown below the results table and then selecting ‘CCMSLIB00000845585’ in the Value-dropdown. The filter was applied by clicking the ‘Remove selected rows’ button.

Surfactin C Tanimoto similarity example

MS/MS spectra were retrieved via Tanimoto similarity search of the SMILES CC(C)CCCCCCCCCC1CC(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)O1)CC(C)C)CC(C)C)CC(=O)O)C(C)C)CC(C)C)CC(C)C)CCC(=O)O with a threshold of 0.8. We then utilized FASSTrecords using a minimum cosine of 0.7 and 5 minimum matching peaks.

Mass-defect analysis

Mass-defect values were calculated as the difference between the exact m/z and the nearest nominal mass (mass defect = exact mass – nominal mass). Data processing and visualization were performed using R (version 4.5.1) in the RStudio environment. For the mass-defect plot of amiodarone (Supplementary Fig. 4b), the m/z values of amiodarone and its potential metabolites identified through the StructureMASST search (Supplementary Fig. 4a) were used, while CHNO-backbone compounds with varying degrees of iodination (Supplementary Table 2) were referenced to confirm their iodination levels. Similarly, for the mass-defect plot of sertraline (Fig. 2c), the m/z values of sertraline and its potential metabolites identified from the StructureMASST search (Fig. 2b) were used, while CHN-backbone compounds with varying degrees of chlorination (Supplementary Table 4) were referenced to confirm their chlorination levels.

ModiFinder analysis

ModiFinder analysis of several potential metabolites of amiodarone and sertraline identified from the StructureMASST search was performed using the ‘View Modification Site’ function in the resulting table for each USI generated in FASST mode. Then, by clicking ‘View Modification Site,’ users are directed to the GNPS2 dashboard (https://modifinder.gnps2.org/), where the results are shown. The inputs, parameters and results can be accessed through the links provided below.

Amiodarone

Δm/z −26.02:

https://modifinder.gnps2.org/?USI1=mzspec%3AGNPS%3AGNPS-LIBRARY%3Aaccession%3ACCMSLIB00013027336&USI2=mzspec%3AMSV000085760%3Araw%2FmzXML%2F5580.mzXML%3Ascan%3A2872&SMILES1=CCCCc1oc2ccccc2c1C%28%3DO%29c1cc%28I%29c%28OCCN%28CC%29CC%29c%28I%29c1&SMILES2&Helpers=&adduct=%5BM%2BH%5D1%2B&ppm_tolerance=25&filter_peaks_variable=0.01

Δm/z −125.90:

https://modifinder.gnps2.org/?USI1=mzspec%3AGNPS%3AGNPS-LIBRARY%3Aaccession%3ACCMSLIB00012316157&USI2=mzspec%3AMTBLS1866%3AFILES%2FLipidomic_ICU+COVID-19_ESI+Positive%2FDA17_p.mzML%3Ascan%3A686&SMILES1=CCCCc1oc2ccccc2c1C%28%3DO%29c1cc%28I%29c%28OCCN%28CC%29CC%29c%28I%29c1&SMILES2&Helpers=&adduct=%5BM%2BH%5D1%2B&ppm_tolerance=25&filter_peaks_variable=0.01

Sertraline

Δm/z +43.99:

https://modifinder.gnps2.org/?USI1=mzspec%3AGNPS%3AGNPS-LIBRARY%3Aaccession%3ACCMSLIB00000084936&USI2=mzspec%3AMSV000080673%3Accms_peak%2F2017.AmericanGut3K.mzXMLfiles%2FSamples%2F000006382_RB8_01_6463.mzML%3Ascan%3A1896&SMILES1=CNC1CCC%28c2ccc%28Cl%29c%28Cl%29c2%29c2ccccc21&SMILES2&Helpers=&adduct=%5BM%2BH%5D1%2B&ppm_tolerance=25&filter_peaks_variable=0.01

Δm/z +148.04:

https://modifinder.gnps2.org/?USI1=mzspec%3AGNPS%3AGNPS-LIBRARY%3Aaccession%3ACCMSLIB00003140022&USI2=mzspec%3AMSV000086415%3Accms_peak%2FPlate+01+Samples+RAW%2F16265624.mzML%3Ascan%3A1311&SMILES1=CNC1CCC%28c2ccc%28Cl%29c%28Cl%29c2%29c2ccccc21&SMILES2&Helpers=&adduct=%5BM%2BH%5D1%2B&ppm_tolerance=25&filter_peaks_variable=0.01

Statistical analysis of human enrichment of drug matches among Metazoa hits

To quantify whether StructureMASST raw data matches were disproportionately associated with human samples, we tested for enrichment of Homo sapiens within the set of positive matches for each queried drug molecule relative to its prevalence in other Metazoa samples. The background population was defined as all entries in the redu_table of FASSTrecords with MS/MS present (MS2spectra_count > 0) and NCBIKingdom == ‘Metazoa’. Human samples were defined as NCBITaxonomy == ‘9606|Homo sapiens’, and all remaining Metazoa entries were treated as nonhuman. Positive matches for each molecule were defined as raw data hits passing the specified spectral matching criteria (default: cosine >0.7 and matching peaks >5; additional cosine thresholds were evaluated as shown).

For each molecule, we constructed a 2 × 2 contingency table comparing the number of human versus nonhuman Metazoa samples among the molecule’s positive matches to the corresponding counts in the background (hits versus nonhits). We then applied Fisher’s exact test (two-sided) to each table to estimate an OR and associated P value. The OR was interpreted as the relative odds that a positive match originated from Homo sapiens rather than nonhuman Metazoa compared with the same odds in the Metazoa background (OR >1, enrichment; OR <1, depletion). Multiple testing across molecules was controlled using the Benjamini–Hochberg procedure; adjusted P values (q values) are reported, with q < 0.05 considered significant.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

abdulmanannet77@gmail.com3 days ago

0 0 12 minutes read

FASSTrecords database construction

The constructed database

StructureMASST

Retrieving library spectra based on structures

Raw data search modes

Downstream and support tooling

Reported FASSTrecords numbers

Caffeine example

Salicylic acid–thiazoline example

Surfactin C example

Amiodarone example

Sertraline example

Desferrioxamine H example

Surfactin C Tanimoto similarity example

Mass-defect analysis

ModiFinder analysis

Amiodarone

Sertraline

Statistical analysis of human enrichment of drug matches among Metazoa hits

Reporting summary

abdulmanannet77@gmail.com

Related Articles

Iain Douglas-Hamilton, pioneering elephant conservationist, dies aged 83

Kevin Kiley Files for Reelection as “No Party Preference” Amid California Redistricting Dispute

ICE Says It Has Made Tentative Job Offers to More Than 1,000 as Hiring Ramps Up

75-year-old man tries to dump wife for AI chatbot woman

Leave a Reply Cancel reply