Scalable homology detection with ERAST

Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).

Article
CAS
PubMed

Google Scholar

Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).

Article
CAS
PubMed
PubMed Central

Google Scholar

Pearson, W. R. Using the FASTA program to search protein and DNA sequence databases. Methods Mol. Biol. 24, 307–331 (1994).

CAS
PubMed

Google Scholar

Yang, J.-M. & Tung, C.-H. Protein structure database search and evolutionary classification. Nucleic Acids Res. 34, 3646–3659 (2006).

Article
CAS
PubMed
PubMed Central

Google Scholar

Wang, S. & Zheng, W.-M. CLePAPS: fast pair alignment of protein structures based on conformational letters. J. Bioinform. Comput. Biol. 6, 347–366 (2008).

Article
CAS
PubMed

Google Scholar

van Kempen, M. et al. Fast and accurate protein structure search with Foldseek. Nat. Biotechnol. 42, 243–246 (2024).

Article
PubMed

Google Scholar

Shindyalov, I. N. & Bourne, P. E. Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng. 11, 739–747 (1998).

Article
CAS
PubMed

Google Scholar

Holm, L. Dali server: structural unification of protein families. Nucleic Acids Res. 50, W210–W215 (2022).

Article
CAS
PubMed
PubMed Central

Google Scholar

Zhang, Y. & Skolnick, J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 33, 2302–2309 (2005).

Article
CAS
PubMed
PubMed Central

Google Scholar

Richardson, L. et al. MGnify: the microbiome sequence data analysis resource in 2023. Nucleic Acids Res. 51, D753–D759 (2023).

Article
CAS
PubMed
PubMed Central

Google Scholar

Jing, Z., Su, Y. & Han, Y. When large language models meet vector databases: a survey. Preprint at arXiv (2024).

Winnicki, M. J., Brown, C. A., Porter, H. L., Giles, C. B. & Wren, J. D. BioVDB: biological vector database for high-throughput gene expression meta-analysis. Front. Artif. Intell. Appl. 7, 1366273 (2024).

Article

Google Scholar

Hamamsy, T. et al. Protein remote homology detection and structural alignment using deep learning. Nat. Biotechnol. 42, 975–985 (2023).

Article
PubMed
PubMed Central

Google Scholar

Liu, W. et al. PLMSearch: protein language model powers accurate and fast sequence search for remote homology. Nat. Commun. 15, 2775 (2024).

Article
CAS
PubMed
PubMed Central

Google Scholar

Hong, L. et al. Fast, sensitive detection of protein homologs using deep dense retrieval. Nat. Biotechnol. 43, 983–995 (2024).

Article
PubMed
PubMed Central

Google Scholar

Verkuil, R. et al. Language models generalize beyond natural proteins. Preprint at bioRxiv https://doi.org/10.1101/2022.12.21.521521 (2022).

Gu, A. & Dao, T. MAMBA: linear-time sequence modeling with selective state spaces. Preprint at https://arxiv.org/abs/2312.00752 (2023).

Schiff, Y. et al. Caduceus: Bi-directional equivariant long-range dna sequence modeling. Proc. Mach. Learn. Res. 235, 43632 (2024).

PubMed
PubMed Central

Google Scholar

Jégou, H., Douze, M. & Schmid, C. Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell. 33, 117–128 (2011).

Article
PubMed

Google Scholar

Malkov, Y. A. & Yashunin, D. A. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Trans. Pattern Anal. Mach. Intell. 42, 824–836 (2020).

Article
PubMed

Google Scholar

Ahmad, T., Ahmed, N., Peltenburg, J. & Al-Ars, Z. ArrowSAM: In-memory genomics data processing using Apache Arrow. In 2020 3rd International Conference on Computer Applications & Information Security (ICCAIS) 1–6 (IEEE, 2020).

Sanderson, T., Bileschi, M. L., Belanger, D. & Colwell, L. J. ProteInfer, deep neural networks for protein functional inference. eLife 12, e80942 (2023).

Article
CAS
PubMed
PubMed Central

Google Scholar

Wood, D. E., Lu, J. & Langmead, B. Improved metagenomic analysis with Kraken 2. Genome Biol. 20, 257 (2019).

Article
CAS
PubMed
PubMed Central

Google Scholar

Huson, D. H., Auch, A. F., Qi, J. & Schuster, S. C. MEGAN analysis of metagenomic data. Genome Res. 17, 377–386 (2007).

Article
CAS
PubMed
PubMed Central

Google Scholar

Durairaj, J. et al. Uncovering new families and folds in the natural protein universe. Nature 622, 646–653 (2023).

Article
CAS
PubMed
PubMed Central

Google Scholar

Muhammed, M. T. & Aki-Yalcin, E. Homology modeling in drug discovery: overview, current applications, and future perspectives. Chem. Biol. Drug Des. 93, 12–20 (2019).

Article
CAS
PubMed

Google Scholar

UniProt Consortium. The universal protein resource (UniProt). Nucleic Acids Res. 36, D190–D195 (2008).

Article

Google Scholar

Suzek, B. E., Huang, H., McGarvey, P., Mazumder, R. & Wu, C. H. UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics 23, 1282–1288 (2007).

Article
CAS
PubMed

Google Scholar

Bairoch, A. & Apweiler, R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 28, 45–48 (2000).

Article
CAS
PubMed
PubMed Central

Google Scholar

O’Leary, N. A. et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 44, D733–D745 (2016).

Article
PubMed

Google Scholar

Chandonia, J.-M., Fox, N. K. & Brenner, S. E. SCOPe: classification of large macromolecular structures in the structural classification of proteins-extended database. Nucleic Acids Res. 47, D475–D481 (2019).

Article
CAS
PubMed
PubMed Central

Google Scholar

Mock, F., Kretschmer, F., Kriese, A., Böcker, S. & Marz, M. Taxonomic classification of DNA sequences beyond sequence similarity using deep neural networks. Proc. Natl Acad. Sci. USA 119, e2122636119 (2022).

Article
CAS
PubMed
PubMed Central

Google Scholar

Elnaggar, A. et al. Ankh ☥: optimized protein language model unlocks general-purpose modelling. Preprint at bioRxiv https://doi.org/10.1101/2023.01.16.524265 (2023).

Maćkiewicz, A. & Ratajczak, W. Principal component analysis (PCA). Comput. Geosci. 19, 303–342 (1993).

McInnes, L., Healy, J. & Astels, S. HDBSCAN: hierarchical density based clustering. J. Open Source Softw. 2, 205 (2017).

Article

Google Scholar