Scalable homology detection with ERAST

Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
Google Scholar
Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).
Google Scholar
Pearson, W. R. Using the FASTA program to search protein and DNA sequence databases. Methods Mol. Biol. 24, 307–331 (1994).
Google Scholar
Yang, J.-M. & Tung, C.-H. Protein structure database search and evolutionary classification. Nucleic Acids Res. 34, 3646–3659 (2006).
Google Scholar
Wang, S. & Zheng, W.-M. CLePAPS: fast pair alignment of protein structures based on conformational letters. J. Bioinform. Comput. Biol. 6, 347–366 (2008).
Google Scholar
van Kempen, M. et al. Fast and accurate protein structure search with Foldseek. Nat. Biotechnol. 42, 243–246 (2024).
Google Scholar
Shindyalov, I. N. & Bourne, P. E. Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng. 11, 739–747 (1998).
Google Scholar
Holm, L. Dali server: structural unification of protein families. Nucleic Acids Res. 50, W210–W215 (2022).
Google Scholar
Zhang, Y. & Skolnick, J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 33, 2302–2309 (2005).
Google Scholar
Richardson, L. et al. MGnify: the microbiome sequence data analysis resource in 2023. Nucleic Acids Res. 51, D753–D759 (2023).
Google Scholar
Jing, Z., Su, Y. & Han, Y. When large language models meet vector databases: a survey. Preprint at arXiv (2024).
Winnicki, M. J., Brown, C. A., Porter, H. L., Giles, C. B. & Wren, J. D. BioVDB: biological vector database for high-throughput gene expression meta-analysis. Front. Artif. Intell. Appl. 7, 1366273 (2024).
Google Scholar
Hamamsy, T. et al. Protein remote homology detection and structural alignment using deep learning. Nat. Biotechnol. 42, 975–985 (2023).
Google Scholar
Liu, W. et al. PLMSearch: protein language model powers accurate and fast sequence search for remote homology. Nat. Commun. 15, 2775 (2024).
Google Scholar
Hong, L. et al. Fast, sensitive detection of protein homologs using deep dense retrieval. Nat. Biotechnol. 43, 983–995 (2024).
Google Scholar
Verkuil, R. et al. Language models generalize beyond natural proteins. Preprint at bioRxiv https://doi.org/10.1101/2022.12.21.521521 (2022).
Gu, A. & Dao, T. MAMBA: linear-time sequence modeling with selective state spaces. Preprint at https://arxiv.org/abs/2312.00752 (2023).
Schiff, Y. et al. Caduceus: Bi-directional equivariant long-range dna sequence modeling. Proc. Mach. Learn. Res. 235, 43632 (2024).
Google Scholar
Jégou, H., Douze, M. & Schmid, C. Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell. 33, 117–128 (2011).
Google Scholar
Malkov, Y. A. & Yashunin, D. A. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Trans. Pattern Anal. Mach. Intell. 42, 824–836 (2020).
Google Scholar
Ahmad, T., Ahmed, N., Peltenburg, J. & Al-Ars, Z. ArrowSAM: In-memory genomics data processing using Apache Arrow. In 2020 3rd International Conference on Computer Applications & Information Security (ICCAIS) 1–6 (IEEE, 2020).
Sanderson, T., Bileschi, M. L., Belanger, D. & Colwell, L. J. ProteInfer, deep neural networks for protein functional inference. eLife 12, e80942 (2023).
Google Scholar
Wood, D. E., Lu, J. & Langmead, B. Improved metagenomic analysis with Kraken 2. Genome Biol. 20, 257 (2019).
Google Scholar
Huson, D. H., Auch, A. F., Qi, J. & Schuster, S. C. MEGAN analysis of metagenomic data. Genome Res. 17, 377–386 (2007).
Google Scholar
Durairaj, J. et al. Uncovering new families and folds in the natural protein universe. Nature 622, 646–653 (2023).
Google Scholar
Muhammed, M. T. & Aki-Yalcin, E. Homology modeling in drug discovery: overview, current applications, and future perspectives. Chem. Biol. Drug Des. 93, 12–20 (2019).
Google Scholar
UniProt Consortium. The universal protein resource (UniProt). Nucleic Acids Res. 36, D190–D195 (2008).
Google Scholar
Suzek, B. E., Huang, H., McGarvey, P., Mazumder, R. & Wu, C. H. UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics 23, 1282–1288 (2007).
Google Scholar
Bairoch, A. & Apweiler, R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 28, 45–48 (2000).
Google Scholar
O’Leary, N. A. et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 44, D733–D745 (2016).
Google Scholar
Chandonia, J.-M., Fox, N. K. & Brenner, S. E. SCOPe: classification of large macromolecular structures in the structural classification of proteins-extended database. Nucleic Acids Res. 47, D475–D481 (2019).
Google Scholar
Mock, F., Kretschmer, F., Kriese, A., Böcker, S. & Marz, M. Taxonomic classification of DNA sequences beyond sequence similarity using deep neural networks. Proc. Natl Acad. Sci. USA 119, e2122636119 (2022).
Google Scholar
Elnaggar, A. et al. Ankh ☥: optimized protein language model unlocks general-purpose modelling. Preprint at bioRxiv https://doi.org/10.1101/2023.01.16.524265 (2023).
Maćkiewicz, A. & Ratajczak, W. Principal component analysis (PCA). Comput. Geosci. 19, 303–342 (1993).
McInnes, L., Healy, J. & Astels, S. HDBSCAN: hierarchical density based clustering. J. Open Source Softw. 2, 205 (2017).
Google Scholar



