Accurate somatic small variant discovery for multiple sequencing technologies with DeepSomatic

Stratton, MR, Campbell, PJ and Futreal, PA The cancer genome. Nature 458719-724 (2009).
Alexandrov, LB et al. The repertoire of mutational signatures in human cancer. Nature 57894-101 (2020).
Alexandrov, LB & Stratton, MR Mutational signatures: patterns of somatic mutations hidden in cancer genomes. Curr. Notice. Broom. Dev. 2452-60 (2014).
Perera-Bel, J. et al. From somatic variants to precision oncology: evidence-based reporting on treatment options in molecular tumor boards. Genome Med. 1018 (2018).
Garcia-Prieto, CA, Martínez-Jiménez, F., Valencia, A. and Porta-Pardo, E. Detection of oncogenic and clinically actionable mutations in cancer genomes critically depends on variant calling tools. Bioinformatics 383181-3191 (2022).
Farswan, A. et al. Models of branched clonal evolution predominate in the mutational landscape of multiple myeloma. Am. J. Cancer Res. 115659-5679 (2021).
Li, W. & Freudenberg, J. Mappability and reading length. In front. Broom. 5381 (2014).
Larson, DE et al. SomaticSniper: identification of somatic point mutations in whole genome sequencing data. Bioinformatics 28311-317 (2012).
Koboldt, DC et al. VarScan 2: Discovery of somatic mutations and copy number alterations in cancer by exome sequencing. Genome Res. 22568-576 (2012).
Wilm, A. et al. LoFreq: an ultra-sensitive, sequence quality-aware variant caller for uncovering cell population heterogeneity from high-throughput sequencing datasets. Nucleic Acids Res. 4011189-11201 (2012).
Cibulskis, K. et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat. Biotechnology. 31213-219 (2013).
Kim, S. et al. Strelka2: fast and accurate calling of germline and somatic variants. Nat. Methods 15591-594 (2018).
Sahraeian, PME et al. Deep convolutional neural networks for accurate detection of somatic mutations. Nat. Common. 101041 (2019).
Krishnamahari, K. et al. Accurate somatic variant detection using weakly supervised deep learning. Nat. Common. 134248 (2022).
Musunuri, RL et al. Lancet2: Improved and accelerated somatic variant calling with joint multi-sample local assembly graphs. Preprint at bioRxiv https://doi.org/10.1101/2025.02.18.638852 (2025).
Fang, LT et al. Establish community reference samples, data, and call sets to evaluate cancer mutation detection using whole genome sequencing. Nat. Biotechnology. 391151-1160 (2021).
Logsdon, GA, Vollger, MR & Eichler, EE Long-read human genome sequencing and its applications. Nat. Reverend Genet. 21597-614 (2020).
Damaraju, N., Miller, AL and Miller, DE Long-read DNA and RNA sequencing to streamline clinical genetic testing and reduce barriers to comprehensive genetic testing. J.Appl. Laboratory. Med. 9138-150 (2024).
Kolesnikov, A. et al. Local read haplot allows accurate calls of small long read variants. Nat. Common. 155907 (2024).
Zheng, Z. et al. Symphonizing pileup and full alignment for deep learning-based long-read variant calls. Nat. Calculate. Sci. 2797-803 (2022).
Popeline, R. et al. A universal caller SNP and small-indel variants using deep neural networks. Nat. Biotechnology. 36983-987 (2018).
Shafin, K. et al. Haplotype-aware variant calling with PEPPER-Margin-DeepVariant enables high accuracy in nanopore long reads. Nat. Methods 181322-1332 (2021).
Kolmogorov, M. et al. Scalable nanopore sequencing of the human genome provides a comprehensive view of haplotype-resolved variation and methylation. Nat. Methods 201483-1492 (2023).
Zheng, Z. et al. ClairS: a deep learning method for long-read small somatic variant calls. Preprint at bioRxiv https://doi.org/10.1101/2023.08.17.553778 (2023).
Kolmogorov, M. & Gokce, A. CASTLE-Panel/castle. Datasets. GitHub https://github.com/CASTLE-Panel/castle (2025).
Keskus, AG et al. Severus detects somatic structural variations and complex rearrangements in cancer genomes using long-read sequencing. Nat. Biotechnology. https://doi.org/10.1038/s41587-025-02618-8 (2025)
Diaz-Gay, M. et al. Assigning mutational signatures to individual samples and individual somatic mutations with SigProfilerAssignment. Bioinformatics 39btad756 (2023).
Vasimuddin, M., Misra, S., Li, H. & Aluru, S. Efficient architecture-aware BWA-MEM acceleration for multi-core systems. In Proc. 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 314-324 (IEEE, 2019).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 343094-3100 (2018).
Bergstrom, EN et al. SigProfilerMatrixGenerator: a tool for visualizing and exploring patterns of small mutational events. BMC genomics 20685 (2019).
Lansdon, LA et al. Successful classification of genetic subtypes of clinical pediatric leukemia via detection of structural variants using HiFi long-read sequencing. Preprint at medRxiv https://doi.org/10.1101/2024.11.05.24316078 (2024).
Kim, R. rkimoakbioinformatics/oakvar. Source code. GitHub https://github.com/rkimoakbioinformatics/oakvar/ (2025).
Steiert, TA et al. A critical spotlight on FFPE-DNA sequencing paradigms. Nucleic Acids Res. 517143-7162 (2023).
Xiao, W. et al. Towards best practices in detecting cancer mutations with whole genome and whole exome sequencing. Nat. Biotechnology. 391141-1150 (2021).
Koboldt, DC Best Practices for Variant Calling in Clinical Sequencing. Genome Med. 1291 (2020).
Keskus, AG et al. Severus detects somatic structural variations and complex rearrangements in cancer genomes using long-read sequencing. Nat. Biotechnology. https://doi.org/10.1038/s41587-025-02618-8 (2025).
Cohen, ASA et al. Genomic answers for children: dynamic analyzes of more than 1,000 pediatric rare disease genomes. Broom. Med. 241336-1348 (2022).
Monlong, J., Lorig-Roach, R., Meredith, M. and Negi, S. nanoporegenomics/wambam. Source code. GitHub https://github.com/nanoporegenomics/wambam (2025).
Bushnell, B. BioInfoTools/BBMap. Source code. GitHub https://github.com/BioInfoTools/BBMap/blob/master/sh/reformat.sh (2025).
Baid, G. et al. A large sequence dataset of reference samples for benchmarking and development. Preprint at bioRxiv https://doi.org/10.1101/2020.12.11.422022 (2020).
An integrated map of genetic variation in 1,092 human genomes. Nature 49156-65 (2012).
Lake, JA & Sequencing (CoLoRS), C. of the LR Consortium Long Read Sequencing Database (CoLoRSdb). Zenodo https://doi.org/10.5281/zenodo.11511513 (2024).
Chen, N.C. et al. Improving variant calling using population data and deep learning. BMC Bioinf. 24197 (2023).
Sherry, ST et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29308-311 (2001).
Karczewski, KJ et al. The spectrum of mutational constraints quantified from variation in 141,456 humans. Nature 581434-443 (2020).
Auton, A. et al. A global reference for human genetic variation. Nature 52668-74 (2015).
Szegedy, C. et al. Rethinking the initial architecture of computer vision. Proc. IEEE Conference on Computer Vision and Pattern Recognition 2818-2826 (2016); https://doi.org/10.1109/CVPR.2016.308
Popeline, R. et al. google/deepvariant. Google (2025). Source code. GitHub https://github.com/google/deepvariant (2025).
Kingma, DP & Ba, J. ADAM: a stochastic optimization method. Preprint at https://arxiv.org/abs/1412.6980 (2017).
Ahmad, T. KolmogorovLab/Wakhan. Source code. GitHub https://github.com/KolmogorovLab/Wakhan (2025).
Bergstrom, EN et al. AlexandrovLab/SigProfilerAssignment. Source code. GitHub https://github.com/AlexandrovLab/SigProfilerAssignment (2025).
Diaz-Gay, M. et al. AlexandrovLab/SigProfilerMatrixGenerator. Source code. GitHub https://github.com/AlexandrovLab/SigProfilerMatrixGenerator (2025).
CASTLE Panel: Long-Read Assessment of Cancer Standards. Datasets. Sequence Playback Archive https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA1086849 (2025).
Childhood Cancer Data Initiative (CCDI): Pediatric Cancer Comprehensive Genomic Sequencing (CMRI/KUCC) datasets. dbGAP https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs002529.v2.p1 (2025).
DeepSomatic: Accurate discovery of small somatic variants for multiple sequencing technologies. Datasets. dbGAP https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs004188.v1.p1 (2025).
Park, J. Supporting data for: Accurate discovery of small somatic variants for multiple sequencing technologies with DeepSomatic. Zenodo https://doi.org/10.5281/zenodo.16595168 (2025).
Park, J. et al. google/deepsomatics. Google (2025). Source code. GitHub https://github.com/google/deepsomatic (2025).



