Manufacturing-aware generative models enable petascale synthesis of designed DNA

Russ, WP et al. An evolutionary-based model for the design of chorismate mutase enzymes. Science 369440-445 (2020).
Shin, J.-E. et al. Protein design and variant prediction using generative autoregressive models. Nat. Common. 122403 (2021).
Madani, A. et al. Large language models generate functional protein sequences in diverse families. Nat. Biotechnology. 411099-1106 (2023).
Ingraham, JB et al. Illuminating protein space with a programmable generative model. Nature 6231070-1078 (2023).
Watson, JL et al. De novo design of protein structure and function with RFdiffusion. Nature 6201089-1100 (2023).
Hopf, TA et al. Mutation effects predicted from sequence co-variation. Nat. Biotechnology. 35128-135 (2017).
Weinstein, EN, Amin, AN, Medical, H., Frazer, J. & Marks, DS Nonidentifiability and benefits of misspecification in molecular fitness models. In Proc. 36th International Conference on Neural Information Processing Systems (ed. Koyejo, S. et al.) (ACM, 2022).
Kosuri, S. & Church, GM Large-scale de novo DNA synthesis: technologies and applications. Nat. Methods 11499-507 (2014).
Weinstein, EN et al. Optimal design of stochastic DNA synthesis protocols based on generative sequence models. In Proc. 25th International Conference on Artificial Intelligence and Statistics (ed. Camps-Valls, G. et al.) (PMLR, 2022).
Li, JQ and Barron, AR Estimation of mixture density. In Proc. 12th International Conference on Neural Information Processing Systems (ed. Kearns, MJ et al.) (ACM, 1999).
Richardson, E. & Weiss, Y. On GANs and GMMs. In Proc. 32nd International Conference on Neural Information Processing Systems (ed. Bengio, S. et al.) (ACM, 2022).
Olsen, TH, Boyles, F. & Deane, CM Observed Antibody Space: a diverse database of cleaned, annotated and translated unpaired and matched antibody sequences. Protein Sci. 31141-146 (2022).
Olsen, TH, Moal, IH and Deane, CM Addressing antibody germline bias and its effect on language models to improve antibody design. Bioinformatics 40btae618 (2024).
Amin, AN, Weinstein, EN & Marks, DS A nonparametric generative Bayesian model for whole genomes. In Proc. 35th International Conference on Neural Information Processing Systems (ed. Ranzato, M. et al.) (ACM, 2021).
Gretton, A., Borgwardt, KM, Rasch, MJ, Schölkopf, B. and Smola, A. A two-sample kernel test. J.Mach. Learn. Res. 13723-773 (2012).
Google Scholar
Amin, AN, Marks, DS & Weinstein, EN Biological sequence cores with guaranteed flexibility. J.Mach. Learn. Res. 261–63 (2025).
Google Scholar
Shuai, RW, Ruffolo, JA & Gray, JJ IgLM: filler language modeling for antibody sequence design. Cellular systems 14979-989.e4 (2023).
Amin, AN, Weinstein, EN & Marks, DS A kernelized Stein divergence for biological sequences. In Proc. 40th International Conference on Machine Learning (ed. Krause, A. et al.) (PMLR, 2023).
Lloyd, JR and Ghahramani, Z. Critique of the statistical model using two kernel test samples. In Proc. 29th International Conference on Neural Information Processing Systems (ed. Cortes, C. et al.) (ACM, 2015).
Wermke, M. et al. Autologous T cell therapy for PRAME+ advanced solid tumors in HLA-A*02+ patients: a phase 1 trial. Nat. Med. 312365-2374 (2025).
Reynisson, B., Alvarez, B., Paul, S., Peters, B. & Nielsen, M. NetMHCpan-4.1 and NetMHCIIpan-4.0: improved predictions of MHC antigen presentation by simultaneous motif deconvolution and integration of MS MHC eluted ligand data. Nucleic Acids Res. 48W449-W454 (2020).
Nijkamp, E., Ruffolo, JA, Weinstein, E.N., Naik, N. & Madani, A. ProGen2: exploring the limits of protein language models. Cellular system. 14968-978 (2023).
Google Scholar
Gibson, DG et al. Enzymatic assembly of DNA molecules up to several hundred kilobases. Nat. Methods 6343-345 (2009).
Shumailov, I. et al. AI models collapse when trained on recursively generated data. Nature 631755-759 (2024).
Framework for screening nucleic acid synthesis (National Council of Science and Technology, 2024); https://aspr.hhs.gov/S3/Documents/OSTP-Nucleic-Acid-Synthesis-Screening-Framework-Sep2024.pdf
Baker, D. & Church, G. Protein design meets biosecurity. Science 383349 (2024).
Baum, C. et al. A system capable of verifiably and privately filtering global DNA synthesis. Preprint at https://arxiv.org/abs/2403.14023 (2025).
Abdali, S., Anarfi, R., Barberan, CJ, He, J. and Shayegani, E. Securing large language models: threats, vulnerabilities and responsible practices. Preprint at https://arxiv.org/abs/2403.12503 (2024).
Weinstein, EN, Slabodkin, A., Gollub, MG & Wood, EB Accelerated learning on large-scale displays using generative library models. Preprint at https://arxiv.org/abs/2510.16612 (2025).
Weinstein, EN et al. Acquisition of lifting biomolecular data. Preprint at https://arxiv.org/abs/2512.15984 (2025).
Zhang, J., Kobert, K., Flouri, T. and Stamatakis, A. PEAR: fast and accurate illumina double-end read fusion. Bioinformatics 30614-620 (2014).
Daily, J. Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments. BMC Bioinformatics 1781 (2016).
Jaravine, V., Mösch, A., Raffegerst, S., Schendel, DJ & Frishman, D. Expitope 2.0: a tool to evaluate immunotherapeutic antigens for their potential cross-reactivity against proteins naturally expressed in human tissues. Cancer BMC 17892 (2017).
Vita, R. et al. The immune epitope database (iedb): 2018 update. Nucleic Acids Res. 47D339-D343 (2019).
Huszár, F. & Duvenaud, D. Optimal weighting breeding is Bayesian quadrature. In Proc. 28th Annual Conference on Uncertainty in Artificial Intelligence (ed. de Freitas, N. and Murphy, K.) (ACM, 2012).
McInnes, L., Healy, J. & Melville, J. UMAP: uniform manifold approximation and projection for dimension reduction. Preprint at https://arxiv.org/abs/1802.03426 (2020).
Salimans, T. et al. Improved techniques for training GANs. In Proc. 30th International Conference on Neural Information Processing Systems (ed. Lee, DD et al.) (ACM, 2016).
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B. & Hochreiter, S. GANs trained by an update rule at two time scales converge to a local Nash equilibrium. In Proc. 31st International Conference on Neural Information Processing Systems (ed. von Luxburg, U. et al.) (ACM, 2017).
Lefranc, deputy. et al. Unique IMGT numbering for immunoglobulin and T cell receptor variable domains and Ig superfamily type V domains. Dev. Comp. Immunol. 2755-77 (2003).
Shen, S. et al. Probabilistic analysis of the frequencies of amino acid pairs within characterized protein sequences. Physics A 370651-662 (2006).
Rao, X., Fontaine Costa, AIC, van Baarle, D. & Kesmir, C. A comparative study of HLA binding affinity and ligand diversity: implications for the generation of immunodominant CD8+ T cell responses. J. Immunol. 1821526-1532 (2009).
Trolle, T. et al. The length distribution of class I-restricted T cell epitopes is determined by both peptide intake and MHC allele-specific binding preference. J. Immunol. 1961480-1487 (2016).


