Target sequence-conditioned design of peptide binders using masked language modeling

abdulmanannet77@gmail.comAugust 13, 2025

0 2 15 minutes read

Target sequence-conditioned design of peptide binders using masked language modeling

https://www.profitableratecpm.com/f4ffsdxe?key=39b1ebce72f3758345b2155c98e6709c

Data curation

In the data curation phase, protein and peptide complexes were amalgamated from the PepNN and Propedia databases^24,25. Initially, redundancy between the two datasets was eliminated, followed by the use of MMseqs2 to cluster the remaining protein sequences, setting a threshold of 0.8 (ref. ²⁶). When protein sequences were identified within the same cluster and exhibited identical binder sequences, a single sequence was retained. This was followed by a manual filtering process, wherein protein sequences were sorted and those exhibiting high similarity (threshold of 80%) were removed to further mitigate homology issues. Consequently, a dataset comprising 10,203 entries was amassed, from which 10,000 were randomly allocated for training and 203 for testing. The maximum lengths for the binder and protein sequences were established at 50 and 500, respectively.

Conditional peptide modeling

Peptide binders are modeled in a distinctive manner, wherein the peptides are modeled conditionally based on the full protein sequence. Let p = (p₁, p₂, p₃,…, p_n) represent the target protein sequence of length n and b = (b₁, b₂, b₃,…, b_m) denote the binder of length m. The protein and peptide sequences are concatenated, incorporating special tokens of start, end and padding. Mask language modeling transforms this into a conditional modeling problem, where the objective is to reconstruct b given p and the entire masked b region. The entire model is updated with MLM loss, which can be represented as:

$${{\mathcal{L}}}_{\rm{{MLM}}}=-\frac{1}{m}\sum _{\mathrm{i} \in m}\log P\left({b}_{\rm{i}}{|p},{b}_{{\rm{mask}}}\right)$$

Through this methodology, the discrepancy between the designed binders and the ground truth is minimized, facilitating the approximation of the conditional probability, $\mathop{\prod}\nolimits_{\rm{i=1}}^{m}P\left({b}_{\rm{i}}{|p}\right)$.

PepMLM training

The pretrained protein language model ESM-2 was used to facilitate full parameter finetuning. ESM-2, a transformer-based model, is adept at discerning co-evolutionary patterns across protein sequences. The concatenated protein and peptide sequences were tokenized at the amino acid level and input into the model. Deviating from the original training strategy of ESM, the entire binder sequence was exclusively masked, compelling the model to learn the relationship between the peptide binder and the protein. The ESM-2-650M and ESM-2-3B models were both trained for PepMLM. Both versions were trained on an NVIDIA 8x A100 640 GB DGX GPU system with PyTorch 2.01 and Python 3.10.10. Specific parameters are shown in Supplementary Table 1.

PepMLM generation

During the generation phase, the target protein sequence, along with a designated number of mask tokens (at end), was input into the model. Subsequently, the model greedily decodes logits at each masked position to identify peptide binders. To infuse greater diversity into the generation process, top-k sampling was implemented, wherein the model randomly selects the top k highest probability logits at each masked position.

PPL of PepMLM

The PPL of ESM-2 was adapted to focus specifically on the evaluation of peptide binder generation. Notably, the perplexity calculation is confined to the binder region or, in other words, the masked regions. Mathematically, the PPL is defined as:

$${\rm{Pseudo}}\mbox{-}{\rm{perplexity}}\left(b\right)=\exp \left\{-\frac{1}{m}\mathop{\sum }\limits_{\rm{i=1}}^{m}\log P\left({b}_{\rm{i}}|{b}_{j}\ne {\rm{i}},p\right)\right\}$$

In this equation, b represents the binder sequence, and m is the length of the binder sequence. This modification ensures a more focused evaluation of the designed peptide binders, aligning with the conditional modeling approach adopted in this study.

Peptide benchmarking

To assess the efficacy of the designed peptide binders, a benchmarking study was conducted on the test set. In the test set benchmarking, top-k sampling (k = 3) was employed to generate a single peptide binder for each target protein. Additionally, the original ESM-2 model was used to generate peptides, and random peptides of equivalent length were created. For ESM-2 generation, specifically, mask tokens of the same length were added at the end of target protein sequences for analogous model prediction and decoding as for PepMLM. The perplexity of the PepMLM was compared across four groups. PepMLM-designed binders and test binders were folded using AlphaFold2 ColabFold version 1.5.2, in conjunction with the protein sequences. Folding metrics including pLDDT and ipTM were gathered, which were used to correlate perplexity findings. For each test target protein, the ipTM scores of the test and designed binders were compared to determine the overall hit rate. Notice, as top-k sampling generates with randomness, the hit rate might vary or increase with different runs or k options.

RFdiffusion generation

In parallel to the PepMLM approach, RFdiffusion was employed to design peptide binders for both cases. For the given test set, RFdiffusion was tasked with generating one peptide binder per target protein, matching the length specified by the ground truth binders. The generated backbones were then converted into sequences using ProteinMPNN with initial guess and number of cycles of 3. The top sequence was selected via root mean square deviation. RFdiffusion inference code on ColabFold can be found at https://colab.research.google.com/github/sokrypton/ColabDesign/blob/v1.1.1/rf/examples/diffusion.ipynb.

Co-folding complex visualization

Structural visualization was performed using ChimeraX 1.7.1. The structures were superimposed using the MatchMaker tool, and interatomic contacts were identified using a van der Waals overlap threshold of ≥ −0.4 Å. The target proteins are shown in gray, whereas the PepMLM-designed and test binders are colored in red and blue, respectively. Their corresponding contact residues are highlighted in matching colors. Amino acid labels are displayed in the focused view.

For the rest visualization of AlphaFold-Multimer co-folding results from PepMLM-designed binder–protein complexes, an initial alignment with the corresponding test complex was performed using Biopython version 1.8.3, which facilitated a comparative visualization of selected complexes, encompassing both the designed and test binders. In these visualizations, the target protein was depicted in yellow, contrasting with the test and designed binders colored in blue and red, respectively. The visualizations were executed using py3Dmol version 2.0.4.

Alignment and identity

Target protein sequence similarity was assessed through two complementary approaches. Sequence identity between test and training sets was computed using the biotite Alignment method with an identity matrix. For each test target protein, the maximum identity score against all training set sequences was recorded. Additionally, a broader sequence similarity analysis was conducted using MMseqs2 (easy-search) to query both train and test target protein sequences against UniRef50, the training dataset of ESM-2.

Expression and purification of SUMO–peptide constructs

Peptides of interest were cloned into a pET-24a⁺ (Novagen) expression vector containing an N-terminal 6×-histidine–SUMO tag to facilitate downstream purification. Oligonucleotide primer pairs, each encoding for one half of the peptide sequences, were designed using NEBaseChanger V2 (https://nebasechanger.neb.com/) and then incorporated into the plasmid using Q5 site-directed mutagenesis, as per the manufacturer’s instructions. Site-directed mutagenesis reactions were carried out according to the same protocol. Plasmid assembly was verified using Sanger sequencing (GENEWIZ) and then transformed into chemically competent Escherichia coli BL21(DE3) cells. Starter cultures (3 ml of LB media, 50 µg ml⁻¹ kanamycin) were inoculated from freshly streaked agar plates or glycerol stocks and grown at 37 °C with shaking at 225 r.p.m. overnight. Starter cultures were then diluted 1:500 in bulk cultures and grown to an optical density at 600 nm (OD₆₀₀) of 0.6–0.8 and then induced at a concentration of 1 mM isopropyl β-d-thiogalactopyranoside (IPTG) overnight at 37 °C with shaking. Thirty minutes after induction, rifampicin was added to a final concentration of 150 µg ml⁻¹. Cells were then collected by centrifugation (4,500g) at 4 °C and washed twice with ice-cold 1× PBS. The resulting cell pellets were frozen at −20 °C overnight, thawed to room temperature and then lysed using BugBuster protein extraction reagent (Millipore Sigma, 70584-3) supplemented with recombinant lysozyme (Millipore Sigma, 71110-3) and benzonase endonuclease (Millipore Sigma, E1014-25KU) for 20 minutes at room temperature with gentle rocking. The corresponding lysate was diluted with lysis buffer (1× PBS, 20 mM imidazole, 1× Halt protease inhibitor cocktail (Thermo Fisher Scientific, 78430)) and then centrifuged at 14,000g for 30 minutes. The cleared supernatant was mixed end over end at 4 °C for 30 minutes with HisPur Ni-NTA resin (Thermo Fisher Scientific, 88221) equilibrated with 20 mM imidazole in 1× PBS. Resin was centrifuged at 700g for 2 minutes and then washed three times with 50 mM imidazole in 1× PBS. Protein was eluted with three consecutive washes with 500 mM imidazole, concentrated (Millipore Sigma, 3K MWCO, UFC900308) and desalted using Zeba spin desalting columns (Thermo Fisher Scientific, 89892). Expression and purity of purified proteins in both the soluble and insoluble fraction, as well as purified fractions, were assessed using SDS-PAGE (Supplementary Fig. 8). Protein concentrations were quantified using a Qubit Protein Assay (Thermo Fisher Scientific, Q33211).

Sandwich ELISA

Purified SUMO-tagged peptide constructs were coated onto 96-well plates (Corning, CLS9018) at a concentration of 2 µg ml⁻¹ in coating buffer (10 mM phosphate, pH 7.4) at a volume of 50–100 µl per well at 4 °C overnight with gentle rocking. Plates were washed once with Tris-buffered saline (50 mM Tris-HCl, 150 mM NaCl) supplemented with 0.05% Tween 20 (v/v) (TBS-T) and then blocked with 300 µl of SuperBlock in PBS (Thermo Fisher Scientific, 37516) per the manufacturer’s instructions. BSA, recombinant AMHR2-Fc (Sino Biological, 10673-H02H) and recombinant NCAM1-Fc (Sino Biological, 15785-H02H2) were serially diluted in triplicate or more in SuperBlock with 0.05% Tween 20, after which 100 µl of each solution was added to each well and incubated at room temperature with gentle rocking for 1.5 hours. Plates were then washed five times using 300 µl of TBS-T per well and then incubated with 100 µl of anti-human IgG (HRP) detection antibody (Thermo Fisher Scientific, A18805, diluted 1:10,000 in SuperBlock with 0.05% Tween 20) for 1 hour at room temperature. Plates were again washed five times with 300 µl of TBS-T and then incubated with 100 µl per well of 3,3′-5,5′-tetramethylbenzidine substrate (1-Step Ultra TMB-ELISA; Thermo Fisher Scientific, 34029) for 30 minutes at room temperature with gentle rocking. Finally, the reaction was quenched with 100 µl of 2 N H₂SO₄, and absorbance at 450 nm was immediately quantified using a Promega GloMax Discover plate reader.

Generation of mammalian plasmids

All uAb plasmids were generated from the standard pcDNA3 vector, harboring a cytomegalovirus promoter and a C-terminal P2A–GFP cassette as a transfection control. An Esp3I restriction site was introduced immediately upstream of the CHIPΔTPR coding sequence and flexible GSGSG linker via KLD Enzyme Mix (NEB) following polymerase chain reaction (PCR) amplification with mutagenic primers (GENEWIZ). For uAb assembly, PepMLM-derived peptide sequences (Supplementary Table 4) were human codon optimized for complementary oligo generation (GENEWIZ). Oligos were annealed and ligated via T4 DNA Ligase into the Esp3I-digested uAb backbone. Assembled constructs were transformed into 50 µl of NEB Turbo Competent E. coli and plated onto LB agar supplemented with the appropriate antibiotic for subsequent sequence verification of colonies and plasmid purification (GENEWIZ).

Sequences for human codon-optimized phosphoprotein genes for NiV (GenBank, AY029767), HeV (GenBank, MN062017) and HMPV (GenBank, AAS22075) were designed with HA tags on their N termini and flanked with restriction enzyme recognition sites for KpnI and XhoI on their 3′ and 5′ ends, respectively, for ligation into a mammalian pCAGGS vector.

Cell culture for target degradation

HEK293T and Vero AT cells were maintained in DMEM supplemented with 100 U ml⁻¹ penicillin, 100 mg ml⁻¹ streptomycin and 10% FBS. uAb-encoding plasmids (500 ng) were transfected into cells (4 × 10⁵ per well in a 12-well plate) with Lipofectamine 2000 (Invitrogen) in Opti-MEM (Gibco). TruHD-Q43Q17M cells were maintained in Eagle’s Minimum Essential Medium with Earle’s Salts (EMEM) supplemented with 15% FBS, 1% NEAA (Gibco) and 1% GlutaMAX (Gibco). For HTT degradation studies, PepMLM peptides were transfected into fibroblasts using the SG cell line 4D-Nucleofector X Kit (Lonza). For viral protein degradation, transfections were done with HEK293T cells at approximately 90% confluency in six-well plates using a 4:1 µl:µg ratio of PEI MAX to DNA, following the transfection reagent manufacturer’s protocol. Target phosphoprotein plasmids were transfected at a 1:1 ratio with uAb plasmids for a total of 2 µg of DNA per well in Opti-MEM. Transfections were supplemented with Opti-MEM at approximately 5 hours after transfection. HMPV strain TN93-32 (BEI) was propagated in Vero AT cells for 5 days in DMEM supplemented with 100 U ml⁻¹ penicillin, 100 mg ml⁻¹ streptomycin and 2% FBS. RPE1 cells used for MSH3 studies were maintained in DMEM/Nutrient Mixture F-12 (Gibco) supplemented with 10% FBS and 10 µg ml⁻¹ hygromycin B (Gibco) and transfected with PepMLM peptides using the TransIT-X2 Dynamic Delivery System (Mirus Bio).

MSH3 quantitative immunofluorescence

MSH3 antibody (Thermo Fisher Scientific, PA5-29829) was directly labeled using an Alexa Fluor Antibody Labeling Kit (Invitrogen). Transfected RPE1 cells were fixed 2 days after transfection using 4% paraformaldehyde (Thermo Fisher Scientific) for 20 minutes at room temperature and permeabilized using 0.2% Triton X-100 (BioShop) for 10 minutes at 4 °C. The cells were blocked using a blocking buffer (10% FBS in PBS) overnight and incubated with the labeled primary antibody diluted in blocking buffer (1:50) for 1 hour. The cells were then incubated with 2 µg ml⁻¹ Hoechst 33258 (Thermo Fisher Scientific) for 5 minutes at room temperature. The imaging was done using an EVOS M7000 Imaging System (Thermo Fisher Scientific) at ×20. Cell segmentation and signal quantification was done using CellProfiler. When conducting downstream analysis, the TRITC signal from the PepMLM plasmid transfection was used to select for transfected cells. Data were analyzed to assess the statistical significance of differences in normalized intensities between the control group (polyG) and treatment groups (pMLM1–pMLM6). Outliers for all samples were removed using the interquartile range (IQR) method, where values greater than the third quartile plus 1.5 times the IQR were excluded to ensure robust comparisons. Statistical comparisons between the control and each treatment group were performed using a one-sided Mann–Whitney U-test. Significance thresholds were defined as follows: *P < 0.05, **P < 0.01, ***P < 0.001, ****P < 0.0001 and *****P < 0.00001. Non-significant comparisons were denoted as ‘NS’. All analyses were conducted using Python, and plots were generated using Matplotlib.

HTT western blotting

The PepMLM peptide expression was induced using 1 µg ml⁻¹ doxycycline (Sigma-Aldrich) 3 days before harvest and replenished every 2 days. On the day of harvest, TruHD-Q43Q17M cells were washed with 1× PBS and then lysed and scraped off using RIPA buffer (50 mM Tris-HCl pH 8.0, 150 mM NaCl, 1% NP-40, 0.25% sodium deoxycholate, 1 mM EDTA) with protease and phosphatase inhibitors (Thermo Fisher Scientific) on ice. The mixture was incubated on ice for 5 minutes followed by centrifugation at 13,000 r.p.m. for 5 minutes at 4 °C. The supernatant was collected and quantified using a BCA Protein Assay Kit (Sigma-Aldrich). Then, 4× loading buffer (250 mM Tris pH 6.8, 40% glycerol, 8% SDS, 0.02% bromophenol blue) was added to the supernatant and incubated at 95 °C for 5 minutes. Immunoblotting was performed using precast 4–20% gradient gels (Bio-Rad) and then transferred onto an Immobilon-P PVDF membrane (Millipore). The membranes were blocked in 5% skim milk powder in 1× TBS-T (50 mM Tris-HCl, pH 7.5,150 mM NaCl, 0.1% Tween 20) at 4 °C overnight and then probed with rabbit anti-huntingtin antibody (Abcam, EPR5526, 1:5,000) or rabbit anti-vinculin antibody (Abcam, EPR8185, 1:5,000) in the same buffer for 1 hour at room temperature. The membranes were washed three times with 1× TBS-T and then three times with 2.5% skim milk powder in 1× TBS-T for 5 minutes each. The membranes were then probed with HRP-conjugated secondary antibodies (Abcam, 1:50,000) for 30 minutes at room temperature before being washed again and incubated with Immobilon Western Chemiluminescent HRP Substrate (Millipore) and imaged with a MicroChemi chemiluminescence detector (DNR Bio-Imaging Systems). Densitometry analysis was conducted using ImageJ. PolyG controls were first normalized by vinculin loading control with this normalized polyG being used to normalize uAb band degradation.

Viral phosphoprotein western blotting

HEK293T cells were harvested 48 hours after transfection and lysed using 1× RIPA buffer (Millipore) containing complete protease inhibitor (Sigma-Aldrich). The cells were incubated at 4 °C, rocking, for 40 minutes before being vortexed at 5-minute intervals for 20 minutes. Cell lysate supernatants were collected after centrifugation at 21,000g for 30 minutes at 4 °C. To denature samples for SDS-PAGE, cell lysates were mixed and incubated with 1.8% SDS containing 5% β-mercaptoethanol for 10 minutes at 95 °C before loading onto 10% acrylamide-Tris HCl gels. Proteins were separated at 100 V for 2 hours and then transferred onto 0.2-µm PVDF membranes at 0.5 A for 2 hours. Membranes were blocked in PBS with 0.2% Tween 20 (PBS-T) containing 4% BSA before staining in 1:1,000 dilutions of mouse anti-Flag (Millipore, F1804), mouse anti-β-actin (Santa Cruz Biotechnology, 47778) and rabbit anti-HA (BioLegend, 923502) primary antibodies. Secondary antibody staining was performed using 1:1,000 dilutions of goat anti-mouse Alexa Fluor 647 and goat anti-rabbit Alexa Fluor 488 secondary antibodies (Invitrogen, A21236 and A11008, respectively). Blocking, primary and secondary antibody membrane incubations were performed rocking at room temperature for 30 minutes, 1 hour and 30 minutes, respectively. Membranes were rinsed with PBS-T three times for 5 minutes after each antibody staining. All membranes were imaged using a Bio-Rad imager in respective Alexa Fluor channels. Densitometric quantification was performed using ImageLab for phosphoprotein and β-actin bands. Background densities from samples mock transfected with pCAGGS vector only were subtracted. Then, sample densities were normalized to their respective β-actin signals before normalization to their respective phosphoprotein controls. Data represent n ≥ 3 experimental replicates. Generation of bar graphs was performed using GraphPad Prism version 10, and the schematic diagram was made using BioRender (https://www.biorender.com/).

Immunofluorescent staining of viral phosphoprotein

Vero AT cells were seeded in 24-well plates to 90% confluency; after transfection and infection with HMPV, cells were washed with DPS twice and then fixed with 4% paraformaldehyde at room temperature to be subsequently permeabilized with a solution of 0.1% Triton X-100 in PBS. Custom polyclonal rabbit serum made against HMPV M was used for viral detection. After 1 hour, bound antibodies were detected with goat anti-rabbit secondary antibody conjugated with Alexa Fluor 488 (Invitrogen). Finally, the cellular nuclei were labeled with Hoechst (Thermo Fisher Scientific) in PBS for 10 minutes, and the images were examined using an ECHO Revolve microscope (BICO).

Cell culture for MESH1 degradation and ferroptosis protection

HEK293T cells were obtained from the Duke Cell Culture Facility and originated from the American Type Culture Collection (ATCC). The cells were cultured in DMEM 4.5 g l⁻¹ glucose and 4 mM glutamine (Thermo Fisher Scientific, 11995-DMEM) and 10% heat-inactivated FBS (HyClone, SH30070.03HI) in a humidified incubator at 37 °C with 5% CO₂. For MESH1 immunoblotting, HeLa cells originating from the ATCC were maintained in DMEM supplemented with 100 U ml⁻¹ penicillin, 100 mg ml⁻¹ streptomycin (Gibco) and 10% FBS. For uAb screening in reporter cell lines, 800 ng of pcDNA-uAb plasmids was transfected into cells in triplicate (3 × 10⁵ per well in a 12-well plate) with Lipofectamine 2000 (Invitrogen) in Opti-MEM (Gibco). Cells were harvested 72 hours after transfection for subsequent immunoblotting.

MESH1 western blotting

On the day of harvest, cells were detached by adding 0.05% trypsin-EDTA and washing cell pellets twice with ice-cold 1× PBS. Cells were then lysed using a 1:100 dilution of protease inhibitor cocktail (Millipore Sigma) in Pierce RIPA buffer (Thermo Fisher Scientific). Specifically, the protease inhibitor cocktail–RIPA buffer solution was added to the cell pellet, and the mixture was placed at 4 °C for 30 minutes followed by centrifugation at 15,000 r.p.m. for 10 minutes at 4 °C. The supernatant was collected immediately to pre-chilled PCR tubes and quantified using a Pierce BCA Protein Assay Kit (Thermo Fisher Scientific). Then, 20 μg of lysed protein was mixed with 4× Bolt LDS Sample Buffer (Thermo Fisher Scientific) with 5% β-mercaptoethanol in a 3:1 ratio and subsequently incubated at 95 °C for 10 minutes prior to immunoblotting, which was performed according to standard protocols. In brief, samples were loaded at equal volumes into Bolt Bis-Tris Plus Mini Protein Gels (Thermo Fisher Scientific) and separated by electrophoresis. iBlot 2 Transfer Stacks (Invitrogen) were used for membrane blot transfer, and, after a 1-hour room-temperature incubation in 5% milk–TBS-T, proteins were probed with rabbit anti-HDDC3 antibody (Sigma-Aldrich, HPA040895, diluted 1:1,000) or rabbit anti-vinculin (Invitrogen, 700062, diluted 1:2,000) for overnight incubation at 4 °C. The blots were washed three times with 1× TBS-T for 10 minutes each and then probed with a secondary antibody, donkey anti-rabbit IgG (H + L) (HRP) (Abcam, ab7083, diluted 1:5,000), for 1 hour at room temperature. After three washes with 1× TBS-T for 10 minutes each, blots were detected by chemiluminescence using an Invitrogen iBright CL1500 Imaging System. Densitometry analysis of protein bands in immunoblots was performed using FIJI software as described at https://imagej.nih.gov/ij/docs/examples/dot-blot/. In brief, bands in each lane were grouped as a row or a horizontal ‘lane’ and quantified using FIJI’s gel analysis function. Intensity data for the uAb bands were first normalized to band intensity of either vinculin in each lane and then to the average band intensity for the polyG–uAb vector control cases across replicates.

Ferroptosis protection assay

HEK293T cells were reverse transfected using 1 μg of uAb plasmid and 3 μl of Mirus TransIT-LT-1 (Mirus Bio) transfection reagent for 48 hours, using the standard protocol for a 12-well plate as described by the manufacturer. The transfected HEK293T cells were transferred 2,500 cells per well to a 96-well plate, and 10 M erastin (Cayman Chemical) was added to the media to induce ferroptosis. Cell viability was measured 24 hours later using the Cell-Titer Glo (Promega) assay following the manufacturer’s protocol.

Statistical analysis and reproducibility

Unless otherwise noted, all data are reported as average values with error bars representing s.d. For samples performed in independent biological triplicates (n = 3) or more, statistical significance was determined by unpaired t-test, one-way ANOVA followed by a Dunnett’s multiple comparison test, paired one-sided Student’s t-test or Mann–Whitney U-test as indicated (*P < 0.05; **P < 0.01; ***P < 0.001; ****P < 0.0001; *****P < 0.00001). Quantitative immunofluorescence sample sizes were 1,954, 1,311, 1,218, 3,662, 3,079, 2,887 and 3,845 for the polyG control and MSH3_pMLM_1–6, respectively. All graphs were generated using GraphPad Prism 10 version 14.4.1 or in Matplotlib. No data were excluded from the analyses unless specifically noted otherwise. The experiments were not randomized. The investigators were not blinded to allocation during experiments and outcome assessment.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.