A resource of RNA-binding protein motifs across eukaryotes reveals evolutionary dynamics and gene-regulatory function

https://www.profitableratecpm.com/f4ffsdxe?key=39b1ebce72f3758345b2155c98e6709c

Identifying eukaryotic RBPs

To identify RBPs across 690 eukaryotic organisms, we scanned protein sequences for well-characterized RBDs using HMMER57 with recommended parameter settings (‘full sequence E value’ (sequence_eval) ≤ 0.01 and ‘domain conditional E value’ (c-Evalue) ≤ 0.01). We used the RBD profile HMMs (pHMMs) CSD, KH_1, La, NHL, PUF, S1, SAM_1, YTH, zf-CCCH, zf-CCHC, zf-CCHH and zf-RanBP from the Pfam database58. For RRMs, we defined a new, longer pHMM as the Pfam RRM_1 pHMM does not include all residues that contact RNA. This extended RRM pHMM is based on PDB RRM structures, and its construction is described in Supplementary Note 1.

We grouped all identified RBPs into ‘RBP families’ by their domain architecture, that is, the type, number and order of the RBDs in their protein sequence (for example, RRM, RRM-RRM, KH-RRM and RRM-RRM-KH).

Calculating amino acid sequence identities

We used two separate methodologies for aligning the RBRs of RBPs and calculating their pairwise AA SIDs. The first method, ‘RBP family-wise AA SID calculations’, was used for calculating AA SIDs within RBP families, where all RBPs have the same RBD architecture. We generated an RBR sequence for each protein by concatenating its individual RBD sequences in order and applied clustalOmega59 with default settings to generate multiple sequence alignment for all RBR sequences within an RBP family. We defined the AA SID between each pair of RBRs in an RBP family as the proportion of exactly matched, aligned residues in the multiple sequence alignment.

We used a second method for individual pairs of RBPs, ‘RBP pairwise AA SID calculations’, to allow for comparisons between RBPs in different RBP families. Here, we defined the RBRs as the subsequence of the RBP that contains all of its RBDs plus up to 15 flanking amino acids before and after each RBD in the RBR. These flanks were added to include linker regions and C and N termini. To align two RBRs, we aligned each RBD (with flanks) in one RBR to each RBD in the other RBR using BLOSUM62 scoring (gap opening –11, gap extension –1; Needleman–Wunsch). For each RBD-to-RBD alignment, we computed the AA SID (number of exact aligned matches / total length). If the two RBRs had the same number of RBDs (for example, RRM1–RRM2 and RRM1–RRM2), we calculated the pairwise RBR AA SID as the mean AA SID of each pair of corresponding RRMs (for example, mean (RRM1 versus RRM1 AA SID, RRM2 versus RRM2 AA SID)). If the two RBRs had differing numbers of RBDs (for example, RRM1–RRM2 and RRM1–RRM2–RRM3), we computed the mean AA SID of all possible alignments of the shorter RBR to the longer RBR, only allowing alignments where adjacent RBDs are aligned (for example, RRM1–RRM2 can only be aligned to RRM1–RRM2 or RRM2–RRM3 of a three-RRM RBR). The maximum AA SID of the shorter RBR alignments is used as the pairwise RBR AA SID. This procedure is designed to account for duplication and deletion events during evolution of protein sequences.

To infer the RNA-binding specificity of an uncharacterized protein based on AA SID (for example, the ‘70% rule’), we identified the RNAcompete-measured RBP with the highest overall AA SID and used the RNAcompete RNA-binding profile of this protein as the RNA-binding profile for the uncharacterized protein. The AA SID between the two proteins was used as the confidence score for this prediction.

Selecting RBPs for RNAcompete

To select eukaryotic RBPs for characterization by RNAcompete, we considered RRM-, KH- and CCCH-containing RBPs from 45 well-annotated eukaryotes, representing model organisms and diverse species across the eukaryotic tree31, identified as described in ‘Identifying eukaryotic RBPs’. For this task, we used AA SIDs calculated using the RBP family-wise method. We used four different strategies (described in Supplementary Note 2) to select RBPs to maximize the utility of the dataset. Collectively, the procedure produced 277 diverse RBPs for experimental characterization using RNAcompete.

Measuring RNA sequence specificities with RNAcompete

RBP inserts (refer to Supplementary Table 1) were commercially synthesized (Bio Basic) for all 277 selected RBPs and cloned into a custom expression vector, pTH6838, using AscI and SbfI restriction enzymes sites32. Glutathione S-transferase (GST)-tagged RBPs were purified from Escherichia coli, and RNAcompete experiments were performed to determine their RNA-binding profiles. RNAcompete, a microarray-based in vitro RNA-binding assay, has been extensively detailed elsewhere17,32 and in Supplementary Note 3. Briefly, recombinant GST-tagged RBPs are incubated with ~241,000 designed (not randomized) RNA probes, each about 40 nucleotides long. RNA probes are designed to possess low probabilities for base pairing (that is, they are single stranded) and to represent each 7-mer at least 310 times. RBP–RNA complexes are affinity purified, and bound RNAs are extracted and labeled with Cy3 or Cy5. Abundance is measured on a custom Agilent 244K microarray.

Measured probe intensities were centered, and their variance was normalized for each protein (columns) and RNA probe (rows) to control for variation in RNA and protein concentrations. Next, a score was robustly computed for each 7-mer as the mean probe intensities of the inner 95% of probes that contain the 7-mer (the probes with the highest and lowest intensities were excluded). These 7-mer scores were transformed to z scores, using the mean and standard deviation of all 7-mer scores. The vector of z scores for 16,382 7-mers is referred to as the RNA-binding profile; two 7-mers (GCTCTTC and CGAGAAG) were removed because they correspond to the SapI/BspQI restriction site. For visualization purposes, position weight matrices were generated by aligning the ten most enriched 7-mers without allowing for gaps. A 7-mer sequence is considered to be ‘specifically bound’ by an RBP if its Benjamini–Hochberg-adjusted z-score q value is <0.01.

We evaluated RNAcompete experiments for the 277 selected RBPs according to previously described success criteria32 and identified a subset of 174 RBPs with successful RNAcompete experiments. Combining these experiments with those from a previous study17, we generated a set of 420 experiments covering 379 unique RBPs (across 381 unique constructs). For RBPs with more than one RNAcompete experiment, we calculated the mean of the z scores for the top ten 7-mers and used the experiment with the highest mean in subsequent analyses (Supplementary Table 1).

We additionally assessed variability in the 7-mer z scores for all 420 experiments using a bootstrap analysis. For each RNAcompete experiment, we sampled probes with replacement 100 times and recalculated the z scores. The mean and standard deviation for each 7-mer z score from the bootstrap samples are available at https://hugheslab.ccbr.utoronto.ca/supplementary-data/RBPZoo/.

JPLE

In JPLE, each RBP is assigned a fixed-length peptide profile vector p′, which contains the count of all 5-mer peptide templates (with four specified amino acids and one wildcard character, which cannot be in the first or fifth position) within any of its RBDs, including 15 flanking nucleotides on either side of each RBD (Fig. 2a). This ‘bag-of-peptides’ representation has been used previously for a number of protein function prediction tasks, including homology modeling28,60. Each RBP is also associated with a fixed-length RNA-binding profile vector r′ composed of its RNAcompete z scores for all RNA 7-mers (Fig. 2b), apart from GCUCUUC and CGAGAAG, which are target sequences of the restriction enzymes used in the assay. The row vectors p′ and r′ for the n = 355 training set RBPs were stacked to form matrices P n × p and R n × r, where p′ is the number of RBP 5-mers, and r′ is the number of RNA 7-mers. Each element of the matrix was divided by the Euclidean norm of its row, and each column was centered by subtracting the corresponding column mean (collected in vectors μP and μR for P′ and R′, respectively) from each element. We then removed columns with zero variance from both matrices. The final transformed matrices, P n × p and R n × r, have p and r columns, respectively, and consist of 355 stacked row vectors p and r, respectively.

The matrices P and R were concatenated column-wise to form the joint protein representation [PR] n× (p+r) (Extended Data Fig. 2a), to which singular value decomposition (SVD) was applied, giving

$$[{\it{P}}{\it{R}}]={\it{U}}{\varSigma}{{\it{V}}}\,^{\mathrm{T}}$$

where U n× rank([PR]), Σ rank([PR]) × rank([PR]), V (p+r) × rank([PR]) and T means transpose. The diagonal entries Σii of Σ are singular values σi, whereas the columns of U and V (that is, ui, vi) form the orthonormal basis of [PR] (UTU = VTV = Irank([PR])). Typically, when performing dimensionality reduction using SVD, the lowest σi are set to 0, effectively removing basis vectors ui and vi that explain the least variance in [PR]. In JPLE, however, we retain basis vectors contributing the most to the variance of R without regard to how much of the variance of P is explained. To compute the per-basis-vector contribution, we expressed the variance of R in [PR] as

$$\begin{array}{l}{\rm{Var}}({\it{R}})={\rm{tr}}({{\it{RR}}}^{{\rm{T}}})={\rm{tr}}({{\it{R}}}^{{\rm{T}}}{\it{R}})={\rm{tr}}({{\it{V}}}_{{\it{R}}}{\varSigma}\,{{\it{U}}}\,^{\rm{T}}{\it{U}}{\varSigma}{{\it{V}}_{\it{R}}}^{\rm{T}})\\={\rm{tr}}({{\it{V}}}_{{\it{R}}}{\varSigma}\,^{2}{{\it{V}}_{\it{R}}}^{\rm{T}})=\mathop{\sum}\limits_{i}^{{\rm{rank}}([{\it{PR}}])}{{\rm{\sigma }}_{i}}^{2}{{\mathbf{v}}_{\it{R},{\it{i}}}}^{\rm{T}}{{\mathbf{v}}_{\it{R},{\it{i}}}}\end{array}$$

where VR are the columns of V that correspond to those of R in [PR]. The top d basis vectors and singular values, ranked by their variance contribution to R in [PR], were retained. By performing SVD on R alone for the 355 training set proteins, we determined that 122 singular vectors were required to achieve a minimum Pearson correlation coefficient of 0.95 between their reconstructed (r*) and measured (r′) RNA-binding profiles (Extended Data Fig. 2b). These selected singular vectors explain 96% of the variance in both R alone (Extended Data Fig. 2b) and R in [PR] (Extended Data Fig. 2d), demonstrating that d = 122 is sufficient to capture most of the variance of R in [PR]. With this, the original SVD formulation can be written as the following approximation:

$${\mathbf{[}}{\it{PR}}{\mathbf{]}}\approx {{\it{U}}^{\prime}{\varSigma}^{\prime}{\it{V}}^{{\prime}{\mathrm{T}}}}$$

where U n × d, Σ d × d and V (p+r) × d. The rows of the matrix W = UΣ n × d each represent a latent embedding of one of the training set RBPs in the subspace spanned by the columns of V′.

Protein query

In a protein query, JPLE maps a peptide profile p to its reconstructed RNA-binding profile r* (Fig. 2c and Extended Data Fig. 2c). Given an RBP with an uncharacterized RNA specificity, JPLE computes its latent embedding wu by deconvolving its peptide profile pu according to a mixture matrix containing the orthonormal bases in VP′ (columns of V′ that correspond to those of P in [PR]), that is,

$${{\mathbf{p}}}_{{\rm{u}}}={{\it{V}}_{\it{P}}}^{\prime}{{\mathbf{w}}}_{{\rm{u}}}$$

The equation above can be solved to find the best least squares estimate wu* of the embedding that reproduces pu using the pseudoinverse of VP′:

$${{\mathbf{w}}_{\rm{u}}}^{*}={\left({{\it{V}}_{\it{P}}}^{{\prime}{\rm{T}}}{{\it{V}}_{\it{P}}}^{\prime}\right)}^{-{1}}{{\it{V}}_{\it{P}}}^{{\prime}{\rm{T}}}{{\mathbf{p}}}_{{\rm{u}}}$$

This embedding wu* can be associated with a reconstructed RNA-binding profile ru* using one of two different approaches. The first approach, termed global decoding, uses a linear mapping and is not used for protein queries, but an equivalent method is used for RNA queries and is described in the next section. The second approach, termed local decoding, is used for protein queries and reconstructs the RNA-binding profile based on training set RBPs with nearby embeddings, thereby implementing a nonlinear mapping. To do so, we first define the e-dist εui between an uncharacterized and training set RBP to be the cosine distance between their latent embeddings, that is,

$${\varepsilon }_{{\rm{u}}{i}}=1-{{\mathbf{w}}_{\rm{u}}}^{*{\rm{T}} }{{\mathbf{w}}}_{{i}}/{\rm{||}}{{\mathbf{w}}_{\rm{u}}}^{*}{\rm{||}}\;{\rm{||}}{{\mathbf{w}}}_{{i}}{\rm{||}}$$

where wi (ith row of W as a column vector) is the latent embedding of the training set RBPi. The e-dist around each uncharacterized RBPu was used with a radial basis function kernel (0 mean and γ = 25) to obtain the e-sim sui:

$${s}_{{\rm{u}}{i}}=\exp (-\gamma {{\varepsilon }_{{\rm{u}}{i}}}^{2})$$

For each uncharacterized RBPu, we constructed a set of N neighborhood training set proteins as follows, N = {i | sui ≥ 0.01}. The reconstructed RNA-binding profile ru* of the uncharacterized RBPu was computed as the average RNA-binding profile ri of all neighborhood training set proteins, weighted by their e-sim sui:

$${{\mathbf{r}}_{\rm{u}}}^{{*}}=(\sum_i{s}_{{\rm{u}}{i}}{{\mathbf{r}}}_{{\it{i}}})/(\sum_i{s}_{{\rm{u}}{i}})\forall i\in N$$

If | N | = 0, ru* is set to the RNA-binding profile of the training set protein with the highest e-sim. Overall, local decoding increases the impact of training set proteins with similar embeddings on the reconstruction while negating the impact of training set RBPs with more dissimilar embeddings.

RNA query

In an RNA query, JPLE maps an RNA-binding profile r to its reconstructed peptide profile p* (Fig. 2c and Extended Data Fig. 2c) using the global decoding approach. Although the preserved basis vectors in V′ account for 96% of the variance in the RNA specificities R, they only explain 44% of the variance in the protein peptide profiles P (Extended Data Fig. 2d). Thus, they are correlated with at least one of the RNA specificities, thereby characterizing functionally important peptides for RNA recognition.

To improve the representation of peptide sequences in the training set RBPs, we used a ‘data augmentation’ approach to train a version of JPLE for RNA queries. Specifically, we augmented the training set with RBPs that have high homology to those in the training set. These homologs lack RNAcompete measurements but likely have similar binding specificities to those measured by RNAcompete. This data augmentation strategy improved RNA queries but did not have a clear impact on protein queries, so it was not used for the version of JPLE trained for protein queries.

The added homologous RBPs were identified by using HMMER57 (E value of ≤1 × 10−15) to align RBP sequences containing RRM or KH domains to those with measured RNA specificities. Those uncharacterized RBPs with AA SIDs ranging from 50% to 99% to any measured RBPs were considered hits (that is, homologs). For a given training set RBP, its 5-mer counts were computed across the alignments of all hits. The resulting peptide counts, normalized by the number of distinct peptides at a given position, were used as a protein representation vector p+ for the given training set RBP. These augmented representations P+ were then concatenated with the measured RNA sequence specificities R+ to train JPLE, resulting in their latent embeddings W+ and orthonormal bases V+′.

Like a protein query, query RNA profiles were embedded into the latent space W+. However, the latent embedding w+u of the query profile was obtained by deconvolving its RNA profile ru using the mixture matrix V+R′:

$${{\mathbf{r}}}_{{\rm{u}}}={{\it{V}}_{+{\it{R}}}}^{{\prime}}{{\mathbf{w}}}_{+{\rm{u}}}$$

One can solve the equation above to compute an estimate, w+u*, of the latent embedding by multiplying both sides by the pseudoinverse of V+R′, giving

$${{\mathbf{w}}_{+{\rm{u}}}}^{{*}}={({{\it{V}}_{+{\it{R}}}}^{{\prime}\rm{T}}{{\it{V}}_{+{\it{R}}}}^{{\prime}})}^{-{{1}}}{{\it{V}}_{+{\it{R}}}}^{{\prime}\rm{T}}{{\mathbf{r}}}_{{\rm{u}}}$$

After computing this estimate, we used global decoding to map the estimated latent embedding w+u* to its reconstructed peptide profile pu*, representing the relative contribution of each peptide 5-mer to the input RNA specificities, as follows:

$${{\mathbf{p}}_{\rm{u}}^{\;{*}}}={{\it{V}}_{+{\it{P}}}}^{{\prime}}{{\mathbf{w}}_{+{\rm{u}}}}^{{*}}+{\mathbf{\upmu}}_{{\it{P}}}$$

Thus,

$${{\mathbf{p}}_{\rm{u}}}^{{*}}={{\it{V}}_{+{\it{P}}}}^{{\prime}}{({{\it{V}}_{+{\it{R}}}}^{{\prime}\rm{T}}{{\it{V}}_{+{\it{R}}}}^{{\prime}})}^{-{1}}{{\it{V}}_{+{\it{R}}}}^{{\prime}\rm{T}}{{\mathbf{r}}}_{{\rm{u}}}+{{\mathbf{\upmu }}}_{{\it{P}}}$$

Notably pu* is a linear function of ru, but the above model corresponds neither to ordinary linear regression (where, among other differences, V+R′ and V+P′ would be replaced with R+ and P+, respectively) nor to principal components regression, which would essentially correspond to using an embedding of ru to predict pu directly rather than w+u*, the embedding of pu. This is an important difference because w+u* only retains information about pu relevant to its RNA sequence specificity.

Alternative JPLE implementations and RNA-binding specificity prediction methods

Protein sequence representation models

We retrieved pretrained ‘Unirep’ and ‘Transformer’ (also referred to as ‘bert’ in the code repository) models reimplemented and trained as part of the tasks assessing protein embeddings (TAPE) framework33 and embedded the RBR sequences for the 355 training set RBPs using TAPE’s tape-embed command. We then used the 1,900- and 768-dimensional embeddings generated by the Unirep and Transformer models, respectively, to train two linear regression models with ridge regression (λ = 0.0001) to predict the RNA-binding profiles.

Affinity regression

Affinity regression is a conceptually similar machine learning approach designed for predicting RNA specificities of RBPs28. Instead of modeling the direct mapping between P and R, however, affinity regression learns the interaction, A, between RBP amino acid 4-mer counts P and RNA 5-mer counts D to reconstruct R during training,

$${{\it{DAP}}}^{{\rm{T}}}\approx {{\it{R}}}^{{\rm{T}}}$$

where D r × r, A r′ × p, P n × p and R n × r. Affinity regression was applied to the same set of 355 training set RBPs for direct comparison to JPLE. Supplementary Note 4 contains further information about affinity regression and our implementation of affinity regression.

Nearest neighbor model

For an RBP with uncharacterized RNA specificities, the nearest neighbor model computes the cosine similarity between its peptide profile and all 355 training set peptide profiles. The uncharacterized RBP adopts the RNA-binding profile of the closest training set RBP.

Protein–nucleic acid structure models

We evaluated the ability of RF2NA34 and AF3 (ref. 35) to differentiate between ‘binding’ and ‘nonbinding’ RBP–RNA interactions. For each of the 355 RBPs containing an RRM or KH domain, we generated both a ‘binding’ and ‘nonbinding’ set of RNA 7-mers. To do this, we ranked the 7-mers by their z scores and excluded the top and bottom 2.5% to account for potential artifacts. The top 50 7-mers were assigned to the ‘binding’ set, whereas the 50 7-mers closest to the median were designated as the ‘nonbinding’ set. This approach produced a total of 35,500 RBP–RNA pairs.

Subsequently, we used RF2NA to predict the three-dimensional (3D) structures of these RBP–RNA pairs. Following the RF2NA evaluation method34, we used the mean interface predicted aligned error (PAE) as a proxy for RF2NA’s predicted binding specificity, with lower values indicating higher binding specificities. For each RBP–RNA residue pair, we calculated its average PAE by taking the mean of the PAE from the RBP to the RNA residue and the PAE from the RNA to the RBP residue. We then determined the mean interface PAE by averaging these values across all interface residue pairs, defined as RBP–RNA residue pairs within 4.5 Å of each other. Structures lacking any interface residue pairs were excluded from the analysis, resulting in 35,479 mean interface PAE values.

We similarly used AF3 to predict the 3D structures and used the same methodology to compute the mean interface PAE for each RBP–RNA pair. In total, 35,213 mean interface PAE values remained after excluding structures without interface residue pairs.

PDB cocomplex structures

We retrieved entries of cocomplex structures that include both RRM-containing RBPs and RNAs from the PDB36. These 89 PDB entries contain a total of 156 RBP chains whose RRM domains were identified and extracted with HMMER57 using the standard Pfam RRM_1 profile HMM58. We identified 156 RRM domains, contained within 119 protein chains, with hmmsearch using default settings57. The RRM domain sequences were extended by 35 amino acids in both directions and merged if the sequences overlapped in the same protein chain, leading to 118 RRM-containing RBRs.

For each extracted PDB structure, we identified its interface protein residues as those with any carbon atoms within 5 Å of any RNA atom. The RBP-binding motifs, in IUPAC format, were defined as the connected RNA bases within 5 Å of an RBR carbon atom.

To assign RBPs with RNAcompete measurements to those within PDB structures, we computed pairwise AA SIDs and motif overlap. The measured RBP with the highest motif overlap (a minimum of 3.5 nucleotides) and at least 50% AA SID was selected as the best match. Given that most PDB structures have no more than two RRM domains in contact with RNA, some RNAcompete experiments were matched to multiple PDB structures. Subsequently, redundant PDB structures that share more than 70% AA SID were deduplicated by retaining the one with the longest protein sequence.

In total, 27 qualifying PDB structures were assigned a measured RBP; however, we reduced the set to 26 as one PDB structure contained only the third RRM of ELAVL1, which functions primarily as a dimerization domain61.

Assignment and assessment of RISs

We conducted RNA queries on all RNAcompete experiments using JPLE with leave-one-out cross-validation and obtained their reconstructed peptide profiles pu*, which were then stacked row-wise to form the matrix P* n × p. The matrix was standardized by its column mean and standard deviation.

For interface prediction, RISs were calculated for each PDB entry from the reconstructed peptide profile pi* of the matched RNAcompete-measured RBP. To assign an importance score to each residue in the PDB RBR, we divided each element in pi* by its number of occurrences in the RBR, and then, for each residue, we summed the values of the overlapping peptides in pi*.

As a baseline for interface characterization, we compared JPLE’s RISs to sequence conservation, along with a random forest model trained directly on selected PDB structures. Details of these methods are in Supplementary Note 5.

Reconstructing RNA sequence specificities for 690 eukaryotes

Using our JPLE model trained on all 355 RNAcompete-measured RBPs, we predicted RNA sequence specificities for RBPs across 690 eukaryotes. RBRs were identified across eukaryotic species as described in ‘Identifying eukaryotic RBPs’.

We used JPLE to perform protein queries on each of the ~88,000 detected RBRs with RRM and KH domains. Reconstructions with an e-dist of less than 0.127 were considered confident predictions, corresponding to an average PCC between reconstructed and measured RNA-binding profiles of 0.75 and a rolling average PCC of 0.70 at all levels of recall (Extended Data Fig. 4b). Moreover, RNA queries were conducted using the version of JPLE augmented by homologous sequence, as per above, and RISs were computed for all RBRs with confident RNA-binding profile predictions. Secondary structure profiles were computed with SCRATCH62 for all measured RBR sequences. RBR sequences of unmeasured RBPs were aligned to the measured RBPs, and the predicted secondary structure profile of the RBR with the highest AA SID was used for the visualization on CisBP-RNA.

Using JPLE e-dist to determine groups of RBRs with common motifs

To study the evolution of RNA motifs in eukaryotes, we sought to use JPLE to identify groups of RBPs with a conserved RNA motif. First, we investigated the relationship between e-dist and RNA motif similarity for RBPs with low sequence homology to the closest RNAcompete-measured RBP. To do so, we used a neighbor-joining algorithm to cluster these proteins into 50 sets of RBPs with high intragroup AA SID and low intergroup AA SID. After clustering, 80% of RBPs had a maximum intergroup AA SID of less than 30%. We trained 50 different JPLEs, each with one of the 50 groups held out, and evaluated how well e-dist within the held-out group correlated with the PCC between the pairs of held-out, low-sequence-homology RBPs. We found that pairs of held-out RBPs with an e-dist of <0.2 had an average PCC of 0.62 (Extended Data Fig. 2e), the lower bound of the PCCs of technical replicates (Fig. 1c). As such, we used an e-dist threshold of 0.2, along with sequence homology, to identify clusters of uncharacterized RBPs with a shared motif and derived from the same ancestor. Note that the 0.2 e-dist cutoff is the same as that used in the main text to estimate recall of JPLE in leave-one-out cross-validation, where it is associated with a PCC of 0.748. Here, the average correlation is lower because the reconstruction of the RNA profiles of the held-out, low-homology RBPs is more challenging.

Clustering RBPs into CRMGs and characterizing their last common ancestors

To define clusters of evolutionarily related RBRs with conserved RNA sequence specificity, we used a multistep clustering algorithm that incorporated JPLE latent distances, pairwise AA SID, a selected species tree and agglomerative clustering. We used TimeTree63 and the ETE Toolkit64 to select and extract evolutionary distances between 53 eukaryotic species that cover the evolutionary space between and within eukaryotic clades. Collectively, these species contain 8,957 RBR sequences that are exclusively composed of RRM and KH domains. We used a global alignment (BLOSUM62 and gap penalty 11, 1 (open, extension), Needleman–Wunsch) to compute pairwise AA SIDs between all pairs of RBR sequences (see ‘Calculating amino acid sequence identities’). Between RBPs with the same or one domain difference, however, we computed sequence identities using the shorter sequence length as a reference to account for homologs that lost a single domain. For all others, we used the longer sequence as a reference, assuming that orthologs generally do not gain or lose more than one domain.

We constructed a ‘highest homology network’ between pairs of RBRs in different species. The ‘highest homology network’ contains one node for each of the 8,957 RBR sequences, and we connected each RBR to the RBR with the highest AA SID in each of the other 52 species. This highest homology network is used as a scaffold for our clustering, thereby ensuring that we only combine RBR sequences likely to be descended from a common ancestral protein into CRMGs. We then remove any links in the highest homology network between RBRs with an e-dist of greater than 0.2. This created a set of 2,463 connected components of the filtered network; these components are used as the initial set of CRMGs. Each pair of RBRs in a cluster are thus connected by at least one path where each link is both a highest homology link and has an e-dist of <0.2. Examining this initial draft of the CRMGs, we identified potential false positives and further refined our set of CRMGs using the process described in Supplementary Note 6. After refinement, we had 2,568 CRMGs, each of which was assigned to a clade represented by the ancestral node in the species tree by inferring the most recent common ancestor of all extant species with RBPs in the CRMG.

Identifying putative A. thaliana stability regulators and deadenylation assay

For the 101 A. thaliana RBPs with an RNAcompete-measured (5) or JPLE-reconstructed (96) RNA-binding profile, we calculated a score for all possible RNA 6-mers to compare to the half-life scores from Narsai et al.47. To calculate the score for a given 6-mer, we took the mean of the z scores for all 7-mers containing the 6-mer (Supplementary Table 7).

RNA sequences

The WT RNA sequence was designed to contain the highest-scoring 6-mer (AAUAAG) bound by the CID homologs, as well as the second- and third-highest-scoring 6-mers, GAAUAA and AAUAAA (Extended Data Fig. 7a). The C-rich MUT RNA sequence was designed not to contain high-scoring CID homolog 6-mers. Both sequences were checked to ensure that they would not form RNA secondary structures or G-quadruplexes using RNAfold65 and the QGRS Mapper66, respectively.

DNA constructs

NOT6 and NOT7 constructs were described previously52,53. The pETM11 plasmid with the gene encoding full-length (residues M1-V636) PABPC1 (UniProt ID P11940) was obtained from Addgene (146642). The gene encoding full-length CID8 (residues M1-N314; UniProt ID Q9C8M0) was cloned into an in-house vector (pnEK-NSupH) in a frame with the N-terminal His6-SUMO tag using the Gibson assembly method.

Protein production and purification

The production and purification of the NOT6–NOT7 deadenylase heterodimer was performed as previously described, with minor adjustments52,53. The protocol is described in detail in Supplementary Note 7.

PABPC1 was produced in BL21(DE3) Star E. coli cells using 4 l of LB medium at 30 °C for 5 h. To induce production, 1 mM IPTG was added after reaching an optical density at 600 nm of 0.3. The cells were collected and resuspended in a lysis buffer containing 50 mM HEPES/NaOH (pH 7.0), 1,000 mM NaCl, 5% (wt/vol) glycerol and 25 mM imidazole and lysed by sonication. The lysate was clarified by centrifugation at 40,000g for 45 min and loaded on a 5-ml nickel-charged IMAC column (Cytiva). The bound protein was washed with 20 column volumes of lysis buffer and eluted from the column in 1-ml fractions with the same buffer supplemented with 250 mM imidazole. Peak fractions from nickel affinity chromatography were then pooled together and incubated with TEV protease at 4 °C overnight to cleave off the His6 tag. The next day, PABPC1 was loaded and eluted on a Superdex 200 26/600 column (Cytiva) equilibrated in a buffer containing 50 mM HEPES/NaOH (pH 7.0), 200 mM NaCl, 5% (wt/vol) glycerol and 2 mM DTT. The peak fractions were then pooled together, concentrated to ~10 mg ml−1, flash-frozen in liquid nitrogen and stored at –80 °C until use.

CID8 was produced in BL21(DE3) Star E. coli cells using 2 l of autoinduction medium at 20 °C overnight. The cells were collected and resuspended in a lysis buffer containing 50 mM potassium phosphate (pH 7.5), 1,000 mM NaCl and 25 mM imidazole and lysed by sonication. The lysate was clarified by centrifugation at 40,000g for 45 min and loaded on a 5-ml nickel-charged IMAC column (Cytiva). The bound protein was washed with 20 column volumes of lysis buffer and then eluted from the column in 1-ml fractions with the same buffer supplemented with 250 mM imidazole. Peak fractions from nickel affinity chromatography were then loaded and eluted on a Superdex 200 26/600 column (Cytiva) equilibrated in a buffer containing 20 mM HEPES/NaOH (pH 7.5), 300 mM NaCl and 2 mM DTT. The peak fractions were then pooled together, concentrated to ~2 mg ml−1, flash-frozen in liquid nitrogen and stored at –80 °C until use.

Deadenylation assays and quantification

Deadenylation reactions were performed in a buffer containing 20 mM PIPES (pH 7.0), 40 mM NaCl, 10 mM KCl and 2 mM magnesium acetate at 37 °C. Assays were performed as described previously53 with the following modifications. To the 5′-fluorescein-labeled WT or MUT RNA substrate (50 nM; biomers.net; for sequences, see Extended Data Fig. 7a) were added PABPC1 (50 nM) and His6-SUMO-CID8 at 1:1, 10:1 and 40:1 ratios relative to PABPC1 and incubated on ice for 15 min. To start the reaction, 100 nM NOT6–NOT7 deadenylase heterodimer was added. To stop the reaction at the corresponding time point, 3-fold reaction volumes of RNA loading dye were added (95% (vol/vol) deionized formamide, 17.5 mM EDTA (pH 8) and 0.01% (wt/vol) bromophenol blue). The products were resolved on a denaturing TBE–urea polyacrylamide gel, which was subsequently imaged using a Sapphire FL Biomolecular Imager (Azure Biosystems).

The deadenylation products were analyzed and quantified using Azure Spot Pro (Azure Biosystems). The final metric represents the normalized stability fold change of PABPC1 footprints in the presence of CID8 on WT and MUT RNAs.

Briefly, the signal intensities of the PABPC1 footprint on WT and MUT RNAs were quantified at the 32-min time point, with and without a 40-fold excess of CID8 (Extended Data Fig. 7b,c, lanes 9 and 23). Quantification was localized within a defined region corresponding to the footprint. For each sample, the signal of the PABPC1 footprint was normalized as a fraction of the total signal intensity within the corresponding lane. The normalized fraction value for PABPC1 in the presence of CID8 was then divided by the normalized fraction value for PABPC1 alone, yielding a stabilization fold for both WT and MUT RNAs. The above steps were performed in triplicate for both WT and MUT RNAs to ensure reproducibility. Measurements and calculations are shown in Supplementary Table 8. An unpaired two-sided t-test was performed to compare the stabilization fold change between WT and MUT RNAs and to assess whether CID8 exerted a significantly different stabilizing effect on the two RNA species. The graph was generated using GraphPad Prism 10.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button