Multiplexed profiling of transcriptional regulators in plant cells

Optimizing magnetic nucleus sorting as a reporter assay

To measure the activity of multiple protein-coding elements in parallel, we sought to develop an MPRA based on separating nuclei by the level of expression of a reporter transgene (Fig. 1a). By placing this reporter under the control of a protein-coding library of regulators (that is, trans-regulatory elements), we aimed to sort nuclei into two fractions: a pull-down fraction enriched with regulators that activate the reporter, and a flow-through fraction enriched with repressors and/or inactive elements. Next-generation sequencing is used to quantify the relative enrichment of each trans-element in each fraction, which is expected to be proportional to their activity (that is, activators or repressors) and strength (Fig. 1a). As the expression platform, we used Agrobacterium-mediated transient expression in N. benthamiana, also known as agroinfiltration (Fig. 1a). Crucially, it was previously shown that transgene expression in this system follows Poisson-like statistics26,27. This makes it possible to minimize the number of cells expressing more than one trans-element from the library by infiltrating Agrobacterium cultures at a very low density. Single transgene expression can mitigate unwanted interactions that may occur when different trans-elements are expressed in the same cell.

Fig. 1: Design and optimization of an assay to measure the activity of trans-acting elements in parallel in plant cells.

a, Overview of the ENTRAP-seq assay. A library of trans-acting elements, such as TFs of varying strengths, is cloned in binary vectors for Agrobacterium-mediated transformation. These trans-elements regulate a promoter driving the expression of an outer nuclear envelope tag (ONEtag) containing a biotin acceptor peptide. A transgene for the constitutive expression of the E. coli biotin ligase BirA is also included to biotinylate the ONEtag. The library as well as the ONEtag reporter and the BirA transgene are transformed into Agrobacterium and mixed cultures are infiltrated into leaves of N. benthamiana. The ONEtag is then biotinylated by constitutively expressed BirA using endogenous biotin. Next, nuclei are isolated from infiltrated leaves and incubated with magnetic streptavidin-coated beads, allowing for magnetic sorting. DNA is extracted from the unsorted and sorted nuclei samples and the abundance of each library element is determined by sequencing the T-DNAs. The enrichment of a given trans-element in the sorted versus unsorted population correlates with its activation strength. b, Live cell fluorescence microscopy images of N. benthamiana epidermis cells expressing a nuclear marker (H2B-RFP) and different ONEtag candidate versions fused to GFP. Scale bars 20 μm. Shown on top of each set of images is a schematic of the ONEtag variant tested. At least three cells were imaged for each construct to confirm consistent localization. c, Schematic of the construct used in d and e to compare promoter activity (RFP fluorescence) with ONEtag accumulation (GFP fluorescence). d, Scatter plot showing RFP (x-axis) and GFP (y-axis) fluorescence of leaves expressing the construct in c containing the SlWIP1-C ONEtag combined with a second strain carrying Gal4DBD or Gal4-VP16. e, Same as d for a ONEtag containing SlWIP2-C. f, Schematic of the experiment to test the sensitivity of the SlWIP2-C ONEtag. Gal4DBD fusions to effector domains of Arabidopsis TFs of varying strengths were used to drive the expression of the reporter. g, Scatter plot of the RFP (x-axis) and GFP (y-axis) fluorescence from leaves expressing the constructs in f. Open circles show mean ± s.d.; r, Pearson’s correlation coefficient. In d, e and g, n = 24 biological replicates (3 plants, 2 leaves per plant, 4 leaf punches per plant). Panel a partially created with BioRender.com.

To link reporter expression to nucleus sorting, we used a chimeric tag composed of a fusion of a biotin acceptor peptide to GFP and an outer nuclear envelope targeting domain (hereafter ONEtag) that can be biotinylated by a coexpressed BirA transgene using endogenous biotin. Nuclei extracts incubated with streptavidin magnetic beads can then be sorted using magnets. Similar approaches have been used to purify nuclei from specific cell types in transgenic plants28, but there are no reports of magnetic sorting of transiently transformed N. benthamiana tissues. We hypothesized that transiently transformed nuclei could be sorted into two fractions: a pull-down fraction enriched in transfer DNAs (T-DNAs) encoding activators, and a flow-through fraction depleted of strong activators and enriched in T-DNAs coding for repressors and neutral elements (Fig. 1a). For this approach to yield quantitative differences in enrichment, magnetic nucleus sorting must be sensitive enough to detect continuous differences in ONEtag expression levels among cells, rather than merely distinguishing between nuclei that do and do not express the ONEtag, as has been previously done to isolate cell type-specific nuclei29. To optimize magnetic nucleus sorting as a quantitative readout of transcriptional activity in agroinfiltrated N. benthamiana, we pursued two design goals. First, the transiently expressed ONEtag should localize to the nuclear envelope, as determined by the GFP subcellular localization. Second, as for any sensitive quantitative reporter, the level of ONEtag accumulation must correlate with the transcriptional activity of the promoter driving its expression.

As the first candidate ONEtag, we tested the AtRanGAP WPP domain, previously used to label nuclei in specific tissues in Arabidopsis. To confirm ONEtag subcellular localization, we coexpressed a nuclear marker (pUBQ10::H2B:mScarlet, hereafter H2B-red fluorescent protein (RFP)). When transiently expressed in N. benthamiana, the WPP tag did not localize to the nuclear envelope, as indicated by its failure to encircle the H2B-RFP signal (Fig. 1b). A similar pattern was observed for the WPP domain of the closest N. benthamiana RanGAP homolog. Since the C terminus of WPP proteins has also been used to target the nuclear envelope, we tested the C-terminal half of the N. benthamiana homolog of AtWIP1, which also failed to localize to the nuclear envelope. We next screened seven domains known to localize to the outer nuclear envelope in their native contexts, including the C termini of two tomato WIP proteins. Several candidates showed correct localization in N. benthamiana (Extended Data Fig. 1a), and we selected the C-terminal halves of WIP1 and WIP2 (SIWIP1/2-C) for further characterization (Fig. 1b).

We next focused on the second design criterion, that is, ensuring that the degree of ONEtag protein accumulation correlates with the transcriptional activity of the promoter driving its expression. To measure promoter activity independent of ONEtag protein abundance, we used an RFP-T2A-ONEtag fusion (Fig. 1c). RFP fluorescence intensity served as a proxy for promoter activity while GFP fluorescence intensity provided a metric for ONEtag accumulation. To drive the RFP-T2A-ONEtag reporter, we placed this transgene under the control of a minimal 35S promoter with 6 upstream Gal4 DNA-binding sites (UAS-min35S::RFP:T2A:ONEtag). To activate expression of this reporter, we coinfiltrated it in N. benthamiana with previously generated strains containing the Gal4 DNA-binding domain (Gal4DBD) alone or fused to the strong VP16 activator (Gal4-VP16)10. As expected, the level of RFP fluorescence was higher in leaves coexpressing the SlWIP1-C ONEtag and Gal4DBD-VP16 compared to Gal4DBD (Fig. 1d). However, GFP fluorescence was comparable between Gal4DBD-VP16 and Gal4DBD, showing that the increase in transcriptional activity driven by the VP16 fusion did not result in a commensurate increase in SlWIP1-C ONEtag protein accumulation (Fig. 1d). Cell microscopy confirmed that the ONEtag and RFP proteins are efficiently separated after being produced from the same transcripts (Extended Data Fig. 1b). While removal of biotin ligase recognition peptide did not relieve the buffering effect, removing SlWIP1-C was sufficient for GFP fluorescence to increase in response to VP16, similar to RFP (Extended Data Fig. 1c,d). We therefore replaced SlWIP1-C with SlWIP2-C. For this ONEtag, GFP and RFP accumulated linearly in response to VP16 showing that the accumulation of this tag closely correlates with the activity of the promoter driving it (Fig. 1e). Removing RFP-T2A from the constructs made accumulation of SlWIP1-C ONEtag slightly responsive to Gal4-VP16 but to a far lesser degree than SlWIP2-C (Extended Data Fig. 1e). To confirm that the accumulation of the SlWIP2-C ONEtag tracks with promoter activity across a large dynamic range, we coexpressed it with eight previously generated Gal4DBD fusions spanning a wide range of transcriptional strengths10. Across combinations, RFP and GFP fluorescence correlated well (R2 = 0.91, Fig. 1f,g and Supplementary Fig. 1). Hence, accumulation of the SlWIP2-C ONEtag is highly sensitive and quantitatively reflects the transcriptional activity of the promoter driving its expression.

Small-scale library validation of ENTRAP-seq

Having optimized the purification tag, we next sought to validate that our approach can be used to measure the activity of trans-acting elements. For enrichment in the nucleus pull-down fraction to serve as a metric of activity, the probability of pulling down a given nucleus should be correlated with its level of ONEtag expression. Further, for this pull-down bias to yield a quantitative metric of reporter expression, the correlation between promoter activity and ONEtag abundance that we observed at the tissue level should also apply in individual nuclei. The ultimate goal of ENTRAP-seq is to serve as a proxy measure of the level of expression of a reporter. Hence, we sought to benchmark ENTRAP-seq by comparing its output (that is, TF enrichment in the pull-down based on next-generation sequencing) against that of a well-established fluorescence reporter assay.

To demonstrate that ENTRAP-seq meets these criteria, we characterized a control set of 26 previously generated synthetic TFs consisting of 53 amino-acid (aa) fragments—hereafter referred to as ‘tiles’—fused to Gal4DBD19 (Supplementary Dataset 1) and spanning a wide range of activity, from repressors to strong activators19. We cloned these Gal4DBD fusion tiles under the constitutive pMas promoter in an Agrobacterium T-DNA expression plasmid containing the SlWIP2-C UAS-min35S::RFP:T2A:ONEtag construct and a constitutive BirA transgene (Fig. 2a). All three cassettes were placed in the same T-DNA to ensure that biotinylated ONEtag expression occurs exclusively in cells that express a synthetic TF. We used this backbone (pSA676v2) for all subsequent ENTRAP-seq experiments and fluorescence reporter assays involving Gal4DBD fusions. We previously determined that a low Agrobacterium OD (optical density at 600 nm) of approximately 0.0025 is optimal for expressing libraries of mixed trans-elements in N. benthamiana as it maximizes the number of transformed cells while minimizing the fraction of these cells that express more than one transgene26. At this OD, approximately 10% of leaf cells express a transgene and about 90% of these cells express a single one. For the following nucleus pull-down experiments, we therefore infiltrated the 26 control strains at a total mix OD of 0.0025. To purify and sort nuclei, we developed a protocol based on pipelines optimized for other plant species, with several modifications29,30 (Methods).

Fig. 2: Validation of ENTRAP-seq as a proxy metric of reporter expression levels.
figure 2

a, T-DNA constructs used to validate ENTRAP-seq. A synthetic promoter containing 6× Gal4-UAS sites followed by the minimal 35S promoter is used to drive the ONEtag fused to RFP by a self-cleaving T2A peptide. Downstream of the ONEtag, the plasmid contains a constitutive BirA expression cassette and a constitutively expressed fusion between the Gal4DBD and variable 53-aa tiles of various strengths. b, Infiltration setup for nucleus pull-down and microscopy experiments shown in c and d. Strains carrying the 26 constructs from a are grown separately and mixed in equal OD ratios before leaf infiltration at a final total OD of 0.0025. This mix is combined with a strain carrying a constitutive H2B-BFP nuclear marker infiltrated at OD 0.1. This tissue is then subjected to magnetic nucleus pull-down and the BFP signal is used to identify the nuclei of transformable leaf cells. c, Distribution of ONEtag expression levels in nuclei from different purification fractions. Histograms are normalized to have the same area. Nuclei with low GFP levels are enriched in the flow-through and there is a progressive enrichment in the pull-down with increasing GFP expression. Brackets show the P values of one-sided Mann–Whitney U-tests between histograms. d, Scatter plot of RFP versus GFP fluorescence in single nuclei in the pull-down fraction from c. e, Schematic of the experiment to compare ENTRAP-seq enrichment with fluorescence reporter assays. Left: the control set of 26 tiles from a was infiltrated at a low OD for nucleus pull-down and sequencing of the flow-through and pull-down fractions. Right: each of the strains in a is infiltrated individually at an OD of 0.2. The activity of each tile is determined by measuring the RFP fluorescence of infiltrated leaves in a plate reader. f, x-axis, transcriptional activity of each tile based on ENTRAP-seq (mean ± s.d. enrichment of each tile T-DNA in the nuclei pull-down fraction compared to the flow-through sample). y-axis, tile transcriptional activity measured individually using a fluorescence plate reader (mean ± s.d., RFP fluorescence). For nuclei pull-down and sequencing n = 4 biological replicates. For plate reader assays n = 32 biological replicates (4 plants, 2 leaves per plant, 4 leaf punches per leaf). , Spearman’s correlation coefficient.

To fluorescently label transformable nuclei in the different purification fractions, we infiltrated the tile mix with a strain carrying a pUBQ10::H2B:BFP transgene (BFP, blue fluorescent protein, hereafter H2B-BFP) at OD 0.1, which ensures homogeneous and strong labeling across cells (Fig. 2b). Using the H2B-BFP nuclear marker, we counted nuclei in the pull-down and the flow-through fractions and calculated their level of ONEtag expression by measuring GFP fluorescence (Methods). Using this approach, we counted approximately 300,000 nuclei in the pull-down fraction starting from 10 g of fresh tissue (about 20 fully infiltrated leaves) (Extended Data Fig. 2a). The flow-through fraction contained approximately 10 million nuclei (Extended Data Fig. 2a). These numbers are within a factor of 2 from an estimate of the total number of transformed cells in the sample (estimation in Extended Data Fig. 2b). As shown in Fig. 2c, nuclei in the pull-down had a higher level of ONEtag expression than those in the flow-through, while a second pull-down showed an intermediate distribution (Fig. 2c). These distributions show a large degree of overlap such that the relative abundance in the pull-down with respect to the flow-through increases with ONEtag abundance (Fig. 2c). This feature is critical for the pull-down-to-flow-through ratio to serve as a continuous metric of transcriptional strength. We obtained qualitatively similar results in a biological replicate of this experiment (Extended Data Fig. 2c). Comparing mCherry and GFP fluorescence at the single nucleus level revealed strong correlation between reporter activity and ONEtag accumulation (Fig. 2d, Pearson r = 0.83), further validating the premise of our method.

Next, we sought to demonstrate that ENTRAP-seq recapitulates reporter activity as measured by a low-throughput method. To this end, we asked whether the degree of pull-down enrichment of each of the 26 control tiles correlates with their transcriptional strength measured using an established reporter assay. We infiltrated each construct individually and used a fluorescence plate reader to measure the tissue RFP and GFP fluorescence normalized to a constitutive BFP strain. This ratio was then expressed as a function of the no-tile Gal4DBD control. In contrast to the mixed infiltrations for nucleus pull-down, single tile infiltrations for fluorescence measurements were done at OD 0.2 to ensure a strong fluorescence signal (Fig. 2e and Methods section ‘Fluorescence plate reader reporter assays’). The basal fluorescence of the min35S promoter is close to leaf autofluorescence, hindering the detection of transcriptional repression in fluorescence reporter assays (Supplementary Fig. 2). Hence, in the case of repressor tile strains we coinfiltrated them with a third construct containing a synthetic promoter with embedded Gal4 binding sites, which has a four-times higher basal expression level than min35S31 (Supplementary Fig. 2). To measure the ENTRAP-seq enrichment of these tiles we combined the corresponding 26 Agrobacterium cultures in equal OD ratios and infiltrated the mix at a final OD of 0.0025. Nuclei purified from these leaves were subjected to magnetic sorting and total DNA was extracted from the pull-down and flow-through fractions. Amplicon sequencing of the T-DNA tile region was used to determine the relative abundance of each tile in each fraction to obtain their level of enrichment. Tile enrichment correlated very well with promoter activity as measured by leaf RFP fluorescence (Spearman  = 0.97), validating sequencing enrichment as a proxy for reporter expression, at least at this library size (Fig. 2f). As expected, leaf GFP fluorescence also correlated well with tile enrichment (Extended Data Fig. 2d, Spearman  = 0.97) and with RFP fluorescence (Extended Data Fig. 2e, Pearson r = 0.96).

Overall, these results demonstrate that by linking reporter activity to relative pull-down enrichment, ENTRAP-seq provides a method to characterize trans-element activity in plant cells.

Accuracy and reproducibility in large libraries of viral protein tiles

We next sought to leverage ENTRAP-seq to discover new potential transcriptional regulators. Plant pathogen attack is usually linked to widespread transcriptomic changes in the host through the action of pathogen-encoded transcriptional regulators32. There are well-known examples of such regulators delivered by pathogenic bacteria into plant cells, but how plant viruses manipulate host transcription is much less understood33. In the case of DNA viruses, transcriptional regulation of the viral genome is part of the pathogenic cycle. Plant viruses have very small genomes (typically in the order of 10–15 kilobase pairs, Extended Data Fig. 3a) that encode just a few proteins (normally about 4, Extended Data Fig. 3b), none of which resemble known TFs in terms of sequence homology and presence of known TF domains34. Notably, many of the most widely used genetic parts in plant and mammalian genetic engineering—for example, the VP16 AD, the 35S promoter and the SV40 NLS—originate from plant and animal viruses. More broadly, a better molecular understanding of the viral transcriptional arsenal could help devise new crop protection strategies.

To identify potential viral transcriptional regulators, we searched the plant viral proteome for putative ADs. Throughout this study, we adopted a strictly functional definition of ADs, namely, short peptides that are sufficient for strongly activating transcription of a promoter when recruited via a heterologous DBD. The fact that a peptide of viral origin fits this definition does not necessarily indicate that its protein of origin is involved in gene regulation, nor that this peptide is required for regulating transcription within its protein context. Indeed, many potent ADs are found in nonnuclear proteins that lack known gene regulatory functions, such as metabolic enzymes in mitochondria and chloroplasts19. A peptide may adopt a different structural conformation when isolated and fused to a DBD compared to its native protein context. As a result, tiles derived from TFs can sometimes behave as ADs in reporter assays even though they seem dispensable for the transactivation function of the TF itself. This functional definition is meant to emphasize that AD identification in reporter recruitment assays is but a first step toward fully characterizing the regulatory function of a protein.

As of the time of writing, the National Center for Biotechnology Information (NCBI) database contained 1,495 plant viral genomes and 5,607 proteins. To maximize sequence diversity, we first clustered the protein sequences at 90% sequence identity, yielding 4,537 representative nonredundant proteins. Each sequence was then tiled using 53-aa windows and 10-bp overlaps (Fig. 3a). To estimate the likelihood of each tile being a putative AD, we used PADDLE, a deep convolutional neural network that predicts acidic ADs14. The convolutional layer enables PADDLE to identify sequence-to-function relationships, especially within disordered sequences where traditional structure-based or homology-based comparative genomic algorithms fail. Although the model was trained on data from a yeast-based screen using yeast protein sequences, acidic ADs are evolutionarily conserved and we previously demonstrated that PADDLE can identify ADs from plants19. Only tiles with the highest predicted z-scores were selected for experimental testing (Fig. 3a). Finally, we further optimized protein diversity by selecting sequences that maximized the Hamming distance between binary protein encodings (Fig. 3a).

Fig. 3: Generation of nested libraries of putative viral ADs to test assay scale and reproducibility.
figure 3

a, Schematic of the computational pipeline to generate libraries of putative transcriptional ADs from plant viruses. From the top, protein open reading frame sequences from genomes of all plant viruses in NCBI were deduplicated and clustered using 90% sequence identity and then fragmented into 53-aa tiles with 10-aa overlaps. PADDLE was used to predict the transactivation strength of these tiles. The top 2,000 tiles with the highest predicted activity were selected (PADDLE z-score >4.9). Protein sequences were binary encoded, and a sequence diversity maximization algorithm was applied to iteratively select sequences that maximized the average Hamming distance from those already included in the subset. This approach enriched for sequence-level diversity in the selected proteins to generate nested libraries of 250, 500, 1,000 and 2,000 tiles. b, Tile activity as measured by ENTRAP-seq (x-axis, sequencing-based enrichment) plotted against the mean ± s.d. tile activity measured individually using a fluorescence reporter assay (y-axis). Two biological replicates of the sequencing experiment are shown (orange and black circles) for libraries of 250, 500, 1,000 and 2,000 viral tiles. Blue lines show a sigmoidal fit to the data with the shaded blue area indicating ± 1 s.d. of residuals as an empirical confidence bound. Horizontal dashed lines indicate the middle point of the sigmoid fits. Vertical dashed lines indicate the enrichment value corresponding to the sigmoid midpoint +1 s.d. This value was taken as the lower bound of enrichment to classify tiles as activators. n for plate reader assays was 32 biological replicates (4 plants, 2 leaves per plant, 4 leaf punches per leaf). c, Reproducibility between biological replicates (Pearson’s correlation in enrichment, y-axis) as a function of the absolute number of tiles that pass a threshold number of mapped reads (x-axis). The set of tiles counted on the x-axis was generated by imposing coverage cutoffs such that only tiles with more counts than an arbitrary and variable threshold (color bar) were considered to calculate the Pearson correlation. d, Reproducibility across different libraries. The activity of each tile was first averaged across both biological replicates of each library of a given size. Scatter plots show the activity of tiles that are shared between libraries of different sizes. The Pearson correlation (r) between libraries is shown.

Although our nuclei enrichment method performed well on the control set of 26 tiles (Fig. 2f), the scalability of ENTRAP-seq to larger libraries was not immediately clear. To address this question, we created 4 nested libraries of 250, 500, 1,000 and 2,000 viral tiles (Fig. 3a and Supplementary Dataset 2). All except for the 2,000-tile library also contained the 26 control tiles for which we had previously obtained single infiltration fluorescence data (Fig. 2f). For the 2,000-tile library, a new set of 26 viral tiles was measured individually in a fluorescence plate reader (Supplementary Dataset 1). These internal controls allowed us to validate the output of the assay by comparing the activity of these tiles in terms of reporter fluorescence and tile enrichment (as in Fig. 2f). Across all libraries, we found that the activity of a given control tile as measured individually in a fluorescence plate reader strongly correlates with its enrichment following magnetic pull-down, for two biological replicates of each library (Spearman  = 0.72–0.97) (Fig. 3b). These validation experiments also allowed us to obtain a calibration value to classify tiles as activators based on their sequencing-based enrichment. To this end, we fitted the validation data to a sigmoidal curve and calculated the enrichment value corresponding to the midpoint of this sigmoid plus its estimation error (Fig. 3b). This yielded an enrichment value of approximately 1.0–1.1, which was used as a cutoff to classify viral tiles as activators in subsequent analyses (Fig. 3b).

We found no consistent correlation between tile coverage in the library and tile enrichment ratio (Supplementary Fig. 3). When the enrichment of all tiles was considered, we found that biological replicates correlated well (Pearson r = 0.5–0.55) and this correlation increased substantially to r = 0.7–0.8 by using coverage cutoffs to filter out tiles with low representation in the library (Fig. 3c). Because of the nested nature of the libraries, it was also possible to compare tile enrichment across libraries. This revealed a high degree of reproducibility even across libraries (Pearson r = 0.59–0.8 without applying coverage cutoffs) (Fig. 3d). Thus, our assay is accurate and reproducible in libraries of up to 2,000 tiles.

Although the ENTRAP-seq signal correlates well with fluorescence reporter assays (Figs. 2f and 3b), we wished to understand to what extent ENTRAP-seq can be used as a method to resolve strength differences between TFs based on enrichment alone. We therefore sought to determine the assay resolution by estimating the measurement error in the largest tested viral library of 2,000 tiles. We define resolution as the minimum enrichment difference between two proteins to be able to classify them as activators of different strengths with a statistical significance of P ≤ 0.05. To this end, we leveraged replicate data to obtain a distribution of within-protein noise and estimate the error (Extended Data Fig. 4a, see Supplementary Information for details). Using 2 s.d. of the within-protein noise as a significance cutoff we found the resolution of ENTRAP-seq to be ~0.45 enrichment units for a single biological replicate of a 2,000-tile library (Extended Data Fig. 4b, see Supplementary Information for details). Since the enrichment values of activators span 1–2.5 enrichment units, this resolution allows us to distinguish at most 3 quantitatively distinct classes of activator strengths (see Supplementary Information for calculation details). As expected, this uncertainty is reduced by a factor of ~1/√n for n = 2 replicates, with a resolution of ~0.3 enrichment units when the 2 biological replicates of the 2,000-tile library are combined (Extended Data Fig. 4c,d; see Supplementary Information for calculation details). Given the dynamic range of enrichment values, a resolution of 0.3 units allows distinguishing multiple statistically different classes of activator TF based on their ENTRAP-seq enrichment signal. This enrichment uncertainty was largely independent of enrichment values (Extended Data Fig. 4e,f). As shown in Fig. 3c, the within-protein experimental error can be drastically reduced by filtering out tiles that are underrepresented in the library. Such reductions in measurement noise via replication or library coverage cutoffs can improve the assay resolution to ~0.15 enrichment units (Extended Data Fig. 4). Hence, the enrichment ratio behaves like a coarse-grained quantitative metric that can resolve a few different transcriptional strengths. It should be noted that, to a first approximation, enrichment and true TF strength (as measured by a fluorescence reporter assay) follow a logarithmic relationship (Figs. 2f and 3b), meaning that a resolution of 0.3 enrichment units encompasses a wide range of activator TF strengths. As a point of comparison, traditional low-throughput fluorescence reporter assays in N. benthamiana can detect much smaller differences in activation strength of 10–20% between constructs35. Thus, the 2,000 times higher throughput of ENTRAP-seq with respect to established methods comes at the expense of fine quantitative resolution.

Identification of putative ADs in plant viral proteins

Our pipeline for discovering potential viral ADs combining machine learning predictions with high-throughput measurements yielded 1,105 tiles that passed the activator enrichment cutoff, confirming that PADDLE can predict acidic ADs in plants19 (Fig. 4a, Extended Data Fig. 5 and Supplementary Figs. 46). These tiles are found throughout the phylogenetic tree of plant viruses (Fig. 4a). Two large families of plant single-stranded DNA (ssDNA) and double-stranded DNA (dsDNA) viruses (Geminiviridae and Caulimoviridae, respectively) as well as multiple positive single-stranded RNA (+ssRNA) families (especially Potyviridae and Betaflexiviridae) were particularly rich in acidic ADs (Fig. 4a). We did not find any ADs in only 4 relatively small families: Amalgaviridae, Aspiviridae, Mitoviridae and Nanoviridae (−ssRNA, −ssRNA, +ssRNA and ssDNA, respectively). This contrasts with recent global surveys of human viruses where ADs were found almost exclusively in a few families of dsDNA viruses22,36. Although our screen did not encompass all viral genomes, it identified approximately twice as many tiles as the more comprehensive studies of human viral ADs, which reported ~600 ADs. This highlights the utility of using a predictive model such as PADDLE to bias the AD search when it is not technically feasible to explore the whole sequence space. In general, our search algorithm based on PADDLE performed remarkably well in terms of predicting ADs; around 50% of the tested tiles qualify as activators based on our calibration thresholds, although there was substantial variation in predictive accuracy across taxa (Fig. 4a, Extended Data Fig. 5 and Supplementary Figs. 46).

Fig. 4: Conserved and potent transcriptional ADs are found in diverse proteins across plant virus taxa.
figure 4

a, Phylogenetic tree of plant viruses showing the number of tiles selected from each viral genome (gray), the maximum PADDLE predicted z-score (light green), whether a tile from this viral genome was selected for characterization (dark green), the number of tiles that were experimentally measured (purple) and the proportion of measured tiles that were activators (orange). dsRNA, double-stranded RNA. b, Gene ontology (GO) enrichment analysis of tiles measured as activators across all libraries. c, Detailed view of tiles from the proteins most enriched in activator tiles in b. Top: AD prediction (PADDLE z-score) of each tile along the length of all proteins belonging to each given class. The amino-acid position of each tile was normalized by dividing by the length of the protein it originates from. The horizontal dashed line shows the z-score cutoff used to select tiles for testing. Blue dots show tiles that were experimentally tested but did not pass our activator enrichment cutoff. Orange dots show tiles that were tested and had an enrichment higher than the cutoff. Bottom: average ± 1 s.d. protein disorder as predicted by Metapredict. d, Selected strong activator viral tiles tested individually in N. benthamiana using a Gal4DBD recruitment fluorescence reporter assay. Shown is the RFP fluorescence normalized to the Gal4DBD alone (Methods section ‘Fluorescence plate reader reporter assays’). e, A set of tiles tested in d were tested by recruiting them to a synthetic reporter driving GFP using dCas9 fusions. Shown is the normalized leaf GFP fluorescence. sgRNA, single-guide RNA. In d and e, n = 32 biological replicates (4 plants, 2 leaves per plant, 4 leaf punches per leaf), brackets indicate two-sided Welch’s t-test comparisons against VP16 with a Bonferroni multiple comparison correction, P < 0.05, **P < 0.01, ***P < 0.001, NS, not significant. Box plots show the median ± IQR, whiskers correspond to minimum and maximum values, excluding outliers.

To better understand the origin of these potential regulators, we performed GO term enrichment analysis. We found activator tiles in proteins from a wide range of functional annotations, although we note that due to the biased nature of our search not all functional classes are represented equally in our libraries (Fig. 4b and Extended Data Fig. 6). The presence of ADs in structural and catalytic proteins may reflect the multifunctional nature of viral proteins and mirrors the findings of a recent study that tiled human viruses, which found that 90% of putative viral ADs are found in nonregulatory proteins22. Four kinds of plant viral proteins showed a notable enrichment of activator tiles: geminiviral AL2-like proteins, RNA-dependent RNA polymerases (RdRp), helicases and helical capsid proteins (Fig. 4b). The AL2 protein, also known as TrAP (transcriptional activator protein), is involved in transactivating viral promoters in Geminiviridae and has been shown to regulate genomic promoters by physically interacting with host TFs37,38. AL2 lacks a DBD but contains a C-terminal acidic region that can transactivate a synthetic reporter when fused to Gal4DBD39. Thus, the recovery of strong ADs in AL2-like proteins using ENTRAP-seq validates our framework for discovering new viral-encoded transcriptional regulators. Our high-resolution tiling of these proteins allowed us to locate the position of their putative ADs (Fig. 4c). Consistent with the C terminus of AL2 being important for transactivation, we found that ADs are consistently located in three positions within a highly disordered region in the C terminus of diverse AL2 proteins (Fig. 4c and Supplementary Fig. 7). The positions of RdRp activator tiles are also relatively conserved but unlike AL2 these putative ADs are clustered in multiple distinct regions throughout the protein, most of which are predicted to fold into stable structures (Fig. 4c and Supplementary Fig. 7). Similar to RdRp, putative activator tiles in viral helicases and helical viral capsid proteins consistently fall in a few structured regions (Fig. 4c and Supplementary Figs. 7 and 8). In terms of prediction accuracy, most predicted ADs in AL2 and helical capsid proteins turned out to be true activators while the opposite was true for RdRp and helicases (Fig. 4c and Supplementary Fig. 7). These data indicate that a dedicated regulatory protein such as AL2 might activate transcription in a similar manner to canonical eukaryotic TFs with disordered ADs, while multifunctional viral proteins may repurpose stably folded and conserved catalytic or structural domains to accomplish this task. Alternatively, the peptides from these nonregulatory proteins may not actually function as ADs in their native context but simply have a sequence composition that led us to serendipitously identify them as activators in our screening setup. This large dataset should serve as a comprehensive starting point for future testing of specific hypotheses regarding the role of these putative ADs in their native protein context.

The discovery of these putative viral regulators prompted us to consider their potential as new plant synthetic biology tools. To find potent and compact activators that could serve as alternatives to existing field standards, we individually cloned 20 tiles that were among the most enriched in at least 1 library. We then tested them individually using the same Gal4-UAS fluorescence reporter assay described above, side by side with two potent popular ADs used in the field, VP16 and VPR40. All tested tiles activated the reporter at least 20-fold over the DBD alone (Extended Data Fig. 7 and Supplementary Table 3). We repeated this experiment for the strongest seven candidates and verified that four performed as well or better than VP16 and VPR (Fig. 4d), confirming that tile enrichment correlates well with AD strength. To determine whether these ADs perform well in different genetic contexts, we fused them to dCas9 and tested their activity when recruited to a synthetic promoter driving GFP as a reporter41 (Fig. 4e). dCas9-VPR outperformed all other fusions, but one tile was stronger than VP16 when fused to dCas9 (Fig. 4e). Since VPR is considerably larger than our viral tiles (553 aa versus 53 aa), this compact viral AD may be advantageous for vectors with a limited size capacity such as viral vectors42,43.

To characterize potential new repressor domains of viral origin, we followed a similar strategy, selecting 20 tiles that were highly depleted from the pull-down across libraries (Extended Data Fig. 7). As field standard controls, we included four of the most potent previously characterized synthetic ethylene-responsive element binding factor-associated amphiphilic repression (EAR) repression motifs44. Only 3 out of 20 of these putative viral repressors had significantly lower reporter expression than Gal4DBD but none surpassed synthetic EAR motifs in potency. The remaining 17 out of 20 tiles were statistically indistinguishable from the Gal4DBD control (Extended Data Fig. 7). Thus, the dynamic range of ENTRAP-seq spans from transcriptionally neutral tiles to the most potent known activators but it is not large enough to resolve transcriptional repressors. To specifically detect transcriptional repression, it may be necessary to use a reporter with a higher basal expression level in combination with changes to the pull-down protocol itself20.

In addition to minimizing construct size, these viral ADs offer a diverse set of genetic parts to reduce unintended consequences in synthetic programs, including cross-interactions with modules sharing identical parts and gene silencing arising from context-dependent effects.

Partial conservation of AD function between distantly related eukaryotes

Recently, plant protein sequences have been used in MPRAs in yeast to identify peptides that are sufficient for transcriptional activation in this model system15,19. Deriving insight about plant ADs from these experiments in a distantly related host depends on the degree to which transactivation mechanisms are conserved between these two systems. It is well documented that there is conservation of transcriptional activation mechanisms among eukaryotes, including between yeast and plants10,45. Decades of insights about plant TFs derived from yeast-based assays and our own results showing that a model trained in yeast successfully predicts plant viral ADs support this idea19. On the other hand, plants have highly derived and expanded TF families46,47,48 and plant transcriptional regulation displays many regulatory features not shared by other eukaryotes19,24,49. Not surprisingly, yeast-based screens can miss known ADs in well-studied plant TFs15. A survey of tens of ADs found that transactivation potency in yeast does not predict strength in N. benthamiana19,24 while an earlier study showed that the quantitative activity of mutant variants of a maize TF was largely consistent between yeast and maize50. Thus, it is not clear the degree to which yeast and plants have their own taxa-specific activation mechanisms and whether the quantitative activation strength of ADs is consistent between these two models19,24. To systematically study these questions, we took advantage of an existing dataset where thousands of protein tiles from Arabidopsis TFs were tested in yeast15. We leveraged ENTRAP-seq to test a subset of these tiles in N. benthamiana to compare their activity across model systems. We randomly selected 250 tiles while keeping the same distribution of yeast transcriptional strengths (referred to as PADI scores15) from the original library. We then cloned this library in the same vector used for our viral libraries and performed an ENTRAP-seq experiment.

Comparing the rank ordering of activity between yeast and N. benthamiana revealed a very limited correlation (Spearman  = 0.18, P = 0.0075) (Fig. 5a). Indeed, numerous strong yeast activators were depleted from the ENTRAP-seq pull-down and multiple enriched tiles in N. benthamiana were not found to be activators in yeast (Fig. 5a). Because yeast-based MPRAs are heavily biased toward identifying acidic ADs, we hypothesized that tiles not found to be activators in yeast might belong to other classes of ADs. To classify tiles as likely plant activators we used an enrichment threshold of 1.2: the minimum enrichment value for validated activators from the viral libraries. Among the 40 tiles that had an ENTRAP-seq enrichment greater than 1.2 and a PADI score lower than 1 (labeled ‘quadrant 4’ in Fig. 5a) we found two glutamine (Q)-rich sequences that were not classified as activators in yeast. The transactivation role of Q-rich stretches is conserved in eukaryotes51. The first of these polyQ ADs is in the middle region (MR) of the TF gene AUXIN RESPONSE FACTOR 19 (ARF19, AT1G19220, Fig. 5b). The middle region is known to be important for transactivation in activator auxin response factors (ARFs)52 but the yeast assay mapped the ADs of ARF19 and other activator ARFs to regions flanking the middle region15. The second polyQ stretch belongs in the C terminus of the TF gene AGAMOUS-LIKE 65 (AGL65, AT1G18750), a region that is thought to contain the AD in this family of TFs53 (Fig. 5c).

Fig. 5: Systematic comparison of tiles from Arabidopsis TFs expressed in yeast and N. benthamiana.
figure 5

a, Comparison of tile transcriptional activity between yeast and N. benthamiana. A set of peptide tiles from Arabidopsis TFs previously measured in yeast was tested in N. benthamiana. Scatter plot shows the activity of each tile in yeast (PADI score, y-axis) and N. benthamiana (tile enrichment, x-axis). The horizontal dashed line indicates the PADI threshold value used to classify tiles as activators in yeast. The vertical dashed line shows the enrichment value below which we never detected validated activator tiles in Fig. 3b. b, Q enrichment along the ARF19 TF protein shown as percentage of Q residues in a moving window of 20 aa. The green bar indicates the position of the ARF AD known as middle region (MR). Orange bars show the position of ADs according to the yeast assay. The blue bar shows the position of the AD in plants according to ENTRAP-seq. c, Same as b for the AGL65 TF. No ADs were identified for this TF in yeast. d, The degree of structural disorder of each tile (y-axis) is plotted against its ENTRAP-seq activity in N. benthamiana (x-axis). The histogram on the right includes tiles likely to be activators in both yeast and N. benthamiana as well as N. benthamiana only (quadrants 2 and 4 in a). e, Predicted disorder in the NAC001 TF. The green bars indicate the NAC dimerization domain and the transactivation domain. The blue bar shows the AD identified by ENTRAP-seq and the orange bar shows the position of the ADs active in yeast. f, Validation of the tiles highlighted in blue in b, c and e using a fluorescence reporter assay. g, A random selection of tiles that were activators in yeast but de-enriched in tobacco (‘quadrant 1’ in a) tested in a fluorescence reporter assay against the Gal4DBD alone as a control. In f and g, n = 32 biological replicates (4 plants, 2 leaves per plant, 4 leaf punches per leaf), *P < 0.05, **P < 0.01, ***P < 0.001, NS, not significant in a two-sided Welch’s t-test against Gal4DBD with a Bonferroni multiple comparisons correction. Box plots show the median ± IQR, whiskers correspond to minimum and maximum values excluding outliers.

An alternative hypothesis for why ADs can be active in plants but not in yeast is that these tiles are not ‘bona fide’ ADs, namely, peptides in intrinsically disordered regions that are sufficient to directly recruit the core transcriptional machinery to the promoter. Instead, these tiles may code for folded protein domains that bind plant-specific TFs present in N. benthamiana that in turn contain canonical ADs that activate transcription in yeast. To explore this possibility, we calculated the predicted disorder of tiles in the library and compared this prediction with their ENTRAP-seq enrichment. We found these two metrics to be completely uncorrelated (Spearman  = 0.13, P = 0.054) and tiles that are active in plants (enrichment >1.2) are equally likely to be structured as to be disordered (Fig. 5d). Thus, activator tiles with a plant-specific bias do not necessarily form folded domains. However, among highly structured tiles we found cases that do fit our initial hypothesis. For example, a tile in the TF gene NAC DOMAIN CONTAINING-PROTEIN 1 (NAC001, AT1G01010) containing part of a structured domain involved in homo-dimerization and hetero-dimerization within NAC family TFs was found to be highly active in plants but not in yeast54 (Fig. 5e). The NAC TF family is plant-specific and contains acidic ADs in their C-terminal half that are activators in yeast55 (Fig. 5e). Thus, in N. benthamiana, this NAC001 tile might simply recruit endogenous NAC activators that, in turn, contain ADs that are active in both N. benthamiana and yeast. In yeast, however, this NAC001 tile may be incapable of activating the reporter because this organism lacks interacting NAC partners.

We confirmed that the polyQ ADs of ARF19 and AGL65 as well as the NAC001 AD activated the reporter in our fluorescence plate reader assay (Fig. 5f and Supplementary Table 3). In this comparative ENTRAP-seq experiment, we found 14 tiles that had a PADI score >1 and an enrichment value <1.2, corresponding to likely yeast-only activators (labeled ‘quadrant 1’ in Fig. 5a). To test whether these plant-derived sequences correspond to ADs that activate transcription in yeast but not in plants, we individually measured 12 out of 14 of these tiles. Most (9 out of 12) of these tiles did not activate transcription in N. benthamiana in a fluorescence reporter assay, in fact, 5 out of 12 were repressors, consistent with their depletion from the ENTRAP-seq pull-down (Fig. 5g and Supplementary Table 3). These results are consistent with the notion that plant-specific as well as yeast-specific AD tiles are not uncommon. The stark differences in the rank order of activity between the two systems for most tiles suggest that there are fundamental differences in transcriptional activation mechanisms between these two models, although other explanations are conceivable as well. For example, the reporter in the yeast assay is chromosomally integrated while the ENTRAP-seq reporter is transiently expressed. To better understand the degree of conservation of ADs between these two model eukaryotes, future studies will be necessary to leverage ENTRAP-seq to measure all tiles across all Arabidopsis TFs in an unbiased and systematic manner. Finally, these findings show that it is wise to exercise caution when interpreting tile MPRA data of plant proteins tested in yeast as both false positives and false negatives can arise when using one model system to represent the biology of another.

Rationally tuning the activity of a plant TF

One of the implicit assumptions of AD recruitment assays is that TFs are largely modular56. If this is the case, it may be possible to study and engineer plant TFs in high throughput by isolating, mutating and screening their ADs for altered functionality. Because ENTRAP-seq can reliably detect coarse-grained quantitative differences in enrichment (Extended Data Fig. 4 and Supplementary Information), and enrichment is highly correlated with activation strength (Figs. 2f and 3b), we hypothesized that the method could be used to screen for mutant AD variants displaying a continuous spectrum of activity. To test this idea, we leveraged ENTRAP-seq to characterize a mutant library of the Arabidopsis TF gene CONSTANS (CO, AT5G15840) with the goal of creating an allelic series containing weaker and stronger activators than wild type (WT) as well as alleles that switch CO from activation to repression of target genes. Natural variation in CO and its homologs is associated with flowering time, a key agronomic trait57,58. Although Arabidopsis CO appears to act exclusively as a transcriptional activator, related CO-like genes in this and other species can act as repressors, suggesting that it is feasible to switch the CO regulatory role via mutations59,60,61. In agreement with the role of CO in its native context and previous reports, transient expression in N. benthamiana of a fusion between CO and a BFP (BFP-CO) strongly activates expression of the FLOWERING LOCUS T (FT, AT1G65480) promoter (1.9 kb upstream of the ATG, hereafter FTpro) as measured by a GFP reporter assay62 (Fig. 6a and Supplementary Table 3).

Fig. 6: Large-scale screening of mutant ADs to create an allelic series of the CO TF.
figure 6

a, Transactivation assay for CO function in N. benthamiana. A strain carrying GFP driven by the Arabidopsis FTpro was infiltrated with a strain carrying pUBQ::BFP:CO or an empty vector (EV). pUBQ::mCherry was used as a normalization. Shown is the P value of a two-sided Welch’s t-test. b, Location of the CO AD. Solid lines show the probability of residues being part of an acidic AD according to the PADDLE and TADA models. The shaded area shows the probability of being a disordered region according to Metapredict. Red bar on top shows the empirically determined AD. The dashed lines indicate the 69-aa region used to build a tile library. c, Screening of the CO AD tile library using nuclei enrichment. Mutant tiles are fused to the Gal4DBD and used to drive the expression of the magnetic sorting reporter. d, Histogram showing the distribution of tile enrichment values. The dashed line and the marker show the mean enrichment of six WT tiles. Error bar corresponds to s.d. e, Heatmap showing the enrichment values of tiles carrying single amino-acid substitutions. f, The single residue substitutions shown in e were classified in three categories, weaker than WT (blue), similar to WT (yellow) and stronger than WT (red). The stacked bars show the fraction of mutations in e belonging to each activity category. g, Transactivation assay to test the effect of mutations in the CO AD in the context of the full-length protein. Mutations identified in cf as being weaker (loss-of-function) or stronger (gain-of-function) than the WT AD were introduced into the CO coding sequence and used to drive expression of the FTpro GFP reporter. h, FT promoter activity in the presence of various CO mutant alleles. Predicted activity is based on the enrichment of the corresponding AD tile with respect to the WT tile. In a and h, n = 32 biological replicates (4 plants, 2 leaves per plant, 4 leaf punches per leaf), box plots show the median ± IQR, bars correspond to minimum and maximum values excluding outliers. ***P < 0.001, **P < 0.01, *P < 0.05, NS, not significantly in a two-sided Welch’s t-test comparison against WT with a Bonferroni multiple comparison correction.

CO contains a poorly conserved, highly disordered middle region that is sufficient for transactivation in a Gal4DBD recruitment assay62 (Fig. 6b). Although multiple CO alleles have been identified in random mutagenesis screens, all lie outside this middle region and are loss of function63 (Fig. 6b). To better define the position of the CO AD within this middle region, we leveraged PADDLE as well as TADA, another model trained on datasets from yeast-based screens that can predict acidic ADs13,15. Both models have a high degree of agreement and point to a region around aa position 175 as the putative CO AD (Fig. 6b). We therefore selected a 69-aa window in this region as the putative AD (Fig. 6b). To explore the sequence space of the CO AD, we generated a library of 350 tiles carrying 1–4 aa substitutions and including the WT AD as a control. The substitutions were rationally designed to either increase or decrease AD strength based on features important for acidic AD function in yeast and animal cells, namely (1) depletion of positive residues (R and K), (2) enrichment of negative residues (D and E), (3) enrichment of hydrophobic residues (particularly aromatics and leucine), (4) depletion of continuous stretches of hydrophobic residues and (5) lack of stable secondary structure6,13,64,65. Following the same rationale, we also mutated every residue to K, P, W and D. As internal controls we included five WT tiles. Finally, we introduced ten random synonymous mutations in each tile in the library to better distinguish between highly similar tiles during sequencing. Following the same strategy used for our previous experiments, we cloned these tiles as Gal4DBD fusions in our magnetic sorting reporter and performed an ENTRAP-seq assay (Fig. 6c).

The enrichment of the CO AD tiles spanned a range of values comparable to other libraries and the enrichment of WT tiles was highly reproducible (1.12 ± 0.02). Perhaps not surprisingly, most tiles were less enriched than WT, corresponding to potential loss-of-function mutants (Fig. 6d). We found a moderate agreement between tile activity as measured by ENTRAP-seq and the predictions from PADDLE and TADA (Spearman  = 0.38 and 0.4, respectively) although the two models largely agreed with one another (Pearson r = 0.89) (Supplementary Fig. 9). This suggests that models trained in plants could be exploited in the future to guide the design of mutant plant AD screens. Comparing the activity of each tile to its biochemical properties showed that, as expected for an acidic AD, positive charge and the number of F or W residues have a significant correlation with activity (Spearman  = −0.38 and 0.24, respectively, Extended Data Fig. 8). Throughout the AD, mutation of residues to K tended to decrease the activity and most of the strong loss-of-function AD alleles are K substitutions (Fig. 6e,f). Most gain-of-function single mutants correspond to W or D substitutions (Fig. 6e,f). Mutating residues to P, which severely disrupts protein secondary structure, rarely had strong effects (Fig. 6e,f). These results are consistent with the CO AD behaving like a classic intrinsically disordered acidic AD and highlights the power of combining predictive models with semirational mutagenesis guided by known AD compositional biases.

To test the effect of these mutations in the context of the full-length CO protein, we chose ten putative loss-of-function and gain-of-function alleles among the tiles at either end of the enrichment value distribution. We then introduced the coding mutations from each of these tiles into the full-length CO expression construct and tested their effect individually using the FTpro fluorescence reporter previously established (Fig. 6g and Supplementary Table 3). As controls we also cloned three previously isolated semidominant loss-of-function alleles corresponding to point mutations outside the AD identified in random mutagenesis screens. Out of ten predicted loss-of-function CO mutants, seven were significantly weaker activators than WT and two of these were as weak as classic alleles. The M186P mutant behaved as a gain-of-function allele, in contrast to its activity as a Gal4DBD fusion tile (Fig. 6h). The ethyl methanesulfonate (EMS) mutant controls as well as some of our isolated alleles switched CO activity from an activator into a repressor, consistent with the EMS mutants being semidominant in Arabidopsis63. Out of the ten predicted gain-of-function mutant alleles, five were stronger activators than WT (Fig. 6h). Among the rest, four retained activator function but were not statistically stronger than WT and one putative gain-of-function allele (Y194W) was a slightly weaker activator than WT. We conducted an independent whole-experiment replicate of this figure and obtained similar results (Pearson r = 0.92, Extended Data Fig. 9). Collectively, these data demonstrate that tile activity in the context of an ENTRAP-seq assay can be predictive of tile function within the original TF protein environment. This allowed us to rapidly engineer a new allelic series of a plant TF, including alleles that switch its function from activator to repressor.

Leave a Comment