KATMAP infers splicing factor activity and regulatory targets from knockdown data

Selecting the knockdown experiments

We selected ENCODE knockdown RNA-seq experiments for proteins annotated as SFs²⁴ and which had a good affinity model derived from RBNS⁵⁸ using RBPamp²⁶ or a PWM derived from RNAcompete²⁷ obtained from the CISBP-RNA data database²⁸. We use the rMATS differential splicing¹⁸ results for the hg19 human reference genome in our analyses. The mouse Rbfox1/2/3 triple-knockout data⁵¹ were processed by rMATS 4.1.1 with default settings using the GRCm39 assembly.

The structure of the KATMAP regression model

The goal of KATMAP is to infer an SF’s regulatory activity and targets from knockdown data by explaining the resulting splicing changes in terms of changes in SF binding. Accomplishing this requires several inputs. Firstly, we need the results of a differential splicing analysis, which assigns labels Y to exons indicating whether their inclusion increased, decreased or was unaffected by knockdown. These are the observations KATMAP seeks to predict. The second input is a model of the SF’s sequence specificity. This assigns exons in the differential splicing analysis a set of scores, S, which evaluate all potential binding sites within predefined windows around their SSs. These scores, S, are what KATMAP will use to explain the splicing changes, Y, in terms of SF binding and ultimately to infer regulatory activity. Lastly, additional information about each exon can be included using a set of optional linear predictors, X. In our analyses, we use these linear features to control for expression and pre-knockdown exon inclusion levels.

Defining the splicing changes

Any differential splicing analysis that labels each exon as unchanged, upregulated or downregulated may be used to define the observations, Y. We use one-hot encoding to define Y_i as (1, 0, 0), (0, 1, 0) or (0, 0, 1) if exon i is labeled downregulated, unchanged or upregulated, respectively.

In our analyses, we defined splicing changes by further processing the rMATS outputs with ashr⁶⁴ to better account for the uncertainty in rMATS’s estimates. This step had two motivations. First, ashr accomplishes something like a multiple-test correction on parameter estimates, while accounting for the possibility that more exons are upregulated than downregulated (or vice versa). Second, while KATMAP does not consider the magnitude of splicing changes, some of our downstream analyses do. The shrinkage estimates provided by ashr control for measurement error because of low read counts and yield a cleaner view of splicing differences than the raw rMATS Δψ summaries.

Briefly, we used the read counts and rMATS P values to reconstruct the effect size on a log odds scale, which we denote Δϕ, and associated uncertainty (mathematical details in Supplementary Information). We then pass these values to ashr to obtain shrinkage estimates for Δϕ and associated s.d. We construct 94% CIs on the basis of these, calling exons differentially spliced if the 94% CI does not overlap zero and nonsignificant otherwise.

Defining the training set

We primarily use the rMATS skipped exon table in our analyses. If a set of skipped exons partially overlapped or shared flanking exons, we retained only the most significant to avoid double counting. We include all remaining significant exons in our training sets. For computational efficiency, we downsample the nonsignificant exons, retaining 2,000 randomly chosen exons.

Scoring motifs with a binding model

To explain splicing in terms of binding, KATMAP requires some model of the perturbed SF’s binding motif. The binding model may be a traditional PWM obtained from RNAcompete, one or more PSAMs (for example, from RBPamp) or more general affinity models. The key requirement is that the binding model orders potential binding sites from weakest to strongest. In our analyses, we used PSAMs derived from RBNS²⁵ using RBPamp²⁶ when available and PWMs derived from RNAcompete²⁷ obtained from the CISBP-RNA database²⁸. We excluded RBNS models where the PSAMs showed strong complementarity to the sequencing adaptors, which likely reflects a technical artifact rather than true sequence specificity.

To score a k-length motif x with a single 4 × k PWM or PSAM, W, we sum the scores (in log scale) for each nucleotide in the motif

$$S(x,W\;)=\mathop{\sum }\limits_{i=1}^{k}\sum _{{\rm{nt}}\in \{A,C,G,U\}}{W}_{{\rm{nt}},i}[{x}_{i}={\rm{nt}}],$$

where the Iverson bracket [x = nt] is equal to 1 if true and 0 if false.

If the motif model comprises multiple matrices W₁, …, W_n, each with associated weights w₁, …, w_n, then we compute the combined score as

$$S(x)=\log \left({\mathop{\sum }\limits_{i}^{n}}{w}_{i}\exp \left[S(x,W_{i})\right]\right).$$

In practice, we always define the scores such that the optimal motif is assigned a score of 0 and all other motifs receive negative scores. This is accomplished by subtracting the score of the optimal motif from all raw scores.

Even for affinity models, we do not assume that these initial sequence-based scores are fully accurate representations of affinity. Rather, we treat them as related to the log affinity but potentially differing by some scaling factor. To account for this, we include an affinity scale parameter, λ, which rescales the binding scores.

These rescaled scores are used as log affinities from which occupancy and changes in occupancy are computed. If the scores are truly linearly related to log affinity, this can be thought of as calibrating the scores against the observed splicing changes. If the scores are related to log affinity in a monotonic but not linear manner, this rescaling can be thought of as widening or narrowing the interval of scores to which appreciable changes in occupancy are assigned.

Scoring the sequences around exons

KATMAP scores sequences in windows around the SSs of each exon. In our analyses, we scored 200 potential binding sites on the intronic and 60 on the exonic side of the SSs of each cassette exon. If desired, larger or smaller windows can be used or the regions adjacent to the flanking exons’ SSs can be included.

Our model calculates a log affinity for each of the potential binding sites analyzed per exon. For computational tractability, rather than model binding at nucleotide resolution at each exon, we aggregate neighboring binding sites into nonoverlapping regions of ten potential binding sites each. We index these regions, 1, …, J, where, in most of our analyses, $J=\frac{1}{10}(200+60)\times 2=52$. More generally, in an analysis with a different region size, different intronic or exonic window size or number of SSs (for example, including the flanking exons),

$$\begin{array}{l}J=\displaystyle\frac{1}{{\rm{region}}\;{\rm{size}}}({\rm{intronic}}\;{\rm{window}}\;{\rm{size}}+{\rm{exonic}}\;{\rm{window}}\;{\rm{size}})\\\quad\;\times {\rm{number}}\;{\rm{of}}\;{\rm{splice}}\;{\rm{sites}}.\end{array}$$

We then assign each subregion j of exon i a log affinity. We denote the motif scores of the ten binding sites ending in subregion j as S_ij and compute the log affinity of the region by first scaling the log affinity of each binding site by λ and taking the log of the sum of their exponents:

$${a}_{ij}={\rm{log}}\;{\rm{sum}}\;{\rm{exp}}(\lambda {{{S}}}_{ij}).$$

Defining the binding function

For a given free protein concentration F, the occupancy of a site with given log affinity, a, is described by the Langmuir equation:

$$\Theta (a;F)=\frac{1}{1+\frac{1}{F{e}^{a}}}.$$

The change in occupancy is, thus, determined by the free protein concentration before and after knockdown or overexpression (Extended Data Fig. 8a). If the free protein concentration changes by some factor k, the change in occupancy can be computed (Extended Data Fig. 8b) as:

$$\delta (a;F,k)=\Theta (a;kF)-\Theta (a;F).$$

However, these parameters, F and k, while biophysically meaningful, are not immediately interpretable in terms of which affinities experience large changes in binding. We, therefore, reparameterize the change in binding function in terms of its shape. First, we include the most relevant affinity, m, indicating which affinity experienced the greatest change in binding. Second, we defined the magnitude of the change in binding at sites with affinity m as d (Extended Data Fig. 8b,c). That is, we reparameterize the function such that the maximum change in occupancy, d, occurs at log affinity m.

The corresponding biophysical parameters can be computed by solving

$$\begin{array}{rr}{\delta }^{{\prime} }(m;F,k)=0&{\rm{fix}}\,{\rm{the}}\,{\rm{optimum}}\,{\rm{at}}\,m\,{\rm{by}}\,{\rm{setting}}\,{\rm{the}}\,{\rm{derivative}}\,{\rm{to}}\,{\rm{zero}}\\ \delta (m;F,k)=d&{\rm{assert}}\,{\rm{that}}\,{\rm{the}}\,{\rm{change}}\,{\rm{in}}\,{\rm{occupancy}}\,{\rm{at}}\,m\,{\rm{is}}\,d\\ \end{array}$$

which yields

$$\begin{array}{rcl}k(d)&=&\frac{{(d+1)}^{2}}{{(d-1)}^{2}}\\ F(m,d)&=&{\left({e}^{m}\sqrt{k}\right)}^{-1}\end{array}.$$

Thus,

$$\tilde{\delta }(a;m,d)=\delta \left(a;F(m,d),k(d)\right),$$

where $\tilde{\delta }$ denotes a reparameterization of δ.

Our model’s predictions are constructed by multiplying these binding changes by other parameters we term activity coefficients, α. Because d changes the height of the binding function but not the relative shape (unless d is near 1 or −1, $\frac{\widetilde{\delta}(s;m,{d}_{1})}{\widetilde{\delta}(s;m,{d}_{2})}\approx \frac{{d}_{1}}{{d}_{2}}$; Extended Data Fig. 8c), we cannot usually identify both the magnitude of binding changes and the magnitude of regulatory activity; most effects on the model’s predictions of changing d from value d₁ to d₂ can be entirely reversed by scaling the magnitudes of the activity coefficients by $\frac{{d}_{1}}{{d}_{2}}$. To resolve this, we normalize the changes in binding by ∣d∣ to compute the relative change in binding (Extended Data Fig. 8d).

$${\tilde{\delta }}_{\rm{rel}}(a;m,d)=\frac{1}{| d| }\tilde{\delta }(a;m,d)$$

Furthermore, because all values of ∣d∣ < 0.95 yield essentially the same relative changes in binding, we fix $d=-\frac{1}{2}$ for knockdown experiments and $d=\frac{1}{2}$ for overexpression. If ∣d∣ > 0.95, as might be the case for knockout data, the relative change in binding function flattens out, but this can be approximated with small values of the affinity scale parameter, λ ≪ 1 (Extended Data Fig. 8e).

To recap, we score the sequences around each exon’s SSs using a model of the RBP’s binding motif. We then divide the sequence around each exon into nonoverlapping 10-nt regions, each containing ten potential binding sites. We denote the set of all scores as S, where S_ij refers to the scores of the ten potential binding sites in region j of exon i. We then scale these scores with an affinity scale parameter, λ, and treat the resulting values as log affinities. We compute the log affinity of the regions on the basis of the affinities of the ten binding sites contained in each. Finally, we assign each region a relative changing in binding, determined by the value of the most relevant affinity, m. We describe this entire procedure of transforming the motif score to relative changes in binding with the binding function

$$B({{{S}}}_{ij};\lambda ,m,d)={\tilde{\delta }}_{rel}\left({\rm{log}}\;{\rm{sum}}\;{\rm{exp}}(\lambda {{{S}}}_{ij});m,d\right),$$

fixing $d=-\frac{1}{2}$ for knockdown experiments.

Predicting changes in regulation from changes in binding

We describe the position-specific effect of RBP binding with a set of activity coefficients α₁, …, α_J. A positive value of α_j indicates that binding at region j enhances exon inclusion. A negative value instead describes suppressing activity. Critically, we assume that activity varies smoothly with position, described later when discussing our model’s priors.

To compute the predicted effect of loss of binding near exon i on its splicing outcome, we multiply the change in binding at each region j, B(S_ij; λ, m, d), by the corresponding activity coefficient α_j. We make the simplifying assumption that activity is additive across all binding sites; hence, the total change in activity because of the changes in binding across regions 1,…,J is

$$\Delta {{{A}}}_{i}={\mathop{\sum }\limits_{j=1}^{J}}B({{{S}}}_{ij};\lambda ,m,d){\alpha }_{j}.$$

Connecting changes in regulation to observed splicing changes

To predict the observed splicing changes, the model must assign each exon i three probabilities, the probability of being downregulated, unchanged or upregulated: ${\hat{p}}_{i}=({p}_{i}^{(-)},{p}_{i}^{(0)}{p}_{i}^{(+)})$. We describe the log odds ($q=log{\frac{p}{1-p}}$) of these probabilities with an additive model:

$$\begin{array}{rcl}{q}_{i}^{(+)}=\Delta {{{A}}}_{i}+\ldots &=&{\mathop{\sum }\limits_{j=1}}^{J}B({S}_{ij};\lambda ,m,d){\alpha }_{j}+\ldots \\ {q}_{i}^{(-)}=-\Delta {{{A}}}_{i}+\ldots &=&-{\mathop{\sum }\limits_{j=1}}^{J}B({S}_{ij};\lambda ,m,d){\alpha }_{j}+\ldots \end{array}$$

That is, if the activity at exon i increases, the predicted log odds of inclusion being upregulated increases and the log odds of being downregulated decreases. In an overexpression experiment, the changes in binding B(S_ij; λ, m, d) are positive and upregulation is driven by gains in enhancing activity and downregulation by gains in suppressing activity. In knockdown experiments, the changes in binding are negative, flipping the signs of the change in activity: upregulation is driven by a loss of suppressing activity and downregulation is driven by a loss of enhancing activity.

Incorporating additional information into the predictions as linear features

Our models aims to explain the results of the differential splicing analysis in terms of SF binding but other information unrelated to binding is likely to be predictive of which exons were called differentially spliced, for biological and/or technical reasons. We incorporate these features by adding a standard linear model into the predictions. The K linear features included in the model are organized as an I-by-K table, X. All features are mean-centered and standardized. Any quadratic terms are created using the standardized features to reduce the dependence structure of the target posterior distribution.

Because not every linear feature will be relevant to each of the three predictions (downregulated, unchanged and upregulated), we divide the linear features X into three (potentially overlapping) subsets, denoted X⁽⁻⁾, X⁽⁰⁾ and X⁽⁺⁾. For example, the read count of an exon influences the statistical power to detect changes in splicing and, hence, should inform the predicted probability of being unchanged, P⁽⁰⁾ and is, thus, assigned to X⁽⁰⁾. The original inclusion level of an exon determines whether upregulation or downregulation is detectable because there is little room for inclusion to increase or decrease when the exon was already nearly 100% or 0% included. Thus, the inclusion level before SF perturbation is likely directly informative of ${p}_{i}^{(-)}$ and ${p}_{i}^{(+)}$ is, therefore, assigned to both X⁽⁻⁾ and X⁽⁺⁾. These features are incorporated into the predictions as

$$\begin{array}{c}{q}_{i}^{(-)}=-\mathop{\sum }\limits_{j=1}^{J}B\left({S}_{ij};\lambda ,m,d\right){\alpha }_{j}+\mathop{\sum }\limits_{k=1}^{{K}^{(-)}}{\beta }_{k}^{(-)}{X}_{ik}^{(-)}+{\beta }_{0}^{(-)}\\ {q}_{i}^{(0)}=\mathop{\sum }\limits_{k=1}^{{K}^{(0)}}{\beta }_{k}^{(0)}{X}_{ik}^{(0)}\\ {q}_{i}^{(+)}=\mathop{\sum }\limits_{j=1}^{J}B\left({S}_{ij};\lambda ,m,d\right){\alpha }_{j}+\mathop{\sum }\limits_{k=1}^{{K}^{(+)}}{\beta }_{k}^{(+)}{X}_{ik}^{(+)}+{\beta }_{0}^{(+)}\end{array},$$

where ${\beta }_{1\ldots {K}^{(-)}}^{(-)}$, ${\beta }_{1\ldots {K}^{(0)}}^{(0)}$ and ${\beta }_{1\ldots {K}^{(+)}}^{(+)}$ denote the linear coefficients describing the effect of the linear features on the predictions and where β⁽⁻⁾ and β⁽⁺⁾ are the intercepts.

Connecting predictions to observed splicing changes

The probabilities for category z are computed as

$${\hat{p}}_{i}^{(z)}=\frac{\exp \left({q}_{i}^{(z)}\right)}{\exp \left({q}_{i}^{(-)}\right)+\exp \left({q}_{i}^{\left(0\right.}\right)+\exp \left({q}_{i}^{(+)}\right)}.$$

We connect these predictions to the observed splicing changes with a categorical distribution:

$${Y}_{i} \sim {\rm{Categorical}}\left(p=\left[{\hat{p}}_{i}^{(-)},{\hat{p}}_{i}^{(0)},{\hat{p}}_{i}^{(+)}\right]\right).$$

The probability of an observation under this distribution is simply equal to the predicted probability for that category. For example, if exon i is downregulated, the likelihood is $P({Y}_{i}| {\hat{p}}_{i})={\hat{p}}_{i}^{(-)}$.

Defining the model’s priors

We represent assumptions about our model’s parameters with Bayesian priors. Some of the priors we use reflect what we consider plausible given our understanding of molecular biology. Others reflect statistical considerations that aim to avoid unrealistically strong predictions or complications to statistical inference. For example, given the limitations of experimental data and modeling thereof, we do not believe KATMAP should ever make extremely confident predictions on the order of million-to-one odds. In other cases, we wish to avoid identifiability issues, where many combinations of distinct parameter values would yield essentially the same predictions. In describing the model’s priors, we note for each parameter which consideration is involved.

Priors on the binding parameters

The parameter λ scales the motif scores S to assign binding sites log affinities S × λ. When λ < 1, the scores are compressed, assigning a narrower range of affinities. When λ > 1 the scores are stretched, assigning a wider range of affinities. We choose a prior that asserts both to be equally plausible. Beyond this, the prior we place on λ largely reflects a practical consideration about operations on a log scale. Multiplication on the log scale corresponds to exponentiating the actual affinities:

$${\rm{Affinity}}={e}^{\;\lambda S}={{e}^{S}}^{\lambda }.$$

Even modest values of λ will yield extremely large or small affinities. Thus, we constrain λ to a relatively narrow range of values by using a unit-normal prior:

$${\log }_{2}\lambda \sim \rm{Normal}(0,1),$$

which places about 99% of the prior probability on λ between $\frac{1}{6}$ and 6 (Extended Data Fig. 9a).

The most relevant affinity parameter m describes which affinity experiences the greatest change in binding. Assuming that the binding site model is accurate, motifs with moderate to high affinity should experience changes in binding; motifs with very low affinity are too weak to be appreciably bound in the first place. Thus, a biologically reasonable prior would favor values of m closer to the affinity of the optimal motif and disfavor very low affinities. Beyond this, it is difficult to precisely express how plausible a given value of m is.

However, m determines how much SF occupancy our model predicts was lost from near each exon. It is easier to assess the biological plausibility of these statements. Certain values of m might imply that the average exon has >20 copies of the SF bound within a few hundred nucleotides. For SFs that bind specific sequence motifs, this is unrealistically high. Thus, rather than explicitly define a prior on m, we instead define our prior on the average change in binding near each exon:

$$\bar{B}=\left\vert \frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}\mathop{\sum }\limits_{j=1}^{J}B({S}_{ij};m,\lambda ,d)\right\vert.$$

We divide this quantity by the number of potential binding sites per exon (J = 52 in our analyses) and place the prior on the log odds of this fraction:

$$\tilde{B}={\frac{\bar{B}+\epsilon }{J+2\epsilon }},$$

with small constants, ϵ = 10⁻⁷, added to the numerator and denominator to avoid numerical issues because of computational precision. We then define our prior as

$$\log {\frac{\tilde{B}}{1-\tilde{B}}} \sim {\mathrm{Normal}}(\mu =-3.9,\sigma =0.84),$$

which places 99% of the prior probability on the average change in binding per exon between 0.12 and 7.8 (Extended Data Fig. 9b). To exclude extremely unrealistic values of m, we constrain the model to only consider values between the median log affinity and the maximum affinity per 10-nt region, $\log (10)$ (or, more generally, $\log ({\rm{region}}\;{\rm{size}})$).

Assuming that activity varies smoothly with distance

A fundamental assumption of our model is that the effect of SF binding on exon inclusion depends on where the protein is bound relative to the exon. This is described by activity coefficients α₁, …, α_J. We further assume that the activities of nearby regions, say j = 2 and j = 3, are similar but that distant positions may have very different activities. Statistically, we express this with a Gaussian process prior that assumes that the activity coefficients are spatially correlated⁶⁵. The Gaussian process can be conceptualized as a multivariate normal (${\mathcal{N}}$) distribution, where the covariance between variables α_j and α_k is a decreasing function of how far apart they are ∣j − k∣.

$${\alpha }_{1\ldots J} \sim {\mathcal{N}}\left(0,\Sigma (L,\sigma )\right)$$

We choose the smooth yet flexible Matern-3/2 covariance function to express these spatial correlations⁶⁵. Because we further want to assume that the effects of exonic versus intronic binding may be very different, we restrict the spatial correlations to pairs of regions that fall on the same side of a given SS. We express this with a set of labels R₁, …, R_J that indicate whether region j is on the intronic side of the 3SS or the exonic side, or is on the exonic side of the 5SS or the intronic side. If a pair of regions j and k correspond to different biological labels, the spatial correlation between α_j and α_k is set to zero:

$${\Sigma }_{jk}(L,\sigma )=\left\{\begin{array}{ll}{\sigma }^{2}\left(1+\frac{\sqrt{3}| j-k| }{L}\right)\exp \left(-\frac{\sqrt{3}| j-k| }{L}\right)\quad &{\rm{if}}\,{R}_{j}={R}_{k}\\ 0\quad &{\rm{if}}\,{R}_{j}\ne {R}_{k}.\end{array}\right.$$

Assuming this spatial correlation structure requires introducing two additional parameters that must be learned. The length scale, L, determines how far apart a pair of regions j and k must be before they are weakly correlated. The stationary s.d., σ, determines the range of magnitudes the activity coefficients are expected to span. To prevent the inference algorithm from wasting computation by exploring large but practically equivalent regions of parameter space, we constructed priors that avoid both large and small values of these hyperparameters (Extended Data Fig. 9c,d; mathematical details in Supplementary Information). This places 99% of the prior probability for the length scale on the interval 3.2 nt < L < 226 nt (Extended Data Fig. 9c), which ranges from near independence to correlations that span the whole intronic window (Extended Data Fig. 9e). For the stationary s.d., 99% of the prior probability falls in the interval 0.14 < σ < 8.7 (Extended Data Fig. 9d).

Assumptions about the linear coefficients

Because we mean-center and standardize the linear features, β_k indicates how much an increase in linear feature k by one s.d. changes the log odds of the prediction. Because the impact is on a log odds scale, coefficients of magnitude ∣β∣ > 10 would have extreme impacts on the prediction; increasing the linear feature by one s.d. would increase a predicted probability of P = 0.01 to P > 0.99. As the standardized features typically span a range of more than four s.d., we used a weakly informative prior for the linear coefficients and intercept, using a normal density with mean of 0 and s.d. of 5:

$${\beta }_{0\ldots K} \sim {\rm{Normal}}(\mu =0,\sigma =5).$$

The model is not sensitive to this choice of prior, as it spans the range of reasonable effect sizes, both strong and weak.

High-level overview of model inference

The goal of Bayesian inference is to determine which parameters are plausible given the observed data. In practice, this is often a harder task than formulating the statistical model itself. Evaluating the exact plausibility requires applying Bayes’ theorem, which expresses a reasonable proposition: to determine how plausible a given hypothesis h is, you must compare it to every other hypothesis admitted by the model, $\tilde{h}\in \Theta$:

$$P(h| Y)=\frac{P(Y| h)P(h)}{{\int}_{\tilde{h}\in \Theta }P(Y| \tilde{h})P(\tilde{h})d\tilde{h}}.$$

This is typically impossible to compute exactly because it requires integrating over the entire parameter space. In the case of KATMAP’s regression model, this parameter space has >50 dimensions.

A common solution is Monte Carlo sampling. Rather than evaluate the exact posterior density P(θ, ϕ∣Y), we instead seek to obtain a sample of hundreds of parameter values that follow the posterior distribution:

$$({\theta }_{1},{\phi }_{1}),\ldots ,({\theta }_{N},{\phi }_{N}) \sim P(\theta ,\phi | Y).$$

This posterior sample provides a representative set of parameters that are plausible given the observed data from which we can estimate the posterior mean. We can also construct CIs describing the range of plausible parameters values. Because the model predictions for each exon are computed on the basis of these parameters, it is further possible to obtain interval estimates for all of the exon-level predictions.

Bayesian inference with the KATMAP model

When writing the posterior for KATMAP’s model, it is useful to divide the model parameters into two groups. The first group we term hyperparameters φ = (λ, m, L, σ) because, if these four parameters are held fixed, the model reduces to a much simpler generalized linear model (GLM). The second group represents the regression coefficients θ = (α₁, …, α_J and β₀, …, β_K) of this conditional GLM. Partitioning the parameters in this manner makes clear the hierarchical nature of the model:

$$P(\theta ,\varphi | Y)\propto P(Y| \theta ,\varphi )P(\theta | \varphi )P(\varphi ).$$

INLA provides a means to leverage this structure and divide the problem of posterior inference into two manageable, nested subproblems²³. First, an inner step focuses on identifying which regression coefficients are plausible given the data and some specified values of hyperparameters. This approximates the conditional posterior P(θ∣Y, φ) for a specific hyperparameter value, φ. Second, an outer step focuses on determining which hyperparameters are plausible given the data. This approximates the marginal posterior for the hyperparameters P(φ∣Y).

Approximating the conditional posterior over the regression coefficients

The inner step takes advantage of how the model simplifies if the hyperparameters are held fixed to a particular value. The known binding parameters (λ, m) would transform the table of motif scores S into a fixed table of changes in binding B_{(λ, m)}. The known spatial correlation parameters would yield a fixed prior on the activity coefficients. The problem reduces to a GLM, albeit a Bayesian formulation with priors.

We use second-order optimization identify the optimal regression coefficients as the mode of the conditional posterior, ${\theta }_{\varphi }^{* }$, determining the step size with a backtracking line search⁶⁶. This requires evaluating the gradient ∇ and Hessian matrix H, which we compute using automatic differentiation with the JAX library (http://github.com/google/jax). How sharply the log posterior is curved around the mode is described by the Hessian matrix ${H}_{{\theta }^{* }}$ and can be used to describe the uncertainty about the coefficients with a covariance matrix, ${\Sigma }_{\varphi }^{* }={H}_{{\theta }_{\varphi }^{* }}^{-1}$. This provides a multivariate normal approximation to the conditional posterior:

$$P(\theta | Y,\varphi )\approx {\mathcal{N}}(\theta ;\mu ={\theta }_{\varphi }^{* },\Sigma ={\Sigma }_{\varphi }^{* }).$$

Approximating the marginal posterior over the hyperparameters

The outer step evaluates the plausibility of specific hyperparameters φ by approximating the marginal posterior, P(φ∣Y). The hyperparameters, φ only relate to the observations through the regression coefficients θ. Hence, evaluating the plausibility of some specific φ requires considering all possible values the regression coefficients θ could take. Rue et al.²³ accomplish this by using the normal approximation to the conditional posterior obtained in the inner step evaluated at its mode, ${\theta }_{\varphi }^{* }$

$$\begin{array}{rcl}P(\varphi | Y)\propto \sim \frac{P(Y,{\theta }_{\varphi }^{* },\varphi )}{{\mathcal{N}}({\theta }_{\varphi }^{* };{\theta }_{\varphi }^{* },{\Sigma }_{\varphi }^{* })}&=&P(Y,{\theta }_{\varphi }^{* },\varphi )\det {\left({(2\pi )}^{d}\left.{\Sigma }_{\varphi }^{* }\right)\right)}^{\frac{1}{2}}\\ &=&P(Y,{\theta }_{\varphi }^{* },\varphi ){\int}_{x\in {{\mathbb{R}}}^{d}}\exp\\&&\left({\left.{\theta }_{\varphi }^{* }-x\right)}^{T}{{\Sigma }_{\varphi }^{* }}^{-1}({\theta }_{\varphi }^{* }-x)\right)dx\end{array}$$

This approximation to the marginal density multiplies the joint density $P(Y,{\theta }_{\varphi }^{* },\varphi )$ by the volume of regression coefficients that are plausible given φ.

Monte Carlo sampling from the joint posterior with INLA

We couple adaptive importance sampling with these INLAs to draw samples of plausible parameters from the joint posterior. First, we use adaptive importance sampling and the INLA approximation to draw a representative sample of plausible hyperparameters from their marginal posterior, φ₁, …, φ_N ~ φ∣Y. This explores the hyperparameters in a way that iteratively hones in on the true posterior by fitting mixture model approximations of the marginal posterior with importance-weighted expectation maximization²² (Extended Data Fig. 10a; mathematical details in Supplementary Information). Second, for each sampled hyperparameter, we solve the resulting conditional model by optimization, approximating the conditional posteriors (including their correlation structure) over the regression coefficients with a multivariate Gaussian. We then sample regression coefficients from these conditional posteriors, θ₁, …, θ_N ~ θ∣Y, φ_{1, …, N}. Together, these constitute an approximate sample from the joint posterior, (θ₁, φ₁), …, (θ_N, φ_N) ~ θ, φ∣Y. We run this algorithm twice for each knockdown experiment, generating two separate posterior samples. If both runs converge to the same distribution, we consider inference successful and merge the samples.

Evaluating convergence

We use two criteria to assess whether our Monte Carlo samples are reliable representations of the target posterior. First, we use the split $\hat{r}$-statistic⁶⁷ to assess whether our two inference runs converged to the same distribution. Values of $\hat{r}$ close to 1 indicate highly overlapping distributions and if marginal distributions of all parameters yield $\hat{r} < 1.05$, we consider the Monte Carlo sampling runs to have converged. Second, during adaptive importance sampling of the hyperparameters, we stabilize the importance weights with Pareto smoothing. This automatically provides a diagnostic statistic, $\hat{k}$, indicating whether the resulting importance sample is reliable. We use the cutoff suggested by Vehtari et al.⁶⁸, requiring $\hat{k} < 1-\frac{1}{lo{g}_{10}{N}_{\rm{proposals}}}$ or $\hat{k} < 0.7$ (whichever is more stringent) for the posterior samples to be considered trustworthy.

Evaluating evidence of splice regulatory activity

To quantify evidence of in favor of splice regulatory activity, we compare to a reduced model with the splicing impacts removed. We compare the full and reduced models using Pareto-smoothed importance sampling leave-one-out cross-validation⁶⁸, a fully Bayesian approach to model comparison that uses the posterior samples from the full and reduced models to estimate whether the full model better predicts held-out data. We bootstrap observations to compute a z score describing our confidence that including splicing activity in the model improves performance.

Determining which activity maps are significant

We consider an activity map significant on the basis of two criteria. First, we require that at least two activity coefficients be significantly nonzero (98% CI does not overlap zero). Second, we require a model comparison z score > 2.

We additionally considered whether the splicing changes are well described by the expected motif. We excluded any activity models that chose binding parameters that assigned an unreasonable number of binding changes per exon (>20). This indicates that the knockdown-affected exons are distinguished by some characteristic sequence composition but it is not well described by the perturbed factor’s expected motif. This excluded SRSF9 (HepG2) from our analysis, which yielded a significant map by assigning appreciable occupancy to most sites in the transcriptome.

Clustering activity maps for visualization

When clustering the activity maps from different knockdowns ${\hat{\alpha }}^{(1)},{\hat{\alpha }}^{(2)},\ldots ,{\hat{\alpha }}^{(N)},$ we wanted to account for the uncertainty in the posterior samples. Each inferred activity map is a posterior sample with a value ${\hat{\alpha }}_{i,j}^{(f)}$ at each position j for each M^(f) from the posterior. We scale each sample i from the posterior by the absolute value of the largest activity coefficient in that sample and compute the posterior mean and s.d. at each position j for the activity map of each knockdown f as

$${\mu }_{f,j}=\frac{1}{M}\mathop{\sum }\limits_{i=1}^{M}\frac{{\hat{\alpha }}_{i,j}^{(f)}}{\max {\hat{\alpha }}_{i,\cdot}^{(f)}}$$

and

$${\sigma }_{f,j}=\sqrt{\frac{1}{M}\mathop{\sum }\limits_{i=1}^{M}{\left(\frac{{\hat{\alpha }}_{i,j}^{(f)}}{\max {\hat{\alpha }}_{i,\cdot}^{(f)}}-{\mu }_{f,j}\right)}^{2}}.$$

To compute the difference between a pair of activity maps ${\hat{\alpha }}^{(x)}$ and ${\hat{\alpha }}^{(y)}$ we compute the absolute difference between the posterior means ∣μ_x,j − μ_y,j∣ and normalize this by the expected s.d. of the difference $\sqrt{{\sigma }_{x,j}^{2}+{\sigma }_{y,j}^{2}}$, averaging this z score across all positions 1, .., J:

$${\rm{Distance}}_{xy}=\frac{1}{J}\mathop{\sum }\limits_{j=1}^{J}\frac{| {\mu }_{x,j}-{\mu }_{y,j}| }{\sqrt{{\sigma }_{x,j}^{2}+{\sigma }_{y,j}^{2}}}.$$

We use hierarchical clustering using this metric with average linkage and use optimal leaf sorting to order the dendrogram.

Extending the additive models to alternative splicing events

While the outlined additive model works well for cassette exons, there may be cases where more complicated but related models may be desired. For example, alternative SS choice can be influenced by whether an exonic enhancer or silencer is located between the two alternative SSs. We structured KATMAP with enough flexibility to accommodate these cases; however, for the sake of simplicity, they are not the primary focus of this work. When computing the splicing impact for each exon i, we sum the positional impacts over different regions j in windows centered around the exon’s SSs.

$$\Delta A_{i}={\mathop{\sum }\limits_{j=1}^{J}}B({{{S}}}_{ij};\lambda ,m,d){\alpha }_{j}$$

We extend our model by truncating these windows of sequence when they extend past another SS associated with exon i. To do this, we introduce a precomputed set of weights, τ_ij, equal to the fraction of coordinates within region j where the corresponding window does not extend past another SS associated with exon i. If region j does not extend past another SS, τ_ij = 1; if region j overlaps another SS τ_ij takes a fractional value; if region j falls on the other side of another SS, τ_ij = 0. These are incorporated into the splicing impacts as

$$\Delta {{{A}}}_{i}={\mathop{\sum }\limits_{j=1}^{J}}B({{{S}}}_{ij};\lambda ,m,d){\tau }_{ij}{\alpha }_{j}.$$

This can be be turned on in the command line interface by setting ‘–annotation-based-activity’ to ‘True’.

When learning models from alternative 3SS and 5SS splicing events, we reduced the intronic and exonic window sizes to 100 and 60 nt, respectively, and only considered knockdowns with at least 50 significant splicing changes.

Evaluating a reduced model without splicing activity

To evaluate evidence of splice regulatory activity, we compare to a reduced version of the model that only includes the linear predictors. The predictions under this model are computed as

$$\begin{array}{c}{q}_{i}^{{{\emptyset}}(-)}=\mathop{\sum }\limits_{k=1}^{{K}^{(-)}}{\beta }_{k}^{(-)}{X}_{ik}^{(-)}+{\beta }_{0}^{(-)}\\ {q}_{i}^{{{\emptyset}}(0)}=\mathop{\sum }\limits_{k=1}^{{K}^{(0)}}{\beta }_{k}^{(0)}{X}_{ik}^{(0)}\\ {q}_{i}^{{{\emptyset}}(+)}=\mathop{\sum }\limits_{k=1}^{{K}^{(+)}}{\beta }_{k}^{(+)}{X}_{ik}^{(+)}+{\beta }_{0}^{(+)}\end{array}.$$

This is a standard multinomial logistic regression and has no hyperparameters that would necessitate the use of INLA. We approximate the posterior distribution over the regression coefficients using the same optimization and Laplace approximations used to evaluate the conditional posterior for the full model. We then draw a posterior sample from the reduced model, ${\theta }_{1\ldots N}^{({{\emptyset}})}$, of the same size as the posterior sample obtained for the full model.

Predicting direct targets

We consider an exon a direct target if its splicing impact is sufficiently large to improve upon a reduced model without splicing activity. To assess this, we compare the posterior distribution of predictions between the full and the reduced models, p and ${p}^{{{\emptyset}}}$. In a knockdown experiment, enhancing targets should be downregulated and suppression targets upregulated. Hence, we consider an exon an enhancing target if we are confident the splicing impact, A_i, is negative enough to improve the downregulation prediction,

$$P({A}_{i} < 0| Y) > 0.9\quad {\rm{and}}\quad P\left({p}_{i}^{(-)} > \left.{p}_{i}^{{{\emptyset}}(-)}\right\vert Y\right) > 0.9,$$

and a suppression target if the splicing impact is positive enough to improve the upregulation prediction,

$$P({A}_{i} > 0| Y) > 0.9\quad {\rm{and}}\quad P\left({p}_{i}^{(+)} > \left.{p}_{i}^{{{\emptyset}}(+)}\right\vert Y\right) > 0.9.$$

This assigns each exon into one of three mutually exclusive categories: enhancing target, suppression target and nontarget. In some analyses, we wish to evaluate the confidence of these classifications and we do so with a multinomial regression that asks what fraction of exons with a given splicing impact were assigned to each category. This assigns each exon a probability of being an enhancing target, a nontarget or a suppression target on the basis of its splicing impact. We assumed that fractions of direct targets differ among upregulated, downregulated and nonsignificant exons by fitting separate intercepts for each category.

Evaluating eCLIP enrichment at targets

For SFs with significant activity maps and available eCLIP data, we downloaded the BED files representing peak calls for both replicates. We did not use the results of the irreproducible discovery rate analysis because we believe this to represent an overly stringent criteria for combining replicate information. We considered the presence of a peak >2-fold enriched over input in at least one of the eCLIP replicates as evidence of potential binding. The signal of differential splicing and eCLIP binding are clearest in highly expressed transcripts because of better representation in the sequencing reads; evaluating enrichment between differentially spliced and nonsignificant exons risks spurious associations because of correlated statistical power. To avoid this, we compared binding at targets to nontargets that significantly changed in the same direction. We further focused on the positions predicted by KATMAP to impact splicing. For both targets and nontargets, we counted the number of exons with evidence of binding, k_target and k_nontarget, and used a beta-binomial conjugate model with a uniform prior on the fractions (α_prior, β_prior = 1) to obtain 2,000 posterior samples of the fractions of bound target and nontarget exons, f_target and f_nontarget:

$$\begin{array}{c}{f}_{\rm{target}}| {k}_{\rm{target}} \sim \rm{Beta}(\alpha ={k}_{\rm{target}}+1,\beta ={N}_{\rm{target}}-{k}_{\rm{target}}+1)\\ {f}_{\rm{nontarget}}| {k}_{\rm{nontarget}} \sim \rm{Beta}(\alpha ={k}_{\rm{nontarget}}+1,\beta ={N}_{\rm{nontarget}}-{k}_{\rm{nontarget}}+1)\end{array}$$

where N_target and N_nontarget are the total number of predicted targets and nontargets that significantly changed in a given direction. We constructed a posterior sample of enrichment estimates as

$${\log }_{2}\frac{{f}_{\rm{target}}}{{f}_{\rm{nontarget}}}$$

and computed the means and CIs from this.

Defining effect sizes for changes in splicing

Differential splicing analyses typically report inclusion changes as Δψ = ψ_kd − ψ_ctrl. However, under the hood, the statistical analysis is typically performed with respect to the log odds transformation of inclusion, $\phi ={\log }_{2}\frac{\psi }{1-\psi }$, with the effect size being the log odds ratio of inclusion before and after knockdown.

$$\Delta \phi ={\phi }_{kd}-{\phi }_{ctrl}={\log }_{2}\frac{{\psi }_{kd}/\left(1-{\psi }_{kd}\right)}{{\psi }_{ctrl}/\left(1-{\psi }_{ctrl}\right)}$$

While our model does not aim to predict the magnitude of inclusion changes, only the direction, our analyses make use of inclusion levels in two ways. First, we use the pre-knockdown inclusion level, ϕ_ctrl as a linear predictor, being informative of whether an increase or decrease in inclusion is detectable. Second, we assess whether our splicing impacts can predict the magnitudes of inclusion changes, despite not being trained on them. In both cases, we favor the log odds representation.

The log odds representations, ϕ and Δϕ, offer two advantages. First, ψ by definition must be between 0 and 1; thus, the magnitude of Δψ only has meaning in the context of the original inclusion level. Here, Δψ = +0.1 is valid for ψ_ctrl = 0.05 but not for ψ_ctrl = 0.95 because ψ_kd = 1.05 is impossible. A second, related concern, is that inclusion levels near 0 or 1 likely represent more extreme splicing contexts than does intermediate inclusion. A change in inclusion from 0.5 to 0.6 corresponds to Δψ = 0.1, as does a change from 0.01 to 0.11. However, an exon included in only 1% of transcripts likely represents a strong silencing context or has weak SSs; hence, the increase of 10% likely reflects a more substantial change in regulation than a change from 50% to 60%. The odds ratio reflects this, treating an inclusion change of 0.5 to 0.6 as representing a smaller effect (Δϕ ≈ +0.6) than a change of 0.01 to 0.11 (Δϕ ≈ +3.6).

When comparing the effect sizes between predicted targets and nontargets, we only compared exons that were differentially spliced in the same direction. For ‘nonsignificant’ exons, we computed the average effect size in two ways. First, we simply averaged across all nonsignificant exons. However, these likely include constitutively included or excluded exons with very strong or weak splicing contexts. To focus on exons could be subject alternative splicing regulation by SFs, we also computed the average effect size excluding exons with confidently small Δψ values. We computed CIs for these point estimates using Bayesian bootstrapping.

GO analysis of QKI–RBFOX cooperatively regulated exons

We defined confident QKI–RBFOX cotargets as predicted targets of either QKI or RBFOX that were downregulated upon knockdown and were inferred to also be regulated by the other factor based on splicing impacts. For the latter criteria, we chose a splicing impact cutoff of <−0.25 by examining the empirical cumulative distribution functions (eCDFs) (Fig. 5c,d). We constructed expression-matched background sets from all genes expressed in the QKI and RBFOX HepG2 knockdowns using importance sampling and used GOrilla to identify enriched Gene Ontology (GO) terms relative to this background⁶⁹. We also performed this analysis using all differentially spliced QKI and RBFOX targets as the background. We used all three ontologies: process, function and component. We identified significantly enriched terms using a 10% FDR threshold.

Inferring SF perturbations from splicing changes

To explain a set of observed splicing changes in terms of changes in SF regulation, we use the splicing impacts previously inferred from the knockdown data as the linear features in a multinomial regression. For each SF, f, with a significant activity map, we use the posterior means of its activity (${\hat{\alpha }}_{1, \ldots , J}^{(f)}$) and binding parameters (${\hat{\lambda }}^{(f)}$, ${\hat{m}}^{(f)}$) to compute splicing impacts for every considered exon i:

$${{{A}}}_{i}^{(f)}={\mathop{\sum }\limits_{j=1}^{J}}B\left({{{S}}}_{ij};{\hat{\lambda }}^{(f)},{\hat{m}}^{(f)},-{\frac{1}{2}}\right){\hat{\alpha }}_{j}.$$

We incorporate these into a regression model, including linear coefficients to account for unperturbed inclusion levels, expression and their squares:

$$\begin{array}{c}{q}_{i}^{(-)}=-\mathop{\sum }\limits_{f\in SF}{\gamma }^{(f)}{{{A}}}_{i}^{(f)}+\mathop{\sum }\limits_{k=1}^{{K}^{(-)}}{\beta }_{k}^{(-)}{X}_{ik}^{(-)}+{\beta }_{0}^{(-)}\\ {q}_{i}^{(0)}=\mathop{\sum }\limits_{k=1}^{{K}^{(0)}}{\beta }_{k}^{(0)}{X}_{ik}^{(0)}\\ {q}_{i}^{(+)}=\mathop{\sum }\limits_{f\in SF}{\gamma }^{(f)}{{{A}}}_{i}^{(f)}+\mathop{\sum }\limits_{k=1}^{{K}^{(+)}}{\beta }_{k}^{(+)}{X}_{ik}^{(+)}+{\beta }_{0}^{(+)}\end{array},$$

where each γ^(f) is a coefficient describing how predictive the splicing impacts for SF f are of the splicing changes. We describe the observed splicing changes Y_{1, …, N} as following a categorical distribution:

$$\begin{array}{rcl}{\hat{p}}_{i}^{(z)}&=&\frac{\exp \left({q}_{i}^{(z)}\right)}{\exp \left({q}_{i}^{(-)}\right)+\exp \left({q}_{i}^{\left(0\right.}\right)+\exp \left({q}_{i}^{(+)}\right)}\\ {Y}_{i}& \sim &{\rm{Categorical}}\left(p=\left[{\hat{p}}_{i}^{(-)},{\hat{p}}_{i}^{(0)},{\hat{p}}_{i}^{(+)}\right]\right)\end{array}$$

Because all of the splicing impacts A are precomputed, this is a standard multinomial logistic regression. We use second-order optimization to find the posterior mode and, as described previously, use the curvature at the mode to approximate the posterior with a multivariate Gaussian, obtaining the posterior mean and s.d. for each coefficient. When considering perturbations inferred from many knockdown datasets, we use ashr to apply shrinkage to the estimates and compute FDR-corrected P values. These shrinkage estimates reflect the assumption that only a subset of SFs were truly perturbed (that is, sparsity). We use the half-normal mixture option to reflect the possibility that the proportions of gains and losses in regulation are unequal.

We compared this to a simpler approach instead aiming to infer perturbations from inclusion changes rather than splicing impacts. Here, we set A equal to the ashr shrinkage estimate for dPSI observed in the knockdown dataset the splicing impacts would have been learned from. We standardize but do not mean-center the inclusion changes to allow the scale of the coefficients to be compared across factors, while preserving the sign of the inclusion changes. If an exon i was not observed in knockdown f, we set ${{{A}}}_{i}^{(f)}$ equal to zero such that factor f simply does not contribute to the prediction at exon i.

In our analyses of the ENCODE data, we use the same training data to define the observations and linear predictors used to learn the KATMAP models. When applying this approach the PDA cancer data, we used previous data (Supplementary Table 2 from another study⁶²) to define the training data and sought to match the criteria used in the original analysis to define significant exons. We defined inclusion changes as metastatic versus primary, restricted our analysis to cassette exons observed in at least 75% of samples and defined exons as significantly upregulated or downregulated if inclusion changed by at least ±10% and the reported FDR-corrected P value was less than 0.05. We included all significant exons and 5,000 randomly selected nonsignificant exons in the training set. Because raw reads counts were not reported, we could not include log read counts as a linear feature but included the average inclusion level (in log odds scale) from the primary tumor samples as a linear and quadratic predictor.

Comparing inferred perturbations to MAPP

To compare our perturbation analyses to MAPP, we downloaded the MAPP results from the ENCODE knockdowns from Zenodo⁷⁰. For each knockdown, we extracted the 3SS and 5SS absolute z scores MAPP assigned to each PWM, selecting the maximum of the two splicing z scores. Typically MAPP aims to distinguish between a large library of motifs, including multiple motifs per factor and paralogs. To make the comparison fairer to MAPP, when computing the rank of the z scores, we only required MAPP to distinguish between SFs also considered by KATMAP. If an SF had multiple motifs in the MAPP outputs, we selected the motif with the highest z score. If an SF had paralogs (for example, PCBP1 and PCBP2), we selected the paralog with the highest z score, considering MAPP to have identified the correct factor if it chose a paralog of the knocked down factor. Because KATMAP models are learned from knockdown data, we limited this analysis to SFs with knockdowns in both cell types and, for the knocked down SF, always used the model learned from the other cell type to infer perturbations (to avoid circularity). We considered SFs for which we had KATMAP models and which had PWMs in the MAPP library.

Simulation study of model inference

For our simulation studies, we wished to assess how reliably our inference algorithm could recover the parameters that truly generated the observed splicing changes. For our true parameters, (θ, φ)^(true), we used the posterior means inferred from twenty-one knockdowns that yielded significant activity maps. These parameters coupled with the predictor variables (S and X) naturally assign each exon probabilities of being downregulated, unchanged or upregulated after knockdown. So that our simulated data provide enough statistical power to make confident inferences, we adjust the intercepts to ensure around 500 upregulated and downregulated exons. To generate the simulated observations Y^(sim), we sample an outcome for each exon from a categorical distribution based on these predicted probabilities. We then use KATMAP to predict these simulated observations using the original predictor variables S and X to obtain a posterior sample (θ, φ)₁, …, (θ, φ)_N∣Y^(sim). We computed posterior means and CIs and compared these to the true values, (θ, φ)^(true).

Statistical analyses

We represent uncertainty in our analyses with CIs. For KATMAP’s inferences, these are obtained by Monte Carlo sampling. With estimates based on count or presence and absence data, we use the beta-binomial conjugate model, with uniform priors. For frequency estimates so obtained, we compute the mean and CI from the posterior beta distribution. For enrichment estimates, we sample frequencies from the posteriors of foreground and background to obtain a posterior sample of enrichments and construct the CI on this basis. Otherwise, we typically used Bayesian bootstrapping, wherein a posterior sample of estimates is obtained by repeatedly computing weighted averages, with the weights sampled from a uniform Dirichlet distribution⁷¹.

There are several technical details consistent throughout our approach to inference. We use the JAX scientific library to compute the gradients and Hessian matrices of our probability densities, which we need to optimize and quantify uncertainty in our model. We always work with log probabilities for the sake of numerical stability. Despite this, loss of numerical precision occasionally results in nonpositive semidefinite covariance matrices (that is, matrices with eigenvalues less than zero). In this case, we follow theorem 3.2 from Cheng and Higham⁷² to identify the nearest positive semidefinite matrix to Σ with eigenvalues greater than 10⁻⁵, which amounts to setting all eigenvalues less than 10⁻⁵ to this cutoff.

Predicting the effects of deletions and ssASOs

To evaluate the effects of deletions and ASOs, we computed predictions for modified sequences. We assume that an ASO blocks SF interactions with the positions bound and one base on either side and we treat these positions as if they were the lowest-affinity nucleotide when scoring with the binding model. For deletions, we simply score the sequence with the deletion. We use the posterior mean binding and activity parameters to construct our predictions. Because we wanted to predict how each ASO or deletion alters regulation, rather than compute splicing impacts as

$$\Delta {{{A}}}_{i}={\mathop{\sum }\limits_{j=1}^{J}}B({{{S}}}_{ij};\lambda ,m,d){\alpha }_{j},$$

we instead computed predicted regulation for a given SF on the basis of occupancy:

$${{{A}}}_{i}=\mathop{\sum }\limits_{j=1}^{J}\Theta ({{{a}}}_{ij};F){\alpha }_{j},$$

where ${{{a}}}_{ij}={\rm{log}}\;{\rm{sum}}\;{\rm{exp}}(\lambda {{{S}}}_{ij})$ is the total affinity in bin j. We compute the change in regulation because of deletion or ASO i as

$$\Delta {{{A}}}_{i}=\frac{1}{\max (| \alpha | )}\mathop{\sum }\limits_{j=1}^{J}{\alpha }_{j}\left[\Theta \left({{{a}}}_{ij};F\right)-\Theta \left({{{a}}}_{0j};F\right)\right],$$

where a_0j is the affinity of region j in the reference sequence and a_ij is the affinity given the deletion or ASO. To compare changes in regulation across models, we normalize the predictions by the maximum change in regulation that could result from a complete loss of occupancy at a binding site (that is, $\max (| \alpha | )$). Doing this requires assuming a free protein concentration, F, which is not known a priori. As a proxy, we use estimates of free protein concentration in the cells from which the models were estimated based on the learned most relevant affinity, m, and assuming free protein decreased by half after knockdown:

$$F=\frac{1}{\sqrt{k}{e}^{m}},$$

with $k=\frac{1}{2}$. Our conclusions are not sensitive to choice of k, with $k=\frac{1}{2},\frac{1}{4},\frac{1}{8}$ all predicting similar disruptions to regulation.

Constructing mutant sequences

For each selected exon, we identified the 10-nt region most strongly contributing to the splicing impact (Fig. 4a,b). We then treated all regions that were at least half as impactful as potential SREs. For each SRE, we identified the three single-nucleotide polymorphisms most disruptive to binding. For exons with both up and downstream SREs, we constructed separate sequences with the upstream or downstream mutations alone, as well as with both upstream and downstream mutations.

Cell culture

HepG2 cells (American Type Culture Collection) were grown in DMEM (Gibco) supplemented with 10% bovine growth serum (Gibco) and 1× penicillin–streptomycin (Gen Clone, 25-512) in a 37 °C incubator under 5% CO₂.

Minigene splicing reporter constructs

RHCglo minigene reporter plasmid DNA (Addgene, plasmid 80169) was digested with SalI-HF (New England Biolabs (NEB), R3138S) and XbaI (NEB, R0145S) restriction enzymes to remove synthetic exon and flanking human β-globin intronic and SS sequence. Digests were performed in a single reaction according to the manufacturer’s instructions and then purified using DNA clean and concentrator kit (Zymo, D4003). DNA constructs consisting of target exonic sequences + 250 nt of flanking upstream and downstream intronic sequences and sequences overlapping the RHCglo sequence (Supplementary Table 5) were synthesized by Twist Bioscience and inserted into the linearized RHCglo pDNA using NEBuilder HiFi DNA assembly master mix (NEB, E2621S). Reactions were used to transform Mix N Go DH5α cells (Zymo). For NF2 exon 16 mutant reporters used in the RBFOX2–QKI coregulation analysis sequence changes were introduced to the wild-type minigene using the QuickChange II site-directed mutagenesis kit (Agilent, 200523) following the manufacturer’s instructions and transformed into XL1-Blue supercompotent cells. Plasmid DNA from successful transformants were purified using Qiagen Spin miniprep kit (Qiagen, 27104). RHCglo was a gift from T. Cooper.

SF knockdown and minigene experiments

For small interfering RNA (siRNA) knockdown minigene reporter experiments, 12-well plates were seeded 24 h in advance with 0.1 × 10⁶ HepG2 cells. Cells were treated for 48 h with 50 pmol of siRNA pool targeting the SF of interest (Supplementary Table 5) or ON-TARGETplus siRNA control pool (Dharmacon) using RNAiMAX reagent (Invitrogen). At 48 h, the medium was removed and cells were transfected with 50 ng of minigene reporter pDNA using Lipofectamine 3000 reagent and incubated for 24 h to allow transcript expression. Culture medium was removed, cells were washed twice with 1× PBS pH 7.4 and total RNA was extracted using the RNeasy mini kit (Qiagen, 74106) according to the manufacturer’s instructions. Each experiment was performed as three biological replicates. The PCPB1 upstream mutant reporter assay was performed without the siRNA knockdown step as a single replicate and not used in later experiments because of the impact of mutations on the 3SS strength.

Minigene splicing quantification by semiquantitative reverse transcription (RT)–PCR

First-strand complementary DNA (cDNA) was generated from 100–150 ng total of extracted RNA using an Oligo d(T)₂₀primer (Invitrogen, 18418020) and SuperScript IV reverse transcriptase (Invitrogen, 18090050) according to the manufacturer’s instructions. cDNA was PCR-amplified using RHCglo gene product specific primers RSV5U and RTRHC⁷³ (universal to all constructs) and Q5 high-fidelity 2× master mix (NEB, M0492S) for 28 cycles. PCR products were resolved on 3% agarose gels and the band intensity for exon inclusion and exon skipping isoforms was quantified using the ImageJ (version 1.53k) software package. PSI was calculated as band intensity of the inclusion isoform divided by the total intensity of both isoforms.

Confirmation of SF knockdown by western blot

HepG2 cells were treated with each siRNA and minigene combination as described above. Cells were washed with 1× PBS and then lysed in RIPA buffer (50 mM Tris pH 7.0, 150 mM NaCl, 0.1% SDS, 0.5% sodium deoxycholate and 1× Triton X-100). Protein from each lysate was resolved on Bis–Tris NuPAGE acrylamide gels (Invitrogen) and then transferred to a iBlot nitrocellulose membrane (IB4301001) using the iBLOT dry blot system. Membranes were blocked for 1 h in Azure fluorescent blot blocking buffer (Azure Biosystems). Rabbit primary antibodies for each SF (Supplementary Table 5) at 1:1,000 dilution were added to the membranes and incubated at 4 °C overnight with gentle agitation. Membranes were washed with 1× TBST and then incubated with fluorescently labeled secondary antibody (Supplementary Table 5) at a 1:10,000 dilution for 1 h at room temperature, washed with 1× TBST and visualized using Azure c600 (Azure Biosystems). Each membrane was then incubated with mouse anti-α-tubulin (1:2,500) for 1 h at room temperature, washed with 1× TBST, incubated with secondary antibody (Supplementary Table 5) at room temperature for 1 h, washed and visualized. Band intensity was quantified using ImageJ software and the knockdown efficiency was calculated as α-tubulin normalized intensity of the SF’s band in the SF siRNA-treated cells over the control siRNA-treated cells (Supplementary Table 5).

Computational resources

Analyses were performed on a Dell Precision 7740 laptop with an NVIDIA Quadro RTX 3000 GPU (5.9 GB, CUDA version 11.4), 16 CPUs (Intel(R) Xeon(R) E-2286M, 2.40 GHz) and 67.2 GB of RAM plus 137.4-GB swap and a Dell Precision 7960 tower with an NVIDIA Ada RTX 5000 GPU (33.7 GB, CUDA version 12.2), 72 CPUs (Intel(R) Xeon(R) w9-3475X) and 201.7 GB of RAM.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.