R2D2M2 prior and monotonic predictors

Xavier_La_Rochelle · September 24, 2024, 3:43pm

I’m currently trying to formulate a multilevel model which contains 46 ordinal predictors and 93 regression terms overall. The high number of predictors warranted the use of a shrinkage prior and I was considering the R2D2M2 prior for this purpose (from this article by @javier.aguilar and @paul.buerkner). The problem is that I initially intended to model the predictors as monotonic effects as described in this article by Bürkner and @Emmanuel_Charpentier).

After reading the article on the R2D2M2 prior, I noticed that predictors must be standardized based on the sample mean and variance. I’m guessing there’s a non-compatibility issue between these modeling techniques since it doesn’t seem to make sense to “standardize” monotonic effects. On the other hand, if these effects are not standardized, then Var(\mu) \neq \sigma^2 \tau^2.

Here’s a simplified reprex of the model I have in mind with only 3 predictors using brms :

# Simulate data
N <- 100 # number of observations
J <- 20 # number of levels
y <- rnorm(N) # Outcome variable
jj <- sample(1:J, n, replace = T) # Grouping index variable
# Ordinal predictors
x1 <- sample(1:3, N, replace = T) 
x2 <- sample(1:3, N, replace = T)
x3 <- sample(1:3, N, replace = T)

data <- data.frame(y, x1, x2, x3, jj)

# Generate stancode with brms 
library(brms)
formula <- bf(y ~ 1 + mo(x1) + mo(x2) + mo(x3) + (1 + mo(x1) + mo(x2) + mo(x3) | jj))
bprior <- prior(R2D2(), class = b)
make_stancode(formula, data, prior = bprior)

Notice that the stancode generated by brms does not center or scale monotonic variables. This means that the global variance of the model \tau^2 will vary based on the mean and standard deviation of each monotonic effects, which in turn depends on how each predictor variable is distributed (e.g. uniform, asymmetric, concentrated around a specific category, etc.) AND the shape parameter.

In short, I’m wondering :

Are the estimates from this model interpretable?
If not, can monotonic predictors and the R2D2M2 prior (or even the R2D2 prior) be used together?
If not, what else could be done?

My first idea is to remove the monotonic transformations and model the ordinal predictors simply as continuous predictors. I could then retrieve the implied effects for each category of the predictors using a linear transformation on posterior draws.

paul.buerkner · September 25, 2024, 9:13am

Yes, you can interprete the results of these models independent of whether or not you put an R2D2(M2) prior on them.

The main question is: Is the amount of shrinkage applied to each term “fair”?

The mean of the variables doesn’t really matter for this. The scale of the variables does though. What I recommend it to have all the predictor variables on the same scale. So, in your case, if x1 to x3 are on the same scale (ranging from 1 … 3, say), you should be fine.

What does become challenging is when you are combining monotonic effects with regular linear effects. The latter you can scale as you wish of course, but as you note, the monotonic effects are harder to scale.

My suggestion is the following: Go to the Stan code of the custom mo() function specified in the brms’ generated Stan code. Change it from

// usual definition
real mo(vector scale, int i) {
    if (i == 0) {
      return 0;
    } else {
      return rows(scale) * sum(scale[1:i]);
    }
  }

to

// new definition
real mo(vector scale, int i) {
    if (i == 0) {
      return 0;
    } else {
      return sum(scale[1:i]);
    }
  }

This way, you scale your monotonic predictors to the unit interval (min value 0 to max value 1). Now, you can also scale all your regular predictors to the unit interval. This way, shrinkage should be fair across all terms.

Xavier_La_Rochelle · September 25, 2024, 2:51pm

That’s an interesting idea. While thinking through the problem, I came up with a similar solution where, instead of scaling by the number of rows, I would scale all predictors by a constant such that, if they were symmetric (uniformly distributed across categories) and all elements of the shape parameter were equal (\zeta_i = 1/D), the scale of the overall monotonic effect would be 1. For a 3 level monotonic predictor, that constant turns out to be \sqrt{12/5}. I did a small simulation to confirm that my math was correct :

set.seed(1)
iter <- 10000
N <- 1000
c <- sqrt(12/5)
delta_1 <- 1/2
m <- c(0, delta_1, 1)
var_bmc <- c()
for (i in 1:iter) {
    mc_i <- sample(m * c, N, replace = T)
    b_i <- rnorm(N)
    bmc_i <- b_i * mc_i
    var_bmc <- c(var_bmc, var(bmc_i))
}
mean(var_bmc) # 1.001602 ≈ 1
hist(var_bmc)

From what I understand, both solutions should give the same results (up to a constant) since the shrinkage applied to each term would be “fair”, as you say.

Still, I’m surprised this is the first time I hear about this because it looks like a big issue when formulating priors for models that combine discrete and continuous effects. I’d be curious to read more on the subject. Like, is the amount of shrinkage really fair if the distribution of one of the discrete predictors is super asymmetric (a kind of 0-inflated predictor)? I’m not sure how you would formalize that idea.

paul.buerkner · September 25, 2024, 10:12pm

That is a very good question. I guess for a standard categorical predictor with dummy variables, I could scale the dummy variables to all have SD = 1 independently of class imbalances, and thus the same scale als all my standardized continuous predictors that I might add too. Of course, the amount of shrinkage of the categorical predictors would depend on the chosen coding (e.g., treatment vs. dummy coding) but that is something that we always have to deal with when putting priors on categorical dummy variables.

As for the monotonic effects, things are a bit awkward since the implied SD of the predictor is not a priori fixed (as for continuous or categorical predictors) but dependent on the simplex shape parameter. Accordingly, if we want to choose a fixed scaling, we have to choose an a priori sensible value, such as the prior mean that you used in your example. Of course, we could also adjust the scaling on the fly by conditioning on the current simplex shape parameter value during sampling, but I am not sure what kind of posterior geometry we would imply by it. Worth investigating though if someone has time on their hands. May even be worth a paper in the end, if we come up with some interesting thoughts on shrinkage priors on monotonic (and similar) effects.

eyanchenko · June 21, 2025, 12:25pm

Hi Xavier. You may remember we exchanged emails several months back about using the R2D2 prior for ordinal regression. I unfortunately lost your email, but I wanted to let you know that I recently finished a paper and package to implement R2D2 in the ordinal setting. If you are interested in using this, please reach out to me via email (on my website).

Best,

Eric

Topic		Replies	Views
Using the R2D2M2 prior on a hierarchical ordered logistic model using brms Modeling ordinal-response , hierarchical-model , brms	7	391	June 14, 2024
Centering ordinal monotonic predictors in cumulative ordinal regression using "brms" R package Modeling techniques , ordinal-response , brms	5	429	January 18, 2024
Monotonic effects with ordered predictors and ordinal response variables Modeling brms	4	862	June 10, 2022
Ordinal response with monotonic predictor brms	31	1601	February 28, 2020
Brms: Prior setting for IRT 2PL ordinal models Modeling brms	4	611	May 15, 2021

R2D2M2 prior and monotonic predictors

Related topics