R2D2M2 prior and monotonic predictors

I’m currently trying to formulate a multilevel model which contains 46 ordinal predictors and 93 regression terms overall. The high number of predictors warranted the use of a shrinkage prior and I was considering the R2D2M2 prior for this purpose (from this article by @javier.aguilar and @paul.buerkner). The problem is that I initially intended to model the predictors as monotonic effects as described in this article by Bürkner and @Emmanuel_Charpentier).

After reading the article on the R2D2M2 prior, I noticed that predictors must be standardized based on the sample mean and variance. I’m guessing there’s a non-compatibility issue between these modeling techniques since it doesn’t seem to make sense to “standardize” monotonic effects. On the other hand, if these effects are not standardized, then Var(\mu) \neq \sigma^2 \tau^2.

Here’s a simplified reprex of the model I have in mind with only 3 predictors using brms :

# Simulate data
N <- 100 # number of observations
J <- 20 # number of levels
y <- rnorm(N) # Outcome variable
jj <- sample(1:J, n, replace = T) # Grouping index variable
# Ordinal predictors
x1 <- sample(1:3, N, replace = T) 
x2 <- sample(1:3, N, replace = T)
x3 <- sample(1:3, N, replace = T)

data <- data.frame(y, x1, x2, x3, jj)

# Generate stancode with brms 
library(brms)
formula <- bf(y ~ 1 + mo(x1) + mo(x2) + mo(x3) + (1 + mo(x1) + mo(x2) + mo(x3) | jj))
bprior <- prior(R2D2(), class = b)
make_stancode(formula, data, prior = bprior)

Notice that the stancode generated by brms does not center or scale monotonic variables. This means that the global variance of the model \tau^2 will vary based on the mean and standard deviation of each monotonic effects, which in turn depends on how each predictor variable is distributed (e.g. uniform, asymmetric, concentrated around a specific category, etc.) AND the shape parameter.

In short, I’m wondering :

  1. Are the estimates from this model interpretable?
  2. If not, can monotonic predictors and the R2D2M2 prior (or even the R2D2 prior) be used together?
  3. If not, what else could be done?

My first idea is to remove the monotonic transformations and model the ordinal predictors simply as continuous predictors. I could then retrieve the implied effects for each category of the predictors using a linear transformation on posterior draws.

1 Like

Yes, you can interprete the results of these models independent of whether or not you put an R2D2(M2) prior on them.

The main question is: Is the amount of shrinkage applied to each term “fair”?

The mean of the variables doesn’t really matter for this. The scale of the variables does though. What I recommend it to have all the predictor variables on the same scale. So, in your case, if x1 to x3 are on the same scale (ranging from 1 … 3, say), you should be fine.

What does become challenging is when you are combining monotonic effects with regular linear effects. The latter you can scale as you wish of course, but as you note, the monotonic effects are harder to scale.

My suggestion is the following: Go to the Stan code of the custom mo() function specified in the brms’ generated Stan code. Change it from

// usual definition
real mo(vector scale, int i) {
    if (i == 0) {
      return 0;
    } else {
      return rows(scale) * sum(scale[1:i]);
    }
  }

to

// new definition
real mo(vector scale, int i) {
    if (i == 0) {
      return 0;
    } else {
      return sum(scale[1:i]);
    }
  }

This way, you scale your monotonic predictors to the unit interval (min value 0 to max value 1). Now, you can also scale all your regular predictors to the unit interval. This way, shrinkage should be fair across all terms.

3 Likes

That’s an interesting idea. While thinking through the problem, I came up with a similar solution where, instead of scaling by the number of rows, I would scale all predictors by a constant such that, if they were symmetric (uniformly distributed across categories) and all elements of the shape parameter were equal (\zeta_i = 1/D), the scale of the overall monotonic effect would be 1. For a 3 level monotonic predictor, that constant turns out to be \sqrt{12/5}. I did a small simulation to confirm that my math was correct :

set.seed(1)
iter <- 10000
N <- 1000
c <- sqrt(12/5)
delta_1 <- 1/2
m <- c(0, delta_1, 1)
var_bmc <- c()
for (i in 1:iter) {
    mc_i <- sample(m * c, N, replace = T)
    b_i <- rnorm(N)
    bmc_i <- b_i * mc_i
    var_bmc <- c(var_bmc, var(bmc_i))
}
mean(var_bmc) # 1.001602 ≈ 1
hist(var_bmc)

From what I understand, both solutions should give the same results (up to a constant) since the shrinkage applied to each term would be “fair”, as you say.

Still, I’m surprised this is the first time I hear about this because it looks like a big issue when formulating priors for models that combine discrete and continuous effects. I’d be curious to read more on the subject. Like, is the amount of shrinkage really fair if the distribution of one of the discrete predictors is super asymmetric (a kind of 0-inflated predictor)? I’m not sure how you would formalize that idea.

That is a very good question. I guess for a standard categorical predictor with dummy variables, I could scale the dummy variables to all have SD = 1 independently of class imbalances, and thus the same scale als all my standardized continuous predictors that I might add too. Of course, the amount of shrinkage of the categorical predictors would depend on the chosen coding (e.g., treatment vs. dummy coding) but that is something that we always have to deal with when putting priors on categorical dummy variables.

As for the monotonic effects, things are a bit awkward since the implied SD of the predictor is not a priori fixed (as for continuous or categorical predictors) but dependent on the simplex shape parameter. Accordingly, if we want to choose a fixed scaling, we have to choose an a priori sensible value, such as the prior mean that you used in your example. Of course, we could also adjust the scaling on the fly by conditioning on the current simplex shape parameter value during sampling, but I am not sure what kind of posterior geometry we would imply by it. Worth investigating though if someone has time on their hands. May even be worth a paper in the end, if we come up with some interesting thoughts on shrinkage priors on monotonic (and similar) effects.

1 Like