That’s an interesting idea. While thinking through the problem, I came up with a similar solution where, instead of scaling by the number of rows, I would scale all predictors by a constant such that, if they were symmetric (uniformly distributed across categories) and all elements of the shape parameter were equal (\zeta_i = 1/D), the scale of the overall monotonic effect would be 1. For a 3 level monotonic predictor, that constant turns out to be \sqrt{12/5}. I did a small simulation to confirm that my math was correct :
set.seed(1)
iter <- 10000
N <- 1000
c <- sqrt(12/5)
delta_1 <- 1/2
m <- c(0, delta_1, 1)
var_bmc <- c()
for (i in 1:iter) {
mc_i <- sample(m * c, N, replace = T)
b_i <- rnorm(N)
bmc_i <- b_i * mc_i
var_bmc <- c(var_bmc, var(bmc_i))
}
mean(var_bmc) # 1.001602 ≈ 1
hist(var_bmc)
From what I understand, both solutions should give the same results (up to a constant) since the shrinkage applied to each term would be “fair”, as you say.
Still, I’m surprised this is the first time I hear about this because it looks like a big issue when formulating priors for models that combine discrete and continuous effects. I’d be curious to read more on the subject. Like, is the amount of shrinkage really fair if the distribution of one of the discrete predictors is super asymmetric (a kind of 0-inflated predictor)? I’m not sure how you would formalize that idea.