I’m running a beta regression with 10 continuous predictors. I’m fairly new to Bayesian modeling (about a week in), and I’m trying to set weakly informative priors. I specified the following after standardizing my predictor:
Intercept: Normal(-0.5, 1.2) — based on the mean and SD from a previous study
Beta coefficients: Normal(0, 1)
Phi : gamma(0.1, 0.1) (the brms default)
When I simulate outcome data from these priors, most simulated y values are very close to 0 or 1. I assume it is because the number of predictors combined with liberal betas adds up.
I tried tightening the beta priors, but I worry this might make them too informative. I’m confused about how to proceed and how to balance having weakly informative priors while keeping the prior predictive distribution reasonable. I would be really grateful for any advice on this issue :)
Good on you for checking the prior predictive distribution. Shrinkage priors were developed exactly for this reason. I would recommend looking into the R2D2 prior paradigm, which is nicely implemented in brms and is an elegant solution to decomposing the explained variance between your predictors.
The problem you’re running into is that independent weakly informative priors do not add up to a multivariate weakly informative prior. This was the point of the (arguably misleadingly named) paper that led us to take prior predictive checks more seriously:
As @mhollanders pointed out, you probably need a better weakly informative joint prior from which to simulate. I don’t know if the R2D2 prior easily admits simulation, but @bgoodri will know.
Also, it’s OK to simulate from a more constrained prior than you will use to fit as long as there’s enough data to resolve the model. You can’t do strictly proper simulation based calibration this way, but I believe it’s what most of us do in practice.
Thank you so much for the reply and explanation @mhollanders and @Bob_Carpenter :) I will try working it out with R2D2 prior! @Bob_Carpenter, I am not sure if I understood your last point correctly: if the data is strong enough, the posterior will be dominated by it, so is it okay to use a slightly different (more constrained) prior for prior predictive checks? I might be misunderstanding.
Simulating R2D2 priors is really straightfoward. Consider you have P predictors. You can either put a prior on R^2 and transform it to explained variance, W = \frac{R^2}{1 - R^2}, or place a prior directly on W, but the key part is that you decompose the explained variance between your P predictors. So you introduce a P-simplex \phi that partitions W between your P components.
For the simplex generation, the two straightforward ways are Dirichlet or logistic-normal. For both simplex generations, I think it makes sense to introduce another parameter \theta that is the shape of the Dirichlet (shared across variance components) or the scale of the logistic normal. In the logistic normal, larger values of \theta correspond to a more sparse simplex, which I find intuitive, so I use the inverse of \theta in the Dirichlet to have the same “direction” of effect. Not that by default in brms, you supply \theta as data and estimate it.
In R:
P <- 10 # 10 predictors
R2 <- rbeta(1, 1, 1)
W <- R2 / (1 - R2)
theta <- rgamma(1, 1, 1) # dirichlet inverse shape or logistic-normal scale
if (dirichlet) {
phi <- rdirichlet(1, rep(1 / theta, P)) # assuming a random Dirichlet generator
} else {
u <- rnorm(P, 0, theta) # unconstrained simplex parameters
u <- u - mean(u) # zero-sum for identifability
phi <- exp(u) / sum(exp(u)) # softmax
}
Now if you put normal priors on your P coefficients, the prior scales of the coefficients and the coefficients themselves are:
Thank you, @mhollanders :) I do not have random effects.
I also found this resource that explains that non gaussian models need an adjustment. Based on that, this is what I did to simulate the values of the dv
df <-py$data
X <- as.matrix(df[, paste0("x", 1:10)]) # The predictors are already standardized
n_draws <- 10000
n_obs <- nrow(X) # n=66
P <- 10
dirichlet <- TRUE
yrep <- matrix(NA,nrow = n_obs,ncol = n_draws)
for (i in 1:n_draws) {
#W
R2 <- rbeta(1,1,1)
W <- R2 / (1 - R2)
#s2b0
b0 <- rnorm(1,-0.5,1.2)
scale <- rgamma(1,1,1)
# predictor fraction of variance
theta <- rgamma(1, 2, 2)
if (dirichlet) {
phi <- rdirichlet(1, rep(1/theta , P)) }
else {
u1 <- rnorm(P, 0, theta)
u1 <- u - mean(u1)
phi <- exp(u1) / sum(exp(u1))
}
# beta draws
u <- plogis(b0)
var <- u*(1-u)/(1+scale)
s2b0 <- var/(u*(1-u))**2
scales <- sqrt(W * phi * s2b0)
beta <- rnorm(P, 0, scales)
# combining with predictors
eta <- X %*% beta + b0
yrep[, i] <- plogis(eta)
}
hist(as.vector(yrep), breaks = 100, main = "Predicted probabilities",
xlab = "yrep", col = "skyblue")
The prior predictive still shows extreme values more frequently, as it did with weakly informative priors. It is theoretically possible, but I do not expect a higher frequency in the extremes. I assumed that the R2D2 would shrink most betas to zero, or maybe I misunderstood. Or maybe the extra step of considering the adjusted noise was wrongly implemented by me. I am thoroughly confused :(