How to specify weakly informative priors for beta regression with 10 predictors?

Hi all,

I’m running a beta regression with 10 continuous predictors. I’m fairly new to Bayesian modeling (about a week in), and I’m trying to set weakly informative priors. I specified the following after standardizing my predictor:

  • Intercept: Normal(-0.5, 1.2) — based on the mean and SD from a previous study
  • Beta coefficients: Normal(0, 1)
  • Phi : gamma(0.1, 0.1) (the brms default)

When I simulate outcome data from these priors, most simulated y values are very close to 0 or 1. I assume it is because the number of predictors combined with liberal betas adds up.
I tried tightening the beta priors, but I worry this might make them too informative. I’m confused about how to proceed and how to balance having weakly informative priors while keeping the prior predictive distribution reasonable. I would be really grateful for any advice on this issue :)

Good on you for checking the prior predictive distribution. Shrinkage priors were developed exactly for this reason. I would recommend looking into the R2D2 prior paradigm, which is nicely implemented in brms and is an elegant solution to decomposing the explained variance between your predictors.

2 Likes

Hi, @spn, and welcome to the Stan forums!

The problem you’re running into is that independent weakly informative priors do not add up to a multivariate weakly informative prior. This was the point of the (arguably misleadingly named) paper that led us to take prior predictive checks more seriously:

As @mhollanders pointed out, you probably need a better weakly informative joint prior from which to simulate. I don’t know if the R2D2 prior easily admits simulation, but @bgoodri will know.

Also, it’s OK to simulate from a more constrained prior than you will use to fit as long as there’s enough data to resolve the model. You can’t do strictly proper simulation based calibration this way, but I believe it’s what most of us do in practice.

1 Like

Thank you so much for the reply and explanation @mhollanders and @Bob_Carpenter :) I will try working it out with R2D2 prior! @Bob_Carpenter, I am not sure if I understood your last point correctly: if the data is strong enough, the posterior will be dominated by it, so is it okay to use a slightly different (more constrained) prior for prior predictive checks? I might be misunderstanding.

Simulating R2D2 priors is really straightfoward. Consider you have P predictors. You can either put a prior on R^2 and transform it to explained variance, W = \frac{R^2}{1 - R^2}, or place a prior directly on W, but the key part is that you decompose the explained variance between your P predictors. So you introduce a P-simplex \phi that partitions W between your P components.

For the simplex generation, the two straightforward ways are Dirichlet or logistic-normal. For both simplex generations, I think it makes sense to introduce another parameter \theta that is the shape of the Dirichlet (shared across variance components) or the scale of the logistic normal. In the logistic normal, larger values of \theta correspond to a more sparse simplex, which I find intuitive, so I use the inverse of \theta in the Dirichlet to have the same “direction” of effect. Not that by default in brms, you supply \theta as data and estimate it.

In R:

P <- 10  # 10 predictors
R2 <- rbeta(1, 1, 1) 
W <- R2 / (1 - R2)
theta <- rgamma(1, 1, 1)  # dirichlet inverse shape or logistic-normal scale
if (dirichlet) {
  phi <- rdirichlet(1, rep(1 / theta, P))  # assuming a random Dirichlet generator
} else {
  u <- rnorm(P, 0, theta)  # unconstrained simplex parameters
  u <- u - mean(u)  # zero-sum for identifability
  phi <- exp(u) / sum(exp(u))  # softmax
}

Now if you put normal priors on your P coefficients, the prior scales of the coefficients and the coefficients themselves are:

scales <- sqrt(W * phi)
beta <- rnorm(P, 0, scales)

Expanding this to random slopes or random intercepts for some predictors is straightfoward as well; let me know if you want some code.

3 Likes

@javier.aguilar should also know :)

[2408.06504] Primed Priors for Simulation-Based Validation of Bayesian Models the technique we describe here could also help with SBC and prior predictive checks.

1 Like

Thank you, @mhollanders :) I do not have random effects.
I also found this resource that explains that non gaussian models need an adjustment. Based on that, this is what I did to simulate the values of the dv

df <-py$data
X <- as.matrix(df[,  paste0("x", 1:10)])    # The predictors are already standardized
n_draws <- 10000
n_obs <- nrow(X)   # n=66
P <- 10
dirichlet <- TRUE

yrep <- matrix(NA,nrow = n_obs,ncol = n_draws)

for (i in 1:n_draws) {
  #W
  R2 <- rbeta(1,1,1)
  W <- R2 / (1 - R2)
  #s2b0
  b0 <- rnorm(1,-0.5,1.2)
  scale <- rgamma(1,1,1) 
  
  # predictor fraction of variance
  
  theta <- rgamma(1, 2, 2)
  if (dirichlet) {
  phi <- rdirichlet(1, rep(1/theta , P)) } 
  else {
  u1 <- rnorm(P, 0, theta)  
  u1 <- u - mean(u1) 
  phi <- exp(u1) / sum(exp(u1))  
}
    
  
  # beta draws
  u <- plogis(b0)
  var <- u*(1-u)/(1+scale)
  s2b0 <- var/(u*(1-u))**2
  scales <- sqrt(W * phi * s2b0)
  beta <- rnorm(P, 0, scales)
  
  # combining with predictors
  
  

  eta <- X %*% beta + b0

  yrep[, i] <- plogis(eta)
  
  
}


hist(as.vector(yrep), breaks = 100, main = "Predicted probabilities",
     xlab = "yrep", col = "skyblue")



Here is what it looks like :

The prior predictive still shows extreme values more frequently, as it did with weakly informative priors. It is theoretically possible, but I do not expect a higher frequency in the extremes. I assumed that the R2D2 would shrink most betas to zero, or maybe I misunderstood. Or maybe the extra step of considering the adjusted noise was wrongly implemented by me. I am thoroughly confused :(

1 Like

Thank you for sharing this resource! I will look into it :)

Nice. This is why we recommend prior predictive checks.

To stop probability mass piling up at the tails, you need to lower the variance of eta = X * beta + b0. So that means lowering the variance of beta or of b0. But if your X varies too much, then you probably can’t solve this problem with this linear model in the prior unless you reduce the scale of beta to unrealistic levels.

Having said that, it may not matter. This is like providing a beta(0.1, 0.1) prior. It’s so weak that any amount of data will overtake it.

1 Like

I also found this resource that explains that non gaussian models need an adjustment.

Thank you for sharing this. I never fully understood how to impement the Yanchenko prior for W in my own models, but this blog post seems to suggest you can just rescale the variance by some factor using the intercept. For the Poisson case, it seems to be the same adjustment as the pseudovariance suggested by @avehtari in the regularised horseshoe paper.

Can you please explain your scale/var/s2b0 stuff? I assume this is the scaling factor for the Beta regression, but how did you get there?

To expand on what @bobcarpenter said, you might try tightening up the prior on R2. If memory serves, I was getting some “overdispersed” prior predictive checks when R2 could end up close to 1. Try rbeta(1, 1, 2) or something like that? Additionally, I haven’t done prior predictive checks on theta; like I said, the typical approach is to supply a known value (I think @javier.aguilar does 1 or 0.1).

1 Like

@mhollanders Here is the derivation. I have highlighted the variables from the code in purple. Just for clarity, W in the code is different from the W in the text, as W in the text is W (in the code)*s2b0 :


I do not have a math background, thus I may be missing some nuance.

@Bob_Carpenter and @mhollanders Thank you for the advice! This clarifies things conceptually.
There is not much variability in my X, but I will double-check one more time. I will also try tightening the prior on R2 - Beta (1,1,2). I suppose it will make W values smaller, as higher R2 values are less favoured, making the sd of beta smaller and thereby reducing eta?
I will try with the fixed theta too. Does brms default to 1?
I was hoping that a sensitivity analysis across reasonable priors would help assess how problematic any remaining misspecification is, especially if data dominates.

Hey, sorry but I’m not well placed to check your derivation, but thanks for sharing! In brms, you set theta with the cons_D2 argument, but it doesn’t allow you to set a prior for it, I’m pretty sure. I also need to reiterate that I haven’t seen others do this, so there may well be problems with it, but it makes sense to me. It becomes more intuitive to me when formulating the simplex as a zero-sum logistic-normal, where it makes sense to estimate the scale.

Re: sensitivity, this all sounds good, but I’m also pretty sure the R2D2 prior is pretty powerful with avoiding overfitting because of its hierarchical construction.

1 Like

Thank you for the explanation :)
No worries about the derivation part. I tried to understand what brms is doing for this :

prior2 <- c(
  set_prior("R2D2(mean_R2 = 0.333, prec_R2 = 3,autoscale=FALSE,cons_D2=1)", class = "b"),
  set_prior("normal(-0.5,1.2)", class = "Intercept"),
  set_prior("gamma(2, 1)", class = "phi")
)

model2_prior <- brm(
  formula = form,
  data = df,
  prior = prior2,
  sample_prior = "only",
  file = "model2_prior",
  family = Beta()
)

stancode(model2_prior)

This was the output

 // generated with brms 2.23.0
functions {
  /* compute scale parameters of the R2D2 prior
   * Args:
   *   phi: local weight parameters
   *   tau2: global scale parameter
   * Returns:
   *   scale parameter vector of the R2D2 prior
   */
  vector scales_R2D2(vector phi, real tau2) {
    return sqrt(phi * tau2);
  }

}
data {
  int<lower=1> N;  // total number of observations
  vector[N] Y;  // response variable
  int<lower=1> K;  // number of population-level effects
  matrix[N, K] X;  // population-level design matrix
  int<lower=1> Kc;  // number of population-level effects after centering
  int<lower=1> Kscales;  // number of local scale parameters
  // data for the R2D2 prior
  real<lower=0> R2D2_mean_R2;  // mean of the R2 prior
  real<lower=0> R2D2_prec_R2;  // precision of the R2 prior
  // concentration vector of the D2 prior
  vector<lower=0>[Kscales] R2D2_cons_D2;
  int prior_only;  // should the likelihood be ignored?
}
transformed data {
  matrix[N, Kc] Xc;  // centered version of X without an intercept
  vector[Kc] means_X;  // column means of X before centering
  for (i in 2:K) {
    means_X[i - 1] = mean(X[, i]);
    Xc[, i - 1] = X[, i] - means_X[i - 1];
  }
}
parameters {
  vector[Kc] zb;  // unscaled coefficients
  real Intercept;  // temporary intercept for centered predictors
  // parameters of the R2D2 prior
  real<lower=0,upper=1> R2D2_R2;
  simplex[Kscales] R2D2_phi;
  real<lower=0> phi;  // precision parameter
}
transformed parameters {
  vector[Kc] b;  // scaled coefficients
  vector<lower=0>[Kc] sdb;  // SDs of the coefficients
  real R2D2_tau2;  // global R2D2 scale parameter
  vector<lower=0>[Kscales] scales;  // local R2D2 scale parameters
  // prior contributions to the log posterior
  real lprior = 0;
  // compute R2D2 scale parameters
  R2D2_tau2 = R2D2_R2 / (1 - R2D2_R2);
  scales = scales_R2D2(R2D2_phi, R2D2_tau2);
  sdb = scales[(1):(Kc)];
  b = zb .* sdb;  // scale coefficients
  lprior += normal_lpdf(Intercept | -0.5,1.2);
  lprior += beta_lpdf(R2D2_R2 | R2D2_mean_R2 * R2D2_prec_R2, (1 - R2D2_mean_R2) * R2D2_prec_R2);
  lprior += gamma_lpdf(phi | 2, 1);
}
model {
  // likelihood including constants
  if (!prior_only) {
    // initialize linear predictor term
    vector[N] mu = rep_vector(0.0, N);
    mu += Intercept + Xc * b;
    mu = inv_logit(mu);
    target += beta_lpdf(Y | mu .* phi, (1 - mu) .* phi);
  }
  // priors including constants
  target += lprior;
  target += std_normal_lpdf(zb);
  target += dirichlet_lpdf(R2D2_phi | R2D2_cons_D2);
}
generated quantities {
  // actual population-level intercept
  real b_Intercept = Intercept - dot_product(means_X, b);
}

It seems to me like it is considering the equation on the logistic scale as just a regular equation (Gaussian) with sigma =1 ( just a superficial understanding of the code, but is doing something different).
I thought that the Taylor transform and the other methods described by Yanchenko are applied because in beta regression and other models, variance is not independent of the mean, and the relationship is nonlinear. I am confused about how these two ways of estimating beta priors fit together. Clearly, I am not understanding something fundamental, and I could also have misunderstood the stan code. I apologize if it is too basic, and also going in a different direction than my initial question. Any insight would be greatly appreciated.

EDIT by Aki: added code block ticks for Stan code

brms doesn’t currently accommodate different scaling of the variance in GLMs; is that the issue? You’ll have to write it up in Stan yourself, or introduce the appropriate scaling factor in the brms code. I’m not familiar enough with brms to do this, however.

1 Like

Ahh okay! Yes,indeed, I was wondering whether brms does that or not. But also why it doesn’t account for it the way described in yanchenko paper . I thought maybe there is a different rationale for the way brms does it and I was just curious to understand why. But if I understand you correctly there’s no specific reason and we need to include it by ourselves?

Yes, the Yanchenko paper came out after R2D2 was implemented in brms. There’s an open issue to include the pseudo-variance, which as I mentioned was a scaling factor recommended in the regularised horseshoe paper, but currently I believe you have to do it in Stan.

1 Like

Okay! This clarifies things. Thank you :)