Pseudo-variance using intercepts in shrinkage priors

Hi everyone,

I am once again confused about implementing the pseudo-variance to scale shrinkage priors. When using R2D2 priors, the proportion of explained variance R2 is converted to explained variance tau2 = R2 / (1 - R2) which, in Gaussian models, is multiplied by the residuals variance to get the scales of coefficients and/or random effects, e.g.:

data {
  int N, P;  // number observations and predictors
  matrix[N, P] X;  // predictors
  array[N] real y;  // observations
}
parameters {
  real alpha;  // intercept
  real<lower=0> sigma;  // residual SD
  real<lower=0, upper=1> R2;  // proportion explained variance
  simplex[P] phi;  // variance partitions
  vector[P] z;  // z-scores
}
transformed parameters {
  vector[P] scales = sqrt(R2 / (1 - R2) * square(sigma)),
            beta = scales .* z;
}
model {
  alpha ~ std_normal();
  sigma ~ exponential(1);
  z ~ std_normal();
  y ~ normal(alpha + X * B, sigma);
}

Piironen and Vehtari (2017) suggest using the pseudo-variance in non-Gaussian models, which they define in Table 1. For Poisson models where y \sim \mathrm{Poisson} (\mu), the pseudo-variance is \mu^{-1}. In practice we don’t know \mu so they suggest using the sample mean.

Alternatively, I thought we could use the intercept to get the pseudovariance. So the above model as a Poisson model could be:

data {
  int N, P;  // number observations and predictors
  matrix[N, P] X;  // predictors
  array[N] int<lower=0> y;  // observations
}
parameters {
  real alpha;  // intercept
  real<lower=0, upper=1> R2;  // proportion explained variance
  simplex[P] phi;  // variance partitions
  vector[P] z;  // z-scores
}
transformed parameters {
  vector[P] scales = sqrt(R2 / (1 - R2) * exp(-alpha)),
            beta = scales .* z;
}
model {
  alpha ~ std_normal();
  sigma ~ exponential(1);
  z ~ std_normal();
  y ~ poisson(alpha + X * B);
}

I am mostly looking for feedback on this approach. When the baseline rate \log \alpha gets really small, the pseudo-variance gets huge, and I’m not sure if that’s desired. The same thing happens for Bernoulli models, where the pseudo-variance is \mu^{-1} (1 - \mu)^{-1}. Alternatively, Yanchenko et al. 2025 suggest a different approach, where they use the sample means to do something with the Generalised Beta Prime distribution that I don’t really understand. I’ve been using R2D2 priors for a while now but this continues to be a stumbling block, so I’d love to sort this out.

thanks!

Matt