Pseudo-variance using intercepts in shrinkage priors

Hi everyone,

I am once again confused about implementing the pseudo-variance to scale shrinkage priors. When using R2D2 priors, the proportion of explained variance R2 is converted to explained variance tau2 = R2 / (1 - R2) which, in Gaussian models, is multiplied by the residuals variance to get the scales of coefficients and/or random effects, e.g.:

data {
  int N, P;  // number observations and predictors
  matrix[N, P] X;  // predictors
  array[N] real y;  // observations
}
parameters {
  real alpha;  // intercept
  real<lower=0> sigma;  // residual SD
  real<lower=0, upper=1> R2;  // proportion explained variance
  simplex[P] phi;  // variance partitions
  vector[P] z;  // z-scores
}
transformed parameters {
  vector[P] scales = sqrt(R2 / (1 - R2) * square(sigma)),
            beta = scales .* z;
}
model {
  alpha ~ std_normal();
  sigma ~ exponential(1);
  z ~ std_normal();
  y ~ normal(alpha + X * B, sigma);
}

Piironen and Vehtari (2017) suggest using the pseudo-variance in non-Gaussian models, which they define in Table 1. For Poisson models where y \sim \mathrm{Poisson} (\mu), the pseudo-variance is \mu^{-1}. In practice we don’t know \mu so they suggest using the sample mean.

Alternatively, I thought we could use the intercept to get the pseudovariance. So the above model as a Poisson model could be:

data {
  int N, P;  // number observations and predictors
  matrix[N, P] X;  // predictors
  array[N] int<lower=0> y;  // observations
}
parameters {
  real alpha;  // intercept
  real<lower=0, upper=1> R2;  // proportion explained variance
  simplex[P] phi;  // variance partitions
  vector[P] z;  // z-scores
}
transformed parameters {
  vector[P] scales = sqrt(R2 / (1 - R2) * exp(-alpha)),
            beta = scales .* z;
}
model {
  alpha ~ std_normal();
  sigma ~ exponential(1);
  z ~ std_normal();
  y ~ poisson(alpha + X * B);
}

I am mostly looking for feedback on this approach. When the baseline rate \log \alpha gets really small, the pseudo-variance gets huge, and I’m not sure if that’s desired. The same thing happens for Bernoulli models, where the pseudo-variance is \mu^{-1} (1 - \mu)^{-1}. Alternatively, Yanchenko et al. 2025 suggest a different approach, where they use the sample means to do something with the Generalised Beta Prime distribution that I don’t really understand. I’ve been using R2D2 priors for a while now but this continues to be a stumbling block, so I’d love to sort this out.

thanks!

Matt

1 Like

Hi, at the moment we don’t have better answer than what Piironen and Vehtari (2017) and Yanchenko et al (2025) provide. We are looking into this, but due to summer vacations, it may take some time before we’ll have a better answer

3 Likes

Hi @mhollanders as you may remember from other threads I’m also trying to work with R2D2 and ordinal models. Looking at your models above I note you declared a phi variable but didn’t do anything with it! Based on comments from @avehtari and @davkoh in the other thread, I have developed some bernoulli regression code that uses pseudovariance based on the data that seems to work (there’s a reason a say “seems” I’ll explain later).

The model:


data {
    int n;              // number of observations
    int nX;             // number if X variables

    matrix[n, nX] X;    // X variable data
    array[n] int y;     // binary outcome variable

    // data for the psR2D2 prior
    real<lower=0> psR2D2_mean;         // mean of the R2 prior
    real<lower=0> psR2D2_prec;         // precision of the R2 prior
    vector<lower=0>[nX] psR2D2_cons_D2;    // concentration vector of the D2 prior

}

transformed data {
    //Center X
    matrix[n, nX] Xc;  // centered version of X without an intercept
    vector[nX] means_X;  // column means of X before centering
    vector[nX] sds_X;
    for (i in 1:nX) {
        means_X[i] = mean(X[, i]);
        sds_X[i] = sd(X[, i]);
        Xc[, i] = X[, i] - means_X[i];
    }
    
    // Calculate pseudovariance from data
    real pseudovar = (1/mean(y)) * (1/(1 - mean(y)));
}

parameters {
    real Intcpt;         // regression intercept
    vector[nX] beta;     // X variable beta parameters

    // parameters of the psR2D2 prior
    real<lower=0, upper=1> psR2D2_R2;    // pseudoR2 value
    simplex[nX] psR2D2_phi;              // pseudoR2 partition proprotions
}

transformed parameters {
    vector<lower=0>[nX] sd_b;  // SDs of the beta coefficients

    real psR2D2_tau2;  // global psR2D2 scale parameter

    // compute psR2D2 scale parameters
    psR2D2_tau2 = pseudovar * psR2D2_R2 / (1 - psR2D2_R2);
    // compute scales
    vector<lower=0>[nX] scales = sqrt(psR2D2_phi * psR2D2_tau2);
    // caculate beta sds from scales and X var sds
    sd_b = scales[nX] ./ sds_X;
}

model {

    // Likelihood
    target += bernoulli_logit_glm_lpmf(y | Xc, Intcpt, beta);

    // priors including constants
    target += normal_lpdf(beta | 0, sd_b);    // prior on betas
    target += student_t_lpdf(Intcpt | 3, 0, 1); //prior on intercept
    // prior on phi partition variable
    target += dirichlet_lpdf(psR2D2_phi | psR2D2_cons_D2);  
    // prior on pseudo R2
    target += beta_lpdf(psR2D2_R2 | psR2D2_mean * psR2D2_prec, (1 - psR2D2_mean) * psR2D2_prec);
}

Simulate some data and fit this model:

libbrary(rstan)
library(posterior)

n <- 200
nX <- 5
Intcpt <- -0.1
beta <- c(-0.388, -0.185, -0.223,  0.477, -0.200)
sigma <- 1 # to add small amount of noise


# Simulate random X variable(s)
X <- sapply(1:nX, function(i) rnorm(n = n))
colnames(X) <- paste0("x", 1:nX)

# Combine to generate a y/ystar variable
ystar <- Intcpt + (X %*% beta) + rnorm(n, 0, sigma)

# convert ystar to probability
p <- 1/(1 + exp(-ystar))

# generate binomial outcome var
y <- rbinom(n = n, size = 1 , prob = p)
table(y)

#### Make a stan data list
data1 <- list(n = n,
              nX = nX,
              X = X,
              y = y)

GPfit <- stan(file='logreg_R2D2_A.stan',
              data = c(data1,
                       list(psR2D2_mean = 0.6, psR2D2_prec = 5, psR2D2_cons_D2 = rep(1, nX))),
              chains=4, seed=49485, iter = 2000, cores=4)

This fits without divergences or thread-depth warnings and gives reasonable looking pairs plots:

And does an ok job recovering the true parameters (with the exception of beta[2] for some reason - some quick checks show me recovery doens’t seem to improve with increased n, but improves with larger effect sizes):

draws <- as_draws_df(GPfit)
bayesplot::mcmc_recover_hist(draws[, 1:6], true)

Now… why did I say "seems to work " above. According to Paul Allison (https://support.sas.com/resources/papers/proceedings14/1485-2014.pdf), R2 for logistic regression has a maximum value:

So I didn’t’ consciously set out to make a Cox and Snell R2, but, since he states: " the usual R 2 for linear regression depends on the likelihoods for the models with and without predictors by precisely this formula" - I think we have a Cox and Snell model anyway. Since he says it has a maximum value of 0.75 when p = 0.5, it seems problematic that my pairs plot above shows value up to 1.0. Indeed, I have seem circumstances where divergences concentrate at high R2 values with this type of model (though I dont’ understand this well enough to make a reprex). I have tried limiting the max value of R2 to the maximum as per Allisons formula above using a 4 parameter beta prior but got mixed results that I don’t understand. (Perhaps this is related to @eyanchenko’s mention of Generalised Beta Prior distrubtion that you mentioned, but I also struggle to understand that work.

This is far as my explorations have gotten, but I hope it is useful to someone. I’d be very curious to hear any questions/comments/thoughts/critiques/corrections!

2 Likes

Hey,

Not declaring phi just places a uniform prior over the simplex due to the simplex constraint.

I think using sample means for pseudo-variance is problematic for more complicated, such as zero-inflated models that are common in ecology. FWIW, when I’ve simulated complex occupancy models with random effects, fixed effects, etc. with R2D2 priors using pseudo-variance for occupancy (zero-inflation) and detection rates (Poisson) I’ve never had any issues recovering parameters, as determined by SBC. However, I’m mostly interested in knowing if using the intercept for pseudo-variance is a principled approach.

Hi Matt. I didn’t mean not placing a prior on phi (although this may also be something worth discussing). I mean you didn’t use it in any of the calculations - it’s just a free parameter in the model not related to data or other parameters. To make it part of the R2D2 model you should multiply it by tau2 - see red circles here (from the R2D2M2 paper):

So perhaps in your code you should do something like:

transformed parameters {
    vector[P] scales = sqrt( phi  * (R2 / (1 - R2) * square(sigma)));      // sqrt( phi * tau2 )
    vector[P] beta = scales .* z;
}

So I haven’t thought about models that you are working with. About the intercept, it seems to me that it may work in terms of parameter recovery, but, it is not truly a pseudo-variance and so your model is not truly a pseudo-R2 prior model. Perhaps you could call it a pseudo-R2-like model? But I think that may also be true for my model above given the R2 values > 0.75 topic.

1 Like

Pseudo-variance is always an approximation. Sometimes approximations are good enough. So the question is whether it is principled to use something that is good enough or revealed by diagnostics to be not good enough.

I think the general idea of using decomposition priors has strong justification. Then we have different degrees of approximations on how to implement the decomposition, and that clearly needs more research for non-normal observation models. We’re working on it, but it takes time.

3 Likes