Using projpred with a hierarchical Stan model with posterior-draw measurement error

Hi all,

I’m working with a Stan model where I explicitly propagate uncertainty from a previous model by using posterior draws as inputs. I’m now trying to perform predictor selection among 9 covariates that exhibit some multicollinearity (assessed via variance inflation factor + corr plots), and I’m looking for a robust and reproducible approach.

My current model structure is:

  • A latent process model for a site-level ‘true quantity’ (on the log scale)

  • A measurement model where, for each site (N = 32), I use ~500 posterior draws from a previous model as observations

In Stan, this looks like:

data {
  int<lower=1> N;                // number of sites --> = 32
  int<lower=1> S;                // number of metric posterior draws per site (set to 500)
  int<lower=1> P;                // number of predictors (= 9)
  matrix[N, P] X;                // predictor matrix
  matrix<lower=0>[N, S] metric_draws; // metric posterior draws
}

parameters {
  real intercept;                // intercept (log-scale)
  vector[P] beta;                // all coefficients
  real<lower=0> sigma_proc;      // between-site process SD
  real<lower=0> sigma_meas;      // measurement SD
  vector[N] log_metric_true;     // latent true log(metric)
}

model {
  // Priors
  intercept ~ normal(0, 10);
  beta ~ normal(0, 5);      
  sigma_proc ~ normal(0, 5);
  sigma_meas ~ normal(0, 5);

  // Process model
  log_metric_true ~ normal(intercept + X * beta, sigma_proc);

  // Measurement model
  for (i in 1:N) {
    metric_draws[i] ~ lognormal(log_metric_true[i], sigma_meas);
  }
}

A key feature here is the measurement model over posterior draws, which allows me to propagate uncertainty from an upstream model.

What I’ve tried

I attempted predictor selection using a regularized horseshoe prior (see below):

  • This does shrink coefficients, but selection based on “% of posterior away from 0” (e.g. 90%, 75%, 65%) feels arbitrary

  • + Results are not very stable: rerunning the same model slightly changes selected predictors and effect sizes - and the difference between cutoff choices is substantial.

data {
  int<lower=1> N;                
  int<lower=1> S;                
  int<lower=1> P;                
  matrix[N, P] X;                
  matrix<lower=0>[N, S] metric_draws; 
}

parameters {
  real intercept;                
  
  vector[P] z_beta;              // standardized coefficients
  vector<lower=0>[P] lambda;     // local shrinkage
  real<lower=0> tau;             // global shrinkage
  real<lower=0> c2;              // slab scale parameter
  
  real<lower=0> sigma_proc;      
  real<lower=0> sigma_meas;      
  vector[N] log_metric_true;     
}

transformed parameters {
  vector[P] beta;
  vector[P] lambda_tilde;
  
  // Regularized horseshoe transformation
  lambda_tilde = sqrt(c2 * square(lambda) ./ (c2 + square(tau) * square(lambda)));
  
  beta = z_beta .* lambda_tilde * tau;
}

model {
  // Priors
  intercept ~ normal(0, 5);
  
  z_beta ~ normal(0, 1);
  lambda ~ cauchy(0, 1);
  tau ~ cauchy(0, 1);       // global shrinkage
  c2 ~ inv_gamma(2, 2);     // slab regularization
  
  sigma_proc ~ normal(0, 2);
  sigma_meas ~ normal(0, 2);

  // Process model
  log_metric_true ~ normal(intercept + X * beta, sigma_proc);

  // Measurement model
  for (i in 1:N) {
    metric_draws[i] ~ lognormal(log_metric_true[i], sigma_meas);
  }
}

What I’d like to do

I’d like to use projection predictive variable selection via projpred, as it seems a more principled and stable predictor selection method - especially under multicollinearity.

However, projpred requires a model fitted via brms, and I’m unsure whether my model can be translated faithfully, because:

  • My likelihood is defined over a matrix of posterior draws per observation

  • I explicitly model a latent site-level parameter (log_metric_true)

  • The measurement model is central to my workflow

Here’s an attempt at the conversion:

formula_str <- paste(
  "metric_draw ~ 1 +",
  paste(covariate_names, collapse = " + "),
  "+ (1 | site)"
)

form <- as.formula(formula_str)

fit <- brm(
  formula = form,
  data = df,
  family = lognormal(),
  prior = c(
    prior(normal(0, 10), class = Intercept),
    prior(normal(0, 5), class = b),
    prior(normal(0, 5), class = sd),     # sigma_proc
    prior(normal(0, 5), class = sigma)   # sigma_meas
  ),
  chains = 4,
  iter = 5000,
  warmup = 2000,
  thin = 12,
  cores = 1
)

My questions

  1. Is it possible to express this model in brms in a way that preserves the measurement model over posterior draws?

  2. If so, is the usual (1 | site) formulation equivalent to explicitly modeling log_metric_true[i]?

  3. Would using projpred on such a brms approximation still be theoretically sound in this context?

  4. Alternatively, are there better approaches for predictor selection under multicollinearity in this kind of measurement-error setting?

Any guidance, references, or examples would be greatly appreciated - especially if you’ve dealt with similar measurement model setups + covariate selection.

Thanks in advance!

Viktor