Hi all,
I’m working with a Stan model where I explicitly propagate uncertainty from a previous model by using posterior draws as inputs. I’m now trying to perform predictor selection among 9 covariates that exhibit some multicollinearity (assessed via variance inflation factor + corr plots), and I’m looking for a robust and reproducible approach.
My current model structure is:
-
A latent process model for a site-level ‘true quantity’ (on the log scale)
-
A measurement model where, for each site (N = 32), I use ~500 posterior draws from a previous model as observations
In Stan, this looks like:
data {
int<lower=1> N; // number of sites --> = 32
int<lower=1> S; // number of metric posterior draws per site (set to 500)
int<lower=1> P; // number of predictors (= 9)
matrix[N, P] X; // predictor matrix
matrix<lower=0>[N, S] metric_draws; // metric posterior draws
}
parameters {
real intercept; // intercept (log-scale)
vector[P] beta; // all coefficients
real<lower=0> sigma_proc; // between-site process SD
real<lower=0> sigma_meas; // measurement SD
vector[N] log_metric_true; // latent true log(metric)
}
model {
// Priors
intercept ~ normal(0, 10);
beta ~ normal(0, 5);
sigma_proc ~ normal(0, 5);
sigma_meas ~ normal(0, 5);
// Process model
log_metric_true ~ normal(intercept + X * beta, sigma_proc);
// Measurement model
for (i in 1:N) {
metric_draws[i] ~ lognormal(log_metric_true[i], sigma_meas);
}
}
A key feature here is the measurement model over posterior draws, which allows me to propagate uncertainty from an upstream model.
What I’ve tried
I attempted predictor selection using a regularized horseshoe prior (see below):
-
This does shrink coefficients, but selection based on “% of posterior away from 0” (e.g. 90%, 75%, 65%) feels arbitrary
-
+ Results are not very stable: rerunning the same model slightly changes selected predictors and effect sizes - and the difference between cutoff choices is substantial.
data {
int<lower=1> N;
int<lower=1> S;
int<lower=1> P;
matrix[N, P] X;
matrix<lower=0>[N, S] metric_draws;
}
parameters {
real intercept;
vector[P] z_beta; // standardized coefficients
vector<lower=0>[P] lambda; // local shrinkage
real<lower=0> tau; // global shrinkage
real<lower=0> c2; // slab scale parameter
real<lower=0> sigma_proc;
real<lower=0> sigma_meas;
vector[N] log_metric_true;
}
transformed parameters {
vector[P] beta;
vector[P] lambda_tilde;
// Regularized horseshoe transformation
lambda_tilde = sqrt(c2 * square(lambda) ./ (c2 + square(tau) * square(lambda)));
beta = z_beta .* lambda_tilde * tau;
}
model {
// Priors
intercept ~ normal(0, 5);
z_beta ~ normal(0, 1);
lambda ~ cauchy(0, 1);
tau ~ cauchy(0, 1); // global shrinkage
c2 ~ inv_gamma(2, 2); // slab regularization
sigma_proc ~ normal(0, 2);
sigma_meas ~ normal(0, 2);
// Process model
log_metric_true ~ normal(intercept + X * beta, sigma_proc);
// Measurement model
for (i in 1:N) {
metric_draws[i] ~ lognormal(log_metric_true[i], sigma_meas);
}
}
What I’d like to do
I’d like to use projection predictive variable selection via projpred, as it seems a more principled and stable predictor selection method - especially under multicollinearity.
However, projpred requires a model fitted via brms, and I’m unsure whether my model can be translated faithfully, because:
-
My likelihood is defined over a matrix of posterior draws per observation
-
I explicitly model a latent site-level parameter (
log_metric_true) -
The measurement model is central to my workflow
Here’s an attempt at the conversion:
formula_str <- paste(
"metric_draw ~ 1 +",
paste(covariate_names, collapse = " + "),
"+ (1 | site)"
)
form <- as.formula(formula_str)
fit <- brm(
formula = form,
data = df,
family = lognormal(),
prior = c(
prior(normal(0, 10), class = Intercept),
prior(normal(0, 5), class = b),
prior(normal(0, 5), class = sd), # sigma_proc
prior(normal(0, 5), class = sigma) # sigma_meas
),
chains = 4,
iter = 5000,
warmup = 2000,
thin = 12,
cores = 1
)
My questions
-
Is it possible to express this model in
brmsin a way that preserves the measurement model over posterior draws? -
If so, is the usual
(1 | site)formulation equivalent to explicitly modelinglog_metric_true[i]? -
Would using
projpredon such abrmsapproximation still be theoretically sound in this context? -
Alternatively, are there better approaches for predictor selection under multicollinearity in this kind of measurement-error setting?
Any guidance, references, or examples would be greatly appreciated - especially if you’ve dealt with similar measurement model setups + covariate selection.
Thanks in advance!
Viktor