Using projpred with a hierarchical Stan model with posterior-draw measurement error

Hi all,

I’m working with a Stan model where I explicitly propagate uncertainty from a previous model by using posterior draws as inputs. I’m now trying to perform predictor selection among 9 covariates that exhibit some multicollinearity (assessed via variance inflation factor + corr plots), and I’m looking for a robust and reproducible approach.

My current model structure is:

  • A latent process model for a site-level ‘true quantity’ (on the log scale)

  • A measurement model where, for each site (N = 32), I use ~500 posterior draws from a previous model as observations

In Stan, this looks like:

data {
  int<lower=1> N;                // number of sites --> = 32
  int<lower=1> S;                // number of metric posterior draws per site (set to 500)
  int<lower=1> P;                // number of predictors (= 9)
  matrix[N, P] X;                // predictor matrix
  matrix<lower=0>[N, S] metric_draws; // metric posterior draws
}

parameters {
  real intercept;                // intercept (log-scale)
  vector[P] beta;                // all coefficients
  real<lower=0> sigma_proc;      // between-site process SD
  real<lower=0> sigma_meas;      // measurement SD
  vector[N] log_metric_true;     // latent true log(metric)
}

model {
  // Priors
  intercept ~ normal(0, 10);
  beta ~ normal(0, 5);      
  sigma_proc ~ normal(0, 5);
  sigma_meas ~ normal(0, 5);

  // Process model
  log_metric_true ~ normal(intercept + X * beta, sigma_proc);

  // Measurement model
  for (i in 1:N) {
    metric_draws[i] ~ lognormal(log_metric_true[i], sigma_meas);
  }
}

A key feature here is the measurement model over posterior draws, which allows me to propagate uncertainty from an upstream model.

What I’ve tried

I attempted predictor selection using a regularized horseshoe prior (see below):

  • This does shrink coefficients, but selection based on “% of posterior away from 0” (e.g. 90%, 75%, 65%) feels arbitrary

  • + Results are not very stable: rerunning the same model slightly changes selected predictors and effect sizes - and the difference between cutoff choices is substantial.

data {
  int<lower=1> N;                
  int<lower=1> S;                
  int<lower=1> P;                
  matrix[N, P] X;                
  matrix<lower=0>[N, S] metric_draws; 
}

parameters {
  real intercept;                
  
  vector[P] z_beta;              // standardized coefficients
  vector<lower=0>[P] lambda;     // local shrinkage
  real<lower=0> tau;             // global shrinkage
  real<lower=0> c2;              // slab scale parameter
  
  real<lower=0> sigma_proc;      
  real<lower=0> sigma_meas;      
  vector[N] log_metric_true;     
}

transformed parameters {
  vector[P] beta;
  vector[P] lambda_tilde;
  
  // Regularized horseshoe transformation
  lambda_tilde = sqrt(c2 * square(lambda) ./ (c2 + square(tau) * square(lambda)));
  
  beta = z_beta .* lambda_tilde * tau;
}

model {
  // Priors
  intercept ~ normal(0, 5);
  
  z_beta ~ normal(0, 1);
  lambda ~ cauchy(0, 1);
  tau ~ cauchy(0, 1);       // global shrinkage
  c2 ~ inv_gamma(2, 2);     // slab regularization
  
  sigma_proc ~ normal(0, 2);
  sigma_meas ~ normal(0, 2);

  // Process model
  log_metric_true ~ normal(intercept + X * beta, sigma_proc);

  // Measurement model
  for (i in 1:N) {
    metric_draws[i] ~ lognormal(log_metric_true[i], sigma_meas);
  }
}

What I’d like to do

I’d like to use projection predictive variable selection via projpred, as it seems a more principled and stable predictor selection method - especially under multicollinearity.

However, projpred requires a model fitted via brms, and I’m unsure whether my model can be translated faithfully, because:

  • My likelihood is defined over a matrix of posterior draws per observation

  • I explicitly model a latent site-level parameter (log_metric_true)

  • The measurement model is central to my workflow

Here’s an attempt at the conversion:

formula_str <- paste(
  "metric_draw ~ 1 +",
  paste(covariate_names, collapse = " + "),
  "+ (1 | site)"
)

form <- as.formula(formula_str)

fit <- brm(
  formula = form,
  data = df,
  family = lognormal(),
  prior = c(
    prior(normal(0, 10), class = Intercept),
    prior(normal(0, 5), class = b),
    prior(normal(0, 5), class = sd),     # sigma_proc
    prior(normal(0, 5), class = sigma)   # sigma_meas
  ),
  chains = 4,
  iter = 5000,
  warmup = 2000,
  thin = 12,
  cores = 1
)

My questions

  1. Is it possible to express this model in brms in a way that preserves the measurement model over posterior draws?

  2. If so, is the usual (1 | site) formulation equivalent to explicitly modeling log_metric_true[i]?

  3. Would using projpred on such a brms approximation still be theoretically sound in this context?

  4. Alternatively, are there better approaches for predictor selection under multicollinearity in this kind of measurement-error setting?

Any guidance, references, or examples would be greatly appreciated - especially if you’ve dealt with similar measurement model setups + covariate selection.

Thanks in advance!

Viktor

Hi @ViktorVdV,

I’m not familiar with measurement-error models, but perhaps this rather general advice helps: You write

but this is not entirely correct. With init_refmodel(), projpred allows to create custom reference model objects, which do not need to be based on a brms or rstanarm fit.

As @fweber144 said you don’t need brms. Measurement-error with posterior draws is close to missing data and latent modeling, and this was dicussed here Implementing projection predictive feature selection for custom model with `projpred` - #6 by avehtari, with the conclusion that it is possible but the projpred package doesn’t provide direct support for that and it would be major task to add that support. You could do it by making your own init_refmodel()

Hi all,

Thanks for these tips!

I have tried out this method in the meantime. As it is not 100% clear to me how to arrange the init_refmodel, I would like to ask if the following code is the best way to implement it.

data {
  int<lower=1> N;                // number of sites
  int<lower=1> S;                // number of metric posterior draws per site
  int<lower=1> P;                // number of predictors

  matrix[N, P] X;                // environmental predictors matrix
  matrix<lower=0>[N, S] metric_draws;
}

parameters {
  real intercept;
  vector[P] beta;

  real<lower=0> sigma_proc;
  real<lower=0> sigma_meas;

  vector[N] log_metric_true;
}

model {

  // Priors
  intercept ~ normal(0, 5);
  beta ~ normal(0, 2);

  sigma_proc ~ exponential(1);
  sigma_meas ~ exponential(1);

  // Latent process
  log_metric_true ~ normal(intercept + X * beta,
                           sigma_proc);

  // Measurement model
  for (i in 1:N) {
    metric_draws[i] ~ lognormal(log_metric_true[i],
                                sigma_meas);
  }
}

Then in R I can fit the reference model:

# Example:
# X_df <- data.frame(x1 = ..., x2 = ..., ...)
# metric_draws_mat <- ...

X_mat <- as.matrix(X_df)

N <- nrow(X_mat)
P <- ncol(X_mat)
S <- ncol(metric_draws_mat)

stan_data <- list(
  N = N,
  S = S,
  P = P,
  X = X_mat,
  metric_draws = metric_draws_mat
)

# Compile Stan model

mod <- cmdstan_model("measurement_error_model.stan")


# Fit reference model

fit <- mod$sample(
  data = stan_data,
  chains = 4,
  parallel_chains = 4,
  iter_warmup = 1000,
  iter_sampling = 1000,
  adapt_delta = 0.95,
  max_treedepth = 12
  # seed = 123
)

Then, extract latent draws of what I call “log_metric_true”:

post_draws <- fit$draws()

# latent true values
latent_array <- as_draws_matrix(post_draws,
                                variable = "log_metric_true")

# sigma_proc posterior draws
sigma_proc_draws <- as.numeric(
  as_draws_df(post_draws)$sigma_proc
)

# Dimensions:
# draws x N

dim(latent_array)

Then build projpred custom reference model and run projection predictive variable selection:

# Create dataframe for projpred

proj_data <- X_df

# projpred requires a response variable
# we use the posterior mean latent truth

proj_data$y <- colMeans(latent_array)

# Reference prediction function

my_ref_predfun <- function(fit, newdata = NULL) {

  latent_array <- as_draws_matrix(
    fit$draws(),
    variable = "log_metric_true"
  )

  # projpred expects N x draws
  t(latent_array)
}


# Create custom reference model

refmodel <- init_refmodel(

  object = fit,

  data = proj_data,

  formula = as.formula(
    paste(
      "y ~",
      paste(colnames(X_df), collapse = " + ")
    )
  ),

  family = gaussian(),

  ref_predfun = my_ref_predfun,

  dis = sigma_proc_draws
)

vs_forward <- cv_varsel(
  refmodel,
  method = "forward",
  cv_method = "LOO",
  verbose = TRUE
)
vs_l1 <- cv_varsel(
  refmodel,
  method = "L1",
  cv_method = "LOO",
  verbose = TRUE
)

# Inspect results

plot(vs_forward)

suggest_size(vs_forward)

ranking(vs_forward)

selected_terms <- head(
  ranking(vs_forward)$fulldata,
  suggest_size(vs_forward)
)

selected_terms

Refit reduced model:

selected_cols <- selected_terms

X_reduced <- X_df[, selected_cols, drop = FALSE]

stan_data_reduced <- list(
  N = nrow(X_reduced),
  S = ncol(metric_draws_mat),
  P = ncol(X_reduced),
  X = as.matrix(X_reduced),
  metric_draws = metric_draws_mat,
  n_fixed = 0,
  fixed_idx = array(1, dim = 0)
)

fit_reduced <- mod$sample(
  data = stan_data_reduced,
  chains = 4,
  parallel_chains = 4,
  iter_warmup = 1000,
  iter_sampling = 1000,
  adapt_delta = 0.95,
  max_treedepth = 12,
  seed = 456
)

My understanding is:

  • the measurement-error uncertainty is propagated into the posterior draws of log_metric_true,

  • projection is then performed conditional on those latent draws,

  • and this is therefore conceptually similar to a latent-data / multiple-imputation approximation discussed in earlier threads.

My main question is:

Is this currently the most principled practical way to use projpred with a latent measurement-error model like this, or am I missing something important in how init_refmodel() should be constructed?

In particular, I am unsure whether:

  • using log_metric_true as the projection target is appropriate,

  • using sigma_proc as dis is correct (and I have to forget about sigma_meas?)

  • and whether the dummy response proj_data$y is being used in a statistically coherent way here.

Thanks very much in advance for your clarifications.

Viktor