Accounting for measurement error during variable selection with projpred (possibly with rstanarm or brms?)

Abstract
Is there a way to use projpred while accounting for measurement uncertainty in the dependent variable (e.g., target variable)?

Body
Hi,
I have a pretty simple linear model I’m looking to do some variable selection on. One issue, though, is my dependent expresses measurement error that is heteroskedastic with some other variable. That is, my "observed " dependent is conditionally distributed based on the “true” dependent value and some variable “w”, such that: y_true ~ normal(y_obs, b0 + b1 / w).

My initial desire was to take advantage of the built-in compatibility between projpred and rstanarm or possibly brms. However, I do not see an obvious way of accounting for measurement error in any of those packages. But perhaps I’m missing some set of functionality, or don’t know what key words to use to properly search the documentation.

I was then looking into the custom reference model functionality provided by projpred::init_refmodel. This looked promising at first, but upon closer inspection it appears that the argument y is static (i.e., the target variable is assumed to be perfectly measured or known). Am I interpreting this argument correctly?

Anyone know of a way of addressing measurement uncertainty in the context of projpred?

1 Like

init_refmodel has also argument dis which for Gaussian observation model this is the noise std sigma https://mc-stan.org/projpred/reference/init_refmodel.html. For non-Gaussian models the uncertainty depends on location parameter (e.g. binomial and Poisson) and some non-Gaussian have also additional dispersion parameter (e.g. negative-binomial). Theoretically having heteroscedastic noise in Gaussian model is trivial, but we haven’t made such experiments with projpred and thus I’m not certain whether dis can accept sample of sigma vector. If it doesn’t accept it now, the change in the code shouldn’t be big. As projpred is being refactored right now, I ping @AlejandroCatalina and @paul.buerkner to comment whether heteroscedastic dispersion has been considered in the new code?

1 Like

Adding to what Aki said, we indeed recover the sigma vector from the samples in the stanfit (or brms, rstanarm) object as the dispersion (so this can be whatever your model defines). So if you are passing a custom model, make sure to include a sigma field. If you are doing so with rstanarm or brms it should be automatic.

My understanding of dis (and the sigma field) are that they are related to the likelihood of the model after accounting for measurement uncertainty, so aren’t entirely coordinated with the measurement uncertainty itself. In that context I’ve been following the strategy laid out by the Stan User’s Guide (specifically, the second block of code in the example):

Using the notation from this example, there would be both a tau (the heteroskedastic part of the model; tau = b0 + b1 / w) and a sigma parameter, in which case I’m not sure how to pass tau to init_refmodel. Or am I not understanding something?

In any case, thank you both very much for your response – very helpful.

I am facing a similar problem. I have described in details my dataset elsewhere.

To construct my reference model, I performed a SPCA using all my 164906 predictors. To account for measurement error on the dependent variable, I’m following the strategy from this book. The measurement error is not related to any of the predictors.

I used the principal components as predictors and I fit the model using brms as:

fit <- brm(y | resp_se(se_y, sigma=TRUE) ~ ., data=data.cv, family = gaussian(),prior=priors_me)

In projpred, I use init_refmodel for my custom reference model and I define dis which, in my understanding, is what was discussed above. However, in the formula argument of init_refmodel, “function calls on the right-side of the | are currently not allowed in projpred” (from the manual), so I cannot use the same formula as in brms. I construct the reference model as:

ref.spca <- init_refmodel(fit, data = data.full, formula = y ~ ., family = gaussian(), extract_model_data = extractor_cust, dis = as.matrix(fit)[, "sigma"])

The performance of the reference model when fit with resp_se() is worse than the submodels.
image

However, this is not the case when I follow exactly the same procedure but my reference model was fit without resp_se().
image

fit <- brm(y ~ ., data=data.cv, family = gaussian(),prior=priors_me)

The search procedure is always the same:

cvvsfast <- projpred::cv_varsel(ref.spca, method = "L1", nclusters_pred = 100, validate_search=FALSE, nterms_max=100, search_terms = NULL, ndraws = 400).

Is my understanding correct? How can I properly account for measurement error on the dependent variable during the procedure?

Ping @fweber144

Currently, projpred requires the dispersion parameter to be the same for all observations. I have added a feature request for making this more flexible: Allow dispersion parameter to be observation-specific · Issue #481 · stan-dev/projpred · GitHub. As mentioned there, I guess we will need special submodel fitters for this (or at least adapt the currently available ones), so that the observation-specific dispersion parameter values are taken into account when performing the projection.

1 Like

Thank you for adding this feature request!

Until this is added, I think I will implement an approach of building the SPCA reference model and performing variable selection without accounting for measurement error. Once the most important variables have been selected by projpred, I will fit the brm model using only these variables and accounting for measurement error in the formula.

2 Likes

Hi @dafanast,

I’m also trying to replicate the SPCA-based reference model approach in Pavone (2023) – given that the formula definition in brms appears to struggle with high numbers of predictors – and wondered how you managed to get this to work with later version(s) of projpred, as the paper’s GitHub code doesn’t work as is. May I ask how you wrote the extract_model_data function to work with the SPCA reference model?

Thanks in advance!

Hi @rtnliqry,

I have paused this work since last year and I’m starting again in a few months. I hope what I’m telling you here is correct, I’ve done my best to remember it. According to my notes, the approach I used is the following:

  1. Data preparation where I filtered out variables based on 4 criteria. I know that’s not ideal, but I had to reduce the number of variables to a workable number to avoid the stack overflow problems since I couldn’t find a way to deal with them. My criteria were to exclude variables: i) with 0 or near 0 variance, ii) above or below certain threshold (criterion appropriate for my dataset due to biological properties), iii) highly correlated between them and iv) showing a lack of correlation with my response variable. That’s how I arrived from ~150.000 to 15.000 variables.
  2. Reference model: I used supervised dimension reduction and I built a measurement error model using the latent features.
dri <- ispca(x,y)
zi <- predict(dri,x)
data <- data.frame(zi, y, se.y) # se.y comes from my dataset

Measurement error model using latent features:
fit.i <- brm(y | resp_se(se.y, sigma=TRUE) ~ . , data=data, family = gaussian(), save_pars = save_pars(all = TRUE))

Construct reference model for projpred

extractor_cust <- function(object, newdata, wrhs = NULL, orhs = NULL,
                           extract_y = TRUE) {
  return(y_wobs_offs(newdata = newdata, wrhs = wrhs, orhs = orhs,
                     resp_form = if (extract_y) ~ y else NULL))
}

fit <- fit.i
ref.ispca <- init_refmodel(
  fit,
  data = data.full,
  formula = as.formula(formula_str),
  family = gaussian(),
  extract_model_data = extractor_cust,
  dis = as.matrix(fit)[, "sigma"])

with formula for reference model (writing it this way sometimes helps with the stack overflow problem but I don’t think it did in this case):
formula_str <- paste("y ~", paste(colnames(data.full)[colnames(data.full) != "y"], collapse = " + "))

  1. I run a preliminary projection predictive variable selection using the reference model and my full dataset with the ~15.000 variables.

Preliminary cv_varsel() run:

cvvs.l1.v2 <- projpred::cv_varsel(ref.ispca, 
                                method = "L1", 
                                nclusters_pred = 20,
                                validate_search=FALSE, 
                                nterms_max=100, 
                                search_terms = NULL,
                                parallel=TRUE,
                                seed=123)

Now, because I wasn’t able to include the measurement error in the projpred (see post above), I used the preliminary search only to choose my final variables.

  1. Using the selected variables, I built a a measurement error model with brms with which I run all the checks.
3 Likes

Just a few comments on the projpred part:

Have you tried whether formula = y ~ . works in init_refmodel()? If I remember correctly, this should work, at least in projpred version 2.8.0.

This line (and hence also the definition of extractor_cust) should not be necessary anymore in projpred v2.8.0 (and in v2.7.0 neither). Note that if you want to use K-fold CV, you would also have to specify arguments cvfun (or—alternatively—cvfits) and cvrefbuilder of init_refmodel().

1 Like

I’m pretty sure I did try formula = y ~ . because it’s the simplest way to write the formula. However, because I had >150000 of variables, I was getting the “protection stack overflow” in R which sometimes may be solved with the paste("y ~", paste(colnames(data.full)[colnames(data.full) != "y"], collapse = " + ")). It wasn’t actually solved in my case, so yes, better to change the formula to the simple form.

If I don’t include it, I get an error.

I just tried again now with the simple formula

ref.ispca <- init_refmodel(
  fit,
  data = data.full,
  formula = y ~. ,
  family = gaussian(),
  #extract_model_data = extractor_cust,
  dis = as.matrix(fit)[, "sigma"])

I get this error:

Error in init_refmodel(fit, data = data.full, formula = y ~ ., family = gaussian(),  : 
  argument "extract_model_data" is missing, with no default

I am using projpred 2.7.0 at the moment.

Wow! You could reduce the memory use by first splitting these e.g. in 10 parts. Do selection with validate_search=FALSE, and use formula which combines the selected variables from each part

2 Likes

Could you try with v2.8.0?

Thanks @dafanast and @fweber144 for your responses.

Using projpred v2.8.0.9000 from GitHub works out of the box with varsel(), but I’m struggling to get cv_varsel() to work. This must be because I’m missing something important out of the custom ref_predfun function, so any advice would be welcome! Here’s a reprex below (admittedly not the tidiest or most efficient code). I’m assuming that by manually specifying search_terms here, that the constant effect that y ~ . will automatically add for the grouping variable name will be ignored by projpred?

# Packages ----
library(brms)
library(dplyr)
library(dimreduce)
library(loo)
library(projpred)
library(purrr)
library(tidyr)

# Simulation parameters ----
set.seed(1234)
n <- 100 # number of observations
p <- 50 # total number of variables
k <- 5 # number of relevant variables

# Data simulation ----
vars <- matrix(rnorm(n * p), ncol = p)
colnames(vars) <- c(paste0("noise.", 1:(p - k)), paste0("pred.", 1:k))
betas <- c(rep(0, p - k), rnorm(k))
linpred <- plogis(0 + vars %*% betas)
df_sim <- data.frame(y = rbinom(n, 1, linpred), vars, subject = as.factor(1:n))

# Add noise to simulate duplicate observations for 20% of n ----
dup_idx <- rbinom(n, size = 1, prob = 0.2)
df_sim <- df_sim |> 
  mutate(dup = dup_idx) |> 
  group_by(subject) |> 
  pivot_longer(-c(y, subject, dup), names_to = "var") |> 
  mutate(value2 = if_else(dup == 1, true = value + rnorm(p, 0, sd = 0.5), false = NA)) |> 
  select(-dup) |> 
  pivot_longer(-c(y, subject, var)) |> 
  pivot_wider(id_cols = c(y, subject, name), names_from = var, values_from = value) |> 
  select(-name) |> 
  ungroup() |> 
  na.omit()

# Supervised PCA ----
nc <- 5
x <- select(df_sim, starts_with("noise"), starts_with("pred"))
dr <- spca(x = x, y = df_sim$y, nctot = nc, screenthresh = 0.6, window = p, sup.only = TRUE, verbose = FALSE)
z <- predict(dr, df_sim |> select(-y, -subject)) |> 
  as_tibble() |> 
  mutate(subject = df_sim$subject)
colnames(z)[1:nc] <- paste0("PC.", 1:nc)

# Fit the reference model in brms (as per model.stan in Pavone 2023) ----
smax <- max(dr$sdev)
custom_stanvars <- stanvar(smax, "smax")+
  stanvar(nc, "nc")+
  stanvar(scode = " real<lower=0> tau;", block = "parameters")

priors <- set_prior("normal(0, tau ^ 2)", class = "b")+
  set_prior("normal(0, tau ^ 2)", class = "Intercept")+
  set_prior("target += student_t_lpdf(tau | nc + 1, 0, smax ^ (-2))", check = FALSE)+
  set_prior("student_t(3, 0, 1)", class = "sd")

form <- paste0("y ~ ", paste0(colnames(z)[1:nc], collapse = " + "), " + (1 | subject)")

fit_spca <- brm(
  form
  ,data = data.frame(y = df_sim$y, z)
  ,family = bernoulli
  ,prior = priors
  ,stanvars = custom_stanvars
  ,iter = 3000
  ,chains = 4
  ,cores = 4
  ,seed = 1234
  ,control = list(adapt_delta = 0.95)
)

# Make refmodel object (traditional projection); varsel runs OK ----
refm <- init_refmodel(
  fit_spca
  ,data = df_sim
  ,formula = y ~ . + (1 | subject)
  ,family = binomial()
)

custom_search_terms <- colnames(select(df_sim, starts_with("noise"), starts_with("pred")))

vsel <- varsel(
  refm
  ,seed = 1234
  ,method = "forward"
  ,nterms = 10
  ,search_terms = colnames(select(df_sim, starts_with("noise"), starts_with("pred")))
)

# cv_varsel ----
folds <- kfold_split_grouped(K = 2, x = df_sim$subject)

custom_ref_predfun <- function(fit, newdata = NULL) {
  set.seed(1234)
  lprd_args <- nlist(object = fit, newdata, allow_new_levels = TRUE)
  out <- do_call(posterior_linpred, lprd_args)
  if (length(dim(out)) == 2) {
    out <- t(out)
  }
  out
}

custom_cvrefbuilder <- function(cvfit) {
  refm <- init_refmodel(
    cvfit
    ,data = df_sim[-cvfit$omitted, , drop = FALSE]
    ,formula = y ~ . + (1 | subject)
    ,family = binomial()
    ,called_from_cvrefbuilder = TRUE
    ,ref_predfun = custom_ref_predfun
  )
  return(refm)
}

custom_cvfun <- function(folds, ...) {
  
  cvfits <- lapply(
    seq_along(1:max(folds))
    ,function(k) {
    
      df <- df_sim[folds != k, ]
      dr <- spca(
        select(df, -y, -subject)
        ,y = df$y
        ,nctot = 5
        ,screenthresh = 0.6
        ,window = p
        ,sup.only = TRUE
        ,verbose = FALSE
      )
      z <- predict(dr, df |> select(-y, -subject)) |> 
        as_tibble() |> 
        mutate(subject = df$subject)
      
      colnames(z)[1:5] <- paste0("PC.", 1:5)
      
      smax <- max(dr$sdev)
      custom_stanvars <- stanvar(smax, "smax")+
        stanvar(nc, "nc")+
        stanvar(scode = " real<lower=0> tau;", block = "parameters")
      
      priors <- set_prior("normal(0, tau ^ 2)", class = "b")+
        set_prior("normal(0, tau ^ 2)", class = "Intercept")+
        set_prior("target += student_t_lpdf(tau | nc + 1, 0, smax ^ (-2))", check = FALSE)+
        set_prior("student_t(3, 0, 1)", class = "sd")
      
      form <- paste0("y ~ ", paste0(colnames(z[,1:nc]), collapse = "+"), "+ (1 | subject)")
      
      model_k <- brm(
        form
        ,data = data.frame(y = df$y, z)
        ,family = bernoulli
        ,prior = priors
        ,stanvars = custom_stanvars
        ,iter = 3000
        ,chains = 4
        ,cores = 4
        ,seed = 1234
        ,control = list(adapt_delta = 0.95)
      )
      return(model_k)
      
  })
  
  return(cvfits)
  
}

refm <- init_refmodel(
  fit_spca
  ,data = df_sim
  ,formula = y ~ . + (1 | subject)
  ,family = binomial()
  ,ref_predfun = custom_ref_predfun
  ,cvfun = custom_cvfun
  ,cvrefbuilder = custom_cvrefbuilder
)

cv_fits <- run_cvfun(refm, folds = folds, seed = 1234)

cvsel <- cv_varsel(
  refm
  ,seed = 1234
  ,cv_method = "kfold"
  ,method = "forward"
  ,nterms = 10
  ,validate_search = TRUE
  ,search_terms = custom_search_terms
  ,cvfits = cv_fits
)

This throws the below error (either using the default ref_predfun without having a multilevel reference model or using the custom_ref_predfun). I’ve included the full traceback at the bottom of this post.

Error: The following variables can neither be found in ‘data’ nor in ‘data2’:
‘PC.1’, ‘PC.2’, ‘PC.3’, ‘PC.4’, ‘PC.5’

Latent projection:
As a side note, trying to use the latent projection with the same model/data throws the error below:

# Make refmodel object (latent projection); errors ----
refm <- init_refmodel(
  fit_spca
  ,data = df_sim
  ,formula = y ~ . + (1 | subject)
  ,family = binomial()
  ,latent = TRUE
)

Error in eval(formula[[2]], data, environment(formula)) :
object ‘.y’ not found

Full error traceback:

Error: The following variables can neither be found in ‘data’ nor in ‘data2’:
‘PC.1’, ‘PC.2’, ‘PC.3’, ‘PC.4’, ‘PC.5’
23: stop(…, call. = FALSE)

22: stop2("The following variables can neither be found in ", “‘data’ nor in ‘data2’:\n”, collapse_comma(missing_vars2))

21: validate_data(newdata, bterms = bterms, na_action = na.pass, drop_unused_levels = FALSE, attr_terms = attr_terms, data2 = current_data2(object, newdata2), knots = get_knots(object$data))

20: validate_newdata(newdata, object = object, …)

19: current_data(object, newdata, newdata2 = data2, re_formula = re_formula, …)

18: standata.brmsfit(x, newdata = newdata, re_formula = re_formula, newdata2 = newdata2, resp = resp, allow_new_levels = allow_new_levels, internal = TRUE, …)

17: standata(x, newdata = newdata, re_formula = re_formula, newdata2 = newdata2, resp = resp, allow_new_levels = allow_new_levels, internal = TRUE, …)

16: prepare_predictions.brmsfit(object, newdata = newdata, re_formula = re_formula, resp = resp, ndraws = ndraws, draw_ids = draw_ids, check_response = FALSE, …)

15: prepare_predictions(object, newdata = newdata, re_formula = re_formula, resp = resp, ndraws = ndraws, draw_ids = draw_ids, check_response = FALSE, …)

14: posterior_linpred.brmsfit(object = .x1, newdata = .x2, allow_new_levels = .x3)

13: .fun(object = .x1, newdata = .x2, allow_new_levels = .x3) at #1

12: eval(expr, envir, …)

11: eval(expr, envir, …)

10: eval2(call, envir = args, enclos = parent.frame())

9: do_call(posterior_linpred, lprd_args)

8: ref_predfun_usr(fit = fit, newdata = newdata)

7: fold$refmodel$ref_predfun(fold$refmodel$fit, newdata = refmodel$fetch_data(obs = fold$omitted), excl_offs = FALSE)

6: one_fold(fold = list_cv[[k]], rk = search_out_rks[[k]], …)

5: FUN(X[[i]], …)

4: lapply(seq_along(list_cv), function(k) { if (verbose) { on.exit(utils::setTxtProgressBar(pb, k)) } …

3: kfold_varsel(refmodel = refmodel, method = method, nterms_max = nterms_max, ndraws = ndraws, nclusters = nclusters, ndraws_pred = ndraws_pred, nclusters_pred = nclusters_pred, refit_prj = refit_prj, penalty = penalty, verbose = verbose, search_control = search_control, K = K, …

2: cv_varsel.refmodel(refm, seed = 1234, cv_method = “kfold”, method = “forward”, nterms = 10, validate_search = TRUE, search_terms = custom_search_terms, cvfits = cv_fits)

1: cv_varsel(refm, seed = 1234, cv_method = “kfold”, method = “forward”, nterms = 10, validate_search = TRUE, search_terms = custom_search_terms, cvfits = cv_fits)

Edit:
I can also confirm that using y ~ . works with projpred when the number of predictors is high enough to cause a stack overflow when trying to fit a non-dimension-reduced reference model.

Yes, it works!

Great idea as always, I will have to try it, thank you.

Sorry @rtnliqry, I don’t know how to help you with this error.

1 Like

You should be able to fix this by replacing the definition of custom_ref_predfun() with

custom_ref_predfun <- function(fit, newdata = NULL) {
  set.seed(1234)
  lprd_args <- nlist(
    object = fit,
    newdata = if (!is.null(newdata) &&
                  !is.null(attr(fit, "projpred.predict_spca"))) {
      attr(fit, "projpred.predict_spca")(newdata)
    } else {
      newdata
    },
    allow_new_levels = TRUE
  )
  out <- do_call(posterior_linpred, lprd_args)
  if (length(dim(out)) == 2) {
    out <- t(out)
  }
  out
}

and by inserting

attr(model_k, "projpred.predict_spca") <- function(newdata) {
  new_z <- predict(
    dr,
    newdata[, !names(newdata) %in% c("y", "subject"), drop = FALSE]
  )
  colnames(new_z) <- paste0("PC.", seq_len(ncol(new_z)))
  new_z <- cbind(as.data.frame(new_z), subject = newdata$subject)
  return(new_z)
}

in the definition of custom_cvfun() just above line return(model_k). There might be better solutions; this is just the first one that came to my mind.

With this information, could you also try the latent part again (the one which resulted in error object ‘.y’ not found)?

2 Likes

Btw, the fix also shows that the ((I)S)PCA approach introduces a dependency between training and test data: When making predictions for the left-out observations from fold k, we use the ((I)S)PCA predict() method which relies on estimates based on the training data for fold k. If I get it correctly from quickly re-inspecting their paper, Piironen and Vehtari (2018) did not investigate the performance of the ((I)S)PCA approach when using it for variable selection. In particular, they don’t seem to have investigated variable selection with K-fold CV, i.e., your use case. I could imagine that variable selection with K-fold CV with a large K and PSIS-LOO CV are not too heavily affected by this dependency between training and test data, but K-fold CV with a small K might be. This is just a guess from my side, though. Perhaps @avehtari can say more about this. And there might also be other relevant papers apart from Piironen and Vehtari (2018).

And of course, supervised PCA (the iterative variant as well) also comes with a dependency between response and predictors, but I think that’s clear already from the name (and this has been pointed out by Piironen and Vehtari (2018): “We would like to point out that the two-step procedure discussed in this paper is not fully Bayesian as it uses the data twice: first when constructing the new feature representation and second time when fitting the predictive model. Nevertheless, we are sometimes willing to relax the full Bayesian view in the pursuit for a scalable method that allows us to handle high-dimensional problems in a computationally feasible manner.”).

2 Likes

Thanks again for your response, @fweber144!

This now works when parallel = FALSE but throws the following error when parallelised:

Error in { : task 1 failed - “could not find function “nlist””

When trying latent = TRUE, it unfortunately throws the same error, with the full traceback below:

Error in eval(formula[[2]], data, environment(formula)) :
object ‘.y’ not found
7: eval(formula[[2]], data, environment(formula))
6: eval(formula[[2]], data, environment(formula))
5: eval_el2(resp_form, newdata)
4: y_wobs_offs(newdata = newdata, wrhs = wrhs, orhs = orhs, resp_form = if (extract_y) lhs(formula) else NULL)
3: extract_model_data_usr(object = object, newdata = newdata, …)
2: extract_model_data(object, newdata = NULL, extract_y = TRUE)
1: init_refmodel(fit_brm, data = df, formula = y ~ ., family = binomial(),
latent = TRUE, cvfun = custom_cvfun, cvrefbuilder = custom_cvrefbuilder,
ref_predfun = custom_ref_predfun_1)

To those coming across this question, I realised that the modelling of duplicate predictor measurements using (1 | subject) isn’t correct, although other than calculating the mean of the duplicates prior to modelling, I’m not sure of the best way to handle this (e.g. here) that would be compatible with projpred.