Accounting for measurement error during variable selection with projpred (possibly with rstanarm or brms?)

Abstract
Is there a way to use projpred while accounting for measurement uncertainty in the dependent variable (e.g., target variable)?

Body
Hi,
I have a pretty simple linear model I’m looking to do some variable selection on. One issue, though, is my dependent expresses measurement error that is heteroskedastic with some other variable. That is, my "observed " dependent is conditionally distributed based on the “true” dependent value and some variable “w”, such that: y_true ~ normal(y_obs, b0 + b1 / w).

My initial desire was to take advantage of the built-in compatibility between projpred and rstanarm or possibly brms. However, I do not see an obvious way of accounting for measurement error in any of those packages. But perhaps I’m missing some set of functionality, or don’t know what key words to use to properly search the documentation.

I was then looking into the custom reference model functionality provided by projpred::init_refmodel. This looked promising at first, but upon closer inspection it appears that the argument y is static (i.e., the target variable is assumed to be perfectly measured or known). Am I interpreting this argument correctly?

Anyone know of a way of addressing measurement uncertainty in the context of projpred?

1 Like

init_refmodel has also argument dis which for Gaussian observation model this is the noise std sigma https://mc-stan.org/projpred/reference/init_refmodel.html. For non-Gaussian models the uncertainty depends on location parameter (e.g. binomial and Poisson) and some non-Gaussian have also additional dispersion parameter (e.g. negative-binomial). Theoretically having heteroscedastic noise in Gaussian model is trivial, but we haven’t made such experiments with projpred and thus I’m not certain whether dis can accept sample of sigma vector. If it doesn’t accept it now, the change in the code shouldn’t be big. As projpred is being refactored right now, I ping @AlejandroCatalina and @paul.buerkner to comment whether heteroscedastic dispersion has been considered in the new code?

1 Like

Adding to what Aki said, we indeed recover the sigma vector from the samples in the stanfit (or brms, rstanarm) object as the dispersion (so this can be whatever your model defines). So if you are passing a custom model, make sure to include a sigma field. If you are doing so with rstanarm or brms it should be automatic.

My understanding of dis (and the sigma field) are that they are related to the likelihood of the model after accounting for measurement uncertainty, so aren’t entirely coordinated with the measurement uncertainty itself. In that context I’ve been following the strategy laid out by the Stan User’s Guide (specifically, the second block of code in the example):

Using the notation from this example, there would be both a tau (the heteroskedastic part of the model; tau = b0 + b1 / w) and a sigma parameter, in which case I’m not sure how to pass tau to init_refmodel. Or am I not understanding something?

In any case, thank you both very much for your response – very helpful.

I am facing a similar problem. I have described in details my dataset elsewhere.

To construct my reference model, I performed a SPCA using all my 164906 predictors. To account for measurement error on the dependent variable, I’m following the strategy from this book. The measurement error is not related to any of the predictors.

I used the principal components as predictors and I fit the model using brms as:

fit <- brm(y | resp_se(se_y, sigma=TRUE) ~ ., data=data.cv, family = gaussian(),prior=priors_me)

In projpred, I use init_refmodel for my custom reference model and I define dis which, in my understanding, is what was discussed above. However, in the formula argument of init_refmodel, “function calls on the right-side of the | are currently not allowed in projpred” (from the manual), so I cannot use the same formula as in brms. I construct the reference model as:

ref.spca <- init_refmodel(fit, data = data.full, formula = y ~ ., family = gaussian(), extract_model_data = extractor_cust, dis = as.matrix(fit)[, "sigma"])

The performance of the reference model when fit with resp_se() is worse than the submodels.
image

However, this is not the case when I follow exactly the same procedure but my reference model was fit without resp_se().
image

fit <- brm(y ~ ., data=data.cv, family = gaussian(),prior=priors_me)

The search procedure is always the same:

cvvsfast <- projpred::cv_varsel(ref.spca, method = "L1", nclusters_pred = 100, validate_search=FALSE, nterms_max=100, search_terms = NULL, ndraws = 400).

Is my understanding correct? How can I properly account for measurement error on the dependent variable during the procedure?

Ping @fweber144

Currently, projpred requires the dispersion parameter to be the same for all observations. I have added a feature request for making this more flexible: Allow dispersion parameter to be observation-specific · Issue #481 · stan-dev/projpred · GitHub. As mentioned there, I guess we will need special submodel fitters for this (or at least adapt the currently available ones), so that the observation-specific dispersion parameter values are taken into account when performing the projection.

1 Like

Thank you for adding this feature request!

Until this is added, I think I will implement an approach of building the SPCA reference model and performing variable selection without accounting for measurement error. Once the most important variables have been selected by projpred, I will fit the brm model using only these variables and accounting for measurement error in the formula.

1 Like