Accounting for measurement error during variable selection with projpred (possibly with rstanarm or brms?)

Is there a way to use projpred while accounting for measurement uncertainty in the dependent variable (e.g., target variable)?

I have a pretty simple linear model I’m looking to do some variable selection on. One issue, though, is my dependent expresses measurement error that is heteroskedastic with some other variable. That is, my "observed " dependent is conditionally distributed based on the “true” dependent value and some variable “w”, such that: y_true ~ normal(y_obs, b0 + b1 / w).

My initial desire was to take advantage of the built-in compatibility between projpred and rstanarm or possibly brms. However, I do not see an obvious way of accounting for measurement error in any of those packages. But perhaps I’m missing some set of functionality, or don’t know what key words to use to properly search the documentation.

I was then looking into the custom reference model functionality provided by projpred::init_refmodel. This looked promising at first, but upon closer inspection it appears that the argument y is static (i.e., the target variable is assumed to be perfectly measured or known). Am I interpreting this argument correctly?

Anyone know of a way of addressing measurement uncertainty in the context of projpred?

1 Like

init_refmodel has also argument dis which for Gaussian observation model this is the noise std sigma For non-Gaussian models the uncertainty depends on location parameter (e.g. binomial and Poisson) and some non-Gaussian have also additional dispersion parameter (e.g. negative-binomial). Theoretically having heteroscedastic noise in Gaussian model is trivial, but we haven’t made such experiments with projpred and thus I’m not certain whether dis can accept sample of sigma vector. If it doesn’t accept it now, the change in the code shouldn’t be big. As projpred is being refactored right now, I ping @AlejandroCatalina and @paul.buerkner to comment whether heteroscedastic dispersion has been considered in the new code?

1 Like

Adding to what Aki said, we indeed recover the sigma vector from the samples in the stanfit (or brms, rstanarm) object as the dispersion (so this can be whatever your model defines). So if you are passing a custom model, make sure to include a sigma field. If you are doing so with rstanarm or brms it should be automatic.

My understanding of dis (and the sigma field) are that they are related to the likelihood of the model after accounting for measurement uncertainty, so aren’t entirely coordinated with the measurement uncertainty itself. In that context I’ve been following the strategy laid out by the Stan User’s Guide (specifically, the second block of code in the example):

Using the notation from this example, there would be both a tau (the heteroskedastic part of the model; tau = b0 + b1 / w) and a sigma parameter, in which case I’m not sure how to pass tau to init_refmodel. Or am I not understanding something?

In any case, thank you both very much for your response – very helpful.