Help interpreting projpred's predictive performance plot

I am fitting a model to proportion data (working on neonatal mortality), and the proportions contain a lot of zeros, so I chose the zero-inflated beta.

Here is a density plot of my outcome variable:

The model is:

model <- brm(
  bf(neonatal_mortality ~ x1  + ... + xp + (1 | county) + (1 | period)),
  prior = set_prior('horseshoe(df = 1)', class = 'b'),
  family = zero_inflated_beta(),
  data = data,
  seed = 7,
  iter = 10000,
  warmup = 2000,
  thin = 4,
  chains = 4,
  cores = 4
)

I am using the horseshoe prior on the coefficients of the model, as the covariates are about 41 and observations 184, and I would like to run variable selection using projection predictive inference (projpred).

Here is the posterior predictive check of the full model:


The projpred function is as below:

varmod <- varsel(model, 
                      nterms_max=200, # setting this to higher number as I would keep changing model size
                      validate_search = F,
                      method = 'forward',
                      refit_prj = T,
                      latent = T,
                      ndraws = 5000,
                      ndraws_pred = 5000,
                      verbose = T,
                      seed = 32)

The plots I obtain for running the projpred are shown below (I get a similar shape using mlmd, eld, gmpd):

plot(varmod, resp_oscale = F, stat = 'mlld')

However for the RMSE I get a more familiar plot:

plot(varmod, resp_oscale = T, stat = 'rmse')

What could this imply ?

Hi Stanley!

Concerning the outcome, do you have the absolute frequencies and the “numbers of trials” (i.e., the denominators of the relative frequencies) available? If yes, wouldn’t it be possible to fit a binomial model instead of a (zero-inflated) beta model? That way, you could avoid the latent projection and the complications (in particular, the less intuitive interpretation of projpred’s output) that it comes with.

For these plots, I usually find it helpful to create them with deltas = TRUE. This is shortly mentioned in projpred’s main vignette, I think. In that vignette, we also mention

Because of its most thorough protection against overfitting5, cv_varsel() with validate_search = TRUE is recommended over varsel() and cv_varsel() with validate_search = FALSE. Nonetheless, a preliminary and comparatively fast run of varsel() or cv_varsel() with validate_search = FALSE can give a rough idea of the performance of the submodels and can be used for finding a suitable value for argument nterms_max in subsequent runs (argument nterms_max imposes a limit on the submodel size up to which the search is continued and is thus able to reduce the runtime considerably).

So the plots based on your varsel() call (btw, in that call, validate_search = FALSE is effectively unused because validate_search is an argument of cv_varsel()) should only be regarded as very rough preliminary results. I would recommend to continue as described in section Final cv_varsel() run (and the sections following it) in the main vignette. Don’t hesitate to come back in case of further questions.

Hi Frank !

Regarding this: Yes, I do have the total number of trials, the number of positive outcomes, as well as sampling weight. The reason I did not take this approach is because: the model then becomes:

brm(success | trials(num_total) + weights(sampling_weights) ~ x1 + ... + xp, family = binomial(), [...])

And the projpred function throws an error suggesting it cannot handle weights and trials at the same time.
There should be a work around where we weight the trials and success using the weights (so that we discard the weights), but I do not have enough experience with that a lot to know if it is standard.

Also regarding using the absolute frequencies and the number of trials, the data’s absolute frequencies has many zeros as well, won’t that call for a zero-inflated binomial ? (I am not sure it is supported by projpred’s traditional projection).

Thanks for the advice of using deltas = T, I will be using it.

I did know that cv_varsel is the best option for getting reliable results (compared to varsel), however, I was doing a preliminary run for now (because cv_varsel can take quite an amount of time).

Ah okay, I didn’t know your model also had observation weights additionally to the numbers of trials. What you could try is to “de-aggregate” the dataset, so that you could use the brms::bernoulli() family instead of binomial() (you can use binomial() also for 0-1 outcomes, I’m just pointing out brms::bernoulli() to emphasize that you need to have a 0-1 outcome). Then, observation weights should work even in projpred (I think). Just note that you might have to re-normalize the observation weights after “de-aggregating” the dataset (in general, a re-normalization of the observation weights should not be necessary, but depending on the analyses you might want to perform, the re-normalization might be required).

The brms::zero_inflated_binomial() family is not supported by projpred’s traditional projection (and currently neither by the augmented-data projection). It is probably also not supported by the latent projection because brms will probably qualify this as a distributional model (but feel free to try out). However, if I were you, I would start with the non-zero-inflated binomial family and check if additional zero inflation is really necessary/helpful.

Thanks for the recommendations. Regarding this however:

I have tried using the individual level data (as I have it) where the observations have a (0=alive/1=dead) outcome in the following formulation:

b1 = brm(
  bf(outcome | weights(sample_weights) ~ x1 + ... + xp),
  data = data,
  family = bernoulli(),
  iter = 5000,
  warmup = 1000,
  thin = 4, 
  cores = 3,
  chains = 3,
  seed = 32385
)

And on running projpred using varsel as below:

vs_b1 <- varsel(b1, 
                nterms_max=50,
                method = 'L1',  # quick
                refit_prj = T,
                ndraws = 2000, # won't be used in L1 ?
                ndraws_pred = 2000,
                verbose = T,
                seed = 32)

I do get the error:
Error: Projpred cannot handle ‘trials’ and ‘weights’ at the same time.

I have tried both latent and forward search as well. Does this mean that projpred does not support weighting for the binomial and bernoulli models ?


(The initial aim of aggregating to the stratum level was to avoid dealing with missing observations at the individual data level, since I would aggregate using:
aggregated_var = weighted.mean(x = individual_var, w = sample_weights, na.rm = T).
I do however know about the multiple imputation availability in brms.)

Hi Stanley, sorry for the late reply.

Ok, sorry then. I didn’t have that in mind when I suggested that you could try to “de-aggregate” the dataset.

This approach to handling missing values is new to me and I don’t fully understand it, so I’m afraid I cannot really help here.

In conclusion, for this Bernoulli / binomial outcome, if you really need to incorporate observation weights other than the numbers of trials, you probably cannot use projpred. There is a small chance that the latent projection works in such a case if you combine the observation weights and the numbers of trials yourself, but I haven’t tried that yet (at that occasion, you could also try to use the zero-inflated binomial family if you really need it).