Projpred: augmented and latent projection for hierarchical categorical models

Hi everyone,

First of all, thanks again to everyone, especially the developers, in this community for all their support. Projpred is such a great package and I’m excited to explore it more.

I have a hierarchical categorical model that I would like to conduct projected predictive variable selection on. I have successfully done this for the non-hierarchical model (see Projpred: selection of submodel that doesn't replicate predictive performance of reference model), but as the data is hierarchical, it would be preferable to have the model reflect this.

When I run refm_obj.categorical <- get_refmodel(categorical.multilevel.fit) (this will automatically use the augmented projection), I get:

“Warning message:
In .fun(object = .x1, data = .x2, formula = .x3, family = .x4, dis = .x5, :
For multilevel models, the augmented-data projection may not work properly.
The latent projection may be a remedy.”

And indeed, doing a preliminary cv_varsel run using this reference model resulted in many warnings about the instability of the projection.

When trying the latent projection with
refm_obj.categorical <- get_refmodel(categorical.multilevel.fit, latent = T),
I get:

Defining `latent_ilink` as a function which calls `family$linkinv`, but there is no guarantee that this will work for all families. If relying on `family$linkinv` is not appropriate or if this raises an error in downstream functions, supply a custom `latent_ilink` function (which is also allowed to return only `NA`s if response-scale post-processing is not needed).

When trying the preliminary cv_varsel run using the latent projection (but without defining the latent_ilink function, I get:
Error in `[<-.data.frame`(`*tmp*`, , response_name, value = c(2.89356639529538, : replacement has 3478 rows, data has 1739.

I have 1739 observations in my data set, and my response variable is categorical with three levels (i.e., one level is the reference category, and then two further levels). I assume this error message occurs because I have not defined the latent_ilink function. However, despite reading the latent projection vignette, I am not advanced enough to define this function correctly myself.

Could anyone help me with this problem?

Hi Melissa,

The reason for the warning message For multilevel models, the augmented-data projection may not work properly is essentially lme4 issue #682, where the underlying issue seems to hold not only for the binomial family, but also for families for which projpred uses the augmented-data projection by default (at least that’s what I experienced in a multilevel cumulative model). In short, the augmented-data projection for multilevel models needs further investigation (and perhaps some modification) before we can recommend it.

Unfortunately, for categorical models (with more than two possible response values, as in your case), there is not a single latent predictor per observation, but multiple. That’s why the latent projection (as currently implemented in projpred) will not work for such categorical models.

In conclusion, I don’t know how you could apply projpred to a multilevel categorical model. Perhaps @avehtari has an idea?

Unfortunately there is no easy option and adding code to support for multivariate target (in original and latent space) is quite a big task. You could use projpred separately for two binary targets to get two answers which you can combine and check that it’s sensible, but of course that is not as nice as being able to do the variable selection at once.

1 Like

Thanks @fweber144 and @avehtari for your very quick responses!

I will continue playing around with the latent projection in projpred for multilevel models, and report back if I manage to create any useful code for others

1 Like