Variable selection for an exploratory multilevel categorical model with weak priors

blokeman · January 14, 2022, 7:41am

I am conducting an exploratory study of a linguistic outcome with 4 unordered categories (logit link). There are two group-level effects (varying-intercept terms) and about 30 population-level covariates, for a total of 150 parameters. N is 2,220. I use weak N(0, 2.5) priors for all di-/polychotomous covariates, and equivalently weak normal priors for quantitative covariates. The group-level SD’s have exponential(2) priors and are not subject to variable selection because they are of inherent interest. The purpose of my priors is just to help the model converge and to rule out infinite/preposterous values. I am basically trying to fit a model that produces the same estimates as a frequentist model, without the computational and convergence problems.

I presently use brms to do all fitting.

Many covariates will likely not make the “final” the model due to excessively noisy/overfit coefficients, however, due to my exploratory emphasis I don’t want those coefficients shrunk to zero by strong priors. Rather, I want to witness every weak and unreliable signal of potential effect and make at least a descriptive comment on whether it accords with theory, before proceeding to remove the associated covariate.

I am now trying to decide on a reasonable method of variable selection that is not too likely to attract serious methodological criticism. The projection predictive approach doesn’t seem an option because it appears to be intended for parsimonious prediction rather than exploratory investigation. It is also not available for categorical responses.

I’m presently aware of two fairly appealing options (but feel free to suggest others!). Each of them was originally developed for frequentist analysis. Below, in outline, are my “improvised quasi-frequentist” and “improvised Bayesian” implementations of each procedure:

1a. Hosmer et al’s “Purposeful Selection” (2013: 89–124), improvised quasi-frequentist version:

Start with a null model containing only the fixed and group-level intercepts.
Fit 30 “univariate” models, adding to the null model each of the 30 population-level covariates in turn, noting each one that improves “Deviance” at p < 0.25 in a likelihood-ratio test (LRT).
Fit a “full model” containing every population-level covariate that was significant at p < 0.25.
Rank the population-level covariates by significance in the LRT.
One at a time, delete every population-level covariate that is neither significant at p < 0.05 nor a confounder, i.e. whose removal doesn’t result in a change of \geq 20% in any of the remaining coefficients.
After removing all non-significant non-confounders, go through each covariate that failed to make the cut in Step 2, one at a time, adding it to the model and checking whether it’s significant at p < 0.05 conditional on the other covariates in the model.
Once no more covariates can be added or removed, check for non-linear and interaction effects among the remaining covariates.

This procedure relies heavily on the LRT and “Deviance”, i.e. minus twice the in-sample log score calculated using the posterior means of the parameters. This quantity is not quite the true Deviance because it is not based solely on the likelihood, but it should be very close due to the weak priors.

1b. Hosmer et al’s “Purposeful Selection”, improvised Bayesian version:

Like 1a except we use elpd_loo comparisons instead of LRT p-values. In Step 2, our criterion is whether elpd_loo improves at all. In Steps 5 and 6, our criterion is whether elpd_loo improves by 4 or more. The confounding criterion is unchanged.

2a. Dunkler et al’s “Augmented Backward Elimination” (2014), improvised quasi-frequentist version:

Start with a “full model” containing the group-level intercepts and all population-level covariates of interest.
Rank the population-level covariates by p-value and start eliminating those not significant at p < 0.2 that are not confounders.
Once no more covariates can be eliminated, check for non-linear and interaction effects among the remaining covariates.

These authors don’t specify which significance test to use, but I would use the LRT based on “Deviance” here as well.

2b. Dunkler et al’s “Augmented Backward Elimination” (2014), improvised Bayesian version:

Like 2a except we use elpd_loo comparisons instead of “Deviance”-based LRT p-values. Simply, main effects are removed from the full starting model if they are not confounders and their removal improves elpd_loo by any amount, no matter how small.

In addition to the time saving of not having to compute LOO posteriors, the quasi-frequentist approaches are arguably more faithful to the original frequentist procedures because they approximate the p-value criteria more closely. AFAIK, with priors this weak, -2*elpd_loo is pretty much equivalent to AIC, so any improvement in elpd_loo that results from adding a parameter should imply a p-value \leq 0.157 in a LRT. This is much more stringent than the p < 0.25 criterion Hosmer et al use in their Step 2, and it is also more stringent than Dunkler et al’s backward-elimination criterion of p > 0.2 for removing parameters. I’m not sure how to go about relaxing it for better equivalence to these frequentist criteria. One idea might be to include a number of pure noise covariates in the model. Then, any covariate of interest whose addition/removal yielded an elpd_loo change within the same range as that observed for the pure noise covariates would be declared inconsequential/non-significant.

My main questions are the following:

Do you think either of the two selection philosophies (Hosmer et al 2013 or Dunkler et al 2014) is more likely than the other to attract serious methodological criticism?
Whichever of the two selection philosophies I choose, do you think I could get away with using the quasi-frequentist version of it?
Might there be some third approach that is better suited for my exploratory aims than these two?

Thanks in advance.

References
Hosmer, David W., Stanley Lemeshow, and Rodney X. Sturdivant. 2013. Applied logistic regression (3rd ed.). Hoboken, N.J.: Wiley.

Dunkler, Daniela, Max Plischke, Karen Leffondré, and Georg Heinze. 2014. “Augmented Backward Selection: A Pragmatic and Purposeful Way to Develop Statistical Models”. PLoS ONE 2014; 9 (11)

avehtari · January 14, 2022, 10:40am

Whether it’s exploratory or not, the prior should reflect what you know, and not your utility/loss in decision making. If you are worried that the prior you’ve chosen to present your prior knowledge might be misspecified, you can do prior sensitivity checking (see, e.g. [2107.14054] Detecting and diagnosing prior and likelihood sensitivity with power-scaling). With 30 covariates, you probably have some prior knowledge how likely it is that all of them are relevant or just some or whether they are all relevant but correlating (Piranha theorem states that you can’t have many big effects unless the covariates are correlating [2105.13445] The piranha problem: Large effects swimming in a small pond, so you can include that assumption in your prior).

If the model can predict with less covariates, you have learned something about those covariates, and thus it’s fine for exploratory investigation, too.

It is, see, e.g. [2109.04702] Latent space projection predictive inference
However, it might be possible that the current projpred package doesn’t work yet with unordered categorical response.

Your N is quite big compared to the number of covariates, so it’s likely that quite many approaches would work for ordering the covariates in some relevance order, but this depends also on signal-noise-ratio, so can’t be sure. Any approach you choose, you can probably improve it with use of reference model approach even if you would not do the projection, see [2004.13118] Using reference models in variable selection

I recommend checking if you can formulate your problem as in [2109.04702] Latent space projection predictive inference

blokeman · January 14, 2022, 7:05pm

Thank you the exhaustive answer, Aki. Much appreciated. That is a wealth of new literature to go through. Your input will definitely help me make a better-informed choice.

Indeed, it says Error in get(extend_family_specific, mode = "function") : object 'extend_family_categorical' of mode 'function' was not found. Otherwise I would certainly give it a try out of pure academic curiosity.

Topic		Replies	Views
Symmetric weakly informative priors for categorical predictors Modeling prior-choice , priors , brms	6	1446	April 29, 2021
Variable selection in bayesian multilevel model General	9	1406	February 11, 2019
Using hierarchical prior with dummies Modeling specification	1	430	January 21, 2019
Projection predictive feature selection for multilevel phylogenetic models Interfaces projpred	8	1292	March 25, 2021
Truncating priors within a linear model with random intercepts Modeling brms	1	484	August 24, 2021

Variable selection for an exploratory multilevel categorical model with weak priors

Related topics