Projpred: Non-Convergence of Predictive Performance with Reference Model

I now managed to dig deeper into this (more precisely, into your reprex from GitHub - l-gorman/projpred_issue_reprex: A reproducible example for an issue being encountered in the ProjPred Package). Sorry that it took me so long.

First of all, in the reprex, I “wrote out” the iso_country_code:village interaction (by creating a new column iso_country_code_IA_village via paste(indicator_data$iso_country_code, indicator_data$village, sep = "_")) because I discovered potential problems with : between grouping variables. Until this is fixed in projpred, I would recommend to follow that workaround.

Then I reduced your reprex to a varsel() call with a hold-out test dataset (consisting of a third of all observations) and tried out different ndraws_pred and nclusters_pred settings. This aimed at what is described at projpred: Projection predictive feature selection • projpred. You mentioned above that you already tried increasing ndraws_pred and nclusters_pred, but I wanted to check again with different values. It turned out that a very low value for ndraws_pred and nclusters_pred (in fact, I achieved this with nclusters = 3 and refit_prj = FALSE) indeed contributes to the gap between the submodel ELPD curve and the reference model ELPD. However, at the default of ndraws_pred = 400, there was still a gap between the submodel ELPD curve and the reference model ELPD and going from the default of ndraws_pred = 400 to nclusters_pred = S_ref %/% 3 (with S_ref = 4000 in your case) did not reduce the gap between the submodel ELPD curve and the reference model ELPD considerably. So I guess the default of ndraws_pred = 400 should be fine in your case (at least in the reprex, not necessarily for your full dataset). The code for my hold-out approach is in holdout.R (3.2 KB), the plots are in holdout1_nclusters3.pdf (5.8 KB), holdout2_ndrawspred400.pdf (5.8 KB), and holdout3_nclusterspred1333.pdf (5.8 KB).

Then, with K-fold CV, I played around with lme4’s tuning parameters because setting options(projpred.check_conv = TRUE) revealed that sometimes, lme4 indeed has convergence problems (which is what I meant above). Note that while investigating this, I discovered a bug in projpred:::check_conv() which has been fixed on GitHub: Fix typo in the convergence check for submodel fits resulting from `l… · stan-dev/projpred@058df68 · GitHub. In the case that I played around with, I was able to resolve the lme4 convergence issues by specifying control = lme4::lmerControl(optimizer = "Nelder_Mead"). However, in the K-fold CVs that I describe below (with the default of nterms_max = NULL), this did not resolve all of the lme4 convergence issues (especially not the numerous ones at the last and the second-to-last submodel size, which, however, is more or less clear due to projpred issue #323). Nevertheless, I would recommend to try control = lme4::lmerControl(optimizer = "Nelder_Mead") also for your full dataset. You might also want to play around with lme4’s tuning parameters yourself, but control = lme4::lmerControl(optimizer = "Nelder_Mead") is the best I could come up with (in principle, you could also create your own divergence_minimizer function which needs to be passed to projpred::init_refmodel(), and in such a customized divergence_minimizer function, you could adapt the lme4 tuning parameters conditionally on a preliminary lme4 fit).

Then, with control = lme4::lmerControl(optimizer = "Nelder_Mead") and some other deviations from your reprex (in particular, nclusters = 3, the “written out” iso_country_code:village interaction), I ran K-fold CV again with the default of nterms_max = NULL: First with the default of search_terms = NULL, which revealed that we can indeed observe the “jumpy” behavior at the last (full) submodel size that I mentioned above. Hence, afterwards, I ran K-fold CV again, but now forcing the two group-level terms to be selected first (and still using the default of nterms_max = NULL). In that case, I did not observe a gap between the submodel ELPD curve and the reference model ELPD anymore.

My K-fold CV code is in kfold.R (5.4 KB), the plots are in kfold_cvvs3_full.pdf (6.4 KB), kfold_cvvs3_full_cvprops.pdf (9.5 KB), kfold_cvvs4_forcedGL.pdf (6.3 KB), and kfold_cvvs4_forcedGL_cvprops.pdf (9.2 KB).

The question is now: Why is the group-level term (1 | iso_country_code) selected last, even though it leads to a jump in the ELPD towards the reference model (so even though it seems to be relevant from a predictive point of view)? If I don’t got anything wrong, this behavior means that in K-fold CV, this group-level term helps to improve out-of-sample predictive performance, but somehow it does not reduce the in-sample KL divergence during the forward search sufficiently so that this group-level term would be selected earlier. I guess this needs to be investigated further, but I’m currently lacking the time to do so, so I would leave this for future research.

At some point, I was wondering whether your reference model might be overfitting, but when I refitted it with an R2D2 prior (once with parameters mean_R2 = 1 / 3.5, prec_R2 = 3.5, but also with parameters mean_R2 = 1 / 5, prec_R2 = 5) and checked the loo() output, I didn’t get a markedly different ELPD estimate, so your reference model might be OK, but I would recommend to investigate overfitting further (and perhaps also try with the R2D2M2 prior instead of “just” the R2D2 prior, see [2208.07132] Intuitive Joint Priors for Bayesian Linear Multilevel Models: The R2D2M2 prior), especially since I ran my quick overfitting check only for the N = 3000 observations from the reprex.

If you still observe a gap between the submodel ELPD curve and the reference model ELPD in your full-dataset analysis, it might also be worth to try increasing the value of K for the K-fold CV.

2 Likes