Projpred: Non-Convergence of Predictive Performance with Reference Model

Hi again!

Apologies for yet another query, I seem to be a boomerang of projpred queries. Thanks so much for all the help so far @fweber144. It’s incredible to see such an active community!

I have been trying to fit a model using projpred. I have discussed the model I am working on here and here. I now have projpred running, and I now am beginning to generate results.

Non-convergence of predictive performance

The model does not appear to be converging with the reference model. I have looked through the documentation on non-convergence here. I have set refit_proj=TRUE, I have tried increasing ndraws_pred from 400 to 800. This is the result I am getting still. For the reference model, I have tried using more informative priors also, still no convergence.

I was wondering whether you have any advice for what else to try? I am a bit at a loss. I cannot understand why it would not converge to the reference model with all variables added.

Here are details of the model used:

Family: gaussian 
Link function: identity 

Formula: tva ~ hh_size + education + livestock_tlu + land_cultivated + 
    market_orientation + debts_have + off_farm_any + kitchen_garden + 
    number_income_sources + assisted_tillage + external_labour + 
    livestock_inputs_any + land_irrigated_any + use_fert + length_growing_period + 
    min_travel_time + gdl_country_shdi + (1 | iso_country_code) + 
    (1 | iso_country_code:village)
Observations: 25626
Projection method: traditional
CV method: K-fold CV with K = 5 and search included (i.e., fold-wise searches)
Search method: forward
Maximum submodel size for the search: 17
Number of projected draws in the search: 20 (from clustered projection)
Number of projected draws in the performance evaluation: 800

Thanks in advance if anyone has any ideas. Happy to send through the projpred results if that would be helpful. The results are about 10GB so happy to get in touch directly for anyone who wants access.

1 Like

I am increasing ndraws_pred to 2000 and running again. In the meantime, if anyone has any more suggestions they are always welcome!

Hi Léo,

No need to apologize. Use cases like yours help to improve projpred continuously.

For the Gaussian family that you have, this phenomenon is new to me, so I would have to debug the cv_varsel() run. Could you send me the code and the data?

1 Like

The predictor ranking(s) from ranking() might also help.

1 Like

Oh, wait. If I recall correctly, you are using custom search_terms, right? This in combination with the search being terminated before the full model is reached (due to CV Varsel Error: Infinite or missing values in 'x' - #8 by fweber144) could explain the non-convergence. Could you maybe try with the default of search_terms = NULL?

1 Like

Thanks very much!

No problem, I have updated the reprex to try and reproduce the issue. I have included the full dataset of 27,000 observations here. In the create_model.R file, you can refit the brms model, and subsample the dataset for speed. I am trying to run with 3,000 observations now just to see if I can reproduce the same issue.

I have the results from the full cv_varsel() run saved as an rda file too, would it be helpful for me to send that through too?

Okay will send that now!

Okay, so in the run I have presented here, I have only used search_terms to fix the grouping terms (i.e. (1 | iso_country_code) + (1 | iso_country_code:village). So all of the terms are still being used in the final model, which is why I was confused about the non-convergence. Could the fact that there is a multi-level term be affecting it?

I will try a rerun with search_terms=NULL :)

Thanks again!

1 Like

Here is the ranking object. Please not the variables included here are slightly different to those in the reprex. These are updated variables that I am using for my final analysis. The issue of non-convergence still occurred with the original variables present in the reprex.

$fulldata
 [1] "(1 | iso_country_code) + (1 | iso_country_code:village)"
 [2] "number_income_sources"                                  
 [3] "hh_size"                                                
 [4] "market_orientation"                                     
 [5] "use_fert"                                               
 [6] "land_cultivated"                                        
 [7] "land_irrigated_any"                                     
 [8] "education"                                              
 [9] "assisted_tillage"                                       
[10] "external_labour"                                        
[11] "length_growing_period"                                  
[12] "min_travel_time"                                        
[13] "debts_have"                                             
[14] "off_farm_any"                                           
[15] "gdl_country_shdi"                                       
[16] "livestock_tlu"                                          
[17] "livestock_inputs_any"                                   

$foldwise
     [,1]                                                      [,2]                    [,3]     
[1,] "(1 | iso_country_code) + (1 | iso_country_code:village)" "number_income_sources" "hh_size"
[2,] "(1 | iso_country_code) + (1 | iso_country_code:village)" "number_income_sources" "hh_size"
[3,] "(1 | iso_country_code) + (1 | iso_country_code:village)" "number_income_sources" "hh_size"
[4,] "(1 | iso_country_code) + (1 | iso_country_code:village)" "number_income_sources" "hh_size"
[5,] "(1 | iso_country_code) + (1 | iso_country_code:village)" "number_income_sources" "hh_size"
     [,4]                 [,5]       [,6]              [,7]                 [,8]              
[1,] "market_orientation" "use_fert" "land_cultivated" "land_irrigated_any" "education"       
[2,] "market_orientation" "use_fert" "land_cultivated" "land_irrigated_any" "external_labour" 
[3,] "market_orientation" "use_fert" "land_cultivated" "land_irrigated_any" "education"       
[4,] "market_orientation" "use_fert" "land_cultivated" "land_irrigated_any" "assisted_tillage"
[5,] "market_orientation" "use_fert" "land_cultivated" "land_irrigated_any" "external_labour" 
     [,9]               [,10]              [,11]                   [,12]                  
[1,] "external_labour"  "assisted_tillage" "min_travel_time"       "length_growing_period"
[2,] "assisted_tillage" "education"        "length_growing_period" "min_travel_time"      
[3,] "assisted_tillage" "external_labour"  "min_travel_time"       "length_growing_period"
[4,] "education"        "external_labour"  "length_growing_period" "min_travel_time"      
[5,] "assisted_tillage" "education"        "length_growing_period" "min_travel_time"      
     [,13]              [,14]              [,15]              [,16]                 
[1,] "debts_have"       "off_farm_any"     "gdl_country_shdi" "livestock_tlu"       
[2,] "off_farm_any"     "gdl_country_shdi" "debts_have"       "livestock_tlu"       
[3,] "debts_have"       "gdl_country_shdi" "off_farm_any"     "livestock_tlu"       
[4,] "off_farm_any"     "debts_have"       "gdl_country_shdi" "livestock_tlu"       
[5,] "gdl_country_shdi" "debts_have"       "off_farm_any"     "livestock_inputs_any"
     [,17]                 
[1,] "livestock_inputs_any"
[2,] "livestock_inputs_any"
[3,] "livestock_inputs_any"
[4,] "livestock_inputs_any"
[5,] "livestock_tlu"       

attr(,"class")
[1] "ranking"

@fweber, so I have run the reprex locally, and it seems to produce the same result. It took approximately 30mins to run, but imagine it could be run with a smaller dataset.

Hope this helps!

Thank you. The predictor rankings confirm that the custom search_terms work as intentioned, namely to put the multilevel terms first.

From the predictor rankings, we can see that apart from the two multilevel terms (which are merged into a single one here, but that’s not the point), the search goes through 16 non-multilevel terms. In the reference model, however, there are 17 non-multilevel terms. So the search is terminated before the full model is reached. Of course, as I said above, this is due to CV Varsel Error: Infinite or missing values in 'x' - #8 by fweber144, but currently, I don’t see a way to avoid that, apart from providing a custom divergence_minimizer function where you play around with lme4’s tuning parameters until the full model projection converges as it should. This is a tedious work, though (when I played around with your reprex the last time, I already tried to adjust lme4’s tuning parameters, but unsuccessfully). In theory, when running up to the full-model size (and not only until full-model size minus 1), the predictive performance curve could jump up to the reference model’s red dashed line at the last (full-model) size, so the custom search_terms in combination with the search being terminated before the full model is reached could explain the non-convergence. I have to admit, however, that such a jumpy behavior would be strange, too. So when I have the time, I will take a look at your (updated) reprex.

And yes: In general, multilevel terms make the whole problem a lot more complicated, more instable, and computationally heavier. Nevertheless, according to Projection Predictive Inference for Generalized Linear and Additive Multilevel Models, projpred should work for multilevel models. So you have the right to expect that it works ;)

Thanks :) But this is still with custom search_terms, right? I would be curious to see the search_terms = NULL results, but as far as I understood you, this is currently running.

Hi @fweber144 ,

Quick update.cI have upgraded to the University’s new HPC cluster, which has R 4.3 installed. I installed the most recent version of (master branch from GitHub) Interestingly when I a the model with search_terms=NULL and I don’t set nterms_max, I get this error:

Error in eigen(sigma, symmetric = TRUE) :
  infinite or missing values in 'x'
Calls: cv_varsel ... repair_re -> repair_re.merMod -> t -> <Anonymous> -> eigen
In addition: Warning message:
In parse_args_varsel(refmodel = refmodel, method = method, refit_prj = refit_prj,  :
  In case of the Gaussian family (also in case of the latent projection) and multilevel terms, the projection onto the full model can be instable and even lead to an error, see GitHub issue #323. If you experience this and may refrain from the projection onto the full model, set `nterms_max` to the number of predictor terms in the full model minus 1 (possibly accounting for submodel sizes skipped by custom `search_terms`).

I appreciate the added warning message, it wasn’t there in 2.6.0. I then try to run and I set nterms_max to N-1 as instructed in the warning message. I keep search_terms to null. It now runs successfully, but again still the non-convergence issue:

A quick question. My deadline for this analysis is getting slightly shorter (about 2 weeks). Just to be clear, I want you to spend any extra time on this, and I am not putting any pressure! I just want to know whether I should plan to use something else, or whether you think it will be possible to fix this convergence issue soon enough?

Again thanks so much for all the help!

1 Like

Thanks for the update. Given your short deadline, I would recommend to try something else. In view of the plenty of data you have, it might be worth trying out some frequentist method. Nevertheless, I will try to find out what’s going on in your case. But as you guessed correctly, this may take some time and I don’t think I’ll have it solved within 2 weeks.

1 Like

Thanks @fweber144 . Again, I really didn’t mean to sound pushy, it was just useful to know for planning purposes. I really appreciate all the help!

If I can help, I am more than happy to keep running these analyses to get to the bottom of the issue though, so please let me know if there is anything I can do to help :)

1 Like

Hi @fweber144,

Hope you’re well! Just thought I would give this quick update! So I am still experiencing the issues of non-convergence, but I wanted to do a bit more digging. I took the result from projpred (i.e. the ranking). I refit each of the submodels independently, adding one variable at a time. When i plot the result, it does seem to converge.

Not sure if I am misinterpreting the issue, or if this is useful for you. But I thought I would share just in case!

Thank you. Yes, this suggests that something with the projection is going wrong, i.e., that the lme4 fits to the reference model fit do not work as intended (I guess the lme4 estimation algorithm does not converge). This is also what I wanted to investigate as soon as I have a little more time to play around with the lme4 tuning parameters.

I now managed to dig deeper into this (more precisely, into your reprex from GitHub - l-gorman/projpred_issue_reprex: A reproducible example for an issue being encountered in the ProjPred Package). Sorry that it took me so long.

First of all, in the reprex, I “wrote out” the iso_country_code:village interaction (by creating a new column iso_country_code_IA_village via paste(indicator_data$iso_country_code, indicator_data$village, sep = "_")) because I discovered potential problems with : between grouping variables. Until this is fixed in projpred, I would recommend to follow that workaround.

Then I reduced your reprex to a varsel() call with a hold-out test dataset (consisting of a third of all observations) and tried out different ndraws_pred and nclusters_pred settings. This aimed at what is described at projpred: Projection predictive feature selection • projpred. You mentioned above that you already tried increasing ndraws_pred and nclusters_pred, but I wanted to check again with different values. It turned out that a very low value for ndraws_pred and nclusters_pred (in fact, I achieved this with nclusters = 3 and refit_prj = FALSE) indeed contributes to the gap between the submodel ELPD curve and the reference model ELPD. However, at the default of ndraws_pred = 400, there was still a gap between the submodel ELPD curve and the reference model ELPD and going from the default of ndraws_pred = 400 to nclusters_pred = S_ref %/% 3 (with S_ref = 4000 in your case) did not reduce the gap between the submodel ELPD curve and the reference model ELPD considerably. So I guess the default of ndraws_pred = 400 should be fine in your case (at least in the reprex, not necessarily for your full dataset). The code for my hold-out approach is in holdout.R (3.2 KB), the plots are in holdout1_nclusters3.pdf (5.8 KB), holdout2_ndrawspred400.pdf (5.8 KB), and holdout3_nclusterspred1333.pdf (5.8 KB).

Then, with K-fold CV, I played around with lme4’s tuning parameters because setting options(projpred.check_conv = TRUE) revealed that sometimes, lme4 indeed has convergence problems (which is what I meant above). Note that while investigating this, I discovered a bug in projpred:::check_conv() which has been fixed on GitHub: Fix typo in the convergence check for submodel fits resulting from `l… · stan-dev/projpred@058df68 · GitHub. In the case that I played around with, I was able to resolve the lme4 convergence issues by specifying control = lme4::lmerControl(optimizer = "Nelder_Mead"). However, in the K-fold CVs that I describe below (with the default of nterms_max = NULL), this did not resolve all of the lme4 convergence issues (especially not the numerous ones at the last and the second-to-last submodel size, which, however, is more or less clear due to projpred issue #323). Nevertheless, I would recommend to try control = lme4::lmerControl(optimizer = "Nelder_Mead") also for your full dataset. You might also want to play around with lme4’s tuning parameters yourself, but control = lme4::lmerControl(optimizer = "Nelder_Mead") is the best I could come up with (in principle, you could also create your own divergence_minimizer function which needs to be passed to projpred::init_refmodel(), and in such a customized divergence_minimizer function, you could adapt the lme4 tuning parameters conditionally on a preliminary lme4 fit).

Then, with control = lme4::lmerControl(optimizer = "Nelder_Mead") and some other deviations from your reprex (in particular, nclusters = 3, the “written out” iso_country_code:village interaction), I ran K-fold CV again with the default of nterms_max = NULL: First with the default of search_terms = NULL, which revealed that we can indeed observe the “jumpy” behavior at the last (full) submodel size that I mentioned above. Hence, afterwards, I ran K-fold CV again, but now forcing the two group-level terms to be selected first (and still using the default of nterms_max = NULL). In that case, I did not observe a gap between the submodel ELPD curve and the reference model ELPD anymore.

My K-fold CV code is in kfold.R (5.4 KB), the plots are in kfold_cvvs3_full.pdf (6.4 KB), kfold_cvvs3_full_cvprops.pdf (9.5 KB), kfold_cvvs4_forcedGL.pdf (6.3 KB), and kfold_cvvs4_forcedGL_cvprops.pdf (9.2 KB).

The question is now: Why is the group-level term (1 | iso_country_code) selected last, even though it leads to a jump in the ELPD towards the reference model (so even though it seems to be relevant from a predictive point of view)? If I don’t got anything wrong, this behavior means that in K-fold CV, this group-level term helps to improve out-of-sample predictive performance, but somehow it does not reduce the in-sample KL divergence during the forward search sufficiently so that this group-level term would be selected earlier. I guess this needs to be investigated further, but I’m currently lacking the time to do so, so I would leave this for future research.

At some point, I was wondering whether your reference model might be overfitting, but when I refitted it with an R2D2 prior (once with parameters mean_R2 = 1 / 3.5, prec_R2 = 3.5, but also with parameters mean_R2 = 1 / 5, prec_R2 = 5) and checked the loo() output, I didn’t get a markedly different ELPD estimate, so your reference model might be OK, but I would recommend to investigate overfitting further (and perhaps also try with the R2D2M2 prior instead of “just” the R2D2 prior, see [2208.07132] Intuitive Joint Priors for Bayesian Linear Multilevel Models: The R2D2M2 prior), especially since I ran my quick overfitting check only for the N = 3000 observations from the reprex.

If you still observe a gap between the submodel ELPD curve and the reference model ELPD in your full-dataset analysis, it might also be worth to try increasing the value of K for the K-fold CV.

2 Likes

Another update: We shortly discussed the question

in a group meeting, but without finding an answer to it. @n-kall suggested to exclude (1 | iso_country_code_IA_village) from search_terms and check how the remaining group-level term (1 | iso_country_code) behaves. I now tried this out: Then (1 | iso_country_code) gets selected first (without forcing it), but the predictive performance plot still shows the gap. So I guess both group-level terms are indeed relevant (from a reference-predictive point of view). I haven’t tried out whether including (1 | iso_country_code_IA_village) in search_terms and forcing (1 | iso_country_code) to be selected first would lead to (1 | iso_country_code_IA_village) being selected last (with the same gap in predictive performance and then a jump at the end) because I currently don’t see which new insights this might provide, but perhaps you might want to try that, too.

At some point, I thought that the small number of projected draws (i.e., nclusters = 3) might have been responsible for the strange behavior of the search (selecting (1 | iso_country_code) last), but when I tried nclusters = 100 and later ndraws = 400 (in varsel(), because above, the CV folds always agreed on the position of the group-level terms in their predictor rankings), (1 | iso_country_code) was still selected last. So that doesn’t seem to be the cause.

I’ll try to create a reprex on projpred’s GitHub issue tracker and post here when I have done that.

By the way, I also checked my comment

# According to my notes, `cvvs1` and `cvvs2` were the settings that I used for
# playing around with lme4's tuning parameters, but now that I re-ran `cvvs1`
# (on a different machine, though), I don't experience lme4 convergence problems
# anymore, so perhaps my notes were not correct. I am pretty sure, though, that
# at some point, `control = lme4::lmerControl(optimizer = "Nelder_Mead")`
# resolved the lme4 convergence problems.

from kfold.R (uploaded above) on that other machine again and there, I was indeed able to reproduce that the Nelder-Mead optimizer (used for cvvs2) avoided the lme4 convergence issues from cvvs1.

2 Likes

Hi @fweber144,

I cannot thank you enough for this! So I haven’t tried without including the group-level terms, as for my analysis it makes more sense to ensure they are fixed (I will explore adding and removing them another time).

I tried what you recommended. So rerwriting the group terms to exclude the semi-colon, changing the optimiser, and setting nterms_max to NULL. I reran on the whole dataset and it now seems to converge as expected:

3 Likes

Great! Thanks for the feedback :)