Conflicting inference from k-fold and loo-cv comparison

I am conducting a longitudinal multivariate regression in brms of three count outcomes - days of heroin, cannabis, and alcohol use in the previous 28-day measurement period - based on days of amphetamine use at start of treatment. I have run three models - a gaussian regression, with the formulas

fit_mvr_lin_ao_nm <- bf(mvbind(cann28, alc28, heroin28) ~ ats_baseline*yearsFromStart + (1|p|encID)

# pareto-k
Computed from 8000 by 1906 log-weights matrix.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.3, 1.4]).
Pareto k diagnostic values:
                         Count Pct.    Min. ESS
(-Inf, 0.7]   (good)     1776  93.2%   86      
   (0.7, 1]   (bad)       110   5.8%   <NA>    
   (1, Inf)   (very bad)   20   1.0%   <NA>  

a linear betabinomial with the formula

bform_betabinom_mvr <- bf(mvbind(cann28, alc28, heroin28) | trials(set) ~ ats_baseline*yearsFromStart + (1|p|encID))

# pareto-k
Computed from 8000 by 1906 log-weights matrix.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.3, 2.2]).
Pareto k diagnostic values:
                         Count Pct.    Min. ESS
(-Inf, 0.7]   (good)     1813  95.1%   104     
   (0.7, 1]   (bad)        92   4.8%   <NA>    
   (1, Inf)   (very bad)    1   0.1%   <NA>

And a spline betabinomial with the formula

bform_binom_mvr_s <- bf(mvbind(cann28, alc28, heroin28) | trials(set) ~ s(yearsFromStart, by = ats_baseline) + (1|p|encID))

# pareto-k
Computed from 8000 by 1906 log-weights matrix.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.4, 1.7]).
Pareto k diagnostic values:
                         Count Pct.    Min. ESS
(-Inf, 0.7]   (good)     1820  95.5%   87      
   (0.7, 1]   (bad)        84   4.4%   <NA>    
   (1, Inf)   (very bad)    2   0.1%   <NA> 

When I compare them using loo_compare()the k-fold comparison looks like this (10 folds)

                          elpd_diff se_diff 
fit_mvr_betabinom_ao_nm        0.0       0.0
fit_mvr_betabinom_s_ao_nm    -10.9      21.8
fit_mvr_lin_ao_nm         -11354.0     136.1

Both betabinomials much better than gaussian but neither betabinomial clearly better than the other.

However when I compare the re-loo’d models via loo-cv the comparisons look like this

                          elpd_diff se_diff 
fit_mvr_betabinom_ao_nm        0.0       0.0
fit_mvr_betabinom_s_ao_nm    -74.3      14.8
fit_mvr_lin_ao_nm         -11386.7     135.4

With the linear a pretty clear winner.

Which model should I trust?

Note that you cannot readily compare the probability masses in the likelihood of a betabinomial model with the probability densities in the likelihood of a gaussian model. Doing this comparison properly would require some sort of computation to figure out how much probability mass the gaussian model is, in effect, allocating to each outcome (for example by integrating the pdf from y minus 0.5 to y plus 0.5).

With respect to the two betabinomial models, both comparisons suggest that the non-spline model might be better, and it’s not surprising that using loo rather than 10-fold cv would be more confident in this.a

In your position, I would conclude that the linear model is better for prediction of new observations than the spline model, but not without further investigation would I conclude that either of them is better than an appropriately discretized Gaussian model. On the other hand, a discretized Gaussian is hacky and I might abandon it anyway as being clunky and difficult to work with as a generative model for the data.

Thanks for answering @jsocolar. A while ago I wondered if you could compare gaussian to binomial/betabinmial and In this notebook @avehtari demonstrates that in fact you can, so long as the outcome data are the same (i.e. positive integers/counts). So you think the linear is better? That’s what it would seem, I just wondered if there were circumstances where k-fold is more reliable/less biased than loo-cv.

Makes sense!

If the folds in k-fold CV are chosen at random, then k-fold CV is likely being used as an approximation to LOO-CV, and the LOO-CV estimate should be more precise and preferred when available. There are other situations where the data has (or potentially has) some natural structure such that leaving out non-random larger folds better tests performance on the prediction task of interest (for example, leaving out entire spatial blocks, or entire levels of a random effect).

1 Like

@jsocolar Do you worry that about 5% of the Pareto k’s in both betabinomial models are > 0.7 and that 1-2 are > 1?

@kholsinger I do worry about this, but the OP said he’d re-loo’d these models:

However when I compare the re-loo’d models via loo-cv the comparisons look like this

I also worried. But yes I moment-matched and then re-loo’d these models. After relooing all pareto-k less than 0.7. I just worried that the re-looing made the loo-cv comparison less reliable for some reason than the k-fold, given the discrepancy in inference between the two methods.

I missed the re-looing. Sorry about that. I am no expert on brms or loo, but my understanding is that reloo() computes exact cross-validation meaning that it you don’t need to worry about reliability (unless you’re worried about the reliability of Pareto-smoothed importance sampling compared to brute force leave one out cross validation).

First, one model having better performance than others, does not yet mean you can trust it. You need to do predictive checks and compare to other information such as the causal assumptions, too.

If you did K-fold-CV with random folds as approximating LOO-CV, then re-loo’d LOO-CV comparison is more accurate. If you did K-fold-CV with groups as approximating leave-one-group-out, K-fold-CV is more accurate for thar purpose.

Preferably, an ordinal with correlations. If implementing that model is too difficult, you may also compare the conclusions for the quantity of interest from the two above-mentioned models and if there is no difference, then probably the ordinal with correlations would not change the conclusions either. You can report all this. You might also be able to slightly improve the normal model by using truncation to the interval of the target values (-.5 and +.5, to make each bin have width 1, as in my case study)

Ah I was hoping either model comparison or diagnostics would trump the other. Never simple is it?

Yes an ordinal regression with correlations would be ideal. I have let my manual stan skills atrophy over the last few years. I’d love to take that line of enquiry up again but that seems like a long-term project of self-improvement rather than a short-term fix for this project. I have to present this analysis in a conference poster in early November so will just have to make do with what I have so far.

I will follow your advice. Thank you again.

While we’re here could you tell me if I have coded the pit-ecdf calibration plot above correctly? I can’t tell if it looks that way because the gaussian regression is inappropriate for the outcome measure or if it’s a simple coding mistake on my part. In you case study you created a function to process what was a discrete outcome. I assume the syntax is simpler when the outcome is treated as numeric but there was a weighting argument in that function that I have not added to this.

ppc_pit_ecdf(y = abs(workingDF$psych-median(workingDF$psych)),
             yrep = abs(posterior_predict(modFit)[,,"psych"]-median(workingDF$psych))) +
    labs(x="LOO-PIT for folded predictions") 

workingDF$psychis the vector of the outcome variable in the dataset and posterior_predict(modFit)[,,"psych"]is the posterior predictions for the psych outcome taken from the brms multivariate regression object (indexed via an array).

@vehtari I have mixed this post up with the many others I have running simultaneously. These last few replies (the ones below the reply I just deleted) belong to this thread dealing with ordinal vs gaussian regression for health outcomes, not linear vs spline for substance use outcomes. My apologies. I’m still interested in whether I got the syntax for the gaussian pit-ecdf calibration plot right though.