Lower loo with interaction effects

I’ve fitted ordinal regression models with stan_polr and performed model comparison with loo. In most cases, the selection makes sense, but when I include models with interaction effects, these exhibit lower elpd. This surprises me, because my understanding is that, for nested models, the more complex model rarely obtain a lower elpd; the additional parameters in the complex model are typically estimated as having low effect and therefore do little in terms of predictive ability.

What is likely causing this? As an example, one model with interactions obtained an elpd of around 45 lower (se_diff=9) than its no-interactions counterpart. I see four possibilities:

  1. I’ve picked a bad prior. The stan_polr function only permits an R2 prior, and I specified the scale to 0.5 on the median. This is admittedly somewhat arbitrarily chosen, so happy for any pointers on what might be more appropriate for ordinal regression models.

  2. The analysis isn’t doing what I think it’s doing, and the model with interaction effects isn’t actually nested with the interaction effect models. Here is the script I’ve used to run the analysis:

fit<-stan_polr(as.formula(moddef) , data = DF, method = "logistic",
               prior = R2(0.5,what = "median"), init_r = 0.1, seed = 12346,
               algorithm = "sampling")

where the model without interactions is

"Ans ~ Q + Vall_ha + Spann_ha + Socker_ha + Olje_ha + Balj_ha + Potatis_ha + Gron_ha + Majs_ha + Annat_ha + Lan"

and the model with interactions is

 "Ans ~ Q * (Vall_ha + Spann_ha + Socker_ha + Olje_ha + Balj_ha + 
	   Potatis_ha + Gron_ha + Majs_ha + Annat_ha + Lan)"
  1. Local optimum. All chains could have converged to some local optimum of the posterior that’s actually a poor fit. It seems somewhat unlikely with four chains, and all interaction coefficient appear to be within two MAD_SD around zero, indicating that they appear to have been estimated as having low effect rather than some (incorrect) effect that pulls the predictive ability way off.

  2. I (still) misunderstand LOO CV and must (once again) change my expectations about its behavior for nested models.

Some info about the data and analysis: The analysis is focusing on farmer questionnaire data and attempting, among other thingss, to rank responses of six questions (Q), taking into account county (Lan) and hectares growing different crops (the rest of the explanatory variables). The data in DF accounts 1332 observations. I’m a bit hesitant to publish the data here in its entirety because I need to think about how to make sure none of the respondents are identifiable from the data. (Unlikely, and even more unlikely anyone would care, but better safe than sorry.)

1 Like

I think this is the key point. If you add additional parameters that don’t add predictive ability then loo will penalize this. Higher elpd is good, so it make sense to have a lower elpd if loo “thinks” your model has complexity that doesn’t help predictive performance.

Thanks jonah. This used to be my expectation, but e.g. this post by avehtari in response to my previous confusions made me change my understanding: Compare with exact LOO for observations with too high Pareto k diagnostics
Confusion continues.

1 Like

Hmm, I agree with what @avehtari was saying in that thread you linked to, but I would think it’s still possible to see ELPD decrease like this in some cases. Here you are introducing a lot of interactions (it looks like you’re interacting Q with every other variable in the model right?).

But maybe @avehtari has a clearer idea of why you’re seeing a lower ELPD. @avehtari Can you help us reconcile this result with what you were saying in the post @tomli linked to?

Can you post the full loo output? I’d like to see p_loo and the diagnostics. Tell also the the total number of parameters in each model (this can’t be seen directly from the formulas as it depends on the number of levels in factor predictors).

1 Like

Thanks @avehtari for having a look. Here is the loo output from the simpler model:

         Estimate   SE
elpd_loo  -2030.5 15.3
p_loo        21.4  0.9
looic      4061.1 30.7
------
Monte Carlo SE of elpd_loo is 0.1.

It had no Pareto shape estimates >0.5. Here is the output from the one with interaction effects after refitting with exact loo for 14 observations with pareto shape >0.7:

         Estimate   SE
elpd_loo  -2075.3 17.3
p_loo        90.9  7.0
looic      4150.6 34.5
------
Monte Carlo SE of elpd_loo is 0.4.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     1307  99.2%   320       
 (0.5, 0.7]   (ok)         11   0.8%   124       
   (0.7, 1]   (bad)         0   0.0%   <NA>      
   (1, Inf)   (very bad)    0   0.0%   <NA>    

The simpler model have 18 coefficients plus the 4 estimates of cut-offs between the ordinal categories. The model with interaction effects have an additional 65 parameters. Lan and Q are categorical, all X_ha are continuous predictors. R-hat and n_eff look good for both models.

Diagnostics look good. High p_loo compared to the total number of parameters hints you have very wide fixed priors, in which case it is possible that the more complex model has worse elpd_loo. When I’ve wrote that the more complex model doesn’t necessarily perform worse, I’ve assumed priors that behave sensibly when more and more effects are added. See, e.g. Student - Models for regression coefficients for related normal regression model. You could similarly try prior predictive simulations to check how weak your priors are. If there are only a small number of predictors and interactions the careful choice of priors is not that important, and it’s also possible that better prior here don’t help if there really are no interactions. The current prior you use assumes the main effects and interactions have independent priors and so that interaction effects have the same prior variance as the main effects. It’s likely that interaction coefficients are big only if the corresponding main effects are big. This kind of behavior could be included with inheritance priors. Inheritance prior combined with something like regularized horseshoe prior would also state that most interaction effects are small even if the corresponding main effects are big. stan_polr has limited prior support and R2 currently implemented doesn’t work well with a larger number of coefficients (in the current version the degrees of freedom are the same as the number of coefficients, which is not good choice beyond a few coefficients). Pinging @paul.buerkner as he know what would be possible with brms (although brms doesn’t yet have better R2 prior or inheritance prior) and this is relevant example for his research.

1 Like

brms supports the horseshoe prior but generally, the support for multivariate priors on regression coefficients is still limited.

1 Like

I refitted the model with brms. With the two caveats that (1) there were two divergent transitions after warmup in the fitting, even when using adapt_delta 0.99, and (2) there were 9 bad and 1 very bad Pareto k diagnostic, here are the loo outputs:

         Estimate   SE
elpd_loo  -2031.0 15.2
p_loo        29.1  1.8
looic      4062.0 30.4
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     1308  98.2%   3115      
 (0.5, 0.7]   (ok)         14   1.1%   9908      
   (0.7, 1]   (bad)         9   0.7%   570       
   (1, Inf)   (very bad)    1   0.1%   1777  

This has elpd very close to the -2030.5 obtained in the model without interaction effects above, and (unsurprisingly) @avehtari was right about the behavior under a more sensible prior. Is there any way to automatize refitting for Pareto shape-flagged observations for brms similar to rstanarm?

Here is the model and tuning specifications in case someone has suggestions for tweaks:

fitt<-brm("Ans ~ Q * (Lan +  Vall_ha + Spann_ha + Socker_ha + Olje_ha + Balj_ha + Potatis_ha + Gron_ha + Majs_ha + Annat_ha)",
             data=DF,
             family = cumulative(link = "logit", threshold = "flexible"),
             prior = prior(horseshoe(), class = "b"),
             control = list(adapt_delta = .99, max_treedepth=15 ),
             iter= 10000)
1 Like

Check out ?reloo.

1 Like

Thanks! Will take 3 days to rerun 10 times, I think

Ouch. I think there is still something suspicious about how long it takes to sample as your data is not very big.

Interesting, this is good to know. @avehtari have you talked about this with @bgoodri? I thought the whole point of the R2 prior was for lots of coefficients!

OK, maybe. It took around 6h to run 10000 samples with adapt_delta 0.99 and maximum tree depth 15. I had plenty of divergent transitions and exceeded tree depths when running on the default settings. Without knowing the exact behavior of the horseshoe prior, I think there are here a lot of uncertainty in the latent parameters, making many coefficients jumping between having close to zero effect and some effect. My experience is that such behavior causes some issues in the sampling.

Yes. There was no conclusion for immediate action. @paul.buerkner and I are working on something better, but that will take long time.

1 Like

Do you mean 10000 posterior draws together, or 10000 iterations per chain after warmup?

Looking at the effective sample sizes in loo diagnostic it seems you could do with 4 chains and 1000 iterations per chain after warmup, which would already make things faster.

adapt_delta 0.99 hints problems in the posterior, which can be due to identifiability problems when using unnecessarily wide priors, but I’m not expert on this model and your data so I can’t right away say how to modify your priors.

Divergent transitions are worse as they indicate possible bias in the estimates. Tree depth exceeds are not that bad as they just indicate potentially higher autocorrelation.

You reported long sampling time also without horseshoe, so I assumed this is not about the horseshoe.

10000 in total and 5000 after warmup and with 4 parallel chains. So total number posterior draws used for inference is 20000. I increased iterations because ESS was a bit low for some parameters with the default total of 2000 per chain. In the runs with 10000 samples and 4 chains, the lowest Bulk ESS is ~4000 and the lowest Tail ESS is ~1300. But most Bulk ESS are close to 20000.

The long sampling time was for the horseshoe example, which, unlike the R2 prior implemented with rstanarm, required a higher adapt_delta. And sorry, I realize my reply was written in a bit of haste and probably unclear. By latent parameters I meant the local and global shrinkage parameters. My guess is that the joint posterior includes both very small and at least moderate values of the local shrinkage parameters, which would make individual coefficients take on both very small and moderate effects. Most apparently, I believe this can be seen in the bimodal marginal posterior of the coefficient with the low ESS alluded to above, for which the left mode reflects posterior regions where the focal variable has a pronounced effect and the right mode reflects regions where it does not. It’s not surprising the sampler has some trouble exploring both modes.

I wanted to plot this as pairs with the corresponding local shrinkage, but I don’t think brms prints the prior parameters, right?

You could probably do well with ~400 ESS, which would save you some computation time.

Good to know.

That’s also useful to know as it’s not always the case, and now we have one more example with a challenges. You could try setting the horseshoe() parameters informatively instead of using the default values.

I’m a little hesitant to upload the data in it’s entirety here because it’s questionnaire data and I need to think about making respondents non-identifiable. But if you think it would be a useful example for development purposes, I can send it to you.