LOO and bayes_R2 (seem to) contradict posterior predictive check

Hi,

I am trying to model data from a psych study involving accuracy for two tasks about judging others. Each participant completed both tasks, but the targets are task specific. I have a ground truth and the participant’s guess. So, I specified my dependent variable as the absolute difference between the truth and the guess. I chose two different approaches to compare them.

First, a multilevel model with Gaussian link:

bf(abs_d ~ task*confidence + continuous_trait + (task*confidence|i|id) + (confidence|q|target),
    sigma ~ task)

Second, a multilevel model with Ordered Beta regression (a custom family described here):

bf(abs_d ~ 0 + Intercept + 
                 task*confidence + continuous_trait + (task*confidence|i|id) + (confidence|q|target),
         phi ~ task)

To evaluate and compare the models I used bayes_R2, loo_compare, and pp_check. The first two are consistent, but pp_check tells a different story. I include my results here:

Normal

bayes_R2(Mnorm) =  
       Estimate    Est.Error      Q2.5          Q97.5
 R2 0.3791823 0.01843937 0.3425403 0.4137769

Beta

bayes_R2(Mbeta) =  
       Estimate    Est.Error      Q2.5          Q97.5
 R2 0.3645287 0.01736945 0.3287989 0.3968946

loo_compare(loo(Mnorm), loo(Mbeta))
             elpd_diff se_diff
Mnorm    0.0       0.0  
Mbeta  -46.5      22.4

I believe the Beta model fits the data best, but I cannot dismiss loo and R2. Could you please help me make sense of this?

1 Like

Visually it seems to fit much better yes. If you look at the 99% conf int. you’d get [-104.2024, 11.2024], so one could argue that there’s not that much of a difference between them perhaps?

Do you get high Pareto-k warnings? Can you show the loo output for each model separately? The models seem to have such hierarchical structure that it’s possible that the fast PSIS-LOO approximation fails. It would be useful also to look at the pointwise elpd values, as the diff_se is quite high

1 Like

Thank you both Richard and Aki. The loo output is as follows:
Beta

Computed from 2000 by 1392 log-likelihood matrix

         Estimate   SE
elpd_loo    144.4 33.1
p_loo       111.3  4.8
looic      -288.7 66.3
------
Monte Carlo SE of elpd_loo is 0.3.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     1385  99.5%   182       
 (0.5, 0.7]   (ok)          7   0.5%   245       
   (0.7, 1]   (bad)         0   0.0%   <NA>      
   (1, Inf)   (very bad)    0   0.0%   <NA>      

All Pareto k estimates are ok (k < 0.7).
See help('pareto-k-diagnostic') for details.

Normal

Computed from 2000 by 1392 log-likelihood matrix

         Estimate   SE
elpd_loo    190.9 24.3
p_loo       108.8  4.0
looic      -381.8 48.5
------
Monte Carlo SE of elpd_loo is 0.3.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     1384  99.4%   647       
 (0.5, 0.7]   (ok)          8   0.6%   231       
   (0.7, 1]   (bad)         0   0.0%   <NA>      
   (1, Inf)   (very bad)    0   0.0%   <NA>      

All Pareto k estimates are ok (k < 0.7).
See help('pareto-k-diagnostic') for details.

Hi @hector -

Thanks for sharing your code and for being thoughtful about models! A few points to consider:

  1. The PPC pred plot for ordbetareg is not correct. I discuss it in a different post on here Ordered Beta Regression model: is a custom posterior predictive check plot necessary? - #10 by saudiwin. I will be adding a correct PPC plot to the ordbetareg package in the next update, hopefully soon but of course depends on my time availability :).

  2. As far as R$^2$ goes, it’s not really supposed to be a model selection criterion. It’s also based, at least indirectly, on estimating the mean or average, which OLS is always going to be very good at. But in any case, R$^2$ is supposed to give you some general info about model fit, and perhaps to compare different linear model specifications but LOO is really more for this. So,

  3. The LOO comparison is obviously helpful, but again is something that should be interpreted with caution. The LOO papers (re: @avehtari ) start with the assumption that we don’t know the correct model. If you have a lot of background information to believe that Beta regression should be optimal–which I think is entirely plausible, as I cover in my paper–then you shouldn’t rely on LOO. When I simulate data from the ordbetareg distribution, LOO does not always identify the correct model vis-a-vis OLS.

The reason for this has to do with the number of observations near the boundaries. The more of these that exist, and the more skewed the data is, the worse fit OLS will give you. But, it’s not really a wise idea to just pick based on some sample characteristics. As I show in my blog post What To Do (And Not to Do) with Modeling Proportions/Fractional Outcomes | Robert Kubinec, the data can be almost Normal and yet OLS can still have weird predictions that violate bounds, etc.

To sum up, a relatively small difference in LOO (the CIs are close to overlapping) should be a small concern when you have significant prior knowledge to know that you have a bounded outcome. OLS will not respect bounds, and LOO is a relatively crude criterion to make really fine-grained distinctions between models. The PPC plot shows mis-fit but that’s an issue with the plot, not the model, actually.

I hope that is helpful to you!

Thank you for the thorough response @saudiwin! I will take some time to digest this post and the others you linked. However, right off the bat, I’m not so sure about how bounded my outcome is. I operationalized distance (truth - guess) as absolute to fit the logic of beta regression better, but I have also modeled it with OLS as signed distance (bounded from -100 to 100). In the latter case, if the OLS does not respect bounds, the predictions may still be sensible (e.g., people would have chosen an even greater number than 100 if they could, although the ground truth was 0, not sure if that means censored or truncated).

For reference, this is how my plots look like with the customized ppc partitioned into discrete and continuous for the Mbeta model:


I believe it conveys very similar information to the default ppc with regards of match between model and data.

No high khats, so that’s good.

When I first time responded I had not checked the actual model, but now that you mentioned that PPC should be partitioned into discrete and continuous part, I realized that the model has those discrete parts, which makes the direct loo comparison of normal model and ordered beta regression model invalid! This is mentioned also in CV-FAQ and loo_compare() tries to also recognize these cases and give warnings, but I guess you didn’t get warning here? Specifically, for the observations with values 0 and 1, the normal model computes log densities, while ordered beta model computes probabilities. There are some special cases where continuous models and discrete models can be made comparable, but in this I don’t see that it would be possible.

There is also LOO-R^2, which is also supported by brms::add_criterion()

For me OLS means ordinary least square regression (ie not Bayesian inference) so I have to check whether you mean here least squares regression or possibly refer to normal linear regression model with Bayesian inference? When you did use LOO, did you take into account the continous-discrete issue I mentioned above, or could be that you were doing incompatible comparison, too?

I’m assuming OLS means normal linear model. Normal model can be made to respect the bounds by adding truncation and there are cases where it works fine and it can also be a valid approximation for discrete observations (a case study coming), but the boundary violation is not the issue here.

I think it is less crude than you make it sound here, but maybe what you say is based on your experience using loo in incompatible comparison?

I think the PPC was showing misfit only for the normal model.

Note that if you scale the data in different ways for different models, the loo comparison is also incompatible (unless you add the Jacobian correction, although that doesn’t solve the mix of continuos-discrete issue here)

1 Like

Thank you @avehtari ! When I wrote OLS, and what I understood from @saudiwin as well, was indeed as shorthand for the normal linear regression model with Bayesian inference. When I changed from absolute distance to signed distance, I only used loo to compare apples with apples, but the same pattern remained betwen Mbeta & Mnormal.

From what you, @torkar, and @saudiwin have said, I gather that:

  • When loo comparison is valid, a small difference such as the one I reported means I do not have clear evidence in favor of one model or the other (on account of loo at least).
  • In my case, loo comparison is not valid because of the discrete components of the ordered beta model.
  • Loo adjusted R2 allows me to compare the R2 between different models (different structure? different predictors? different families?), but also doesn’t work this time around because of the discrete portion of the beta model.
  • When I operationalize distance as signed, the normal model seems to fit better visually (and has other attractive characteristics). Still, if I add truncation to account for the [-100, 100] boundaries that would improve the model.

Please let me know if my understanding is correct.

Yes I think your understanding is correct. Sorry this is getting so complicated :). It really doesn’t have to be. In your case, both the Normal distribution and ordbetareg fit reasonably well. Normal is showing better fit primarily because you differenced two bounded scores, which results in a Normal-ish looking hump.
If this is starting to make you struggle, just know that most people would feel you are fine using either model at this point. I’m of course a fan of ordbetareg, having written it, but a lot of people would use Normal/Gaussian regression without thinking much about it at this point.
@avehtari is correct regarding issues with loo. These are based on issues with the underlying implementation. brms doesn’t recognize mixed continuous/discrete models, so the loo/ppc implementations need to be adjusted, and I’ll add them to the issue list.
Also @avehtari I wasn’t trying to diss the loo criterion :). I have used it extensively in my own work. My only point is that if you do know the true model, then LOO on its own can’t change that fact. LOO is designed to help you find the best model if there is some uncertainty about specifications.

2 Likes

Or another way to put this @hector is that LOO/R2 allow you to say which model shows better predictive validity with respect to the data. When we don’t know where the data come from, that’s our best clue. But it is entirely possible that the sample at hand are not representative of all the data that the true distribution produces. For the case of ordbetareg, it’s possible to have a sample with few if any observations on the bounds while future samples may indeed have them.

Of course it’s possible to truncate the normal distribution, but at least for me, that always seems like moving the goal posts…

Sorry @hector final comment - you can also consider fitting your model differently which would make better use of ordbetareg. Instead of differencing the scores, consider stacking the two outcomes, and then have a binary treatment variable coded 0 for one outcome and 1 for the second. You can then interact that with other variables. This would allow you to separately model effects on each outcome type.

I would expect that it would be rare that LOO would favor a wrong model strongly (we can also discuss what it does mean if the parameter space does include parameter values corresponding to the true model, but the prior has a very small probability mass in that region, and we have only finite data). If you have interesting examples, please do send me a message.

I found also this result

loo_compare(loo(Mnorm), loo(Mbeta))
             elpd_diff se_diff
Mnorm    0.0       0.0  
Mbeta  -46.5      22.4

suspicious. Even elpd_diff is small compared to se_diff, the magnitude of elpd_diff and diff_se is indicating that one of the models might be misspecified. In this case I assume the models are disagreeing specifically on observations at 0 and 1, as comparison between log probabilities and log densities doesn’t work well. But maybe you have seen also examples where the comparison is valid, but LOO is still favoring the model you consider to be the wrong one? I’m not claiming LOO is perfect, but eager to learn more about unexpected cases.

1 Like

Thank you @saudiwin and @avehtari! I believe my original questions have been sorted.

@saudiwin, I take the point home that principled model selection is better than overfitting my current sample. In this case, I believe only one of the two tasks will require me to think carefully about how likely it is to have many responses in the bounds in unseen samples. Ordbetareg fits the data well, but it is also challenging to tinker with if I cannot rely on the posterior prediction, the R2, or the loo. I think the reasonable fit with the normal for the signed distance (not the absolute) is good enough for now. On the other hand, psych data is drenched in bounded measures, so I’m sure I will be using ordbetareg often and I look forward to the loo/ppc adjustments.

Hi, I am newbee and I have no math background. I just want to use your software to check the psychometrics quality in terms of IRT for a non cognitive scale. Woul you so kind as to advise me how to do it?
Thank you

1 Like

Hi, welcome to Stan forums! Please start a new thread asking for specific advice on your problem. Also please try to give more details as to what your goals are and what tools you intend to use.

1 Like