Gaussian vs. skew-normal model selection

I am fitting brms models on rating data that I collected.

The ratings range from 0 to 200 and are highly skewed as can be seen in the posterior predictive plots below.

As the response variable distribution is highly skewed, my intuition was to fit a skew_normal model to these data. However, as I haven’t used skew_normal models a lot in the past, I also fit a gaussian model to these data, just for comparison.
To simplify matters for this question I fit a model only estimating the overall intercept + random intercept adjustments per participant. The two models i used are:

m_gauss <- brm(m_conf ~ 1 + (1 | ppid)
              , data = d_h1
              , chains = 4, cores = 4)


m_SN <- brm(m_conf ~ 1 + (1|ppid)
              , data = d_h1
              , family = "skew_normal"
              , chains = 4, cores = 4)

In this case I use default priors, but I tried to increase the alpha prior of the skew-normal model in it does not make much of a difference.

As was to be expected the pp_check overlay plot looks quite bad for the gaussian model (left) and much more reasonable for the skew_normal model (right).


However, what surprised me was that the gaussian model (left) rather than the skew_normal model (right) did a better job at recovering the observed mean and standard-deviation in the data.


What I also realized is that the gaussian model looks better in terms of pareto-k diagnostic, i.e. shows less problematic/influential observations.


When however, comparing the models with both loo-ic and k-fold CV, the skew-normal model outperforms the gaussian model in both cases.

m_gauss 12273.47 104.38
m_SN 11786.12 82.09
m_gauss- m_SN 487.35 49.31

elpd_kfold is -6141.9 for the gaussian model and -5898.7 for the skew_normal.

My question is what I should do with this.
As I see it, the skew_normal model performs better in terms of prediction / fit-indices.
Why is it though, that the skew_normal model identifies more problematic observations in the data in terms of pareto-k and does not correcly recover the observed mean in the data?
Is that something with the data or is it something that can be expected with a skew_normal model in general?
Is there maybe another model-family / transformation that I could try that might work better?

I hope I included all relevant information and this is the right place to ask a question like this.

Thanks in advance!

  • Operating System: Windows
  • brms Version: 2.7.0

I am not entirely certain yet, what exactly drives these differences and I may need to see the fitted models myself and play around with it to come up with an explanation. Is it possible that you share the models / data or provide some fake data to reproduce the behavior you are describing?

Out of curiosity have you given an exponential() model ?
Might be able to capture that skew a little better than the skew_normal() as it looks like you have a ceiling effect thing going on.

Throw a rug plot on the observations responsible for tanking the performance of the skew normal, my guess is they’re the bump in the left tail. PSIS is an approximate CV so go with the CV overall

Given that your response data have hard boundaries at 0 and 200, wouldn’t it be more principled to scale the response value; e.g., scaled_response = (0.5 + (A-1)*response/200) / A, where A is some large number like 1000, to get it into the interval (0, 1) and then fit with a beta distribution?

@paul.buerkner Thank you for your quick reply and help ! I attached the data with this reply, let me know if you need anything else.

data_exp.csv (20.4 KB)

@haututu thank you for the suggestion. From my understanding the exponential model would be a better choice for data that are heavily skewed in the opposite direction, no? However, I gave it a try and the posterior predictive check does not look very promising.

@sakrejda good point, I think there is indeed a little bump (plot below). But from the posterior predictive check of the gaussian model it did not look like it was predicting these observations more accurately in any case. In any case, i guess you’re right and going with smaller elpd from CV is the way to go here.

@andymilne Interesting idea, i tried an inverse transformation before and fit an exponential on that, but that did not really work so well, but I gave your suggestion a try as well and it looks quite good.

Also in terms elpd_kfold is 1422.2, so much better than the other 2 models.

My only problem with this is that it appears quite unnatural to do this. The data represent ratings that people give on a scale that is - at least to some extent - representing meaningful values. In the beta model, the estimates could not readily be interpreted (e.g. how many points difference does a experimental manipulation result in). Can the posteriors of the beta-model readily be transformed back onto the original scale, and if so, is that a thing one should do or rather avoid?

Thanks for all your suggestions, they are really helpful.


If you have rating data I would not use a gaussian or a skew-gaussian.

It is hard to say what the proper model is, without more information about the data generating process.

Is your outcome a sum score? Then you could fit a hierarchical ordinal model with individual items as with-in person repeated measures (As @paul.buerkner proposed somewhere else). If this to slow, the beta-binomial model could be useful.

Somebody else mentioned re-scaling to the [0,1] interval. This will be problematic of you have any values directly on the bounds (0,1) for reasons described elsewhere in the list.

Anyhow, I think the most important thing is to first be clearer about what type of data you have.

Thanks for your reply. About the data-generating process: each observation of the DV represents the mean-score of 5 ratings of the same object, with each person rating 32 objects. I assumed that the ordinal model was more useful when the scale has rather few points (e.g. 7-point) compared to a 200-point scale. Do you think that ordinal/binomial is a good choice in this case?

In what sense are the actual values of 0 to 200 in a rating scale genuinely meaningful? If the ratings are scaled to (0, 1) and you used a logit link, you can interpret \beta_x as saying that a unit increase in x will result in ratings/(1 - ratings) being multiplied by \exp(\beta_x) or, put differently, an increase of 100(\exp(\beta_x) - 1) percent. I dare say with a bit of algebraic manipulation of the above formula, you could take a given ratings value in any scale you want and determine what a unit change in a predictor x does to it.

OK. My first suggestion would be to not take the mean of the ratings and rather use the raw data and a hierarchical model to account for multiple observations from the same person. (i.e. a random intercept per person)

I agree that an ordinal model with 201 levels could be a stretch (though @betanalpha has a nice new case study about setting priors for ordinal models when some categories are sparse). But you could still give it a try, I would not be surprised if it works.

If you don’t want to try the ordinal model or it indeed does not work, you can try a beta-binomial model, because the beta-binomial distribution is for integers with a lower and upper bound. Here is a description of how to implement a beta-binomial regression in brms.

Finally, you could also check out a model for censored data (e.g. using a gaussian model). If this makes sense depends on why you have so many responses on one boundary. Intuitively, this would be a good model if the reason you end up with many values at 200 is that your scale has a restricted range. (see here) for a quick into into censored vs. truncated data.


@andymilne thanks for the further explanation, I will play around with this a bit. To be fair, in this case the exact values on the scale are not genuinely meaningful, but I have another related data-set where they present real monetary value.

@Guido_Biele thanks for the help and references, I really appreciate it.

About this point; I am already using a MLM, the only reason why it’s mean scores here is that these ratings, are predicted by a variable from another task in the full model that only have 1 observation per level. But you’re right the aggregation is unnecessary here, though the estimates are really similar in any case. Thanks for pointing that out!

Your other three suggestions sound cool I’ll give them a try and then report back to you. Thanks for taking the time, I really appreciate the help.

Sorry Im an idiot, didnt think exponential doesnt work the other way !

well, he could just do 200 - x and then fit the exponential

I wouldn’t use exponential in this data set. It just doesn’t feel right inverting the data and even then exponential is too restrictive and one would perhaps rather use gamma or weibull (again not on this data set).

An ordinal model has some potential here and I have used ordinal models for similar purposes in the past. It may require a little bit of grouping of the categories though and I am not sure if that’s something you want to do.

Fair enough. I wasn’t trying to advocate for the exponential but just noting that the ‘direction’ issue could be easily overcome. It does seem like a gamma on the (200-x) data might be reasonable.

Again, thank you all for the suggestions.

I was going with the skew-normal in the beginning as I thought it fits the data-generating process quite well. The ratings on the scale represent people’s subjective confidence about a decision. In this scale, the scale is only anchored at 0 and 200 without additional ticks/categories.
Thus, I assume that each observation of the DV represents some true confidence of a person + noise. I thought that a gaussian model would capture that idea quite well, but as I knew these confidence indications are normally negatively skewed, I decided to go with the skew normal. Shouldn’t that, at least in principle, be accurate?

@paul.buerkner: Also, I would like to avoid binning/grouping the data if possible. For example, in another experiment the values represent real monetary values that people have to “recover” on the scale rather than giving ratings in the classical sense. Thus, in that case a gaussian model assmuing observations have a true value / mean + error seems to be most plausible to me at least (but then again maybe my intuition is wrong). Moreover, I also use the spread of ratings of the same objects, i.e. their sd, for predicting other variables in a later stage of the analysis. This probably does not make a lot of sense out of context, but it is to say that the solution of fitting an ordinal model on binned data would be something I would like to avoid. That said, if you think it might be the only reason out of this misery or

I tried some additional stuff in the meantime with some interesting but confusing results again:

I fit a truncated model with an upper bound of 200

m_gauss_trunc <- brm(m_conf | trunc(ub = 200) ~ 1
                 + (1|ppid)
              , data = d_h1, family = "gaussian"
              , chains = n_cores-1, cores = n_cores-1

in the pp_check the fit looks reasonable, and looic is better than for the previous models (looic = 11290):


However, for some reason I get an intercept in the summary(m_gauss_trunc) output that is around 212.5 with SE of 9.9 :

Group-Level Effects: 
~ppid (Number of levels: 57) 
              Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sd(Intercept)    66.55      9.24    50.86    87.09       1151 1.01

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
Intercept   212.52      9.86   194.05   233.04        622 1.01

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sigma    36.02      1.58    33.17    39.34       6531 1.00

which seems odd as the intercept is > 200, or am I just misinterpreting was the intercept means here? If I look at pp_check(m_gauss_trunc, type="stat") I get this plot:

which does seem to recover the mean/intercept correctly. I do not really understand what is going on here. Does the intercept in this case represent something like: “if we assume the data are from a normal distribtution but have been truncated at 200, the best normal distribution to describe this would have a mean of 208”?

At least I was able to reconstruct that kind of behavior when I simulating a sort-like situation using a truncated normal with the estimates from m_gauss_trunc

d_trunc <-, 212, 36, 0, 200))
names(d_trunc)[1] <- "y"
mtest_trunc <- brm(y | trunc(ub = 200) ~ 1
              , data = d_trunc 
              , chains = n_cores-1, cores = n_cores-1

Again, I get an intercept of > 200 but an accurate mean-recovery for the data.

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
Intercept   212.59      7.74   200.13   229.76       1287 1.00

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sigma    37.07      2.74    32.30    43.04       1366 1.00

So I guess, the truncated model does allow for underlying distributions with mean values that are higher than the upper bound of the truncation? In that case, I wonder whether the truncated model would make sense in this case.

Additionally, I try a truncated skew-normal model

m_SN_trunc <- brm(m_conf | trunc(ub = 200) ~ 1 
                 + (1|ppid)
              , data = d_h1, family = "skew_normal"
              , chains = n_cores-1, cores = n_cores-1
              , control = list(adapt_delta = 0.999)

in which case I also get an intercept > 200 but cannot plot the posterior predictive check with pp_check() as I get the following error:

Using 10 posterior samples for ppc type 'dens_overlay' by default.
Error in if (any(lower > upper)) stop("lower>upper integration limits") : 
  missing value where TRUE/FALSE needed

Again, no idea why this happens there.

Sorry if this is too much information, but I thought it might help if people experience similar problems in the future maybe.

But to give a short summary of the remaining questions:

  1. As I see it, if I were to choose a model purely on a-priori theoretical considerations, I would go for the skew-normal model. However, that model shows some influential observations, which is of course not per se problematic but I do not exactly know where these come from compared to the gaussian model . Moreover, it does not recover the mean of the data in the posterior predictions which is at odds with the posterior predictive check of the response distribution looking better compared to the gaussian model and performing better in k-fold CV.

  2. If, on the other hand, I were to choose a model based purely on best model-fit (of the models I have tried so far) in terms of looic, I would go with the truncated normal model. However, that model would result in (that is, if my above explanation is sensible) a posterior with a mean/intercept that is higher than the upper bound of the scale, i.e 200, though it does indeed seem to recover the mean in the observed as well as simulated data correctly.

@paul.buerkner Do you think any of these models make (enough) sense to go on using them? Do, in your experience, results change drastically among the different specifications? In my case, the estimates for the predictors of interest were really similar with both the gaussian and skew-normal.

Thanks very much in advance!

This is what I’m suggesting your should plot. Put the values on the x axis, put their posterior likelihood on the y axis, the values with the highest posterior log likelihood are usual the most influential ones in a model this simple. You may also need to do the same at the individual level if they’re clustered within individuals (x axis becomes the individual intercept,y axis density can stay the same)

Ah okay, thanks for the clarification, I think I misunderstood the first time you suggested this, sorry!

I did this now, and as you expected, the low values are the ones with high log-likelihoods (I extracted them with log_lik(m_gauss) and log_lik(m_SN) for the gaussian and skew-normal model respectively and took the abs(colMeans()) of that).

left: gaussian model; right: skew-normal model. X-axis is outcome variable, y-axis is their log-likelihood, Labels show outcome scores for values with relatively high LL

I also did the same for the participant intercepts (participant means from the data)

left: gaussian model; right: skew-normal model. X-axis is outcome variable, y-axis is their log-likelihood, Labels show outcome scores for values with relatively high LL

@sakrejda Thanks for this suggestion, this is helpful. So the low values are the ones that are most influential in this case. This is probably due to their relatively low frequency in the model, right? However, as this happens for both models does that help me in any way with the model selection? Again, my idea would be that the skew-normal model should in theory be better equipped to handle these low values while this does not seem to be the case in this situation?

That’s true in the sense that they contribute more to the log sum of the log likelihoods than others so d(ll)/d(theta) for theta involved in the calculations of their log-likelihoods. More sophisticated approaches use this information but it’s nice to think it through too. If you checked which values were marked by LOO-PSIS as problematic my guess is that at least some of them would be out there.

I think you mean their low expected frequency according to the model, right? Yeah, I agree. You always have some values that you expect to see at a low frequency:

x = rnorm(100)
x = c(-5.5, x)
plot(x, -dnorm(x,0,1,TRUE))

I think this (and the CV results) are telling you that the skew normal isn’t really doing a better job by much, and apparently it can’t reproduce the mean correctly. Looking back at the first plots you posted I think what’s happening is that the only way to improve the fit using the skew normal is to move the mean to the right as you increase skeweness to handle the tail. … that probably explains everything you’re seeing here… you could see positive covariance in the posterior of the skew and mean parameters that support this, I think.

You might need a more detailed model of how the scale is constructed to get a better distribution than the ones you’re working with. You could consider non-normal (t-distributed or a two-component mixture) noise at the observation level and then the skew normal should do quite well.

After being busy with a lot of other stuff, I finally got around to fitting these data with a beta-binomial as you suggested and that seems to work really well, so thanks for that great suggestion. It outperforms the other models in model comparison and does not have any problematic pareto-k.
I just wanted to report this back and marked your answer as the solution, so other people may try the same if they ever find themselves with the same problem.

1 Like