Appropriate choice of likelihood for 0 to 100 self-report response variable?

I’m trying to model some psychological self-report data & the response variable was reported on a scale of 0 to 100. The distribution of the responses is left-skewed, with mode around 75 & quite a few responses on the boundaries (0 and 100). I first tried modeling the data as truncated normal but I was getting a lot of divergent transitions & weird fit based on posterior predictive checks. I’ve seen some threads where people talked about divergent transitions being common with truncated normal, especially with responses on the boundaries, so I instead tried the skewed_normal() likelihood fromm brms, and with that I was able to get a more reasonable posterior predictive distributions (see attached picture). However, there’s still two problems with that : I’m still getting some divergent transitions, and the response isn’t bounded between 0 and 100.

I’ve looked around for different solutions that people have applied to similar data. I’ve seen some people divide the response variable by a 100 so it’s between 0 and 1, and then use the beta likelihood to model it. However, as I mentioned, in my data there are a few observations on the boundary (0’s and 1’s, in this case), and so brms throws an error if I try to use the Beta() or zero_inflated_beta() likelihood.

I’ve been using the default vague brms priors, so I think the divergent transitions should hopefully go away once I set more sensible priors. However, I’m more concerned about the choice of likelihood. Does anyone know a good likelihood for this sort of data? Or how to fix the problems with the truncated normal likelihood?

PS: There’s clearly over-representation of certain “nice” values (i.e. 50, 75, etc…) in the response, which I think makes sense with self-reported data. I’d be keen to model that kink in the data, but I’m not sure if it’s not a little bit beyond my current modeling skills/time resources. Is there some easy way of modeling the over-represented values? If not, should I be worried about them affecting my overall model fit?

This sort of thing comes up a lot, and I usually recommend just fitting a linear model and not worrying about it. It depends on your goal. If your goal is to understand what predicts the response to be higher or lower, I think a straight normal model on the untransformed scale will work just fine. If your goal is to make predictions of the response, the problem becomes more challenging. But, in my experience, people usually aren’t trying to predict the response.

If you are trying to predict, I recommend you first transform the raw data to quantiles and then to z-scores using the inverse normal cdf. Then do linear regression on the z-scores, do whatever you want. Then at the end when you want to get predicted responses, untransform back to the empirical distribution of the raw data.

But all that beta distribution stuff? No way, don’t go there.

2 Likes

Hey @andrewgelman would you mind adding more colour to this? I’d be interested to know why, if in this case the beta distribution were to fit snuggly, it should not be used nonetheless.

2 Likes

Yeah, I would be interested as well!

In the applications I’ve seen, the important part of the model is the deterministic part, not the error term. I recommend a simple normal model because then you can focus on the regression part, which is the most important thing. As noted above, if you want to get predictions of the outcomes and reproduce that distribution, you can always do an inverse cdf transform of the predictions. If you really want to use a beta distribution, fine, but my guess is you’re focusing on the least important part of the problem. Usually the goal is not to model the univariate distribution of the data; the goal is to understand what predicts the response to be higher or lower.

2 Likes

@andrewgelman is correct that a linear model actually fits this type of data quite well. If you want something more specialized, I have a model based on the beta distribution that is useful when observations on the boundaries, such as 0s or 100s in your case, are likely to be qualitatively different than other observations. I think this may be useful to you as you have a self-response variable, and often people assign a different meaning to complete certainty than they do for other responses (i.e. there can be a big jump in certainty/intensity between 99 and 100 compared to going from 50 to 51).

You can read a working paper about the model here: https://osf.io/preprints/socarxiv/2sx6y/

And fit it in brms using the vignette here: https://htmlpreview.github.io/?https://github.com/saudiwin/ordbetareg/blob/master/estimate_with_brms.html

1 Like

I think there are two questions one can ask before choosing a model:

  1. Is it reasonable to assume that differences of responses of plus/minus one can be measured reliably?
  2. Is it reasonable to assume that the answers are on an interval scale?

In my experience, the intuitive answer for most psychological scales that measure a state or personality trait is no to both questions. When this is the case, I tend to bin the data in 10 bins and use an item response theory (IRT) model to analyze the binned data.

One could argue that one is loosing some information by binning the data. But I don’t think this is a big issue because I assume that a 1-100 scale is much more fine-grained than necessary given accuracy of self reporting behavior (well, excecpt one literally asks for counts) or feelings.

One question is of course how many bins to choose. I don’t have a good answers for that. Intend to use 10 bins and try if the results change if I use 20 or more bins.

One advantage of using IRT models is that they can (to a degree) deal with responses at the extremes. They also automatically make you think about item difficulty, discrimination, and person parameters (ability) and highlight that responses are just indicators of a latent variable, but you can also use these concepts with linear indicators.

Hope this helps

2 Likes

PS: The density plot in the original post seems to indicate clustering of responses around multiples of 10.