Health score modeling

Hi,

I’ve have health score that ranges from 0 - 48. It comes from a questionnaire with 12 questions, each question has 5 answers with either 4,3,2,1 or 0 points. The score is the sum. For an individual patient, it normally only declines. I’m interested in the slope. So far the likelihood is normal, which works fine but feels wrong as it is discrete data. I wondering what you would choose as likelihood? Negative Binomial?

Best,
Thorsten

Another idea was scaling the data between 0 and 1 and using a beta distribution as likelihood

Could you model the 5 questions with some sort of multivariate (Poisson?) distribution and recover the total score as a generated quantity?

Thanks for your response, I only have access to the total score not to the answers of the individual questions.

Hmm, ok that makes things a little more difficult. You could potentially treat the question responses as latent values (though this will lead to some degree of non-identifiability, unless you have some covariates which can strongly predict the specific responses) – alternatively, it might not be too difficult to work out the combinatorics of the possible summed values and write a custom _lpmf function. Otherwise, yes working with something like the negative binomial or Poisson is probably your best bet and you’ll just have to live with the actual data showing more “clumping” than the model would predict.

I would choose beta-binomial. It typically fits sum scores very well from my experience.

If you have access to the item-level data, another option is to model the items directly in a multilevel IRT model. I have a brms-based example here.

Trick question! You haven’t given us enough information to answer.

Is the idea that you have a bunch of observations in {0, 1, ..., 48} over time for a given subject and you fit a regression of some kind? If so, you have to be careful in that you want your predictions in that range.

Negative binomial is a generalization of Poisson, and hence unbounded, so not appropriate.

At least that has the right range.

What I would recommend is doing the regression and then plotting a histogram of the residuals to see what the error looks like.

With 0—48, you can often get away with a continuous approximation. The bigger issue is the bounds. You could build a linear regression that predicts values in (-inf, inf), or a scaled logistic regression that predicts in (0, 48). Then you typically want to link that continuous approximation. A typical way to do that might be like this for observation from subject n at time t with a regression that only depends on time.

y[n, t] ~ binomial(48, inv_logit(alpha[n] + beta[n] * t));

Alternatively you can constrain alpha < 0 and beta < 0 and use log(alpha[n] + beta[n] * t)—the latter gives you exponential decay, which may make sense.

If binomial is too restrictive, you can push out to beta-binomial, but now you have the additional concentration parameter to estimate in addition to the mean.

Sure, try my package ordbetareg. Should work just fine: CRAN: Package ordbetareg