What Response Distribution (Family) Should I be Using?

I am modeling a business phenomena based on survey data (repeated measures, those frequently used in Structural Equation Modeling). My response variable is generated using Bartlett score after the factor analysis is conducted (using STATA, but I am modeling using R + brms of course.) It is a multilevel model where the grouping variable is different country.

I have been getting some interesting results and trying to close my analysis after conducting the posterior predictive check and other steps. I have found that brm by default uses “gaussian” family distributions for the response variable. The default pp_check looked much better after I specified “student” family. But I am still feeling that things can be improved. Any suggestions on which type of family should be used?

Many thanks in advance.

The case of a “gaussian” (the default) family


pp_check(gaussian_model, type = ‘stat’, stat = ‘mean’)

         Estimate   SE
elpd_loo   -743.3 21.8
p_loo        35.1  4.4
looic      1486.6 43.7

It is the heavy deviation around -0.5 ~ -1.5 motivated me to explore a better response distribution. (Even though the gaussian family model is better in reconstructing the mean and with a smaller looic value.)

The case of a “student” family


pp_check(student_model, type = ‘stat’, stat = ‘mean’)

         Estimate   SE
elpd_loo   -770.6 19.4
p_loo        37.4  2.4
looic      1541.1 38.7

Visually, the deviation around -0.5 ~ -1.5 improved a little bit. But its ability to reconstruct the mean was weaker and has a bigger looic value.

I am fairly new to the Bayesian and Stan. I understand that model selection should not be rely on solely one single measure. But I rest my case… I am hoping to get your second opinion about the specific type of distribution family should be used here to bring the improvement on all measures.

Other information:
The response variable (y) is actually somewhat skewed.

mean = 0
sd = 1
skewness = -0.12
kurtosis = 3.89

Thank you!

1 Like

Choice of response distribution depends strongly on domain expertise. Can you explain a bit about the kind of data you’re dealing with?

1 Like

Thanks for taking time mentioning this. I am doing operations management research. Specifically, the outcome is a type of perceived level on operation outcome (how quickly and efficiently respond to changes).

How is the outcome actually measured? For example, is it a questionnaire with multiple questions?

1 Like

It was a five level Likert scale questionnaire.
The response variable is a second order construct with two sub-dimensions. Each sub-dimension has three questions.
I did a factor analysis and the final score is extracted (or predicted) using Bartlett score.

Generally it is best to move inference as close to the raw data as possible, so I suggest you should be doing an ordinal regression of the individual likert items (for a tutorial, see here), add your factor-analytic (CFA example here) on top, within which you can also put your repeated-measures/SEM structure.

1 Like

Possibly related, @mike-lawrence are you aware of any exploratory factor analysis Stan implementations?

1 Like

Thank you. I will look through those references.
But one thing is, after I extracted the factor score using the Barlett technique, the composite score is now in a continuous scale, rather than ordinal. Do you still recommend me using the ordinal approach?

When you model the raw likert responses, it will yield parameters on a continuous latent scale. You’ll then be doing all your multi-level stuff with that parameter as the “outcome”.