Dear STAN Community,

I’m seeking guidance on employing non-Gaussian error models for analyzing reading times in psycholinguistic research. Our challenge lies in modeling non-negative, right-skewed reading time data where a traditional log transformation conflicts with our theoretical model’s need for additive relationships between predictors in raw reading times. We’re contemplating custom STAN models to specify non-Gaussian error terms which conforms to distributions like gamma or lognormal. Right now I am a bit stuck on where to get things started. Insights on practical effects of statistical assumption violations and coding experiences with similar modeling challenges, especially in psycholinguistics, would be greatly appreciated.

Thank you for your time and expertise.

Tom

Hi

It seems to me that the only constraint you have within your data is that it should be non-negative and right skewed. This does not exclude the possibility of using a Gaussian error model. You may decide to use a truncated normal distribution (lower bounded at 0). Another distribution would be the log normal. Which we can just view as some parameter transformation wrt a standard normal. That is we can train our model based on the natural log data instead of the data directly. That is ln(data) \sim \mathcal{N}(\mu,\sigma)

Hello Tom,

you seem to be interested in a simple GLMM. The easiest way to get to your goal is probably using brms with STAN under the hood. You can just use the classical lme4 notation with substantially more error distributions.

As was mentioned by @Garren_Hermanus it does not necessarily mean you can’t use a normal. However, transformations on the data side are usually my last resort and I try to avoid it mostly because you want something resembling the data generating process. Transformations on the data side have some other consequences on the error side. Either way, you should probably test different error distributions if there is no existing literature on distributions. However my guess would be that the Gamma distributions is usually a good bet for time variables.

You can easily assess the models with different distributions with posterior predictive checks and model comparison using *loo*. All of these options are easily available using *brms*. One thing to consider is generally what knowledge you have already. Using appropriate priors is very powerful and will improve your model drastically. Thinking of these in advance is a great approach. If you are unsure if your priors make sense you (even if your are sure) you can use prior predictive checks to asses possible outcomes of the model without any data (again integrated in brms).

For general advice on how to approach a question using such a workflow you can look at the paper by @jonah et al. This will help you to be more confident in your analysis outcomes and give you a hint on what has to be changed in your model.

Good luck!