Modelling small-sample, potentially skewed data with regression where CLT doesn't apply

Hi, sorry for not getting to you earlier, the question is relevant and well written.

A few thoughts:

  • I am not completely sure why you try so hard to avoid working with raw data - if performance is an issue, then at that sample size (unless you have a lot of predictors), even simple frequentist/maximum likelihood tools, approximations like INLA or ADVI/optimize in Stan are quite likely to give you sensible results.
  • For determining a good response distribution, it is not so much important to look at the population distribution - the important part is what is the distribution given your predictors. If you have good predictors, than even “stupid” distributions (e.g. normal / lognormal / gamma) can work pretty well even when the population distribution is very complex. I would advise against using loo/WAIC as a primary criteria for selecting a response and focus on qualitative properties of the fit (e.g. via posterior predictive checks or residual plots), but if those fail to give clear answers, than cross-validation criteria like loo/WAIC might be a sensible next step.
  • If bin is the only predictor and there are repeated values in y, you can drastically speed up your inference by tricks similar to [Case-study preview] Speeding up Stan by reducing redundant computation . Even doing something like
for(i in 1:k) {  
  y[ bin_indices[i] ] ~ normal(mu[i], sigma[i]);
}

(i.e. extracting all y values from the same bin and calling normal() or other distribution with a scalar parameter) will likely lead to speedups. There are also speed-ups to be gained by normal_id_glm and similar models that implement the full GLM in a single call.

  • The small number of datapoints in some bins is IMHO a relatively less important problem if the normal approximation holds, than unless you have very few (say < 5) the standard error should account for most of this. If the normal is a bad approximation for the response, the low number of data points is likely secondary.

Could you share what exactly does y represent? Maybe somebody will have experience modelling this kind of data and help you find a good model.

No pressure to disclose the reason, but that looks a bit suspicious - unless the randomization was uneven by design, how could you get uneven bins? Isn’t that a sign that there is additional structure in the data that should be taken into account?

Best of luck with your model!

2 Likes