Modelling small-sample, potentially skewed data with regression where CLT doesn't apply

martinmodrak · May 13, 2021, 9:33am

Hi, sorry for not getting to you earlier, the question is relevant and well written.

A few thoughts:

I am not completely sure why you try so hard to avoid working with raw data - if performance is an issue, then at that sample size (unless you have a lot of predictors), even simple frequentist/maximum likelihood tools, approximations like INLA or ADVI/optimize in Stan are quite likely to give you sensible results.
For determining a good response distribution, it is not so much important to look at the population distribution - the important part is what is the distribution given your predictors. If you have good predictors, than even “stupid” distributions (e.g. normal / lognormal / gamma) can work pretty well even when the population distribution is very complex. I would advise against using loo/WAIC as a primary criteria for selecting a response and focus on qualitative properties of the fit (e.g. via posterior predictive checks or residual plots), but if those fail to give clear answers, than cross-validation criteria like loo/WAIC might be a sensible next step.
If bin is the only predictor and there are repeated values in y, you can drastically speed up your inference by tricks similar to [Case-study preview] Speeding up Stan by reducing redundant computation . Even doing something like

for(i in 1:k) {  
  y[ bin_indices[i] ] ~ normal(mu[i], sigma[i]);
}

(i.e. extracting all y values from the same bin and calling normal() or other distribution with a scalar parameter) will likely lead to speedups. There are also speed-ups to be gained by normal_id_glm and similar models that implement the full GLM in a single call.

The small number of datapoints in some bins is IMHO a relatively less important problem if the normal approximation holds, than unless you have very few (say < 5) the standard error should account for most of this. If the normal is a bad approximation for the response, the low number of data points is likely secondary.

Could you share what exactly does y represent? Maybe somebody will have experience modelling this kind of data and help you find a good model.

No pressure to disclose the reason, but that looks a bit suspicious - unless the randomization was uneven by design, how could you get uneven bins? Isn’t that a sign that there is additional structure in the data that should be taken into account?

Best of luck with your model!

Topic		Replies	Views
Multivariate Skewed Model for Simple Dataset Modeling	6	905	April 8, 2019
Modelling biological data with few subjects but large # samples per subject Modeling	2	553	July 19, 2018
Non-centered hierarchical skew normal implementation Modeling fitting-issues	0	526	January 14, 2019
Linear model where each x has it's own normal distribution Modeling	6	623	March 1, 2019
Handle skewed random effect? brms	0	37	August 10, 2024

Modelling small-sample, potentially skewed data with regression where CLT doesn't apply

Related topics