Modelling small-sample, potentially skewed data with regression where CLT doesn't apply

Problem setup

Suppose I have run some experiment on a large number of people (say 10^6 for argument’s sake) in order to try to understand the relationship between some integer predictor (x) and a dependent variable (y).

People have been randomised into i \in [1,k] bins, each of which has a different value of the independent variable (x[i]). The randomisation means that I can assume that any measured relationship between x and y is driven solely by this relationship (ie no confounding).

For reasons I can’t get into here, the distribution of people between the bins is highly uneven (some bins have many more people than other bins).

To try to collapse the problem down from a large number of data points to a manageable size, I calculate for each bin:

  • y_mean_obs as the mean of the observed y values within each bin
  • y_se_obs as the standard error of the observed values y within each bin.

If I define a parameter sigma_obs to represent any unexplained observational noise, then I can craft a linear model for the problem as follows:

for(i in 1:k){
   y_mean_obs[i] ~ normal(x[i]*beta[i] + c, sqrt(y_se_obs[i] + sigma_obs[i]))

…where the final term reflects combining uncertainties from the sampling process and the unexplained observational noise.

Provided that the sample size in each bin is large enough, the central limit theorem applies and so this model could be a good representation of the process.

Accounting for small sample size and skewness

However, in the present situation there are two problems:

  • The number of data points in each bin may be small
  • The population distribution of y is extremely skewed

Both of these combine to meant that my normal approximation for y_mean_obs may not be a good one.

My questions:

  1. Is there any easy way to extend this problem to account for the potential failure of the CLT at this size? The only idea I have had here is something around Edgeworth series to try to provide a more faithful PDF assumption for the error distribution
  2. If I wanted to work with the raw data (which is highly skewed) but it’s not clear that it obeys any simple parametric function (eg lognormal), how would you approach this? One option would be to try a number of parametric functions and see which fits best (eg via WAIC). Alternatively, could end up back at the Edgeworth series.

I am really trying to avoid working with the raw data at all if possible, but it may not be possible.

Any thoughts welcomed!

1 Like

Hi, sorry for not getting to you earlier, the question is relevant and well written.

A few thoughts:

  • I am not completely sure why you try so hard to avoid working with raw data - if performance is an issue, then at that sample size (unless you have a lot of predictors), even simple frequentist/maximum likelihood tools, approximations like INLA or ADVI/optimize in Stan are quite likely to give you sensible results.
  • For determining a good response distribution, it is not so much important to look at the population distribution - the important part is what is the distribution given your predictors. If you have good predictors, than even “stupid” distributions (e.g. normal / lognormal / gamma) can work pretty well even when the population distribution is very complex. I would advise against using loo/WAIC as a primary criteria for selecting a response and focus on qualitative properties of the fit (e.g. via posterior predictive checks or residual plots), but if those fail to give clear answers, than cross-validation criteria like loo/WAIC might be a sensible next step.
  • If bin is the only predictor and there are repeated values in y, you can drastically speed up your inference by tricks similar to [Case-study preview] Speeding up Stan by reducing redundant computation . Even doing something like
for(i in 1:k) {  
  y[ bin_indices[i] ] ~ normal(mu[i], sigma[i]);

(i.e. extracting all y values from the same bin and calling normal() or other distribution with a scalar parameter) will likely lead to speedups. There are also speed-ups to be gained by normal_id_glm and similar models that implement the full GLM in a single call.

  • The small number of datapoints in some bins is IMHO a relatively less important problem if the normal approximation holds, than unless you have very few (say < 5) the standard error should account for most of this. If the normal is a bad approximation for the response, the low number of data points is likely secondary.

Could you share what exactly does y represent? Maybe somebody will have experience modelling this kind of data and help you find a good model.

No pressure to disclose the reason, but that looks a bit suspicious - unless the randomization was uneven by design, how could you get uneven bins? Isn’t that a sign that there is additional structure in the data that should be taken into account?

Best of luck with your model!