Problem setup
Suppose I have run some experiment on a large number of people (say 10^6 for argument’s sake) in order to try to understand the relationship between some integer predictor (x
) and a dependent variable (y
).
People have been randomised into i \in [1,k] bins, each of which has a different value of the independent variable (x[i]
). The randomisation means that I can assume that any measured relationship between x
and y
is driven solely by this relationship (ie no confounding).
For reasons I can’t get into here, the distribution of people between the bins is highly uneven (some bins have many more people than other bins).
To try to collapse the problem down from a large number of data points to a manageable size, I calculate for each bin:
-
y_mean_obs
as the mean of the observed y values within each bin -
y_se_obs
as the standard error of the observed values y within each bin.
If I define a parameter sigma_obs
to represent any unexplained observational noise, then I can craft a linear model for the problem as follows:
for(i in 1:k){
y_mean_obs[i] ~ normal(x[i]*beta[i] + c, sqrt(y_se_obs[i] + sigma_obs[i]))
}
…where the final term reflects combining uncertainties from the sampling process and the unexplained observational noise.
Provided that the sample size in each bin is large enough, the central limit theorem applies and so this model could be a good representation of the process.
Accounting for small sample size and skewness
However, in the present situation there are two problems:
- The number of data points in each bin may be small
- The population distribution of y is extremely skewed
Both of these combine to meant that my normal approximation for y_mean_obs
may not be a good one.
My questions:
- Is there any easy way to extend this problem to account for the potential failure of the CLT at this size? The only idea I have had here is something around Edgeworth series to try to provide a more faithful PDF assumption for the error distribution
- If I wanted to work with the raw data (which is highly skewed) but it’s not clear that it obeys any simple parametric function (eg lognormal), how would you approach this? One option would be to try a number of parametric functions and see which fits best (eg via WAIC). Alternatively, could end up back at the Edgeworth series.
I am really trying to avoid working with the raw data at all if possible, but it may not be possible.
Any thoughts welcomed!