We are modeling a cellular data experiment with Stan and BRMS. This pilot data has <10 subjects with measurements in each of ~3 conditions, but each measurement produces around 10,000 data points which have lognormal distribution. We cannot get the raw data but can get whatever quantiles we want from this distribution.
We would like to take advantage of the presence of these distributions and use them to inform posterior estimates (as opposed to our esteemed colleagues who take the means and run pairwise t tests). So we are running more or less the following standard error model (taking logs of various quantities removed for simplicity):
brm(formula = Mean | se(SE) ~ 1+Condition+(1 | Subject) ...)
Where Mean is the log mean signal from the cells of each Subject under each Condition, and SE is a conservative estimate of SE obtained by taking the larger of the two distances between the log-transformed 5 or 95 quantile and the mean (the distances are usually quite similar), over the number of samples.
I have grown suspicious of our approach for the following reason: Our colleagues have noticed that every pairwise t test they run is “significant” due to the 10K sample sizes. So they have switched to a likelihood ratio test between a model with the hypothesized effect by Condition and a null model, I guess because it strikes them as more believable (they are not statisticians).
With our BRMS standard error approach I am seeing something similar and it makes me skeptical. When the SE is not incorporated, the model has no strong evidence for an effect by Condition. When the SE is added, the HDIs for Condition are very tight and well above zero, so this is purely due to the inclusion of a tight standard error. Has this model sufficiently accounted for having a great many measurements one level, but sparse measurements on another? Is model comparison a better choice here for some reason?