I have what I think is a parallel problem, that I think is very prevalent in biomedical research.
I want to analyse a multi-level data-set that includes both nested and crossed random effects. The data are from an experiment to measure the intensity of a marker in individual living cells. We observe cells occurs in “subplots” (microscope fields, where each field may show several cells), which are samples of plots (culture dishes where the cells grow, that usually contain hundreds of cells). We measure the intensity of the marker (in individual cells, nested in subplots, within plots) at different times. The cells are from different people (blocks) and we sometimes measure people’s cells at different times - but there is partial overlap between people and times. The cells that we measure for each dish and person are not the same from time to time.
It’s a complicated and unbalanced design - but it reflects the realities of when cells become available and how many of them grow at different times in different culture dishes and different microscope fields (which we do not control). As I say, complex data of this kind are very common in biomedical research. We ignore differences between microscope fields (subplots) for now.
There are very many cells in the total study (~200k) - and some individual microscope fields have no cells, so that estimation of the full model for all the data, using the zero-inflated negative binomial family is practically impossible (it takes several hours to become reasonably stable using variational Bayes, but even if I use the estimates from the vB output as starting values for estimation using NUTS, the warm-up phase ran for 72 hours without reporting any progress at all).
Since I cannot analyse the raw data, I wondered about using a meta-analysis approach, in line with the method suggested previously in this post. Specifically, I obtained the mean and standard deviation of the intensity in each culture dish at each time, along with the number of cells where we measured the intensity. So, after taking the mean intensity per cell and its standard deviation for each dish, the data look something like this:-
Person culture_dish time Total_N_of_cells mean_intensity_per_cell sd_of_intensity_per_cell
1 1 1 123 3.5 1.1
1 2 2 322 4.2 1.5
2 1 2 126 2.1 0.6
2 2 1 139 1.4 0.4
etc. etc. (there are roughly 200 cells per person - so the total data-set has about 1000 observations).
I want to estimate the mean_intensity_per_cell for each person, taking account - as far as possible - the variation between culture dishes and times. I take it that the problem shows formal similarity to Xiazhu111’s question. That is, it seems reasonable (to me) to conceptualise the design as a meta-analysis, where each person constitutes an ‘experiment’. The main difference from XiaZhu111’s original post is that there are no culture dishes with no cells, so every culture dish has a real/positive standard deviation for the mean_intensity_per_cell. Given that there are no standard deviations of zero, following Paul Buerker’s post (above), I wrote the following brmsformula to separate the effects of individual people from the other design factors:-
mean_intensity_per_cell|se(sd_of_intensity_per_cell/sqrt(Total_N_of_cells+1))+weights(Total_N_of_cells)~1+(1|Person)+(1|culture_dish)+(1|time), family=gaussian(link=‘log’)
(I use family=gaussian(link=‘log’) for 3 reasons: (1) se() is not available with family=lognormal; (2) the distribution of mean_intensity_per_cell is actually right-skewed; (3) Paul Buerkner used it in his previous post).
So, now I would like to ask:-
- Does this formula seem appropriate to take into account both the variation in the mean_intensity_per_cell and the Total_N_of_cells between culture dishes?
- is it reasonable to use family=gaussian(link=‘log’) to account for the skewness of the data (mean_intensity_per_cell), or would it be better to take log(mean_intensity_per_cell) and use family=gaussian, or family=student?
- If I use log(mean_intensity_per_cell), then should I also log-transform its standard error and the Total_N_of_cells?
I think it likely that there are many biomedical researchers whose data are similar to those that I describe here. Therefore, figuring out the best way to analyse these complex, unbalanced and numerous data (initial N~200k, or more) could be very helpful to many other researchers.
With many thanks, in anticipation of your help