Dear Stan community - I have a modeling problem that I would love to get some input on. I am working with experimental data on plant performance across environmentally variable sites, and am trying to fit a model along the lines of:
where x is an environmental variable, \text{sp} = 1, 2, \cdots, 8 (species), \text{trt} = 1, 2 (treatment), and \text{site} = 1, 2, \cdots, 7 (site). I have ~10 replicate observations for each species \times treatment \times site combination, so \gamma is a site-level random intercept intended to represent repeated measures (ideally \alpha_{\text{sp}, \text{trt}} might vary by site, but I’m not sure if I have enough data for this). I’m using the following formula with brms
:
y | trials(n) ~ 0 + sp:trt + sp:trt:x + (1 | site)
The problem is that this model samples quite inefficiently, with ~10% of iterations hitting max treedepth of 10 and up to ~1% divergences. Divergences appear to be concentrated towards large values of \sigma. In fact, this happens even with just y | trials(n) ~ (1 | site)
, seemingly related to strong posterior correlation between the global intercept and random intercepts. These issues are slightly alleviated by rewriting the formula as y | trials(n) ~ sp*trt*x + (1 | site)
, but are not entirely resolved, and if possible I would much prefer the index-like parameterization for setting priors. The model works very well if (1 | site)
is dropped, but then I have a pseudoreplication problem. My first question is: Why might the inclusion of site as a random intercept be so problematic for this model, and is there any way that I can try to improve the efficiency and/or validity of the model?
As I was working through this model, it also occurred to me that x is measured at the site-level, such that there is a perfect correspondence between x and \text{site}. I’m not sure if this has anything to do with the issues described above, but I’m wondering if the data might be better represented by a multilevel model along the lines of:
Observation-level model:
Site-level model:
The idea is to have an observation-level model that estimates the mean (\gamma) for each species \times treatment \times site combination, coupled with a site-level model that predicts \gamma according to the environmental variable (x) measured at each site. I’m not sure if this model can be specified with brms
, but I tried implementing it with cmdstanr
and it seems to sample quite well (although ESS for \sigma was a little low). Thus, my second question is: Do either of these models make more or less sense, and why might they be so different in their sampling performance? Both models are similar in their mean/median predictions, but differ somewhat in the uncertainty around those predictions. My (perhaps naive) intuition for this sort of data was to go with something like the first model, but I am interested in hearing what others think about the second model in comparison. Thank you!