Model specification with random intercept for group vs. predicting group-level mean

Dear Stan community - I have a modeling problem that I would love to get some input on. I am working with experimental data on plant performance across environmentally variable sites, and am trying to fit a model along the lines of:

y \sim \text{Binomial}(n, \theta) \\ \text{logit}(\theta) = \alpha_{\text{sp}, \text{trt}} + \beta_{\text{sp}, \text{trt}} \times x + \gamma_{\text{site}} \\ \alpha, \beta \sim \text{Normal}(0, 2) \\ \gamma \sim \text{Normal}(0, \sigma) \\ \sigma \sim \text{Exponential}(1)

where x is an environmental variable, \text{sp} = 1, 2, \cdots, 8 (species), \text{trt} = 1, 2 (treatment), and \text{site} = 1, 2, \cdots, 7 (site). I have ~10 replicate observations for each species \times treatment \times site combination, so \gamma is a site-level random intercept intended to represent repeated measures (ideally \alpha_{\text{sp}, \text{trt}} might vary by site, but I’m not sure if I have enough data for this). I’m using the following formula with brms:

y | trials(n) ~ 0 + sp:trt + sp:trt:x + (1 | site)

The problem is that this model samples quite inefficiently, with ~10% of iterations hitting max treedepth of 10 and up to ~1% divergences. Divergences appear to be concentrated towards large values of \sigma. In fact, this happens even with just y | trials(n) ~ (1 | site), seemingly related to strong posterior correlation between the global intercept and random intercepts. These issues are slightly alleviated by rewriting the formula as y | trials(n) ~ sp*trt*x + (1 | site), but are not entirely resolved, and if possible I would much prefer the index-like parameterization for setting priors. The model works very well if (1 | site) is dropped, but then I have a pseudoreplication problem. My first question is: Why might the inclusion of site as a random intercept be so problematic for this model, and is there any way that I can try to improve the efficiency and/or validity of the model?

As I was working through this model, it also occurred to me that x is measured at the site-level, such that there is a perfect correspondence between x and \text{site}. I’m not sure if this has anything to do with the issues described above, but I’m wondering if the data might be better represented by a multilevel model along the lines of:

Observation-level model:

y \sim \text{Binomial}(n, \theta) \\ \text{logit}(\theta) = \gamma_{\text{sp}, \text{trt}, \text{site}}

Site-level model:

\gamma \sim \text{Normal}(\mu_{}, \sigma) \\ \mu = \alpha_{\text{sp}, \text{trt}} + \beta_{\text{sp}, \text{trt}} \times x_{\text{site}} \\ \alpha, \beta \sim \text{Normal}(0, 2) \\ \sigma \sim \text{Exponential}(1)

The idea is to have an observation-level model that estimates the mean (\gamma) for each species \times treatment \times site combination, coupled with a site-level model that predicts \gamma according to the environmental variable (x) measured at each site. I’m not sure if this model can be specified with brms, but I tried implementing it with cmdstanr and it seems to sample quite well (although ESS for \sigma was a little low). Thus, my second question is: Do either of these models make more or less sense, and why might they be so different in their sampling performance? Both models are similar in their mean/median predictions, but differ somewhat in the uncertainty around those predictions. My (perhaps naive) intuition for this sort of data was to go with something like the first model, but I am interested in hearing what others think about the second model in comparison. Thank you!

To supplement the above post, here is Stan code for the second, multilevel model:

data {
  // observation-level data
  int<lower=1> n; // number of observations
  int<lower=1> n_sp; // number of species
  int<lower=1> n_trt; // number of treatments
  int<lower=1> n_site; // number of sites
  array[n] int sp; // species ID
  array[n] int trt; // treatment ID
  array[n] int site; // site ID
  array[n] int t; // number of trials
  array[n] int y; // outcome
  // site-level data
  vector[n_site] x; // environmental variable
}

parameters {
  array[n_sp, n_trt] real site_intercept;
  array[n_sp, n_trt] real site_b_x;
  real<lower=0> site_sd_intercept;
  array[n_sp, n_trt, n_site] real z_intercept;
}

transformed parameters {
  // site-level model
  array[n_sp, n_trt, n_site] real obs_intercept;
  for (i in 1:n_sp) {
    for (j in 1:n_trt) {
      for (k in 1:n_site) {
        real mu = site_intercept[i, j] + site_b_x[i, j] * x[k];
        obs_intercept[i, j, k] = mu + z_intercept[i, j, k] * site_sd_intercept;
      }
    }
  }
}

model {
  // site-level model
  to_array_1d(site_intercept) ~ normal(0, 2);
  to_array_1d(site_b_x) ~ normal(0, 2);
  to_array_1d(z_intercept) ~ normal(0, 1);
  site_sd_intercept ~ exponential(1);
  // observation-level model
  vector[n] theta;
  for (i in 1:n) {
    theta[i] = inv_logit(obs_intercept[sp[i], trt[i], site[i]]);
  }
  y ~ binomial(t, theta);
}

generated quantities {
  vector[n] log_lik;
  array[n] int yrep;
  {
    vector[n] theta;
    for (i in 1:n) {
      theta[i] = inv_logit(obs_intercept[sp[i], trt[i], site[i]]);
      log_lik[i] = binomial_lpmf(y[i] | t[i], theta[i]);
    }
    yrep = binomial_rng(t, theta);
  }
}

In addition to the questions in the above post, I’m also unsure whether I’m calculating log_lik (for use with loo) and yrep (for posterior predictive checks) correctly. As I’ve managed to set things up here, the site-level model does not seem to play a direct role in generating these quantities. Perhaps I could draw obs_intercept from the site-level model’s normal distribution for computing yrep, but I’m unsure how to think about this for computing log_lik for use with loo.