Multilevel model implementation - tricks to solve low neff, low E-BFMI, mildly high Rhat?

Hi, I’m looking for suggestions on how to tackle low eff/low E-BFMI/Rhat issues in a multilevel model in an organized manner. I describe below my data and objective and what I have tried so far. Shall I try different prior distributions? A mixture of centered and non-centered parameterization? Is my model or code even wrong to start with? Is it due to the data?

Data: ecological dataset with multiple levels related to the controlling variables of interest (“site” and “species”) and the sampling design (“plot”). On each of the 50 sites, there were 4 plots, on each plot we measured 3 species. The measurement is on a real scale (could be anything from -15 to +15), I standardized it for modeling purposes (center-scaled).

Objective: to check the effects of site and species. I expect “plot” to have no effect, it’s replication at the site scale.

Tried: the following multilevel model with centered parameterization, where \bar{\alpha} is the grand mean, \alpha the deviation from the grand mean due to site, and \beta the deviation due to species. Then \alpha + \beta would provide the expected distribution for the measured variable for a given species at a given site.

data {
  int<lower=0> N;               // number of observations
  int<lower=0> N_site;          // number of sites
  int<lower=0> N_species;          // number of species
  int<lower=0> N_plot;          // number of plots
  array[N] int<lower=1> site;   // vector of site idx
  array[N] int<lower=1> species;   // vector of species idx
  array[N] int<lower=1> plot;   // vector of plot idx
  vector[N] y_obs;              // vector of observed d15N values
}

parameters {
  vector[N_site] alpha;
  real alpha_bar;
  real<lower=0> sigma_alpha;
  vector[N_species] beta;
  real<lower=0> sigma_beta;
  vector[N_plot] gamma;
  real<lower=0> sigma_gamma;
  real<lower=0> sigma;
}

model {
  
  // adaptive priors
  target += normal_lpdf(alpha | alpha_bar, sigma_alpha);
  target += normal_lpdf(beta | 0, sigma_beta);
  target += normal_lpdf(gamma | 0, sigma_gamma);
  target += exponential_lpdf(sigma | 1);
  
  // hyper-priors
  target += normal_lpdf(alpha_bar | 0, 10);
  target += exponential_lpdf(sigma_alpha | 1);
  target += exponential_lpdf(sigma_beta | 0.1);
  target += exponential_lpdf(sigma_gamma | 0.5);
  
  // likelihood
  target += normal_lpdf(y_obs | alpha[site] + beta[species] + gamma[plot], sigma);
  
}

generated quantities {
  
  // posterior predictive distribution for replications y_rep of the original data set y given model parameters
  array[N] real y_obs_rep = normal_rng(alpha[site] + beta[species] + gamma[plot], sigma);
  
  // compute log-lik
  vector[N] log_lik;
  for (i in 1:N) {
    log_lik[i] = normal_lpdf(y_obs[i] | alpha[site[i]] + beta[species[i]] + gamma[plot[i]], sigma);
  }
  
}

This centered parameterization leads to low E-BFMI, pretty bad “mcmc nuts energy” plots, highly correlated parameter estimates, and I can see some “departures” (not sure how to call that) in the trace and rank plots… The model reproduce the data pretty well though. I played a bit with the priors and they affect the spread of the posteriors, reasonably informative priors from expect knowledge help reducing issues but do not solve them all.

I then tried the non-centered parameterization below… but it creates divergences, persistent with increasing adapt_delta, so I assume it’s even worse.

data {
  int<lower=0> N;               // number of observations
  int<lower=0> N_site;          // number of sites
  int<lower=0> N_species;          // number of species
  int<lower=0> N_plot;          // number of plots
  array[N] int<lower=1> site;   // vector of site idx
  array[N] int<lower=1> species;   // vector of species idx
  array[N] int<lower=1> plot;   // vector of plot idx
  vector[N] y_obs;              // vector of observed d15N values
}

parameters {
  real alpha_bar;
  vector[N_site] a;
  vector[N_species] b;
  vector[N_plot] g;
  real<lower=0> sigma_alpha;
  real<lower=0> sigma_beta;
  real<lower=0> sigma_gamma;
  real<lower=0> sigma;
}

model {
  
  // priors
  target += normal_lpdf(alpha_bar | 0, 1);
  target += normal_lpdf(a | alpha_bar, sigma_alpha);
  target += normal_lpdf(b | 0, 1);
  target += normal_lpdf(g | 0, 1);
  target += exponential_lpdf(sigma_alpha | 1);
  target += exponential_lpdf(sigma_beta | 1);
  target += exponential_lpdf(sigma_gamma | 1);
  target += exponential_lpdf(sigma | 1);
  
  // likelihood
  target += normal_lpdf(y_obs | alpha_bar + a[site] * sigma_alpha + b[species] * sigma_beta + g[plot] * sigma_gamma, sigma);
  
}

generated quantities {
  
  // recover "decentered" parameters
  vector[N_site] alpha = alpha_bar + a * sigma_alpha;
  vector[N_species] beta = b * sigma_beta;
  vector[N_plot] gamma = g * sigma_gamma;
  
  // posterior predictive distribution for replications y_rep of the original data set y given model parameters
  array[N] real y_obs_rep = normal_rng(alpha_bar + a[site] * sigma_alpha + b[species] * sigma_beta + g[plot] * sigma_gamma, sigma);
  
  // compute log-lik
  vector[N] log_lik;
  for (i in 1:N) {
    log_lik[i] = normal_lpdf(y_obs[i] | alpha[site[i]] + beta[species[i]] + gamma[plot[i]], sigma);
  }
  
}

Hello,

One thing that can help folks answer this is the raw data going into the model (if possible) or simulated data. Writing out the full model can also be helpful (either plain language or mathematical symbols or both!). You have a similar experimental setup to what we use here at work and we use student_t priors on site effects.

Thank you @Ara_Winter for the suggestions to make better posting! I actually wanted to edit my post to make it shorter and more generic (and correct priors in the centered Stan code), but it ended up I could not edit, so here I am (lesson learned!).

Here is the model I envisioned for this post, in it’s centered parameterization:

\begin{aligned} y_{obs} &\sim Normal(\alpha_{site} + \beta_{species} + \gamma_{plot}, \sigma)\\ \alpha &\sim Normal(\overline{\alpha}, \sigma_{\alpha})\\ \beta &\sim Normal(0, \sigma_{\beta})\\ \gamma &\sim Normal(0, \sigma_{\gamma})\\ \overline{\alpha} &\sim Normal(0, 10)\\ \sigma_{\alpha} &\sim Exponential(1)\\ \sigma_{\beta} &\sim Exponential(1)\\ \sigma_{\gamma} &\sim Exponential(1)\\ \sigma &\sim Exponential(1) \end{aligned}

Here is some simulated data in R:

set.seed(879)

## set population parameters
alpha_bar <- 0.0
sigma <- 0.4
sigma_alpha <- 0.9
sigma_beta <- 0.7
sigma_gamma <- 0.1

## number of sites, plots per site, plots (total), species, and 
## observations (each species is observed at each plot of each site)
n_sites <- 50
n_plots_p_site <- 4
n_plots <- n_sites * n_plots_p_site
n_species <- 3
n_obs <- n_sites * n_plots_p_site * n_species

## simulate intercepts
alpha_site <- rnorm(n_sites, mean = alpha_bar, sd = sigma_alpha)
beta_species <- rnorm(n_species, 0, sigma_beta)
gamma_plot <- rnorm(n_plots_p_site*n_sites, 0, sigma_gamma)

## sequence of length n_obs for site id, plot id, and species id
site_id <- rep(1:n_sites, each = n_plots_p_site * n_species)
species_id <- rep(1:n_species, times = n_plots)
plot_id <- rep(1:n_plots, each = n_species)

## vectors of length n_obs for alpha, beta, and gamma
alpha <- alpha_site[site_id]
beta <- beta_species[species_id]
gamma <- gamma_plot[plot_id]

## simulate observations
y_sim <- rnorm(n_obs, alpha + beta + gamma, sigma)

## plot data
plot(y_sim ~ site_id, col = species_id)
abline(h = mean(y_sim))
abline(h = c(mean(y_sim)+sd(y_sim), mean(y_sim)-sd(y_sim)), lty = 2)

Here is the centered parameterization of the model (mcp.stan). Note I corrected the priors from the initial post.

data {
  int<lower=0> N;                // number of observations
  int<lower=0> N_site;           // number of sites
  int<lower=0> N_species;        // number of species
  int<lower=0> N_plot;           // number of plots
  array[N] int<lower=1> site;    // vector of site id
  array[N] int<lower=1> species; // vector of species id
  array[N] int<lower=1> plot;    // vector of plot id
  vector[N] y_obs;               // vector of observations
}

parameters {
  vector[N_site] alpha;
  real alpha_bar;
  real<lower=0> sigma_alpha;
  vector[N_species] beta;
  real<lower=0> sigma_beta;
  vector[N_plot] gamma;
  real<lower=0> sigma_gamma;
  real<lower=0> sigma;
}

model {
  
  // adaptive priors
  target += normal_lpdf(alpha | alpha_bar, sigma_alpha);
  target += normal_lpdf(beta | 0, sigma_beta);
  target += normal_lpdf(gamma | 0, sigma_gamma);
  
  // hyper-priors
  target += normal_lpdf(alpha_bar | 0, 10);
  target += exponential_lpdf(sigma_alpha | 1);
  target += exponential_lpdf(sigma_beta | 1);
  target += exponential_lpdf(sigma_gamma | 1);
  target += exponential_lpdf(sigma | 1);
  
  // likelihood
  target += normal_lpdf(y_obs | alpha[site] + beta[species] + gamma[plot], sigma);
  
}

generated quantities {
  
  // posterior predictive distribution for replications y_rep of the original data set y given model parameters
  array[N] real y_obs_rep = normal_rng(alpha[site] + beta[species] + gamma[plot], sigma);
  
  // log-lik
  vector[N] log_lik;
  for (i in 1:N) {
    log_lik[i] = normal_lpdf(y_obs[i] | alpha[site[i]] + beta[species[i]] + gamma[plot[i]], sigma);
  }
  
}

Here is how I run the model with cmdstanr:

library(cmdstanr)

dlist <- list(
  N = n_obs, 
  N_site = n_sites,
  N_species = n_species, 
  N_plot = n_plots, 
  site = site_id, 
  plot = plot_id,
  species = species_id, 
  y_obs = y_sim
)

mod <- cmdstan_model("mcp.stan")

fit <- mod$sample(
  data = dlist, 
  chains = 4, 
  parallel_chains = 4,
  refresh = 1000
)

I am going to try the Student-t prior on the site effect now.

1 Like

No better diagnostic so far hard-setting \nu at 3 or 7 in:

\begin{aligned} y_{obs} &\sim Normal(\mu, \sigma)\\ \mu &= \alpha_{site} + \beta_{species} + \gamma_{plot}\\ \alpha &\sim Student(\nu, \overline{\alpha}, \sigma_{\alpha})\\ \beta &\sim Normal(0, \sigma_{\beta})\\ \gamma &\sim Normal(0, \sigma_{\beta})\\ \overline{\alpha} &\sim Normal(0, 10)\\ \sigma_{\alpha}, \sigma_{\beta}, \sigma_{\gamma}, \sigma &\sim Exponential(1)\\ \end{aligned}

And then giving \nu a prior as following:

\begin{aligned} y_{obs} &\sim Normal(\mu, \sigma)\\ \mu &= \alpha_{site} + \beta_{species} + \gamma_{plot}\\ \alpha &\sim Student(\nu, \overline{\alpha}, \sigma_{\alpha})\\ \nu &\sim Gamma(2, 0.1)\\ \beta &\sim Normal(0, \sigma_{\beta})\\ \gamma &\sim Normal(0, \sigma_{\beta})\\ \overline{\alpha} &\sim Normal(0, 10)\\ \sigma_{\alpha}, \sigma_{\beta}, \sigma_{\gamma}, \sigma &\sim Exponential(1)\\ \end{aligned}

Was it somewhat what you suggested @Ara_Winter?

Let me get this up and running on my end so I can take a look. My general approach would be to pair the model down and see where the errors are creeping in.

But first what do your raw data plots look like? You don’t need to show them here but does the model you’ve picked support what your data is showing?

So how does the model behave when you do not include the plot effect?

Indeed, I have been pairing down the model quite a bit (the ultimate model has more levels). Actually, issues arise as soon as I introduce the species (y ~ \alpha + \beta), while it goes well with only \alpha.

The simulated data looks pretty similar to the real one plotted that way (the dispersion due to species is a bit higher in the simulated data though) and, so far, the model behave in a very similar way on both datasets. Of course, the real data has more “structure” as site represents a gradient that we aim to study through (non-)linear regressions, but for the example here I ignore that and only try to reproduce means (grand mean, deviation due to site, deviation due to species if that is a correct way to interpret \overline{\alpha}, \alpha, and \beta).

1 Like

What do you gain by trying to learn the parameters of the regularizing prior over species with just three species to learn from? What happens if you replace the hierarchical prior over the species with a non-hierarchical regularizing prior?

Thanks @jsocolar. Indeed, I have tried setting a non-hierarchical prior for species and the situation seemed even worse (Rhat > 1.6, still low E-BFMI, additional treedepth limit issue).

I guess that’s what ecologists would call a “mixed” model, with species as a “fixed” effect and site and plot as “random” effects… I have always been a bit confused by this terminology (+ I don’t know what are the maths behind lme4 for example, so it doesn’t help); hence, I would like to better get the rationale behind having a hierrachical vs non-hierarchical prior for a factor such as species in a Bayesian framework. I was thinking about letting the intercept vary by species. My idea to interpret the initial model was: there is a grand mean of all observations (\overline{\alpha}, which is 0 as I centered the observations), then every observation deviates from this grand mean because of the site (\alpha), the plot (\gamma), and the species (\beta).

Also, your comment makes me think that I could implement the model differently if I had more levels within species. Is that correct? Why is that?

I went back to textbooks and got the answer to my last question about whether we would implement hierarchical or non-hierarchical priors (a “random” effect or a “fixed” effect) depending on the number of levels measured for the said factor. That is because we assume these levels to be draws from a distribution of possible levels, a few draws are unlikely to be a good sample of this distribution. But what if the factor is something like size and we have both extremes of the range and something in the middle (3 levels: smallest, medium, largest)? Would a hierarchical/“random” implementation be more reasonable than the alternative?

Sorry in the middle of back to back field weeks. Will be a little slow responding here.

So you have data like this:


and this

One thing you can do is throw this into the brms package (I often do this as a double ands triple check) on my models, like this

model1 <- bf(y_sim~ (1|site_id))

k_fit_model1 <- brm(model1, 
                  data=y_sim_df, family = "gaussian",
                  cores=6, chains = 4, iter = 2000,
                  control = list(adapt_delta=0.99, max_treedepth = 12))
summary(k_fit_model1)

make_stancode(y_sim~ (1|site_id),
              data = y_sim_df, family = "gaussian")

model2 <- bf(y_sim~ (1|site_id) + (1|species))

k_fit_model2 <- brm(model2, 
                    data=y_sim_df, family = "gaussian",
                    cores=6, chains = 4, iter = 2000,
                    control = list(adapt_delta=0.99, max_treedepth = 12))
summary(k_fit_model2)

make_stancode(y_sim~ (1|site_id) + (1|species_id),
              data = y_sim_df, family = "gaussian")

and you will see the second model will throw the same errors that you are getting. But it’s worth comparing the structures of the brms out with your own code.

1 Like

Yep, brms uses the non-centered parameterization.

1 Like

So when I run your model as it is with the simulated data, right off the bat chain 1 is throwing this error:

Chain 1 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 1 Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in '/var/folders/t2/fvd9y0l92_gbtz2914llq_rc0000gn/T/RtmpLTHXnl/model-76883929b410.stan', line 29, column 2 to column 48)
Chain 1 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 1 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.

Does that happen on your end as well?

Indeed, it happens on the first iteration of each chain (btw, I very often see that message with models I fit to the type of data I work with). Not sure what to get out of it, so far I have assumed I was in the case of the “warning occurs sporadically... then the sampler is fine” but I might be wrong!

Better late than never… Just had an “off course!” revelation reading that thread (Low effective sample size for random effects only - #5 by mike-lawrence) about my model with three factors (\mu = \alpha_{site} + \beta_{species} + \gamma_{plot}): each combination of levels has only one observation! Hence, we have to make a choice and take one factor out. In my example case, we can assume plot (which can be seen as intra-site spatial variability) is the least important factor.

The issues with the two factors model (\mu = \alpha_{site} + \beta_{species}), which has 4 observations per combination of levels, remain to be solved: high Rhats (>1.15), low ESS (< 35 for \alpha and \beta), high correlation between sampled parameters, poor trace plots… with the default 4 chains with 1000 iterations.

1 Like

I finally had time to come back to this post. I think this is a case of non-identifiability. The issue is described, and some solutions proposed, in Ogle & Barber (2020).

Here I implemented the “post-sweeping of random effects” (solution 4 from Ogle & Barber, 2020) on the model with site and species, with hierarchical prior on site only.

Math notation:

\begin{aligned} y_{\text{obs}} &\sim \text{Normal}(\alpha_{\text{site}} + \beta_{\text{species}}, \sigma)\\ \alpha_{\text{site}} &\sim \text{Normal}(0, \sigma_{\alpha}) && \text{non-identifiable random effect (don't monitor)}\\ \alpha_{\text{site}}^\ast &= \alpha_{\text{site}} - \overline{\alpha} && \text{identifiable random effect (monitor)}\\ & \text{with } \overline{\alpha} = \frac{1}{N_{\text{site}}} \sum_{site = 1}^{N_{\text{site}}} \alpha_{site}\\ \\ \beta_{\text{species}} &\sim \text{Normal}(0, 10) && \text{non-identifiable intercepts (don't monitor)}\\ \beta_{\text{species}}^\ast &= \beta_{\text{species}} + \overline{\alpha} && \text{identifiable intercepts (monitor)}\\ \sigma_{\alpha} &\sim \text{Exponential}(1)\\ \sigma &\sim \text{Exponential}(1) \end{aligned}

Stan code:

data {
  int<lower=0> N;                // number of observations
  int<lower=0> N_site;           // number of sites
  int<lower=0> N_species;        // number of species
  array[N] int<lower=1> site;    // vector of site id
  array[N] int<lower=1> species; // vector of species id
  vector[N] y_obs;               // vector of observations
}

parameters {
  vector[N_site] alpha;
  real<lower=0> sigma_alpha;
  vector[N_species] beta;
  real<lower=0> sigma;
}

model {
  
  // adaptive prior
  target += normal_lpdf(alpha | 0, sigma_alpha); // non-identifiable "random" effect (DON'T MONITOR)
  
  // hyper-prior
  target += exponential_lpdf(sigma_alpha | 1);
  
  // priors
  target += normal_lpdf(beta | 0, 10); // non-identifiable intercept (DON'T MONITOR)
  target += exponential_lpdf(sigma | 1);
  
  // likelihood
  target += normal_lpdf(y_obs | alpha[site] + beta[species], sigma);
  
}

generated quantities {
  
  // Post-sweeping of "random" effects (Ogle & Barber 2020, solution 4)
  real alpha_bar = mean(alpha); 
  vector[N_site] alpha_star = alpha - alpha_bar; // identifiable "random" effect (MONITOR)
  vector[N_species] beta_star = beta + alpha_bar; // identifiable intercept (MONITOR)
  
  // posterior predictive distribution for replications y_rep of the original data set y given model parameters
  array[N] real y_obs_rep = normal_rng(alpha[site] + beta[species], sigma);
  array[N] real y_obs_rep_star = normal_rng(alpha_star[site] + beta_star[species], sigma);
  
  // log-lik
  vector[N] log_lik;
  for (i in 1:N) {
    log_lik[i] = normal_lpdf(y_obs[i] | alpha_star[site[i]] + beta_star[species[i]], sigma);
  }
  
}

Results:

The identifiable parameters \beta^\ast and \alpha^\ast, have better \hat{R}, higher ess, and narrower credibility intervals than the non-identifiable \beta and \alpha:

> fit$summary(variables = c("beta", "beta_star"))
# A tibble: 6 × 10
  variable        mean  median     sd    mad      q5     q95  rhat ess_bulk ess_tail
  <chr>          <num>   <num>  <num>  <num>   <num>   <num> <num>    <num>    <num>
1 beta[1]       1.58    1.58   0.127  0.126   1.37    1.78    1.02     175.     315.
2 beta[2]       0.136   0.139  0.126  0.122  -0.0722  0.345   1.02     169.     243.
3 beta[3]      -0.0926 -0.0892 0.127  0.126  -0.303   0.119   1.02     173.     264.
4 beta_star[1]  1.57    1.57   0.0313 0.0321  1.51    1.62    1.00    4335.    2671.
5 beta_star[2]  0.124   0.124  0.0310 0.0312  0.0739  0.176   1.00    4177.    2975.
6 beta_star[3] -0.104  -0.104  0.0314 0.0306 -0.156  -0.0518  1.00    4295.    2381.
> fit$summary(variables = c("alpha[1]", "alpha[18]", "alpha[47]", "alpha_star[1]", "alpha_star[18]", "alpha_star[47]"))
# A tibble: 6 × 10
  variable         mean median    sd   mad     q5    q95  rhat ess_bulk ess_tail
  <chr>           <num>  <num> <num> <num>  <num>  <num> <num>    <num>    <num>
1 alpha[1]        1.19   1.19  0.176 0.179  0.909  1.49   1.01     322.     845.
2 alpha[18]      -0.922 -0.922 0.175 0.177 -1.22  -0.637  1.01     329.     792.
3 alpha[47]      -2.73  -2.74  0.175 0.175 -3.02  -2.44   1.01     339.     923.
4 alpha_star[1]   1.20   1.20  0.126 0.126  0.994  1.41   1.00   11120.    2477.
5 alpha_star[18] -0.911 -0.909 0.126 0.128 -1.12  -0.707  1.00    9301.    2703.
6 alpha_star[47] -2.72  -2.72  0.125 0.124 -2.92  -2.51   1.00    8847.    2612.

They also show better mixing and convergence:

And identifiability! The correlations among \beta and \alpha disappear with \beta^\ast and \alpha^\ast:

mcmc_pairs_2.2cp_postsweeping_nonid

mcmc_pairs_2.2cp_postsweeping_id