Proper use of sum_to_zero_vector in nested multilevel models

In nested multilevel models we model variability between experimental units within rather than across groups of interest, hence nested. In the simplest case one group is known a priori to have much lower variability between experimental units than another and estimating this variability across both groups would inflate our variability estimate for the first group.

Usually I would use a parameter matrix for simplicity, but an array of parameter vectors seems the only appropriate method for sum_to_zero_vector, as mentioned here. I am looking for a simple syntactic guideline to follow for such multilevel models.

Here is my reprex in R and Stan. First let’s generate some data according to my description above.

require(tidyverse)
set.seed(101)
data <- tibble(
  group = 1:2 %>% rep(each = 15 * 5) %>% str_c() %>% fct(),
  unit = 1:15 %>% rep(each = 5) %>% rep(2) %>% str_c() %>% fct()
  ) %>%
  mutate(
    value_mean = if_else(group == 1,
                         0.1, 2),
    value_sd = if_else(group == 1,
                       0.05, 0.5)
  ) %>%
  group_by(unit) %>%
  mutate(
    value_dev = if_else(group == 1,
                        rnorm( 1 , 0 , 0.02 ),
                        rnorm( 1 , 0 , 0.2 ))
  ) %>%
  ungroup() %>%
  mutate(
    value = rgamma( n() , 
                    (value_mean + value_dev)^2 / value_sd^2 ,
                    (value_mean + value_dev) / value_sd^2 )
  ) %>%
  select(group, unit, value)

Here’s what they look like.

data %>%
  ggplot(aes(unit, value)) +
    geom_point(shape = 16, size = 3, alpha = 0.2) +
    facet_grid(~ group) +
    theme_minimal()


Now let’s look at the centred (stan_c) and non-centred (stan_nc) parameterisations without sum_to_zero_vector. alpha_u, the intercept term capturing variation across unit within group, is a matrix in both cases.

stan_c <- "
    data{ 
      int n;
      vector[n] value;
      array[n] int group;
      int n_group;
      array[n] int unit;
      int n_unit;
    }

    parameters{
      // Intercepts in log space
      vector[n_group] alpha; 
      matrix[n_group, n_unit] alpha_u;
      
      // Inter-unit variability
      vector<lower=0>[n_group] sigma_u;
      
      // Likelihood uncertainty
      vector<lower=0>[n_group] sigma; 
    }

    model{
      // Hyperpriors
      sigma_u ~ exponential( 1 );
      
      // Priors
      alpha ~ normal( log(1) , 0.2 );
      for (i in 1:n_group) {
        alpha_u[i,] ~ normal( 0 , sigma_u[i] );
      }
      sigma ~ exponential( 1 );
      
      // Model with link function
      vector[n] mu;
      for ( i in 1:n ) {
        mu[i] = exp( alpha[ group[i] ] + 
                     alpha_u[ group[i] , unit[i] ] );
      }

      // Gamma likelihood
      value ~ gamma( square( mu ) ./ square( sigma[group] ) ,
                     mu ./ square( sigma[group] ) );
    }
"

stan_nc <- "
    data{ 
      int n;
      vector[n] value;
      array[n] int group;
      int n_group;
      array[n] int unit;
      int n_unit;
    }

    parameters{
      // Intercepts in log space
      vector[n_group] alpha; 
      matrix[n_group, n_unit] z_u; // z-score
      
      // Inter-unit variability
      vector<lower=0>[n_group] sigma_u;
      
      // Likelihood uncertainty
      vector<lower=0>[n_group] sigma; 
    }

    model{
      // Hyperpriors
      sigma_u ~ exponential( 1 );
      
      // Priors
      alpha ~ normal( log(1) , 0.2 );
      to_vector(z_u) ~ normal( 0 , 1 );
      sigma ~ exponential( 1 );
      
      // Convert z_u to alpha_u
      matrix[n_group, n_unit] alpha_u;
      for (i in 1:n_group) {
        for (j in 1:n_unit) {
          alpha_u[i,j] = z_u[i,j] * sigma_u[i] + 0;
        }
      }
      
      // Model with link function
      vector[n] mu;
      for ( i in 1:n ) {
        mu[i] = exp( alpha[ group[i] ] + 
                     alpha_u[ group[i] , unit[i] ] );
      }

      // Gamma likelihood
      value ~ gamma( square( mu ) ./ square( sigma[group] ) ,
                     mu ./ square( sigma[group] ) );
    }
    
    generated quantities{
      // Save alpha_u
      matrix[n_group, n_unit] alpha_u;
      for (i in 1:n_group) {
        for (j in 1:n_unit) {
          alpha_u[i,j] = z_u[i,j] * sigma_u[i] + 0;
        }
      }
    }
"

model_c <- stan_c %>%
  write_stan_file() %>%
  cmdstan_model()

model_nc <- stan_nc %>%
  write_stan_file() %>%
  cmdstan_model()

require(tidybayes)
samples_c <- model_c$sample(
          data = data %>% compose_data(),
          chains = 8,
          parallel_chains = parallel::detectCores(),
          iter_warmup = 1e4,
          iter_sampling = 1e4,
        )

samples_nc <- model_nc$sample(
          data = data %>% compose_data(),
          chains = 8,
          parallel_chains = parallel::detectCores(),
          iter_warmup = 1e4,
          iter_sampling = 1e4,
        )

Let’s compare Rhat.

samples_c$summary() %>%
  left_join(samples_nc$summary(),
            by = "variable") %>%
  rename(rhat_c = rhat.x, rhat_nc = rhat.y) %>%
  ggplot(aes(rhat_c, rhat_nc)) +
    geom_abline(slope = 1) +
    geom_point() +
    theme_minimal()


Both models converged well but the non-centred model is sightly better.

But when we look at the pairs plot we see that alpha_u is far from zero, absorbing a substantial part of the effect of alpha, deflating it while inflating sigma_u, which expects the mean to be zero. In other words, alpha and alpha_u are non-identifiable. The results are pretty much identical for both parameterisations, so here’s the one for the non-centred model.

samples_nc$draws(format = "df") %>%
  mcmc_pairs(pars = c("alpha[1]", "sigma[1]",
                      "alpha_u[1,1]", "sigma_u[1]"))


Enter sum_to_zero_vector. Now we force the mean effect of alpha_u to be zero. stan_c_stz and stan_nc_stz are versions of the previous models where alpha_u is a pair of sum_to_zero_vectors. Here I am unsure of my notation using array[group] sum_to_zero_vector[unit] instead of matrix[group, unit].

stan_c_stz <- "
    data{ 
      int n;
      vector[n] value;
      array[n] int group;
      int n_group;
      array[n] int unit;
      int n_unit;
    }

    parameters{
      // Intercepts in log space
      vector[n_group] alpha; 
      array[n_group] sum_to_zero_vector[n_unit] alpha_u;
      
      // Inter-unit variability
      vector<lower=0>[n_group] sigma_u;
      
      // Likelihood uncertainty
      vector<lower=0>[n_group] sigma; 
    }

    model{
      // Hyperpriors
      sigma_u ~ exponential( 1 );
      
      // Priors
      alpha ~ normal( log(1) , 0.2 );
      for (i in 1:n_group) { // Note correction of sigma_u
        alpha_u[i][] ~ normal( 0 , sigma_u[i] * sqrt( n_unit * inv( n_unit - 1 ) ) );
      }
      sigma ~ exponential( 1 );
      
      // Model with link function
      vector[n] mu;
      for ( i in 1:n ) {
        mu[i] = exp( alpha[ group[i] ] + 
                     alpha_u[ group[i] ][ unit[i] ] );
      }

      // Gamma likelihood
      value ~ gamma( square( mu ) ./ square( sigma[group] ) ,
                     mu ./ square( sigma[group] ) );
    }
"

stan_nc_stz <- "
    data{ 
      int n;
      vector[n] value;
      array[n] int group;
      int n_group;
      array[n] int unit;
      int n_unit;
    }

    parameters{
      // Intercepts in log space
      vector[n_group] alpha; 
      array[n_group] sum_to_zero_vector[n_unit] z_u; // z-score
      
      // Inter-unit variability
      vector<lower=0>[n_group] sigma_u;
      
      // Likelihood uncertainty
      vector<lower=0>[n_group] sigma; 
    }

    model{
      // Hyperpriors
      sigma_u ~ exponential( 1 );
      
      // Priors
      alpha ~ normal( log(1) , 0.2 );
      for (i in 1:n_group) {
        z_u[i][] ~ normal( 0 , 1 );
      }
      sigma ~ exponential( 1 );
      
      // Convert z_u to alpha_u
      matrix[n_group, n_unit] alpha_u;
      for (i in 1:n_group) {
        for (j in 1:n_unit) { // Note correction of z_u
          alpha_u[i,j] = z_u[i][j] * sqrt( n_unit * inv( n_unit - 1 ) ) * sigma_u[i] + 0;
        }
      }
      
      // Model with link function
      vector[n] mu;
      for ( i in 1:n ) {
        mu[i] = exp( alpha[ group[i] ] + 
                     alpha_u[ group[i] , unit[i] ] );
      }

      // Gamma likelihood
      value ~ gamma( square( mu ) ./ square( sigma[group] ) ,
                     mu ./ square( sigma[group] ) );
    }
    
    generated quantities{
      // Save alpha_u
      matrix[n_group, n_unit] alpha_u;
      for (i in 1:n_group) {
        for (j in 1:n_unit) { // Note correction of z_u
          alpha_u[i,j] = z_u[i][j] * sqrt( n_unit * inv( n_unit - 1 ) ) * sigma_u[i] + 0;
        }
      }
    }
"

model_c_stz <- stan_c_stz %>%
  write_stan_file() %>%
  cmdstan_model()

model_nc_stz <- stan_nc_stz %>%
  write_stan_file() %>%
  cmdstan_model()

samples_c_stz <- model_c_stz$sample(
          data = data %>% compose_data(),
          chains = 8,
          parallel_chains = parallel::detectCores(),
          iter_warmup = 1e4,
          iter_sampling = 1e4,
        )

samples_nc_stz <- model_nc_stz$sample(
          data = data %>% compose_data(),
          chains = 8,
          parallel_chains = parallel::detectCores(),
          iter_warmup = 1e4,
          iter_sampling = 1e4,
        )

The centred model samples far less efficiently and the non-centred model samples like a breeze. Let’s compare Rhat again.

samples_c_stz$summary() %>%
  left_join(samples_nc_stz$summary(),
            by = "variable") %>%
  rename(rhat_c_stz = rhat.x, rhat_nc_stz = rhat.y) %>%
  ggplot(aes(rhat_c_stz, rhat_nc_stz)) +
    geom_abline(slope = 1) +
    geom_point() +
    theme_minimal()


The centred model’s performance has dramatically degraded, so that the non-centred model is now much better. The most likely explanation is a syntactic error on my part. Is array[group] sum_to_zero_vector[unit] supported? Am I doing something else wrong?

The pairs plot for the centred model confirms the sampling issues. But in the non-centred model sum_to_zero_vector seems to mostly have resolved the non-identifiability between alpha and alpha_u, albeit introducing some correlation between alpha and sigma because this is a gamma likelihood and mean and sd aren’t independent.

samples_nc_stz$draws(format = "df") %>%
  mcmc_pairs(pars = c("alpha[1]", "sigma[1]",
                      "alpha_u[1,1]", "sigma_u[1]"))


It helps to look at the predictions of all models over the data.

samples_c %>%
  recover_types(data %>% select(group, unit)) %>%
  spread_draws(alpha[group], sigma_u[group], sigma[group]) %>%
  ungroup() %>%
  mutate(mu = exp( alpha + rnorm( n() , 0 , sigma_u ) ),
         obs = rgamma( n() , mu^2 / sigma^2 , mu / sigma^2 ),
         model = "c" %>% fct()) %>%
  select(group, obs, model) %>%
  bind_rows(
    samples_nc %>%
      recover_types(data %>% select(group, unit)) %>%
      spread_draws(alpha[group], sigma_u[group], sigma[group]) %>%
      ungroup() %>%
      mutate(mu = exp( alpha + rnorm( n() , 0 , sigma_u ) ),
             obs = rgamma( n() , mu^2 / sigma^2 , mu / sigma^2 ),
             model = "nc" %>% fct()) %>%
      select(group, obs, model),
    samples_c_stz %>%
      recover_types(data %>% select(group, unit)) %>%
      spread_draws(alpha[group], sigma_u[group], sigma[group]) %>%
      ungroup() %>%
      mutate(mu = exp( alpha + rnorm( n() , 0 , sigma_u ) ),
             obs = rgamma( n() , mu^2 / sigma^2 , mu / sigma^2 ),
             model = "c_stz" %>% fct()) %>%
      select(group, obs, model),
    samples_nc_stz %>%
      recover_types(data %>% select(group, unit)) %>%
      spread_draws(alpha[group], sigma_u[group], sigma[group]) %>%
      ungroup() %>%
      mutate(mu = exp( alpha + rnorm( n() , 0 , sigma_u ) ),
             obs = rgamma( n() , mu^2 / sigma^2 , mu / sigma^2 ),
             model = "nc_stz" %>% fct()) %>%
      select(group, obs, model)
  ) %>%
  ggplot() +
    geom_point(data = data, aes(value, group),
               shape = 16, size = 3, alpha = 0.2) +
    ggridges::geom_density_ridges(aes(obs, group, colour = model),
                                  from = 0, to = 4, fill = NA, scale = 0.8) +
    # geom_density(data = tibble(value = rnorm( 1e5 , log(1) , 0.2 ) %>% exp()),
    #              aes(value), colour = "grey") + # prior on alpha
    scale_x_continuous(limits = c(0, 4), oob = scales::oob_keep) +
    scale_colour_manual(values = c("orange", "goldenrod", "lightblue", "purple")) +
    theme_minimal() +
    theme(legend.position = "top",
          legend.justification = 0)


In each case the centred and non-centred parameterisations actually produce comparable predictions but the sum_to_zero_vector models clearly outperform the previous ones for group == 1.

I think I’m on the right track with using sum_to_zero_vector because my stz models clearly yield better predictions given the same priors, but I would appreciate if someone could validate my syntax. Or is there an easy solution that does not involve sum_to_zero_vector? Why is the c_stz model having such convergence issues? I would appreciate any other advice on such hierarchical models.

I am taking the liberty to tag people I know to be involved with sum_to_zero_vector: @mitzimorris @Bob_Carpenter @spinkney @martinmodrak

The two terms of art are “hierarchical” for when things are properly nested (e.g., students in classes in schools in school districts in states) and “multilevel” when they’re not (e.g., age group effects, ethnicity effects, state of residence effects, etc. in a polling model). The notion of a multilevel model includes hierarchical models as a subclass and they all work the same way. The key is identifying them, hence the sum-to-zero you found.

That’s how you have to code multiple sum-to-zero vectors for now. We also don’t have ragged arrays or matrices, so it’s challenging to do any kind of multilevel modeling with a matrix without blowing up the matrix, as for example, lme4 does.

Yes. You should be able to verify that the elements of the array sum to zero in the output.

Gamma regressions can be challenging to fit in my experience from reading these forums.

That’s a good safety check. It also highlights the point that you don’t need really great sampling for pretty good inference.

The traditional approach is to pin one of the values to zero. That also identifies the model. The main trick is to make sure you only introduce a single intercept. Sum-to-zero enforces that because the values can’t all shift without breaking the sum-to-zero constraint.

This is splitting hairs. Inference will be fine for even much larger R-hat values. We don’t usually recommend running so long that you get these kinds of R-hat values as the real uncertainty in your posterior is the posterior variance, not the standard error in estimating expectations.

You’re on the right track looking at pairs plots in the posterior and look at what happens to the step size in adaptation. It pretty much always comes down to varying geometry causing a problem. Centered vs. non-centered efficiency can vary based on how informative the data and prior are together—as data and prior get stronger, centered starts to work better.

The following code is problematic if both alpha and alpha_u are set to sum to zero because it doesn’t provide an intercept.

for (i in 1:n) {
  mu[i] = exp(alpha[group[i]] + alpha_u[group[I], unit[I]]);
}
value ~ gamma(mu^2 ./ sigma[group]^2, mu ./ sigma[group]^2));

You’re also going to want to make sure that the exponentiation doesn’t overflow. I would recommend introducing a new parameter beta with an appropriately scaled prior, keeping both alpha and alpha_u centered by group, and then using

mu[i] = exp(beta + alpha[group[i]] + alpha_u[group[I], unit[I]]);

I recommend this because as models get more complicated, it’s easiest to just keep all the varying effects zero-centered with a single intercept. Otherwise, one of the varying effects has to be allowed to be non-centered to account for the intercept. But it’s almost always more efficient to do it with a single parameter.

P.S. I’d strongly suggest storing Stan programs in their own files. You can get syntax highlighting to edit and you can also make sense of line numbers and you can use either quotes to print or apostrophes for transposition without worrying about what quote symbol was used. It also saves on indentation.

2 Likes

Thanks for the thorough reply @Bob_Carpenter!

I did not know this. I always assumed the terms to be interchangeable. There seems to be some confusion on this, e.g.

This results in a model with multiple levels of uncertainty, each feeding into the next—a multilevel model. Multilevel models—also known as hierarchical, random effects, varying effects, or mixed effects models… (Richard McElreath 2019 Statistical rethinking: a Bayesian course with examples in R and Stan).

Thanks for confirming I’m on the right track with array[] sum_to_zero_vector[].

I usually don’t have issues with the mean-sd parameterisation of the gamma likelihood and I know mean-scale is also used successfully. What may be the issue here is that I am trying to model heteroscedasticity even though mu and sigma are not independent in the gamma likelihood. Not stratifying sigma by group would definitely simplify the model. Should modelling heteroscedasticity generally be avoided in generalised linear models? The obvious cases are Poisson and Bernoulli where there is only one parameter. Any thoughts?

Yes I am guilty of that. I usually look for R-hat < 1.001 with 8*1e4 iterations. Would you say R-hat is less important to look out for than E-BFMI? In my experience E-BFMI is at risk of being low when I try to model variance separately from the mean in the gamma likelihood.

But given that the data and priors stayed the same, I’m afraid I still don’t understand why the centred parameterisation went from reasonable to terrible convergence when introducing sum_to_zero_vector?

I see. I simply chose the normal( log(1) , 0.2 ) prior for alpha based on the fact that on the response scale the midpoint between the two group means (0.1 vs. 2) is approximately 1, hence log(1). I did not consider that since log(1) = 0 this implies two zero-centred priors.

I typically don’t include a global intercept by default since it is an additional parameter that often holds no intrinsic value (e.g. in my reprex an average across group may be meaningless) and it is sometimes harder to think of a sensible prior for deviations from the global intercept rather than the intercept itself. Following your suggestion the priors would become beta ~ normal( log(1) , 0.2 ) and alpha ~ normal( 0 , 0.2 ) etc. or similar, so this doesn’t really solve the issue because now there are three instead of two parameters centred on zero, with one summing to zero. Am I missing something?

Why is having a non-stratified global intercept with deviations more efficient than a stratified intercept despite introducing the need to estimate an additional parameter?

Is there an easy way to do this in RStudio? I noticed that there is a Stan File option in the New File dropdown, but I thought it was only to be used with rstan. I would be grateful for an easy solution for cmdstanr in RStudio. It would definitely make my R code more readable.

You can use the same files with cmdstanr. Any file ending in .stan can be used with any interface. You can create the file with RStudio or just put your Stan code in a text file and give it a .stan extension yourself. The only issue is that the Stan parser used by RStudio to check syntax uses rstan. So if you use RStudio to check the syntax it will use rstan, which is several versions of Stan behind cmdstanr and so may not recognize some newer additions to the language. But it’s certainly fine to create and edit the file using RStudio and then give it to cmdstanr even if you’re using something in the language not supported by rstan yet (I do this all the time). Trying to compile the model with cmdstanr will then check the syntax without using rstan. Or you could also use the check_syntax method if you prefer create the model object without compiling it.

1 Like

Not related to original post. But this info about RStudio using rstan to parse Stan files explains some odd things I’ve noticed for a while, like std_normal_lpdf() not getting highlighted “appropriately”. I am glad to know this now.

Is this factoid publically documented anywhere? If not, should it be?

1 Like

Hmm, I guess it’s not documented. It’s just that it’s always used rstan since way before cmdstanr existed, but you’re right that we should put this information somewhere. Where do you think would be a helpful place to document this? Maybe in the main vignette for cmdstanr? Somewhere else?

If I recall correctly I think at one point either @mike-lawrence or @ssp3nc3r (I can’t remember which, sorry!) wrote a script that overrides it and allows it to use cmdstanr. Does that ring a bell, Mike and/or Scott?

1 Like

I think this is what your referring to: Rstudio Stan syntax checker is outdated

2 Likes

You an also install rstan from the develop branch for reasonably up-to-date syntax highlighting.