SUG 1.13 ("Multivariate priors for hierarchical models"): missing prior & serious identification issue?

I discovered tonight that I’ve spent years misunderstanding the example in the
SUG 1.13 (“Multivariate priors for hierarchical models”), and now have a couple questions:

  1. The absence of a prior on sigma (observation-level noise scale) is surely an error, yes? I can’t imagine that the implied uniform(0,+Inf) could be expected to perform appropriately.

  2. I feel like there are serious identification issues in the model at present due to the combination of three structures:
    a) The data Y consist of single observations from each of N individuals, modelled as Y~normal(mu,sigma);
    b) mu is itself a vector of length N, each entry of which encodes the dot-product between the corresponding row in the N \times K contrast matrix x and the column in the K \times J coefficient matrix beta corresponding to the group of that individual.
    c) The K \times J coefficient matrix beta is modelled as a multivariate normal with the square of the scale vector tau forming the diagonal.

Thus, I don’t think that sigma and tau are actually identifiable with this structure. These quantities would be properly identified if we had multiple observations from each of the individuals (or even just one of the individuals), but that’s explicitly not how the data declarations are set up.

I actually think I long ago worked out (with help from other Stan gurus as I recall) code for a version of the model that doesn’t have this identification issue (old video walkthrough here, but the critical bit is the use of the get_coefs() function in the transformed data section to transform the data to the coefficients that are then modelled as multivariate normal in the model).

Or am I missing something (again)?

(Tagging @andrewgelman but happy to receive input/correction from anyone)

3 Likes
  1. We don’t currently recommend uniform priors. Times have changed! Back in the day, we were kind of embarrassed by using priors at all, and we used a uniform prior by default. In this example, though, it won’t make much difference, as the posterior for sigma is narrow even when the prior is uniform. I guess something like an exponential(1) prior would be a better default choice.

  2. What is identified depends on the data. If N is a lot bigger than J, then you’re basically OK. You don’t need multiple observations from individuals and groups, because mu and Sigma describe the distribution across the groups. That said, I think the notation mu_j in that document might be confusing you. In the basic model, mu_j is just mu, a shared mean vector for all the groups. More generally, mu_j could be a regression model of the form mu_j = u_j * gamma, where u_j is a fixed predictor matrix (i.e., data) and gamma is a lower-dimensional matrix. The point is that mu_j are not free parameters. Anyway, there’s no problem estimating sigma and tau.

If you’re not sure, you can always simulate fake data and check that the Stan model can recover the parameters. It could be that I got something wrong. We can mess up with math, but simulations never lie (except when they do).

3 Likes

Ah, I should have thought of that; I’ll do so to see if it helps my intuition for how the identification you describe works. Thanks for the input! I’ll report back with code/results…

1 Like

Hey @mike-lawrence, this example can be stripped down to it’s barest form and it might be clarifying. Suppose that there are no covariates, so x is a column vector of 1’s, each element of beta is just a scalar, tau is a scalar, and there is no covariance structure to worry about in the random effects. tau is the square root of the between-group variance and sigma is the square root of the within-group variance. mu is still a vector of length N, but within groups all elements of mu are identical because all rows of x are identical.

When we add covariates, it is no longer the case that all rows of x are identical and so mu now varies within groups, but this within-group variation doesn’t involve a free parameter per observation, but rather just the K free parameters encoded by beta for that group. At first look, you might worry about what happens if the number of observations per group is less than K, but remember that the random coefficients matrix contains fewer than J*K effective degrees of freedom (this is the regularization afforded by estimating the group-specific terms as random rather than fixed).

1 Like

Then I guess the basic model (in 1.13) should have \mu instead of \mu_j since the idea is to have a shared mean vector for all groups? Currently, in the basic model, \beta_j are independent of each other so there is no information sharing i.e. there is no shrinkage toward the population (conditional) mean as in the final model where \beta_j s are shrunk toward u_j\gamma.

I think that would just mean that your posterior will rely more on priors because of the lack of information in the data. As prof Gelman said, “…the concept of identification is less important in the Bayesian world than elsewhere…”. Here is a good link to a discussion on the identifiability issue in the context of Bayesian between Gelman and Ben Goodrich.

I agree the notation is ambiguous (and probably should be updated!) but in context it’s fairly clear that the basic model is a special case of the general model under the restriction that \mu_i = \mu_k for all i and k. The subscript j is introduced because it is useful later on in the more general case. Immediately below the ambiguous line, the SUG says “for now, we assume μ is a simple vector parameter.”

This likelihood is fully identified. That’s not to say it should be fit with overly vague priors in practice, but priors are not needed for identification here.

I agree with you. But at the moment someone read the first part of the document (basic model) she or he is not aware of that and may get confused about why there is no pooling in beta parameters. Each \beta_j has its own independent hyperparameter \mu_j for every j.

Yes, agree again but I was not aiming for that. Mike-lowrence said that quantities would be properly identified if we had multiple observations from each of the individuals so I just wanted to stress the fact that identification is not an issue in that situation. More or fewer observations per individual mean less or more reliance on priors.

1 Like

@mike-lawrence

I also had trouble understanding 1.13, even though I coded the model and put it into an R package. Here are some things I did to understand the model:

  1. I recoded the model using longer names rather than letters. This helped me avoid confusion around variables such as Ns and Ks.
  2. I have simulated fake data, but do not trust my simulation process, so I used known data. Specifically, the radon data from Gelman and Hill that is in the chapter 13. The data is also in Datasets for rstanarm examples — rstanarm-datasets • rstanarm (mc-stan.org)
  3. I ended up working through the model line-by-line with with lme4::lmer()and posted the solution in my answer to my own question: Understanding output from 1.13 Multivariate priors for hierarchical models example - Modeling - The Stan Forums (mc-stan.org)
  4. Realize that this is a regression on regression coefficients. This fact took me weeks to understand the first time and wrap my head around.
  5. Think about how matrices such as gamma and beta map from data to parameter or parameters to other parameters. I’ve had to think about linear algebra a lot to figure this out.

My sticking point was understanding the Cholesky factor and how to back transform, hence my question.

@Bob_Carpenter and the Stan Core team: Would the Stan team please consider show how to get Omega back in in the 1.13 tutorial. You include it for the Gaussian Process tutorial, but I found this but only through a Google search of the manual.

@andrewgelman Please consider using a side-by-side comparison between lme4::lmear() and the 1.16 example for Applied Regression and Multilevel Models See my linked, side by side example here: Understanding output from 1.13 Multivariate priors for hierarchical models example - Modeling - The Stan Forums (mc-stan.org)

1 Like

Thanks @Richard_Erickson. I’m trying to wrap my head around this “regression on regression coefficients” thing; when I see an lmer formula such as:

M3 <- lmer(log_radon  ~ floor + (1 + floor | jj), radon_2)

My head goes to the framework here. I see that in the radon data, floor is a two-level nominal variable but numerically-coded as a 0/1 indicator, and I assume jj encodes a nominal variable where each level is observed multiple times, usually multiple times in each level of floor. In that case, the lmer formula implies a model where each level of jj is associated with an intercept (radon levels when floor==0) and an “effect of floor” (radon levels when floor==1). With two variates and the “full” model implied by the formula, there are two across-levels-of-jj variances and one correlation, as well as the across-levels-of-jj means for the intercept and slope.

My usual approach to coding this model is here, and now that I am typing this I think the distinction from the SUG 1.13 (other than the performance optimizations) is that the SUG 1.13 uses the multivariate normal as a prior on the across-levels-of-jj means then models the by-jj intercepts and slopes as uncorrelated, whereas my version (and that of lmer4 so far as I’ve always understood) encodes independent normal priors on the means and uses the multivariate normal to achieve inference on the by-jj intercepts and slopes. Does that sound right?

@mike-lawrence It’s hard to think about this and there is no single, best, method to write mixed-effect models.

I am not 100% sure, but I think you are correct.

If you have it, check out Section 12.5 of Gelman and Hill. It is titled Five ways to write the same model.

If you don’t have the book, either find it cheaply used, or wait until the next edition comes out. It should be sometime this year based upon Home page for the book, “Applied Regression and Multilevel Models” (columbia.edu).

1 Like

I don’t think this is quite it. The SUG model is conceptually identical to your model except that the intercepts and slopes (which are sampled from a multivariate normal) are actually sampled from a multivariate normal whose mean is modeled as a function of group-level predictors. If we replace
u_gamma[j] = u[j] * gamma; with some more familiar prior on u_gamma, then unless I’m missing something we would have exactly your model, no?

Well, now this is embarassing. I just worked out a series of forgettings on my part that led to my wasting of folks’ time here. First, in the recent entries here I forgot that in the post that kicked off the thread I’d worked out that my code and that of SUG 1.13 were identical (as @jsocolar poipnts out), and it’s the data declaration that I thought implied a difference/identifiability issue. Furthermore, this latter was itself forgetting that SUG 1.13 uses data/terminology that’s different-but-deceptively-similar to my typical data/terminology.

Specifically, it speaks of individuals associated with individual-level predictors, and individuals are lumped together in groups that are in turn associated with group-level predictors. Importantly, there’s only one observation of each individual.

 I Ip1 Ip2 ...  G Gp1 Gp2 ...
i1   +   - ... g1   +   - ... 
i2   +   - ... g1   +   - ... // i1 & i2 share the same Ip's
i3   -   + ... g1   +   - ... 
i4   -   + ... g1   +   - ... // i1:i4 share the same Gp's; i3 & i4 share the same Ip's
i5   +   - ... g2   -   + ... 
i6   +   - ... g2   -   + ... // i5 & i6 share the same Ip's
i7   -   + ... g2   -   + ... 
i8   -   + ... g2   -   + ... // i5:i8 share the same Gp's; i7 & i8 share the same Ip's

Whereas the way I typically have data structured, I still speak of individuals and groups, but of within-individual predictors and between-group predictors. This works out to meaning what SUG 1.13 calls a group, I call an individual, and what SUG 1.13 calls an individual, I call a trial:

 t Ip1 Ip2 ...  I Gp1 Gp2 ...
t1   +   - ... i1   +   - ... 
t2   +   - ... i1   +   - ... // t1 & t2 share the same Ip's
t3   -   + ... i1   +   - ... 
t4   -   + ... i1   +   - ... // t1:t4 share the same Gp's; t3 & t4 share the same Ip's
t5   +   - ... i2   -   + ... 
t6   +   - ... i2   -   + ... // t5 & t6 share the same Ip's
t7   -   + ... i2   -   + ... 
t8   -   + ... i2   -   + ... // t5:t8 share the same Gp's; t7 & t8 share the same Ip's

Using my terminology, if there were only one observation per combination of individual and within-individual predictors, then there would be identifiability issues as I was concerned about at the outset of this thread and the alternative multivariate I posted would be more appropriate. But I see now that this isn’t an issue for the data as discussed in SUG 1.13 and I was just forgetting the terminology-mapping 🤦

1 Like

@mike-lawrence

Here is an example of how I think about the two-level regression. This is different from how you think about the regression, but is how I think about it.

For the example, I simulate mean fish lengths using an intercept for each location. Thus, each location has multiple fish species and we compare their mean lengths across sites. For sake of completeness with the simulation, assume lengths are on a log-10 scale so they are normal.

My post is a code dump, rather than a well written explanation, but hopefully helps you to see how I think about the problem.

library(fishStan)
library(tidyverse)
library(GGally)
library(lme4)
library(MASS)
options(mc.cores = parallel::detectCores())


species_list <- sort(c("red", "blue", "green", "gold"))
spp_hyper_mean <- c(1, 2, 4, 5)
n_spp <- length(species_list)
n_unit <- 50 # Number of units/sites/locations/places/etc. sampled
ind_per_spp_unit <- 15 # individual replicates per unit

unit_mean_table <-
  tibble(unit_mean = rnorm(n_unit, mean = 0, sd = 1)) %>%
  rowid_to_column("unit")

set.seed(1234)
cov_mat_raw <- 
  matrix(c(0.25,    0,    0,       0,
             0,  0.25,  -0.5,   -0.5, 
             0,  -0.5,   0.25,   0.01,
             0,  -0.5,   0.01,   0.25),
             nrow = n_spp, ncol = n_spp)


cov_mat <-
  3*t(cov_mat_raw) %*% cov_mat_raw
cov_mat
simulated_delta <- 
  mvrnorm(n_unit, mu = spp_hyper_mean, cov_mat)
cor(simulated_delta)
ggpairs(as.data.frame(simulated_delta))

colnames(simulated_delta) <- species_list

simulated_delta_long <-
  simulated_delta %>%
  as_tibble() %>%
  rowid_to_column("unit") %>%
  pivot_longer(., -unit,
               names_to = "species",
               values_to = "species_site_mean") %>%
  full_join(unit_mean_table, by = "unit")

simulated_delta_long

sample_table <-
  expand_grid(
    unit = seq_len(n_unit),
    individual = seq_len(ind_per_spp_unit)
    )

simulated_data <-
  sample_table %>%
  full_join(simulated_delta_long, by = "unit")  %>%
  mutate(global_error = rnorm(n = n(),
                              mean = 0,
                              sd = 0.5),
         y_no_unit = species_site_mean + global_error,
         y_unit = y_no_unit + unit_mean,
         unit_id = factor(unit))

unit_data <-
  simulated_data %>%
  group_by(unit, unit_id) %>%
  summarize(n = n(), .groups = "keep")

sim_fit <-
  h_lm_stan(
    y = simulated_data %>% pull(y_no_unit),
    jj = simulated_data %>% pull(unit),
    ind_dat = simulated_data,
    ind_formula = ~ species - 1,
    group_dat = unit_data,
    group_formula = ~ 1,
    n_proj = 30,
    x_projection = NULL,
    gamma_prior = NULL,
    gamma_sd_prior = 1,
    iter = 1600
  )

## Compare to lmer
lmer_out <- lmer(y_no_unit  ~ species - 1 + ((species - 1)| unit), simulated_data) 
summary(lmer_out)
cor(simulated_delta)

fixef(lmer_out)
print(sim_fit$model_par, pars = c("gamma"), probs = 0.5)

## "random-effects"
print(sim_fit$model_par,
      pars = c("beta[1,1]", "beta[1,2]", "beta[1,3]", "beta[1,4]",
               "beta[2,1]", "beta[2,2]", "beta[2,3]", "beta[2,4]"),
      probs = 0.5)

head(coef(lmer_out)$unit, 2)
## examine sigma
print(sim_fit$model_par, pars = c("sigma"), prob = 0.5)
sigma(lmer_out)

## examine correlation structure of model results
L_Omega_est_raw <-
  summary(sim_fit$model_par, pars = c("L_Omega"), prob = 0.5)$summary[, 1]
  
L_Omega_est <-
  matrix(L_Omega_est_raw, nrow = 4, ncol = 4, byrow = TRUE)

colnames(L_Omega_est) <- rownames(L_Omega_est) <- species_list
print(sim_fit$model_par, pars = c("tau"))
VarCorr(lmer_out)

cor(simulated_delta)
L_Omega_est %*% t(L_Omega_est)
VarCorr(lmer_out)

@mike-lawrence I discovered this paper yesterday that talks about mixed effect models and might also help you understand more if you want/need some more help Nested by design: model fitting and interpretation in a mixed model era - Schielzeth - 2013 - Methods in Ecology and Evolution - Wiley Online Library

1 Like