Mean and variance of Bayesian analysis of multivariate linear mixed models

I came across a few publications about Bayesian analysis of LMM and got confused on the formulas.

Generally, from my previous knowledge, an LMM is Y=X\beta+Zu+e with two variance matrices that Var(u)=G and Var(e)=R.

Then the expectation and variance of Y are E[Y] = X\beta, Var[Y] = ZGZ^\top+R. I believe it is true if \hat{\beta} and \hat{u}, also known as BLUE and BLUP, are estimated by REML approaches. In Bayesian analysis, I found a paper here and another paper that use the same expression.

However, @paul.buerkner in his paper brms pointed out that E[Y] =X\beta + Zu that’s because

we want to make explicit that u is a model parameter in the same manner as \beta so that uncertainty in its estimates can be naturally evaluated. In fact, this is an important advantage of Bayesian MCMC methods as compared to maximum likelihood approaches, which do not treat u as a parameter, but assume that it is part of the error term instead.

Then Var[Y]=R. Same in this paper that implements LMM in stan.

I am confused on this issue.

  • If the first expression is correct, then the exponential term in the log-likelihood density function is (Y-X\beta)^\top(ZGZ^\top+R)^{-1}(Y-X\beta);
  • If the second expression is correct, it becomes (Y-X\beta-Zu)^\top R^{-1}(Y-X\beta-Zu)

Which is correct, or they are both correct in Bayesian analysis?

Thank you for reading this post.

I wanted to write out the expressions from the bottom up, but I realized I really don’t have the time, so I’m going for an heuristic/intuitive reply instead, sorry if it’s not formal or precise in every aspect.

I believe the decision to write E[Y] = X\beta + Zu, is really one for emphasis since E[u]=0. The estimators you describe take into account the (co)variances R and G in the error terms \varepsilon and u but omit their uncertainties. Looking at the uncertainties being something inherently bayesian, this emphasizes the outcomes for those parameters, instead of sort of integrating them out. Maybe that’s what @paul.buerkner meant by that statement – and of course he would be able to convey their interpretation better than I could. (I’d go further, possibly, and look at the uncertainty in \varepsilon as well)

The main point is that what you call the terms doesn’t change the likelihood, so you must get the exact same result if you have the same model. I think of this an LMM like this: in addition to the predictor X\beta, the ZGZ^T term arises because the noise u is structured between groups specified in Z, while \varepsilon applies to all observations indistinctly. If you write that for each data point y_i = \mathbf{x}_i^T\beta + u_i + \varepsilon you will arrive at the same likelihood, regardless of whether you are thinking of this as a least squares, maximum likelihood, or bayesian estimation problem.

Personally, something else bayesian inference has helped me with is explicitly describing and understanding what I am modeling before setting up the entire inference problem, any matrix representations will simply follow from that, as opposed to trying to translate the predictors at a higher level that may not make sense (or seem like it doesn’t).

Like I said, I wanted to go through the matrix expressions you show from that perspective, and try and answer your question more directly, but I hope this helps.

2 Likes

Dear @caesoma, thanks for replying the post and addressing my questions. I run a simulation for both expressions in rstan, let’s say

  • A: E[Y]=X\beta+Zu, Var[Y]=R,
  • B: E[Y]=X\beta, Var[Y]=ZGZ^\top+R.

and got similar results.
One of the problems is that model B is misspecified, shown by LOO.
LMM_A.stan (766 Bytes)
LMM_B.stan (1.1 KB)
LMM_C.stan (1.1 KB)

The \hat{\sigma} for the term R=\sigma I from File C is much smaller than File A. However, A is closer to the true \sigma.

File B converts the Cholesky decomposition to a vector by multiplying z and uses normal_lpdf as the target function. File C uses multi_normal_cholesky_lpdf straight away. Results from file C are much closer to results from File A (for expression A).

Results:

intercept and \beta

true beta: 4.553443 5.174803 5.074551 5.428167
apply(post.A$beta,2,median)
[1] 4.555137 5.178650 5.078968 5.428761
apply(post.B$beta,2,median)
[1] 4.530715 5.203898 4.848865 5.524263
apply(post.C$beta,2,median)
[1] 4.554859 5.178618 5.079023 5.428616

Var(u)

apply(post.A$sd_u,2,median)
[1] 0.3247672 0.5964836 0.7249493 1.4808232 1.3468040
apply(post.B$sd_u,2,median)
[1] 0.4260537 0.5675873 0.5788620 0.5070517 0.7498735
apply(post.C$sd_u,2,median)
[1] 0.3589458 0.5070350 0.5949263 1.4044897 1.2004613

\sigma: true 0.01

 mean(post.A$sigma)
    [1] 0.009063868
    mean(post.B$sigma)
    [1] 0.01823508
    mean(post.C$sigma)
    [1] 8.298258e-05

LOO results from expression A:

Computed from 2000 by 100 log-likelihood matrix
         Estimate   SE
elpd_loo    324.5  7.0
p_loo         9.4  1.4
looic      -649.0 13.9
------
Monte Carlo SE of elpd_loo is 0.1.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.

Results from expression B:

Computed from 2000 by 100 log-likelihood matrix

         Estimate   SE
elpd_loo   -124.8  6.1
p_loo        12.6  2.2
looic       249.6 12.3
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     68    68.0%   755       
 (0.5, 0.7]   (ok)       22    22.0%   317       
   (0.7, 1]   (bad)       6     6.0%   146       
   (1, Inf)   (very bad)  4     4.0%   179       
See help('pareto-k-diagnostic') for details.

Practically, expression A is more correct and robust than B.

I just wonder, is there a theoretical way to prove it? Thank you.

1 Like

I said the likelihood would be the same, but to be precise, the requirement for u to be gaussian is actually enforced by the prior in a bayesian implementation (which would also apply to the requirement for \varepsilon). Priors would otherwise be uniform to match the frequentist approach (not that this is a good idea in general, as most people here will tell you).

I’m intuitively convinced about the one to one correspondence between the two approaches, so I don’t think either is more robust (but the bayesian approach should indeed allow uncertainty on any parameter to be assessed directly, which won’t be the case in the frequentist formulation). But I have to say I’d like to see the posterior/likelihood for each of them to be sure, which I think can and is best done analytically.

1 Like

Thanks again for replying the post. I used INLA package on the same data set and compared the outcomes to rstan.

library(INLA)
set.seed(123)
n = 100
nz = 5
nb = 4
tmp = matrix(rnorm(nz^2), nz, nz)
Qz = tmp %*% t(tmp)
Z = matrix(runif(n*nz), n, nz)
u = inla.qsample(1, Q=Qz)
beta = 5 + rnorm(nb)
X = matrix(runif(n*nb), n, nb)
X[,1] = 1 # want an intercept
eta = X %*% beta + Z %*% u
y = eta + rnorm(n, sd=0.01)
Qx = diag(nb)
formula = y ~ -1 + X + f(id.z, model="z", Cmatrix=Qz, Z=Z)
result = inla(formula, data = list(y=y, id.z = 1:n, X=X))

“robust” means: results from A are the same as from INLA and the pp check is promising, compared to B and C. pp check with uniform (vague) priors are in below.

(\sigma=0.01 was too small)

This paper (at top of page 486) says “Combining the joint distribution of (y,u) and integrating out u” yields to expression (Y-X\beta)^\top (ZGZ^\top+R)^{-1}(Y-X\beta), which is correspondence to what you mentioned

But I still don’t understand why “the likelihood would be the same”. From my perspective, the two likelihoods have different means and covariance matrices. Perhaps that is because the conditional distribution of a hierarchical model that y\mid u\sim N(X\beta+Zu,\sigma), u\mid \tau \sim N(0,\tau) and \tau\sim N^+(0,1).

Regards with thanks.

Hi!

The discussion made me really curious, and I wanted to understand better this marginalization thing.

I made that small simulation study to compare the two formulations, and as you computed analytically, it does not give the same likelihood…

# Clean the environment
rm(list=ls())

## Load needed libraries
library(mvtnorm)
library(useful)

# Set seed
set.seed(123)

# Number of data points
N = 500
# Number of predictors
P = 5
# Number of categories
K = 10

# Create predictor matrix
X <- matrix(rnorm(N*P), nrow = N, ncol = P)
# Create hierarchical effect matrix
Z_vec <- data.frame(Z = rep(LETTERS[1:K], each = N/K))
Z <- build.x(~ Z, data = Z_vec, contrasts = F)[,-1]
# Sample hyper sigma
G <- matrix(0, K, K)
diag(G) <- 0.5
# Sample sigma
R <- matrix(0, N, N)
diag(R) <- 1
# Sample intercepts
u <- rmvnorm(1, sigma = G)
# Sample slopes
beta <- rnorm(P, 0, 2)

# Compute linear predictor
mu <- X %*% beta + Z %*% t(u)

# Sample from the mean equation
y_mean <- rmvnorm(1, mu, R)
# Sample from the marginalized equation
y_marg <- rmvnorm(1,  X %*% beta, Z %*% G %*% t(Z) + R)

# Evaluate log-likelihoods from the mean simulation
dmvnorm(y_mean, mu, R, log = T)
sum(dnorm(c(y_mean), mu, diag(R), log = T))
[1] -719.3105

dmvnorm(y_mean, X %*% beta, (Z %*% G) %*% t(Z) + R, log = T)
[1] -732.0707

# Evaluate log-likelihood from the marginalized simulation
dmvnorm(y_marg, mu, R, log = T)
sum(dnorm(c(y_marg), mu, diag(R), log = T))
[1] -855.9829
dmvnorm(y_marg, X %*% beta, (Z %*% G) %*% t(Z) + R, log = T)
[1] -720.5364

I repeated the above simulation 200 times and compared the log-lik from the two solutions. The log-likelihood of the true generating process is always superior to the other by a regular offset.

It seems quite logical and self-explaining after all: we compare a solution in which we know (estimate) the actual realization of the hierarchical effects to a solution in which we consider it as “noise” and we estimate only the noise. Should we see the second one as an approximation of the true data generating process? I should read more, and dive into @andrewgelman books, in my opinion.

Then, there is the shrinkage applied in the bayesian case, leading hierarchical parameters to be shrunk toward the mean depending upon the information available to identify them. What effect would that have on the log-lik value? I guess the answer is here. According to this result, pooling should naturally decrease the residual sum of squares.

Sorry for being fuzzy, I try to assemble disconnected pieces of a big puzzle.

1 Like

As far as I understand, the model itself is the same Y = X\beta + Zu + \varepsilon, the only difference is that the normality of the error terms is enforced by assumption in the frequentist case – which allows you to obtain an analytical estimate – and by the priors in the bayesian formulation. There is no elaborate reason why they converge to the same model other than they have the same structure and constraints/assumptions on the noise terms.

As I mentioned, the mean is easily shown to be the same since E[u] = 0. The variances you are simulating are clearly different because ZGZ^T is nonzero. The question is if Var[Y] = R, I suspect this may not be the case. I don’t see any reason they should be different, in which case the matrix expressions should be equivalent. I’d try and compute the variance and/or likelihood/posterior from the model directly, and I believe they will be the same, but I may be missing something.

1 Like

Just looking at Y = X\beta + Zu + \varepsilon it seems clear to me that Var[Y] cannot be just Var[\varepsilon]=R because they will have increased variance (but not mean) from u.