Whether to preserve posterior correlations in downstream inference

I haven’t tried your example, but when you are taking the difference between two parameters at the same row of output, are you considering whether the parameters may be correlated? What happens if you separately resample each column of output and then make your histogram?

This model yields an uncorrelated posterior because it fits the two regressions independently. If the model didn’t fit the two regressions independently, then accurate inference would crucially require not resampling the columns, so that any posterior correlation is accurately reflected in downstream inference.

thanks. Does this mean that when considering the difference between two levels of a “random effect” the relevant columns should not be resampled? if so thanks for correcting my misunderstanding.

1 Like

Short answer is yes, assuming your inferential goal is what I think you are saying.

Say that you have a posterior distribution for the two quantities of interest that is approximately bivariate normal. Here’s a graph of the margins of a bivariate normal distribution with means 0 and 0.5, and sds 1 and 1.

The important thing to note here is that the margins (given these means and sds) will look like this regardless of what the correlation is. Resampling within columns and then differencing will tell you the probability that a random draw from one marginal distribution is higher than a random draw from the other. They don’t tell you the probability that x1 is higher than x2 in a random draw from the joint distribution. That probability depends on the correlation. So for example if the posterior is highly positively correlated, it could be virtually certain that x2 is larger than x1. Here’s the bivariate normal, with the same means and sds, and with correlation 0.9.

Now, in general we don’t expect random effects to be so strongly correlated, but under common parameterizations we actually do expect some positive correlation, because all random effect values tend to be negatively correlated with the intercept. Note that it’s also possible–though a bit trickier to get the prior right–to parameterize the random effect vector under a sum-to-zero constraint, but then the constrained parameters will tend to be negatively correlated with each other due to the constraint itself. In either case, these correlations (positive or negative) will tend to get stronger as the number of random effect levels decreases.

In any case, the correct procedure here is to preserve the correlation structure of the joint posterior by not resampling within columns.

2 Likes

thanks. I’ll try to come up with a short synthetic example to illustrate
why I thought resampling appropriate, when considering the ranef for
different levels of the same factor. Later.

thanks

library(brms)
library(hdi)

set.seed(1984)
rn<-rnorm(5)
x<-runif(400)
sd<-0.5
s<-0.3
f<-sample(1:5,400,replace=TRUE)
fac<-factor(f)
levels(fac)<-c(“Liverpool”,“Manchester”,“Birmingham”,“Coventry”,“Newcastle”)

y<-x^2+sd * rn[f]+s * rnorm(400)
plot(x,y,col=f,pch=15+f)

dat<-data.frame(x,y,fac)
bf<-bf(y~(1|fac)+s(x^2))
m1<-brm(bf,dat,chains=2,adapt_delta=0.99,max_treedepth=15)

vars<-variables(m1)
vars[7:11]
mat<-as.matrix(m1)

ranef(m1)
#compare Liverpool and Birmingham - large overlap of symmetric credible intervals
hdi(mat[,7])
hdi(mat[,9])
#likewise large overlap of highest posterior density intervals

crit<-(f%in%c(1,3))
plot(x[crit],y[crit],col=f[crit],pch=15+f[crit])

#plenty of overlap of the random effects

#BUT
hdi(mat[,7]-mat[,9])
lower     upper
-0.219579 -0.021977
attr(,“credMass”)
[1] 0.95

#so apparently we can be highly confident that the effect for Liverpool is below Birmingham
#in fact this is due to the high correlation of the samples.

cor(mat[,7],mat[,9])

#if we permute the samples the result is more plausible

hdi(sample(mat[,7],2000)-sample(mat[,9],2000))
lower     upper
-1.088742  0.917237
attr(,“credMass”)
[1] 0.95

This is a great example of exactly what I described above. In this model, the random effect distribution is a Gaussian with mean \mu corresponding to the model intercept. The actual group-specific values are \mu + \theta where \theta is the random effect parameter. No matter how data-rich the individual groups are, such data only serves to constrain the value of \mu + \theta. Adding data within the groups doesn’t help to constrain the estimate of \mu. With only five groups, the true value of \mu is not all that well constrained, even though the values of \mu + \theta are well constrained.

In this situation, what we get is a scenario where the joint distribution of \mu and each element of \theta features a strong negative correlation. You can verify that from your fitted model object. At the same time, the elements of \theta are all strongly correlated with one another. The correct marginal estimates for the group specific intercepts \mu + \theta are correctly and appropriately derived by summing within rows, and will yield much tighter uncertainty intervals than the uncertainty in either \mu or \theta. Again, this is expected because the data are tightly constraining the sum \mu + \theta but are not tightly constraining \mu.

Going back to the rn object, we see that indeed Birmingham is higher than Liverpool. So everything is as expected here. We have highly correlated samples because the intercept is only weakly identified with a sample of just 5 groups to estimate the population-level mean. We have correct inference that Birmingham is a bit higher than Liverpool. And we recover that inference precisely by NOT breaking apart the joint posterior into its marginals.

More broadly, when MCMC methods work properly, they yield samples from the joint posterior distribution. Downstream computation works precisely because of this key fact. The distribution whose marginals correspond to the posterior but whose correlations are zero is a completely different distribution that is not the posterior distribution anymore, and inference from that new distribution about quantities that depend on more than one margin of the joint posterior will be incorrect to a potentially arbitrarily severe degree.

3 Likes

thanks for helping me to understand this. Unfortunately, the example was not actually what I intended. I actually wanted to delete the intercept using -1 in the formula. But in that case, the columns are barely correlated so the example does not illustrate the problem I had in mind - which does arise but not in this example. For now though, thanks again.

thanks again. Below is an example in which the numerical covariate has some relation with the factor, leading to correlations in the sampled output, which is what led me to the error of permuting the factor outputs. Further below is how I now see it.

Greg

library(rstan)
rstan_options(auto_write = TRUE)
library(brms)
library(cmdstanr)
options(mc.cores=2, brms.normalize=TRUE,brms.backend=“cmdstanr”)
library(HDInterval)

set.seed(2001)
N=400
f<-sample(1:10,size=N,replace=TRUE)
fac<-factor(f)
levels(fac)<-LETTERS[1:10]
table(fac)
rnf<-0.3 * rnorm(10)
ef<-rnf[f]
v<-ef+0.2 * rnorm(N)
plot(fac,v,xlab=“fac”,ylab=“v”)

y<-0.5+ef+v^2+0.1 * rnorm(N)
plot(v,y,col=f)

bf1<-bf(y~-1+(1|fac)+s(v,k=7))
dat1<-data.frame(v,y,fac)

m1<-brm(bf1,dat1,chains=2,adapt_delta=0.95)
conditional_smooths(m1)
vars<-variables(m1)
mat<-as.matrix(m1)
cor(mat[,10],mat[,12])

#this led me, mistakenly, to permute the columns before forming the difference and applying hdi

#how I now see it
vars
matr<-mat[,5:14]
cr<-rowMeans(matr)
matrc<-matr-outer(cr,rep(1,10))
mnrc<-colMeans(matrc)
rnfc<-rnf-mean(rnf)
plot(rnfc,mnrc,ylim=c(-0.7,0.5),pch=16)
for (i in 1:10)
{
lines(rep(rnfc[i],2),hdi(matrc[,i]))
}

#the hdi obtained from the centred columns of matr do capture the centred values of rnf
#rnf is centred around its mean
#columns of matr are centred around cr, which is a distribution of means of samples

#to compare the random effects of F and H, take matr[,6]-matr[,8] = mat[,10]-mat[,12]
#as the same cr is subtracted from both
mean(matr[,6]-matr[,8])
hdi(matr[,6]-matr[,8])
rnf[6]-rnf[8]