Bayesian standard error for an estimate of a mean

I’m running into some unexpected differences between a linear model fit in Stan versus a linear model fit via the “lm” function in R (rstan version 2.16.0).

I would expect that the standard error for a model coefficient fit via a linear model would roughly equal the standard deviation of the marginal posterior distribution of that parameter in a Bayesian linear model. This is not the case for the models I’m looking at, which I don’t understand. This may be a basic misunderstanding of posteriors distributions versus sampling distributions of max. likelihood estimates on my part. Any help is appreciated, more details follow:

I’m interested in the performance of linear models in in models where the number of parameters is approaching the sample size. I’ve run the same linear model on a series of datasets with N=60 units. The true data generating function is given by b*X, where X is a NxP matrix. For P in {5,12,20,40,50} I’ve fit a linear model in Stan and one in R and compared a) the average difference in coefficients (should be zero if the posterior mean is equal to the maximum likelihood estimate) and b) the average ratio of standard errors (e.g. the standard deviation of the posterior distribution fo the coefficients versus the default standard error in R’s lm function).

For p<<N, the parameters are the same between Stan and lm, and the ratio of standard errors is approximately 1.0. However, as p grows and N stays constant, the ratio of standard errors increases (e.g. Stan gives less precise standard errors), up to a ratio of 1.2 at p=50. This happens even when the parameters in Stan are given null centered, normal priors.

I don’t have a good explanation for these observations. Stan gives consistently less precise answers than a standard MLE for the same model. I noticed this when comparing Stan results to ridge regression, where ridge regression performed much better in terms of standard error.

R + Stan code below

# does a stan model return a larger variance than a non-stan model?
library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
library(mvtnorm)
zz <- file('/dev/null', open = "wt")


# data generating function
dg.f <- function(N,p){
  set.seed(1232)
  U = matrix(runif(p*p), nrow=p, ncol=p)
  S = 0.5*(U+t(U))
  S = S + p*diag(rep(1,p))
  X = rmvnorm(N, mean = runif(p, -1, 1), sigma = S)  
  z = cbind(rep(1, N), X)%*%sort(sample(c(rep(0,p), runif(1)), size = p+1, replace=TRUE)) + rnorm(N,0,1)
  y = cbind(rep(1, N), X)%*%sort(sample(c(rep(0,p), runif(1)), size = p+1, replace=TRUE)) + rnorm(N,0,1)
 list(X=X,y=as.numeric(y),z=as.numeric(z),N=N,p=p) 
}


stanmod <- '
data{
 int N;
 int p;
 row_vector[N] y;
 matrix[N,p] X;
}
parameters{
 real b0;
 vector[p] b;
 real<lower=0> sig;
}
model{
 b0 ~ normal(0, 10);
 b ~ normal(0, 3);
 y ~ normal(b0 + X*b, sig);
}
'

compare_lm_stan <- function(p){
  dat = dg.f(N=60, p)
  sink(zz)
    check.fit <- stan(model_code = stanmod, data=dat, chains=1, iter=1, refresh=-1)
    stanmod.fit <- stan(fit = check.fit, data=dat, chains=4, iter=30000, refresh=-1)
    linmod.fit = glm(dat$y~dat$X)
  sink()
  
  ord.stan=c('b0', paste0('b[',1:p,']'))
  b.stan <- summary(stanmod.fit)$summary[,1][ord.stan]
  se.stan <- summary(stanmod.fit)$summary[,3][ord.stan]
  b.lm <- linmod.fit$coefficients
  se.lm <- summary(linmod.fit)$coefficients[,2]
  c(mean(b.stan-b.lm), mean(se.stan/se.lm))
}


stan_version()
# [1] "2.16.0"


# with N/p = 12
compare_lm_stan(5)
# [1] 7.040717e-05 1.029381e+00

# with N/p = 3
compare_lm_stan(20)
# [1] -0.0001195749  1.0378405119

# with N/p = 3/2
compare_lm_stan(40)
#[1] -0.0001041135  1.0873995925

# with N/p = 6/5
compare_lm_stan(50)
#[1] -6.553468e-05  1.207667e+00

Based on some simulations, I may have the following explanation:

The MLE for a linear model does not use the ML estimate of the variance of the error term.

e.g. for the model $y_i = \beta_0 + \beta_1x_i + \epsilon_i$, we assume that $\epsilon ~N(0,\sigma^2)$, MLE approaches to estimating $\beta$ are not used to estimate $\sigma$.

In contrast, for a non-informative prior a Bayesian approach assumes that the posterior distribution of $\sigma$ is determined only by the likelihood, such that an MAP estimate of $\sigma$ corresponds to the MLE of $\sigma$.

I re-fit my stan models using the estimate of $\sigma$ given via the lm function, and the standard deviation of the posterior distributions (from Stan) now equal the standard error estimates from the linear model (from the lm function).

I had never heard of this before, but it makes sense, given that the MLE is based on a fixed value of $\sigma$, which is estimated (I believe) using a biased estimator.

In R, you have

var* = sum( (y - mean(y))^2 ) / N
var-hat = sum( (y - mean(y))^2 ) / (N - 1)

with var* being the max likelihood estimat of variance and var-hat being the traditional unbiased estimator. The MLE systematically underestimates variance because it uses the sample mean rather than the true mean, which underestimates differences. Turns out it’s exactly off by a factor of N / (N - 1), so it’s easy to correct. The “sample variance” usually means var-hat and it’s what most regression packages report.

Sorry there’s no MathJax on this site. It’s really annoying!

Thanks - that makes sense based on where I got.

This was motivated by seeing the same thing in Logistic models.

There’s no error term in the logistic model, so I don’t understand with the variance of the posterior distribution for a model coefficient is higher in a Bayesian model than the variance of the ML estimate. The posterior distributions look normal under vague priors, so it doesn’t seem to me that there would be something terribly wrong with the ML estimates of the variance.

Example code that demonstrates this in logistic models.

# now with logistic

dg2.f <- function(N,p){
  set.seed(1232)
  U = matrix(runif(p*p), nrow=p, ncol=p)
  S = 0.5*(U+t(U))
  S = S + p*diag(rep(1,p))
  X = rmvnorm(N, mean = runif(p, -1, 1), sigma = S)  
  z = cbind(rep(1, N), X)%*%sort(sample(c(rep(0,p), runif(1)), size = p+1, replace=TRUE)) + rnorm(N,0,1)
  y = rbinom(N, 1, 1/(1+exp(cbind(rep(1, N), X)%*%sort(sample(c(rep(0,p), runif(1)), size = p+1, replace=TRUE)))))
 list(X=X,y=as.numeric(y),N=N,P=p) 
}


stanmod2 <- '
data{
 int N;
 int P;
 int<lower=0,upper=1> y[N];
 matrix[N,P] X;
}
transformed data{
real n;
real p;
n = N*1.0;
p = P*1.0;
}
parameters{
 real b0;
 vector[P] b;
}
model{
 b0 ~ normal(0, 10);
 b ~ normal(0, 10);
 y ~ bernoulli_logit(b0 + X*b);
} 
'


compare_logit_stan <- function(p){
  dat = dg2.f(N=60, p)
   stanmod.check <- stan(model_code = stanmod2, data=dat, chains=1, iter=1, refresh=-1)
   stanmod.fit <- stan(fit = stanmod.check, data=dat, chains=4, iter=5000, refresh=-1)
   logitmod.fit = glm(dat$y~dat$X, family=binomial())
   ord.param=c('b0', paste0('b[',1:p,']'))
   b.stan <- summary(stanmod.fit)$summary[,1][ord.param]
   se.stan <- summary(stanmod.fit)$summary[,3][ord.param]
   b.jags <- b.jags[ord.param]
   se.jags <- se.jags[ord.param]
   b.lm <- logitmod.fit $coefficients
   se.lm <- summary(logitmod.fit)$coefficients[,2]
   c(mean(b.stan-b.lm), mean(se.stan/se.lm))
}


stan_version()
# [1] "2.16.0"


zz <- file("/dev/null", open = "wt")
sink(zz,type = c("output", "message"), split=FALSE)
  results <- sapply(c(2,5,10,15), compare_logit_stan, USE.NAMES=TRUE)
sink()
results

#            [,1]        [,2]        [,3]       [,4]
#[1,] 0.005760191 -0.03406230 -0.01549622 -0.1438062
#[2,] 1.033591831  1.10562883  1.12184562  1.4267704

There’s no error parameter, but there’s still error in the linked distrbution. For binomial logistic regression, you get

y[n] ~ binomial_logit(x[n] * alpha);

there’s sampling variation in the y[n] due to the binomial. Specifically,

var[y[n]] = p * (1 - p)

where

p = inv_logit(x[n] * alpha).

When you have

y[n] ~ normal(x[n] * alpha, sigma)

then

var[y[n]] = sigma^2.

So you get variance in the outcome either way.

Thanks for your continued explanations!

There’s no error parameter, but there’s still error in the linked distribution.

But isn’t the lack of an error parameter an important distinction here? As I understand it, the issue I was seeing with the linear model is that there is a difference between how the error parameter is estimated between ML routines and Bayes. That difference leads to different precision in other parameters because the distribution of the mean model coefficients depend on the error term. Doing a pure ML estimation of the linear model (i.e. using the ML estimate of the error variance) eliminates the discrepancy I observed.

In contrast, for a logistic model I know of no differences between how ML algorithms would estimate variance. The variance

var[y[n]] = p * (1 - p)

does not show up in the binomial likelihood, so it’s not clear why differences in variance estimation would affect the other model parameters in the logistic model setting. However, that seems to run counter to my simulations.

Some other things I think I’ve ruled out:

  • non-normal parameters: based on the (marginal) posterior densities from the Stan models, the posterior parameter distributions from even my sparsest models appear very normal looking. A posterior mean under uniform priors should therefore give similar inference to the MLE (right?).
  • quirks with the glm function in R or hidden sparse data techniques (e.g. using Jeffrey’s prior without a warning). I’ve refitted some models with different optimization routines (nlm, optim) in R and obtained the same answers.

If I’m totally missing the point, feel free to tell me to go read a book.

The bottom line is that MLE and Bayes give you very different things. Even Bayesian point estimates are posterior means (which minimize expected square error relative to the model), which are very different than (penalized) maximum likelihood estimates (which have no probabilistic interpretation but have nice limiting properties for infinite data streams).

It’s implicit. Every distribution has a mean and a variance (though in some cases, like the Cauchy, they’re infinite). For example, poisson(lambda) has a mean lambda and variance lambda, whereas negative binomial adds an extra variance parameter for overdispersion.

No. It’s a common misconception. If your posterior is symmetric and concave, then the posterior mean will match the penalized MLE, but that’s only the point estimate. If you use uniform priors on an interval, they can introduce bias if the data supports values near the boundaries.

You’re probably not getting an MLE of the variance—it’s usually estimated by dividing by (N - 1) rather than N. Dividing by N gives you the MLE, but it’s biased, whereas dividing by (N - 1) adjusts for the bias in using the sample mean rather than the actual mean in the squared differences

Priors may also have an impact if they’re more than weakly informative.