I’m running into some unexpected differences between a linear model fit in Stan versus a linear model fit via the “lm” function in R (rstan version 2.16.0).
I would expect that the standard error for a model coefficient fit via a linear model would roughly equal the standard deviation of the marginal posterior distribution of that parameter in a Bayesian linear model. This is not the case for the models I’m looking at, which I don’t understand. This may be a basic misunderstanding of posteriors distributions versus sampling distributions of max. likelihood estimates on my part. Any help is appreciated, more details follow:
I’m interested in the performance of linear models in in models where the number of parameters is approaching the sample size. I’ve run the same linear model on a series of datasets with N=60 units. The true data generating function is given by b*X, where X is a NxP matrix. For P in {5,12,20,40,50} I’ve fit a linear model in Stan and one in R and compared a) the average difference in coefficients (should be zero if the posterior mean is equal to the maximum likelihood estimate) and b) the average ratio of standard errors (e.g. the standard deviation of the posterior distribution fo the coefficients versus the default standard error in R’s lm
function).
For p<<N, the parameters are the same between Stan and lm
, and the ratio of standard errors is approximately 1.0. However, as p grows and N stays constant, the ratio of standard errors increases (e.g. Stan gives less precise standard errors), up to a ratio of 1.2 at p=50. This happens even when the parameters in Stan are given null centered, normal priors.
I don’t have a good explanation for these observations. Stan gives consistently less precise answers than a standard MLE for the same model. I noticed this when comparing Stan results to ridge regression, where ridge regression performed much better in terms of standard error.
R + Stan code below
# does a stan model return a larger variance than a non-stan model?
library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
library(mvtnorm)
zz <- file('/dev/null', open = "wt")
# data generating function
dg.f <- function(N,p){
set.seed(1232)
U = matrix(runif(p*p), nrow=p, ncol=p)
S = 0.5*(U+t(U))
S = S + p*diag(rep(1,p))
X = rmvnorm(N, mean = runif(p, -1, 1), sigma = S)
z = cbind(rep(1, N), X)%*%sort(sample(c(rep(0,p), runif(1)), size = p+1, replace=TRUE)) + rnorm(N,0,1)
y = cbind(rep(1, N), X)%*%sort(sample(c(rep(0,p), runif(1)), size = p+1, replace=TRUE)) + rnorm(N,0,1)
list(X=X,y=as.numeric(y),z=as.numeric(z),N=N,p=p)
}
stanmod <- '
data{
int N;
int p;
row_vector[N] y;
matrix[N,p] X;
}
parameters{
real b0;
vector[p] b;
real<lower=0> sig;
}
model{
b0 ~ normal(0, 10);
b ~ normal(0, 3);
y ~ normal(b0 + X*b, sig);
}
'
compare_lm_stan <- function(p){
dat = dg.f(N=60, p)
sink(zz)
check.fit <- stan(model_code = stanmod, data=dat, chains=1, iter=1, refresh=-1)
stanmod.fit <- stan(fit = check.fit, data=dat, chains=4, iter=30000, refresh=-1)
linmod.fit = glm(dat$y~dat$X)
sink()
ord.stan=c('b0', paste0('b[',1:p,']'))
b.stan <- summary(stanmod.fit)$summary[,1][ord.stan]
se.stan <- summary(stanmod.fit)$summary[,3][ord.stan]
b.lm <- linmod.fit$coefficients
se.lm <- summary(linmod.fit)$coefficients[,2]
c(mean(b.stan-b.lm), mean(se.stan/se.lm))
}
stan_version()
# [1] "2.16.0"
# with N/p = 12
compare_lm_stan(5)
# [1] 7.040717e-05 1.029381e+00
# with N/p = 3
compare_lm_stan(20)
# [1] -0.0001195749 1.0378405119
# with N/p = 3/2
compare_lm_stan(40)
#[1] -0.0001041135 1.0873995925
# with N/p = 6/5
compare_lm_stan(50)
#[1] -6.553468e-05 1.207667e+00