Hi,
I’ve noticed some strange behavior from logistic regression in STAN. When the priors for \beta are improper uniform priors, I’d expect the fitting to return the maximum likelihood estimates as means. This is however not the case if n is comparable with p, the coefficients seem to be a constant multiple of the ML estimates. For example, in the included program, with n=80, p=15, the MLE \beta coefficients are approximately 0.6 times smaller than those resulting from STAN sampling, with a very close to linear relation. This behavior persists for various choices of n and p, with the results converging if n >> p.
What is happening? Is there a mistake in my thinking that the results should be equal? Is the STAN model wrong? Or where’s the problem?
Thanks in advance for any replies. I’ve included the STAN model and the Python code that is used for the simulation below.
data {
int<lower=1> n;
int<lower=1> p;
matrix[n, p] X;
int<lower=0,upper=1> y[n];
}
parameters {
vector[p] beta;
real beta_0;
}
model {
y ~ bernoulli_logit(X * beta + beta_0);
}
import numpy as np
from scipy.stats import multivariate_normal as mvn
from scipy.stats import random_correlation
from scipy.special import expit
import pystan
import statsmodels.api as sm
lr_model = pystan.StanModel(file = "log_reg.stan")
# Setup of the problem
n = 80
p = 15
np.random.seed(140)
# Generating a random correlation matrix
top_bottom_ratio = p*p # larger means more correlation
unnormalized_range = np.exp(np.linspace(0.0, np.log(top_bottom_ratio), p))
eigenvals = unnormalized_range / sum(unnormalized_range) * p
X_cor = random_correlation.rvs(eigenvals)
mvn_gen = mvn(cov = X_cor, seed = 42)
X = mvn_gen.rvs(size = n)
beta = np.random.normal(size = p)
beta_0 = np.random.normal(size = 1)
y = np.random.binomial(1, expit(np.matmul(X, beta) + beta_0))
# STAN sampling
data_in = {'X':X, 'y':y, 'n':n, 'p':p}
fit = lr_model.sampling(data = data_in, iter = 2000, chains = 4)
# The same model evaluated using MLE
glm = sm.GLM(y, sm.add_constant(X, prepend = False), family = sm.families.Binomial())
res = glm.fit()
# The mean betas for the STAN samples
means_stan = fit.to_dataframe().iloc[:, 3:(3+p+1)].mean()
means_glm = res.params
# Linear regression of means_glm on means_stan ... coef should be 1.0 but isn't
coef = np.sum(means_stan * means_glm) / np.sum(means_stan**2)
means_stan_corr = coef * means_stan
# Verification that the stan solution doesn't correspond to maximal likelihood
p_stan = expit(np.matmul(X, means_stan[:-1]) + means_stan[-1])
p_glm = expit(np.matmul(X, means_glm[:-1]) + means_glm[-1])
p_stan_corr = expit(np.matmul(X, means_stan_corr[:-1]) + means_stan_corr[-1])
np.sum(y*np.log(p_stan) + (1-y)*np.log(1-p_stan))
np.sum(y*np.log(p_glm) + (1-y)*np.log(1-p_glm))
np.sum(y*np.log(p_stan_corr) + (1-y)*np.log(1-p_stan_corr))