Reproduce results of bayesglm with stan_glm

I am cross-posting this question from stackexchange as suggested in a comment. The difference between the two results is not purely a question of theoretical interest. In psychophysical experiments, small linearly separated datasets are not uncommon and Bayesian logistic regression is often recommended to deal with with this problem (e.g. Kuss et al. 2005, JOV). Moreover, using bayesglm with uninformative default priors as a stand-in for glm has become quite popular. While I expect the results to be influenced by the choice of the priors, the algorithm should not have a large effect. I tried to specify the Stan model (using stan_glm) to be equivalent to the bayesglm model with its default priors and obtained different results, which makes me ask whether I misspecified the model or if the results are indeed influenced by the algorithm.

The bayesglm function uses the EM algorithm to provide point estimates of the unknown parameter as described in Gelman et al. (2008). It uses the t distribution with 1 dof as priors (also known as Cauchy prior). Continuous predictors are rescaled so that they have a standard deviation of 0.5. The scale of the t distribution prior is set to 10 for intercept and to 2.5 for the rescaled predictors. As a result, the scale for the default prior for a continuous predictor is:

\lambda = \text{prior.scale} / 2 * \text{sd}(x)

with prior.scale = 2.5.

Here is a full example with data that are linearly separated and that yields a warning message when fitted with glm.

library(arm)      # bayesglm
library(rstanarm) # stan_glm

# simulated data set with linear separation
dat <- data.frame(
  x = rep(c(0,1,2.5,4,5),each=4),
  y = rep(c(0,1),each=10)
)

# compute proportions 
tmp <- aggregate(dat[,"y",drop=FALSE],dat[,"x",drop=FALSE],mean)
tmp
#     x   y
# 1 0.0 0.0
# 2 1.0 0.0
# 3 2.5 0.5
# 4 4.0 1.0
# 5 5.0 1.0

# fit glm 
fit.glm <- glm(y ~ x, family=binomial(link="logit"), data=dat)
# Warning message:
# glm.fit: fitted probabilities numerically 0 or 1 occurred

coef(fit.glm)
# (Intercept)           x 
#   -33.93562    13.57425 

# bayes glm
fit.bglm <- bayesglm(y ~ x, family=binomial(link="logit"), data=dat)
coef(fit.bglm)
# (Intercept)           x 
#   -4.756769    1.902708 

Additional information on the priors used by bayesglm:

# prior
tmp <- do.call("rbind",
  fit.bglm[paste0("prior.",c("mean","scale","df","sd"))])
colnames(tmp) <- c("(Intercept)","x")
tmp
#             Intercept         x
# prior.mean   0.000000 0.0000000
# prior.scale 10.000000 0.6607427
# prior.df     1.000000 1.0000000
# prior.sd     7.962467 1.5195179

It is easy to verify the scale of the predictor for the continuous predictor

lambda <- 2.5/(2*sd(dat$x))
lambda
# [1] 0.6607427

The function stan_glm from the rstanarm package uses Markov Chain Monte-Carlo methods to obtain the full posterior distribution of the parameters.

fit.sglm <- stan_glm(y ~ x, data = dat,
   family = binomial(link = "logit"), 
   prior = student_t(df = 1, 0, lambda), 
   prior_intercept = student_t(1, 0, 10),
   cores = 2, seed = 12345)

prior_summary(fit.sglm)
# Intercept (after predictors centered)
#  ~ cauchy(location = 0, scale = 10)    
# Coefficients
#  ~ cauchy(location = 0, scale = 0.66)

print(fit.sglm,digits=5)
#             Median MAD_SD
# (Intercept) -9.407  7.034
# x            3.734  2.731

coef(fit.sglm)
# (Intercept)          x 
#  -9.407408    3.734188 

The coefficients corresponds to the median of the posterior distribution. It is also possible to obtain the mean of the posterior distribution

bayestestR::point_estimate(fit.sglm, centrality = "mean")$Mean
# [1] -21.928522   8.766372

The results are very different from these obtained with bayesglm. I selected this example with a small linearly separated data set to emphasize this differences. I expect to obtain slightly different values due to the difference in underlying algorithms but I would not expected such a large difference if the priors are the same and the point estimates are computed correctly. I have tried also to rescale the intercept but it did not help. I would like to know how to reproduce the results of bayeglm with stan_glm and, if it is not possible, why.

3 Likes

Hi,
this is a well written and relevant question, thanks for all the effort. I can verify that this happens also on my computer and also with the latest rstanarm. I vaguely recall @jonah or maybe @bgoodri mentioning there was some problem with the implementation of logistic regression in rstanarm that manifests in the case of (almost) linearly separable data, and they are working to fix it so this might be a part of the issue.

To add to the confusion, implementing the same model in brms gives yet another set of results:

library(brms)
fit_brms <- brms::brm(y ~ 0 + Intercept + x, data = dat,  family = bernoulli(link = "logit"),
                      prior = set_prior(paste0("student_t(1, 0, ", lambda, ")"), class = "b") + 
                         prior(student_t(1, 0, 10), class = "b", coef = "Intercept"))

fit_brms
# Population-Level Effects: 
#           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
# Intercept    -7.37      5.46   -21.09    -1.84 1.01      329      275
# x             2.98      2.16     0.88     8.31 1.01      332      273


samples <- rstan::extract(fit_brms$fit)
median(samples$b[,1]) # Intercept
# [1] -5.965337
median(samples$b[,2]) # x
# [1] 2.399894

However, there are some hints that this particular model might be difficult to sample well. Note the low ESS in the brms fit. Also running

confint(fit.bglm)
# Waiting for profiling to be done...
# Error in profile.glm(object, which = parm, alpha = (1 - level)/4, trace = trace) : 
#  profiling has found a better solution, so original fit had not converged

So not completely sure what is the cause, but hope that provides some more hints.

1 Like

I think the Cauchy priors (student_t with df=1) might just be too weak for this data when doing full MCMC. If I just use normal(0,1) and normal(0, 10) instead of the cauchy(0, lambda) and cauchy(0, 10) I get more reasonable results that are close to the bayesglm results and also the mean and median are nearly the same:

fit.sglm <- stan_glm(y ~ x, 
                     data = dat,
                     family = binomial(link = "logit"), 
                     prior = normal(0, 1), 
                     prior_intercept = normal(0, 10))
coef(fit.sglm)

# (Intercept)           x 
#  -4.328186    1.731228 

summary(fit.sglm)[c("(Intercept)", "x"), "mean"]
# (Intercept)           x 
#  -4.500684    1.790628 

For future reference, you can get this from rstanarm itself using the summary() method, which has a column mean in the results. Or you can just do colMeans(as.matrix(fit)).

Also, this is totally unrelated but a useful suggestion: for a model that samples this fast using multiple cores

will be quite a bit slower than just using cores=1 because the overhead of using multiple cores dominates the mcmc time.

1 Like

Thank you for the responses. I ask for several confirmations because bayesglm is often the go-to function used in many analyses and papers.

  • First, the results of bayesglm should be considered as the correct results for the default Cauchy prior if I understood well?
  • Second, the point estimate returned by bayesglm correspond to the mean of the posterior distribution and not the median which is reported by default by stan_glm?
  • Third, we should expect that future versions of Stan/brms will give the same results?

To provide a bit of context: Although the total number of observations is small (N=20 with 4 repetitions per stimulus value), datasets where a participant respond systematically 0 or 1 for all but one stimulus value is quite common in psychophysical experiments when the methods of constant stimuli is used.
This also happens with a larger number of repetitions per stimulus value (e.g 10 repetitions per stimulus value). While this indicates that the choice of the stimulus values is not optimal for these participants, it is not always avoidable because inter-subject variability and time constraint limiting the total number of trials. Obviously, one might use adaptive methods to avoid the problem or decide to fit all participants together using GLMM models but there is also much in favor of using simple approaches.

Thanks for the good questions!

I think even the bayesglm authors would now recommend stan_glm as a better go-to function, but I know that there’s a lot of older code using bayesglm still.

I’m not really sure, I guess it depends on what you mean by “correct”. bayesglm is using an approximate EM algorithm that I haven’t personally studied in depth. But either way, even the bayesglm authors don’t really recommend Cauchy priors anymore. The latest recommendations from @andrewgelman for priors for regression coefficients are implemented in rstanarm:

I’m not sure about bayesglm (but my guess is that it’s probably an estimate of the posterior mean), but yeah the estimates from stan_glm (via print or coef) are posterior medians. And instead of posterior SD stan_glm reports posterior MAD, which is more robust for long-tailed distributions. This is just the default reporting though, you can always still get means and SDs from stan_glm too.

The same results as bayesglm? Probably not given that bayesglm isn’t doing MCMC (although the answers should be close for well behaved models when using similar priors). Or do you mean the same results as other Stan-based software? In this case rstanarm and brms give me the same answer when using Normal priors and different answers when using Cauchy priors. I’m guessing the difference when using Cauchy priors is because the Cauchy priors are too weak for this data, leading to suboptimal MCMC sampling (see e.g. the low ESS numbers @martinmodrak showed for the brms model).