I’ve been teaching myself Bayesian modeling and trying to learn new approaches to solving old problems. For a super basic toy example, let’s say I wanted to estimate the proportion of people who floss at a university and I take a sample size of 1000. 280 respond yes. My non-Bayesian modeling approach would be something like this:
> fdf <- data.frame(floss = c(rep(1, 280), rep(0, 1000 - 280)))
> mod1 <- glm(floss ~ 1, data = fdf, family = binomial)
> plogis(confint(mod1))
Waiting for profiling to be done...
2.5 % 97.5 %
0.2527574 0.3083672
My confidence interval is about (0.25, 0.31). Pretty much what I get if I use the base R prop.test function.
When implementing the same approach using stan_glm and the default priors, I get the following
I’m very surprised the posterior interval is so wide and uninformative. I know the default intercept prior is a weakly information normal(0, 10), but I thought with 1000 observations it would produce an interval of a similar width to the one I got using glm. Am I doing something wrong or misunderstanding how stan_glm works? Thanks for any help!
Thanks for reporting this @Clay_Ford. Given that @lauren’s version works correctly with 2.18.2 but I’m having the same issue you are with 2.19.2, I think there may have been a bug introduced in version 2.19.2. But I’m not yet sure how that happened and it seems to only affect the logit link function. With version 2.19.2 if I change the family to binomial(link="probit") for both regular glm and stan_glm then I get almost identical results when comparing them so that seems reasonable (@Clay_Ford can you confirm that it also works as expected for you if you change to the probit link?). But with the default logit link I’m seeing the same problem you are. I’ll try to track down this issue today because it’s concerning. (@bgoodri we may need to submit an rstanarm bugfix release to deal with this sooner than CRAN usually wants resubmissions, but they make exceptions for important fixes.)
This problem also only seems to happen for models that only have an intercept. For example, if I add a variable x to the model then even the default logit link works fine:
fdf <- data.frame(floss = c(rep(1, 280), rep(0, 1000 - 280)), x = rnorm(1000))
mod1 <- glm(floss ~ 1 + x, data = fdf, family = binomial)
bm1 <- stan_glm(floss ~ 1 + x, data = fdf, family = binomial)
# estimates are basically the same
coef(mod1)
(Intercept) x
-0.94440314 0.03278853
coef(bm1)
(Intercept) x
-0.94687830 0.02933131
So there’s some strange interaction between using the logit link and only an intercept that causes the problem. Changing the link or adding a predictor removes the problem.
Ouch, I guess this is due to the change of using bernoulli_logit_glm and we missed something for only the intercept case! This means we should check also normal (with n<p), poisson and neg_binomial_2
It’s weird. I’ve opened an issue where I explain more, but I’m confused (it may be a Stan issue, not an rstanarm issue). See here (and my comment below original issue):
this turned out to be a bug in Stan math that affects glm functions like bernoulli_logit_glm_lpmf(y | X, alpha, beta) when X has zero columns and beta has length 0 (i.e., models that only have an intercept). The rstanarm implementation seems to be correct but is giving the wrong result for that reason. Those functions were introduced into the Stan programs in rstanarm in 2.19.2, which is why @lauren’s version wasn’t affected.