Using stan_glm to estimate a proportion

Hello.

I’ve been teaching myself Bayesian modeling and trying to learn new approaches to solving old problems. For a super basic toy example, let’s say I wanted to estimate the proportion of people who floss at a university and I take a sample size of 1000. 280 respond yes. My non-Bayesian modeling approach would be something like this:

> fdf <- data.frame(floss = c(rep(1, 280), rep(0, 1000 - 280)))
> mod1 <- glm(floss ~ 1, data = fdf, family = binomial)
> plogis(confint(mod1)) 
Waiting for profiling to be done...
    2.5 %    97.5 % 
0.2527574 0.3083672 

My confidence interval is about (0.25, 0.31). Pretty much what I get if I use the base R prop.test function.

When implementing the same approach using stan_glm and the default priors, I get the following

> bm1 <- stan_glm(floss ~ 1, data = fdf, family = binomial)
[sampling output omitted]
> plogis(posterior_interval(bm1))
                      5%       95%
(Intercept) 1.197878e-07 0.9999999

I’m very surprised the posterior interval is so wide and uninformative. I know the default intercept prior is a weakly information normal(0, 10), but I thought with 1000 observations it would produce an interval of a similar width to the one I got using glm. Am I doing something wrong or misunderstanding how stan_glm works? Thanks for any help!

  • Operating System: Win 10, 64-bit
  • rstanarm Version: 2.19.2
1 Like

Hi Clay,

Welcome to the community! Thanks for presenting such a nice repeatable example.

I ran it on my mac (rstanarm version 2.18.2) and got

0.2562871 0.303537

for the same chunk of code. I’m not sure what is exactly going on - but your understanding of the function is spot on. :)

3 Likes

Thanks for reporting this @Clay_Ford. Given that @lauren’s version works correctly with 2.18.2 but I’m having the same issue you are with 2.19.2, I think there may have been a bug introduced in version 2.19.2. But I’m not yet sure how that happened and it seems to only affect the logit link function. With version 2.19.2 if I change the family to binomial(link="probit") for both regular glm and stan_glm then I get almost identical results when comparing them so that seems reasonable (@Clay_Ford can you confirm that it also works as expected for you if you change to the probit link?). But with the default logit link I’m seeing the same problem you are. I’ll try to track down this issue today because it’s concerning. (@bgoodri we may need to submit an rstanarm bugfix release to deal with this sooner than CRAN usually wants resubmissions, but they make exceptions for important fixes.)

1 Like

Yes, I get a more reasonable interval when I set family = binomial(link="probit").

Thank you both for your quick responses!

Thanks for confirming. I’ll report back once I’ve tracked down the issue with the logit link.

This problem also only seems to happen for models that only have an intercept. For example, if I add a variable x to the model then even the default logit link works fine:

fdf <- data.frame(floss = c(rep(1, 280), rep(0, 1000 - 280)), x = rnorm(1000))
mod1 <- glm(floss ~ 1 + x, data = fdf, family = binomial)
bm1 <- stan_glm(floss ~ 1 + x, data = fdf, family = binomial)

# estimates are basically the same 
coef(mod1)
(Intercept)           x 
-0.94440314  0.03278853 

coef(bm1)
(Intercept)           x 
-0.94687830  0.02933131 

So there’s some strange interaction between using the logit link and only an intercept that causes the problem. Changing the link or adding a predictor removes the problem.

Ouch, I guess this is due to the change of using bernoulli_logit_glm and we missed something for only the intercept case! This means we should check also normal (with n<p), poisson and neg_binomial_2

2 Likes

It’s weird. I’ve opened an issue where I explain more, but I’m confused (it may be a Stan issue, not an rstanarm issue). See here (and my comment below original issue):

1 Like

To summarize what we’ve figured out now:

this turned out to be a bug in Stan math that affects glm functions like bernoulli_logit_glm_lpmf(y | X, alpha, beta) when X has zero columns and beta has length 0 (i.e., models that only have an intercept). The rstanarm implementation seems to be correct but is giving the wrong result for that reason. Those functions were introduced into the Stan programs in rstanarm in 2.19.2, which is why @lauren’s version wasn’t affected.

4 Likes