Using stan_glm to estimate a proportion

Clay_Ford · October 10, 2019, 3:35pm

Hello.

I’ve been teaching myself Bayesian modeling and trying to learn new approaches to solving old problems. For a super basic toy example, let’s say I wanted to estimate the proportion of people who floss at a university and I take a sample size of 1000. 280 respond yes. My non-Bayesian modeling approach would be something like this:

> fdf <- data.frame(floss = c(rep(1, 280), rep(0, 1000 - 280)))
> mod1 <- glm(floss ~ 1, data = fdf, family = binomial)
> plogis(confint(mod1)) 
Waiting for profiling to be done...
    2.5 %    97.5 % 
0.2527574 0.3083672

My confidence interval is about (0.25, 0.31). Pretty much what I get if I use the base R prop.test function.

When implementing the same approach using stan_glm and the default priors, I get the following

> bm1 <- stan_glm(floss ~ 1, data = fdf, family = binomial)
[sampling output omitted]
> plogis(posterior_interval(bm1))
                      5%       95%
(Intercept) 1.197878e-07 0.9999999

I’m very surprised the posterior interval is so wide and uninformative. I know the default intercept prior is a weakly information normal(0, 10), but I thought with 1000 observations it would produce an interval of a similar width to the one I got using glm. Am I doing something wrong or misunderstanding how stan_glm works? Thanks for any help!

Operating System: Win 10, 64-bit
rstanarm Version: 2.19.2

lauren · October 10, 2019, 3:54pm

Hi Clay,

Welcome to the community! Thanks for presenting such a nice repeatable example.

I ran it on my mac (rstanarm version 2.18.2) and got

0.2562871 0.303537

for the same chunk of code. I’m not sure what is exactly going on - but your understanding of the function is spot on. :)

jonah · October 10, 2019, 4:23pm

Thanks for reporting this @Clay_Ford. Given that @lauren’s version works correctly with 2.18.2 but I’m having the same issue you are with 2.19.2, I think there may have been a bug introduced in version 2.19.2. But I’m not yet sure how that happened and it seems to only affect the logit link function. With version 2.19.2 if I change the family to binomial(link="probit") for both regular glm and stan_glm then I get almost identical results when comparing them so that seems reasonable (@Clay_Ford can you confirm that it also works as expected for you if you change to the probit link?). But with the default logit link I’m seeing the same problem you are. I’ll try to track down this issue today because it’s concerning. (@bgoodri we may need to submit an rstanarm bugfix release to deal with this sooner than CRAN usually wants resubmissions, but they make exceptions for important fixes.)

Clay_Ford · October 10, 2019, 4:28pm

Yes, I get a more reasonable interval when I set family = binomial(link="probit").

Thank you both for your quick responses!

jonah · October 10, 2019, 4:29pm

Thanks for confirming. I’ll report back once I’ve tracked down the issue with the logit link.

jonah · October 10, 2019, 4:53pm

This problem also only seems to happen for models that only have an intercept. For example, if I add a variable x to the model then even the default logit link works fine:

fdf <- data.frame(floss = c(rep(1, 280), rep(0, 1000 - 280)), x = rnorm(1000))
mod1 <- glm(floss ~ 1 + x, data = fdf, family = binomial)
bm1 <- stan_glm(floss ~ 1 + x, data = fdf, family = binomial)

# estimates are basically the same 
coef(mod1)
(Intercept)           x 
-0.94440314  0.03278853 

coef(bm1)
(Intercept)           x 
-0.94687830  0.02933131

So there’s some strange interaction between using the logit link and only an intercept that causes the problem. Changing the link or adding a predictor removes the problem.

avehtari · October 10, 2019, 7:24pm

Ouch, I guess this is due to the change of using bernoulli_logit_glm and we missed something for only the intercept case! This means we should check also normal (with n<p), poisson and neg_binomial_2

jonah · October 10, 2019, 7:33pm

It’s weird. I’ve opened an issue where I explain more, but I’m confused (it may be a Stan issue, not an rstanarm issue). See here (and my comment below original issue):

jonah · October 10, 2019, 8:12pm

To summarize what we’ve figured out now:

this turned out to be a bug in Stan math that affects glm functions like bernoulli_logit_glm_lpmf(y | X, alpha, beta) when X has zero columns and beta has length 0 (i.e., models that only have an intercept). The rstanarm implementation seems to be correct but is giving the wrong result for that reason. Those functions were introduced into the Stan programs in rstanarm in 2.19.2, which is why @lauren’s version wasn’t affected.

Topic		Replies	Views
Odds ratio and querying posterior with different priors with stan_glm function rstanarm biology , bioinformatics	3	1350	May 30, 2020
Large discrepencies between stan_glm and standard glm Modeling specification	2	405	May 28, 2020
Using posteriors as new priors Modeling	5	2796	January 22, 2020
Credibility intervals for one group Modeling variational-bayes , bayesplot , stan , rstanarm , survival	6	639	April 30, 2021
How to obtain P value from stan_glm() in rstanarm rstanarm	4	2632	September 13, 2018

Using stan_glm to estimate a proportion

Related topics