Logistic regression with two random intercept terms

ermeel · April 14, 2018, 6:16am

Dear community,

I am new to the rstanarm package and I have a hard time to recover the true parameters of the synthetic (toy) model below. Is the model actually identifiable? If so how can I obtain posteriors for icpt and id1_values and id2_values?

library(tidyverse)
library(lme4)
library(rstanarm)
set.seed(42)
# Create synthetic data -------------------------------------------------------
ef1 <- rnorm(1000)
ef2 <- rnorm(1000)
id1 <- sample(c(1L,2L,3L), 1000, replace=TRUE)
id2 <- sample(c(1L,2L,3L), 1000, replace=TRUE)
id1_values <- rnorm(3)
id2_values <- rnorm(3)
icpt <- .4711
p1 <- 2.3
p2 <- 1.7
z = icpt + p1*ef1 + p2*ef2 + id1_values[id1] + id2_values[id2]
pr = 1/(1+exp(-z))
y = rbinom(1000,1,pr)
df = tibble(y=y,ef1=ef1,ef2=ef2, id1=id1, id2=id2)
# Fit and simulate -------------------------------------------------------------
rslt<-glmer(formula=y~ef1+ef2+(1|id1)+(1|id2),data=df,family=binomial)
rslt_stan <- stan_glmer(formula=y~ef1+ef2+(1|id1)+(1|id2),data=df,family=binomial,
                        chains = 1, cores = 1, seed = 12345, iter = 2000)

I recall that glmer and stan_glmer estimate varying intercepts in general by estimating a single global intercept and individual deviations from that intercept for each possible value of id. But what in case of two random intercepts terms? Or should I actually go and use maybe a formula term (1|id1:id2)?

jonah · April 15, 2018, 3:17am

I don’t see the results so I’ll have to run this and take a look at the parameter recovery when I have a chance (not at my computer at the moment). But in general two varying intercepts is ok. There will still only be one global intercept estimated, but two different group-specific deviations from that global intercept.

Also, I definitely recommend running more than one Markov chain. No magic number but 4 is a somewhat safe default.

ermeel · April 15, 2018, 5:02am

Thanks, I would appreciate this!

jonah · April 15, 2018, 6:58pm

When I run

rslt_stan <-
  stan_glmer(
    formula = y ~ ef1 + ef2 + (1 | id1) + (1 | id2),
    data = df,
    family = binomial,
    iter = 2000,
    chains = 4,
    cores = 4
  )

on my laptop with rstanarm v2.17.4, I get results very comparable to the results from glmer:

round(cbind(lmer = fixef(rslt), stan = fixef(rslt_stan)), digits = 2)

            lmer stan
(Intercept) 0.06 0.08
ef1         2.65 2.65
ef2         1.74 1.74

round(cbind(
  lmer = unlist(ranef(rslt)), 
  stan = unlist(ranef(rslt_stan))
), digits = 2)

                   lmer  stan
id1.(Intercept)1  -0.39 -0.37
id1.(Intercept)2   0.45  0.45
id1.(Intercept)3  -0.06 -0.05
id2.(Intercept)1   0.13  0.13
id2.(Intercept)2   0.45  0.45
id2.(Intercept)3  -0.58 -0.57

Those are just point estimates. Of course with rstanarm you also get the whole posterior.

ermeel · April 15, 2018, 8:30pm

Thanks Jonah. And can you actually “recover” the true values stored in id1_values, id2_values, icpt?

jonah · April 16, 2018, 12:16am

library("bayesplot")
color_scheme_set("brightblue")

draws <- as.matrix(rslt_stan, pars = c("(Intercept)", "ef1", "ef2"))
id1_draws <- as.matrix(rslt_stan, regex_pars = " id1:")
id2_draws <- as.matrix(rslt_stan, regex_pars = " id2:")
log_sigmas <- log(as.matrix(rslt_stan, regex_pars = "Sigma"))


mcmc_recover_hist(draws, true = c(icpt, p1, p2))
mcmc_recover_hist(id1_draws, true = id1_values)
mcmc_recover_hist(id2_draws, true = id2_values)
mcmc_recover_hist(log_sigmas, true = log(c(1,1)))

ermeel · April 16, 2018, 7:06am

Hi Jonah,

at first the width of the distributions seem very big too me. In fact this is also what I observed and I was rather surprised that this is the case for such a “simple” model with 1000 data points.

Additionally, if I run the following on my Windows machine

rslt_stan <- stan_glmer(formula=y~ef1+ef2+(1|id1)+(1|id2),data=df,family=binomial,
                        chains = 4, cores = 2, seed = 12345, iter = 2000)

and then execute

round(cbind(
  lmer = unlist(ranef(rslt)), 
  stan = unlist(ranef(rslt_stan))
), digits = 2)

I get

                  lmer  stan
id1.(Intercept)1 -0.20 -0.18
id1.(Intercept)2  0.03  0.02
id1.(Intercept)3  0.17  0.14
id2.(Intercept)1 -0.86 -0.84
id2.(Intercept)2  1.54  1.56
id2.(Intercept)3 -0.68 -0.66

and finally running

round(cbind(lmer = fixef(rslt), stan = fixef(rslt_stan)), digits = 2)

I get

             lmer  stan
(Intercept) -0.31 -0.33
ef1          2.35  2.34
ef2          1.84  1.84

However inspecting the values for icpt, id1_values and id2_values I get

c(id1_values,id2_values)
[1] -0.6856617 -0.7927145 -0.4070042 -1.1486706  1.1157605 -0.8794568

as well as icpt=0.4711. And now I am not sure how to recover the random intercepts from the stan or lmer point estimates. I guess it must be a linear combination (if it is possible at all).

ermeel · April 16, 2018, 11:59am

I just realised that there is a continuum of possibilities to get values for the group specific intercepts, since you can always shift the global intercept and then adjust the relativ difference of the group specific intercepts. In other words, the only thing I think you can recover is the sums `icpt+id1_values+id2_values´. Simpliyfing the problem I considered only one random intercept (whith a slightly adjusted toy model of course):

library(tidyverse)
library(lme4)
library(rstanarm)
set.seed(42)
# Create synthetic data -------------------------------------------------------
ef1 <- rnorm(1000)
ef2 <- rnorm(1000)
id1 <- sample(c(1L,2L,3L), 1000, replace=TRUE)
id1_values <- rnorm(3)
icpt <- .4711
p1 <- 2.3
p2 <- 1.7
z = icpt + p1*ef1 + p2*ef2 + id1_values[id1] 
pr = 1/(1+exp(-z))
y = rbinom(1000,1,pr)
df = tibble(y=y,ef1=ef1,ef2=ef2, id1=id1)
# Fit and simulate -------------------------------------------------------------
rslt<-glmer(formula=y~ef1+ef2+(1|id1),data=df,family=binomial)
rslt_stan <- stan_glmer(formula=y~ef1+ef2+(1|id1),data=df,family=binomial,
                        chains = 4, cores = 4, seed = 12345, iter = 2000, adapt_delta=.99)
mcmc_pairs(as.array(rslt_stan), pars=c("(Intercept)",sprintf("b[(Intercept) id1:%d]", 1:3)))

Is this a continuous non-identifiability in the sense of http://mc-stan.org/users/documentation/case-studies/identifying_mixture_models.html ?

jonah · April 17, 2018, 4:19am

Yeah In this particular example there’s not really anything to distinguish the two varying intercept terms. But there are plenty of cases when it’s possible, e.g. if one or both are also correlated with varying slopes.

ermeel · April 17, 2018, 10:10am

For future reference I cite the relevant section of the Stan manual (w.r.t. v2.17.0):

Problematic Posteriors

There is in particular a section on redundant intercepts.

Topic		Replies	Views
Rstanarm - for my own learning, trying to match credible interval to lm() confidence interval rstanarm specification	5	1254	November 3, 2020
Recovering parameters from stanreg object rstanarm	1	32	June 13, 2025
R-squared from rstan (and not rstanarm) Modeling	5	1864	March 18, 2025
Intercept only model rstanarm rstan	7	1669	August 26, 2020
Having trouble recovering some model parameters when fitting the simulated data Modeling techniques , fitting-issues , irt	4	1141	June 11, 2022

Logistic regression with two random intercept terms

Related topics