Bias in main effects for logistic model with random intercept and small cluster sizes

I have the following random intercept logistic regression model, which is derived from brms.

functions {
}
data {
  int<lower=1> N;  // number of observations
  int Y[N];  // response variable
  int<lower=1> K;  // number of population-level effects
  matrix[N, K] X;  // population-level design matrix
  // data for group-level effects of ID 1
  int<lower=1> N_1;  // number of grouping levels
  int<lower=1> M_1;  // number of coefficients per level
  int<lower=1> J_1[N];  // grouping indicator per observation
  // group-level predictor values
  vector[N] Z_1_1;
  int prior_only;  // should the likelihood be ignored?
  real b0nu;
  real b0mu;
  real b0sig;
  real b1nu;
  real b1mu;
  real b1sig;
  real signu;
  real sigmu;
  real sigsig;
}
transformed data {
  int Kc = K - 1;
  matrix[N, Kc] Xc;  // centered version of X without an intercept
  vector[Kc] means_X;  // column means of X before centering
  for (i in 2:K) {
    means_X[i - 1] = mean(X[, i]);
    Xc[, i - 1] = X[, i] - means_X[i - 1];
  }
}
parameters {
  vector[Kc] b;  // population-level effects
  // temporary intercept for centered predictors
  real Intercept;
  vector<lower=0>[M_1] sd_1;  // group-level standard deviations
  // standardized group-level effects
  vector[N_1] z_1[M_1];
}
transformed parameters {
  // actual group-level effects
  vector[N_1] r_1_1 = (sd_1[1] * (z_1[1]));
}
model {
  // initialize linear predictor term
  vector[N] mu = Intercept + Xc * b;
  for (n in 1:N) {
    // add more terms to the linear predictor
    mu[n] += r_1_1[J_1[n]] * Z_1_1[n];
  }
  // priors including all constants
  target += student_t_lpdf(Intercept | b0nu, b0mu, b0sig);  // 7 0 2
  target += student_t_lpdf(b | b1nu, b1mu, b1sig);       // 7 0 2
  target += student_t_lpdf(sd_1 | signu, sigmu, sigsig)  // 5, 0, 3
    - 1 * student_t_lccdf(0 | signu, sigmu, sigsig);     // 5, 0, 3
  target += normal_lpdf(z_1[1] | 0, 1);
  // likelihood including all constants
  if (!prior_only) {
    target += bernoulli_logit_lpmf(Y | mu);
  }
}
generated quantities {
  // actual population-level intercept
  real b_Intercept = Intercept - dot_product(means_X, b);
}

I generate three-armed clustered data, with a sample size of 400 clusters with the clusters size being truncated poisson distributed with mean of 3 (i.e. small). The varying intercepts are generated using a normal distribution with user specified standard deviation. I assume a \pi^2/3 residual for the logistic model and so can calibrate the data to a given ICC with \sigma_c^2/(\sigma_c^2 + \pi^2/3) where the \sigma_c^2 is the random intercept variance.

library(tidyverse)
getdat <- function(
  nclust = 400,
  sd_for_icc = 0.3454296,
  prob = c(0.15,0.15,0.15),
  clust_mu = 3){
  
  clust_size_1 <- rpois(
    nclust,
    lambda = clust_mu)  
  while(any(clust_size_1==0)){
    clust_size_1[clust_size_1==0] <- rpois(
      sum(clust_size_1==0),
      lambda = clust_mu)
  }  
  id <- cumsum(rep(1, sum(clust_size_1)))
  n <- length(id)  
  clustid <- rep(seq_along(clust_size_1), clust_size_1)  
  clust_u0 <- rnorm(nclust, 0, sd_for_icc)
  clust_u0 <- clust_u0[clustid]  
  narms <- length(prob)
  arm <- sample(1:narms, size = length(unique(clustid)), replace = T)
  arm <- arm[clustid]
  p <- plogis(qlogis(prob[arm]) + clust_u0)  
  y <- rbinom(n, size = 1, prob = p)
  d <- tibble::tibble(
    id = id,
    clustid = clustid,
    arm = arm,
    clust_u0 = clust_u0,
    p = p,
    y = y
  )
  d
                               
}

After simulating and fitting this data a large number of times I summarize the parameter estimates. Forgive the idiosyncratic code.

library(parallel)
library(rstan)
options(mc.cores = 1)

tg_env <- new.env()
tg_env$model_code <- rstan::stan_model(file = "logistic.stan", auto_write = TRUE)

if(.Platform$OS.type == "unix") {
  ncores <- parallel::detectCores()-1
} else {
  ncores <- 1
}

expit <- function(x){1/(1+exp(-x))}

# 3 should be changed >> 3 but it will take a (good) while to run
ld <- mclapply(X=1:3, mc.cores=ncores, FUN=function(x) {
  d = getdat()  
  ldat <- brms::make_standata(y ~ factor(arm)+(1|clustid),
                              data = d,
                              family = brms::bernoulli())
  
  ldat$b0nu <- 5     # default brms is 3, 0, 10 . I was using 5 0 3
  ldat$b0mu <- 0
  ldat$b0sig <- 3  
  ldat$b1nu <- 7       # default brms is to give no prior for b
  ldat$b1mu <- 0
  ldat$b1sig <- 2.5  
  ldat$signu <- 7    # default brms is 3, 0, 10, I was using 7 0 1
  ldat$sigmu <- 0
  ldat$sigsig <- 1  
  fit <- rstan::sampling(tg_env$model_code, data = ldat,
                          chains = 1, iter = 3000, refresh = 3000)  
  post <- do.call(cbind, rstan::extract(fit, pars = c("b_Intercept","b")))
  ncols <- ncol(post)
  post_mu <- cbind(post[, 1], sweep(post[,2:ncols], 1, post[,1], "+"))
  post_mu <- apply(post_mu, 2, expit)
  colMeans(post_mu)
  
})
dpost <- do.call(rbind, lapply(ld, function(x)x)) 
mu1 <- colMeans(dpost)

And then one hopes that mu1 isn’t far from 0.15, which is the value that the data was simulated with, i.e. each arm had the same probability of response.

For convenience I have attached the R and stan code.

I find that with small cluster sizes, the parameter estimates are negatively biased, smaller than the simulation parameters. Typically I am seeing estimates of slightly above 0.14 after 25k iterations of the sim when the simulation parameter was 0.15 in each arm. In the non-null case, I’d expect shrinkage towards the mean but I wasn’t expecting deviation to the extent that I am seeing here. My sense is that I am probably not going to be able to do anything about this (other than get bigger clusters) but I am wondering if anyone else has guidance? Thanks.

sim.R (2.2 KB)
logistic.stan (2.1 KB)

When I look at the same simulation with larger cluster size (poisson distributed with mean of 20), i get:

Number of simulations 5000
Mean of means reference value -1.7346
Mean of means for log odds -1.7359 -1.737 -1.7357
Mean of means for p 0.1502 0.1501 0.1502
Mean of means for sig 0.3334

which is completely fine for my requirements. Compared to using a mean of 3 for the cluster size:

Number of simulations 5000
Mean of means reference value -1.7346
Mean of means for lo -1.7676 -1.7706 -1.7671
Mean of means for p 0.1479 0.1475 0.148
Mean of means for sig 0.3645

And this is from an extended run looking at mean cluster size from 3 to 20.

Rplot01

I now think that I understand why I am seeing the ‘bias’ (in a classical sense). It arises due to the fact that the priors on the log odds scale are informative on the probability scale with weight towards zero and one. As the information in the data reduces the prior pulls the estimate towards zero and hence the behavior that I observe. An example of the prior on the intercept transformed to the probability scale shown below.

Rplot

2 Likes

Yes, that seems like the correct conclusion. Generally to evaluate if the model works, it is good to not rely on fixed values but instead draw the true parameters from the priors in your model, as in Simulation-based calibration.

Best of luck!

Many thanks Martin. I am proceeding down the path of examining the prior predictive distribution. For the case of a logistic regression with varying intercept and binary response, I haven’t seen an example anywhere, so I will post the model when I get back to it. I can see that it could be a little bit challenging constructing and visualising the results and that it would probably be necessary to do this on the basic of a summary statistic.

1 Like