Assessing priors for Bernoulli logistic model

I am exploring priors for Bernoulli logistic regression and am trying to find a sensible way of evaluating the prior predictive. In terms of application, think of the individual binary responses (responds/does not respond) to an intervention for one or more interventions with interest in the proportion that respond in each group. The model will eventually turn into a hierarchical logistic regression, but I am starting with the simplest example. An implementation is shown below.

functions {

}
data {
  int<lower=1> N;  // number of observations
  int<lower=1> K;  // number of arms
  int Y[N];  // response variable
  int X[N];  // population-level design matrix
 
  int prior_only;  // should the likelihood be ignored?
  real prior_par1;
  real prior_par2;
  real prior_par3;
  real prior_par4;
  
  int prior_to_use;
  
  int dbg;
}
transformed data {
  int doprint = 1;
}
parameters {
  vector[K] b;  // population-level effects
}
transformed parameters {
}
model {
  // initialize linear predictor term
  vector[N] mu;
  for (n in 1:N) {
    // add more terms to the linear predictor later, perhaps introduce 
   // groups into the data and add a varying intercept to the model
    mu[n] = b[X[n]];
  }
  
  if(prior_to_use == 1){
    if(dbg) print("Normal prior");
    target += normal_lpdf(b | prior_par1, prior_par2);  
  } else if(prior_to_use == 2){
    if(dbg) print("Student t prior");
    target += student_t_lpdf(b | prior_par1, prior_par2, prior_par3);  
  } else if(prior_to_use == 3){
    if(dbg) print("Logistic prior");
    target += logistic_lpdf(b | prior_par1, prior_par2);  
  } else if(prior_to_use == 4){
    if(dbg) print("Uniform prior");
    target += uniform_lpdf(b | prior_par1, prior_par2);  
  } else if(prior_to_use == 5){
    if(dbg) print("Inverse gamma prior");
    target += inv_gamma_lpdf(b | prior_par1, prior_par2);  
  }

  // likelihood including all constants
  if (!prior_only) {
    target += bernoulli_logit_lpmf(Y | mu);
  }
}
generated quantities {
  real yrep[N];
  {
    vector[N] eta;
    for(i in 1:N){
      // unnecessary for one group but makes extension 
      // to multiple groups easy.
      eta[i] = b[X[i]];
    }
    yrep = bernoulli_logit_rng(eta);
  }
  
}

For which data can be simulated and then fit. However, I noted that for assessing the prior this might not be entirely necessary unless I was going to make a comparison between the data generated and the prior predictive (so maybe it is necessary):

library("tidyverse")
getdat <- function(
  n = 400,
  prob = c(0.15,0.15))
  {  
  id <- 1:n
  narms <- length(prob)
  arm <- sample(1:narms, size = n, replace = T)

  y <- rbinom(n, size = 1, prob = prob[arm])
  d <- tibble::tibble(
    id = id,             # subject
    arm = arm,           # group
    p = prob[arm],       # Pr(response)
    y = y                # response
  )
  d
}

The stan model is parameterized in terms of estimating the probability of response in each arm rather than the more typical approach that specifies an intercept and effects. In truth, both are of interest, but the former lets me focus on a single prior rather than having to consider separate priors on the intercept and effects. Below, I just create data with a single arm, so all we are doing is estimating the proportion that respond in that single group.

I am interested in weakly informative priors and lets say that I am confident that the response rate (denoted by \theta) is less than 0.25 but I still want to let the data speak if it really wants to. I know I do not need to generate the data at this stage but it would be useful if I move to 2, 3 4 interventions.

d = getdat(n = 100, prob = c(0.2))
head(d)
# A tibble: 6 x 4
#     id   arm     p     y
# <int> <int> <dbl> <int>
# 1     1     1   0.2     0
# 2     2     1   0.2     0
# 3     3     1   0.2     0
# 4     4     1   0.2     0
# 5     5     1   0.2     0
# 6     6     1   0.2     0

ldat <- list(N = nrow(d),
             K = length(unique(d$arm)),
             Y = d$y,
             X = d$arm,
             dbg = 0)

# Normal distribution
ldat$prior_to_use <- 1
ldat$prior_only <- 1
ldat$prior_par1 <- qlogis(0.25)
plogis(ldat$prior_par1)

ldat$prior_par2 <- 1.2
ldat$prior_par3 <- 0
ldat$prior_par4 <- 0 

Compile the stan model then sample:

library("rstan")
options(mc.cores = 1)
mod <- rstan::stan_model(file = "logistic.stan", auto_write = TRUE)
fit <- rstan::sampling(tg_env$model_code, data = ldat,
                       chains = 1, iter = 2000, warmup = 1000, refresh = 0)

the following will visualize the prior on the parameter of interest on both the log-odds and probability scale.

b <- rstan::extract(fit, pars = c("b"))
b <- b[[1]]
expit <- function(x){1/(1+exp(-x))}
par(mfrow = c(2, 1))
hist(b)
hist(expit(b))
par(mfrow = c(1, 1))

While the prior heavily represents lower probabilities, it puts the parameter estimate on the log-odds scale between -3.7 and -1.2 with 95% probability, which translates to roughly 0.02 to 0.75 on the probability scale. Moreover, it does cover the full support on the probability scale and so will not prevent high values of \theta if there is sufficient data.

What I am interested in is how to go about evaluating the prior predictive in this binary context.

They way I am doing it is to simulate data using the prior (giving me the prior predictive) and then extract and plot.

yrep <- rstan::extract(fit, pars = c("yrep"))
m <- yrep[[1]]
dim(m)
# [1] 1000  simulations x 100 observations (assuming that my data is going to be somewhere around that size per group)
d2 <- as.data.frame(t(m)) %>%
  tidyr::gather("sim", "value") %>%
  dplyr::mutate(sim = gsub("V", "", sim, fixed = T),
                sim = as.numeric(sim))

prop.table(table(d2$value))

The table summary gives me something around what I would expect.

d3 <- d2 %>%
  dplyr::group_by(sim) %>%
  dplyr::summarise(mu = mean(value)) %>%
  dplyr::ungroup()

par(mfrow = c(3, 1))
hist(d$y)
hist(d2$value[d2$sim == 1])
hist(d3$mu)
par(mfrow = c(1, 1))

The above gives the following plots. The first summarizes the responses from the originally simulated data using getdat. The second summarizes the data generated from the prior and the third plot summarizes the distribution of mean responses across all of the datasets simulated from the prior predictive (and looks suspiciously similar to the prior on the probability scale so I suspect might be the wrong thing to do).

I then repeat the process by tweaking my prior parameters and for different prior distributions entirely and then reassess.

Am I taking a reasonable approach or is there a better way to go about this?

What is your assessment of the current prior?

Thanks.

1 Like

I definitely choose the conjugate prior distribution as hyper-prior, except I have a good reason to do otherwise. The inverse gamma is such an exception. The beta distribution is the conjugate prior of the Bernoulli distribution, see: https://en.wikipedia.org/wiki/Conjugate_prior
So that is my choice. Since Stan doesn’t have implemented the Bernoulli beta distribution, I’d choose the binomial beta distribution with N = 1.

Thanks Andre. I am familiar with the Beta-Binomial conjugate prior although I am not sure I entirely understand your comment. As far as I am aware there are no conjugate priors for logistic regression, but I could well be wrong on that. My particular interest is in how to evaluate the ‘reasonableness’ of (any particular distributional form of) informative prior that you choose for a logistic regression. The prior predictive approach is one suggested by Gelman and more recently by Gabry (https://arxiv.org/abs/1709.01449) but it is typical presented for a non-binary response. My question is how to evaluate a prior in the context of a potentially hierarchical logistic regression with Bernoulli response.

You can use simulation based calibration
https://cran.r-project.org/web/packages/rstan/vignettes/SBC.html to evaluate the ‘reasonableness’ of (any particular distributional form of) informative prior.

If we sample, that means y as parameter
y ~ normal(..., sigma) ;
we can always take inv_logit(y),