I am exploring priors for Bernoulli logistic regression and am trying to find a sensible way of evaluating the prior predictive. In terms of application, think of the individual binary responses (responds/does not respond) to an intervention for one or more interventions with interest in the proportion that respond in each group. The model will eventually turn into a hierarchical logistic regression, but I am starting with the simplest example. An implementation is shown below.
functions {
}
data {
int<lower=1> N; // number of observations
int<lower=1> K; // number of arms
int Y[N]; // response variable
int X[N]; // population-level design matrix
int prior_only; // should the likelihood be ignored?
real prior_par1;
real prior_par2;
real prior_par3;
real prior_par4;
int prior_to_use;
int dbg;
}
transformed data {
int doprint = 1;
}
parameters {
vector[K] b; // population-level effects
}
transformed parameters {
}
model {
// initialize linear predictor term
vector[N] mu;
for (n in 1:N) {
// add more terms to the linear predictor later, perhaps introduce
// groups into the data and add a varying intercept to the model
mu[n] = b[X[n]];
}
if(prior_to_use == 1){
if(dbg) print("Normal prior");
target += normal_lpdf(b | prior_par1, prior_par2);
} else if(prior_to_use == 2){
if(dbg) print("Student t prior");
target += student_t_lpdf(b | prior_par1, prior_par2, prior_par3);
} else if(prior_to_use == 3){
if(dbg) print("Logistic prior");
target += logistic_lpdf(b | prior_par1, prior_par2);
} else if(prior_to_use == 4){
if(dbg) print("Uniform prior");
target += uniform_lpdf(b | prior_par1, prior_par2);
} else if(prior_to_use == 5){
if(dbg) print("Inverse gamma prior");
target += inv_gamma_lpdf(b | prior_par1, prior_par2);
}
// likelihood including all constants
if (!prior_only) {
target += bernoulli_logit_lpmf(Y | mu);
}
}
generated quantities {
real yrep[N];
{
vector[N] eta;
for(i in 1:N){
// unnecessary for one group but makes extension
// to multiple groups easy.
eta[i] = b[X[i]];
}
yrep = bernoulli_logit_rng(eta);
}
}
For which data can be simulated and then fit. However, I noted that for assessing the prior this might not be entirely necessary unless I was going to make a comparison between the data generated and the prior predictive (so maybe it is necessary):
library("tidyverse")
getdat <- function(
n = 400,
prob = c(0.15,0.15))
{
id <- 1:n
narms <- length(prob)
arm <- sample(1:narms, size = n, replace = T)
y <- rbinom(n, size = 1, prob = prob[arm])
d <- tibble::tibble(
id = id, # subject
arm = arm, # group
p = prob[arm], # Pr(response)
y = y # response
)
d
}
The stan model is parameterized in terms of estimating the probability of response in each arm rather than the more typical approach that specifies an intercept and effects. In truth, both are of interest, but the former lets me focus on a single prior rather than having to consider separate priors on the intercept and effects. Below, I just create data with a single arm, so all we are doing is estimating the proportion that respond in that single group.
I am interested in weakly informative priors and lets say that I am confident that the response rate (denoted by \theta) is less than 0.25 but I still want to let the data speak if it really wants to. I know I do not need to generate the data at this stage but it would be useful if I move to 2, 3 4 interventions.
d = getdat(n = 100, prob = c(0.2))
head(d)
# A tibble: 6 x 4
# id arm p y
# <int> <int> <dbl> <int>
# 1 1 1 0.2 0
# 2 2 1 0.2 0
# 3 3 1 0.2 0
# 4 4 1 0.2 0
# 5 5 1 0.2 0
# 6 6 1 0.2 0
ldat <- list(N = nrow(d),
K = length(unique(d$arm)),
Y = d$y,
X = d$arm,
dbg = 0)
# Normal distribution
ldat$prior_to_use <- 1
ldat$prior_only <- 1
ldat$prior_par1 <- qlogis(0.25)
plogis(ldat$prior_par1)
ldat$prior_par2 <- 1.2
ldat$prior_par3 <- 0
ldat$prior_par4 <- 0
Compile the stan model then sample:
library("rstan")
options(mc.cores = 1)
mod <- rstan::stan_model(file = "logistic.stan", auto_write = TRUE)
fit <- rstan::sampling(tg_env$model_code, data = ldat,
chains = 1, iter = 2000, warmup = 1000, refresh = 0)
the following will visualize the prior on the parameter of interest on both the log-odds and probability scale.
b <- rstan::extract(fit, pars = c("b"))
b <- b[[1]]
expit <- function(x){1/(1+exp(-x))}
par(mfrow = c(2, 1))
hist(b)
hist(expit(b))
par(mfrow = c(1, 1))
While the prior heavily represents lower probabilities, it puts the parameter estimate on the log-odds scale between -3.7 and -1.2 with 95% probability, which translates to roughly 0.02 to 0.75 on the probability scale. Moreover, it does cover the full support on the probability scale and so will not prevent high values of \theta if there is sufficient data.
What I am interested in is how to go about evaluating the prior predictive in this binary context.
They way I am doing it is to simulate data using the prior (giving me the prior predictive) and then extract and plot.
yrep <- rstan::extract(fit, pars = c("yrep"))
m <- yrep[[1]]
dim(m)
# [1] 1000 simulations x 100 observations (assuming that my data is going to be somewhere around that size per group)
d2 <- as.data.frame(t(m)) %>%
tidyr::gather("sim", "value") %>%
dplyr::mutate(sim = gsub("V", "", sim, fixed = T),
sim = as.numeric(sim))
prop.table(table(d2$value))
The table summary gives me something around what I would expect.
d3 <- d2 %>%
dplyr::group_by(sim) %>%
dplyr::summarise(mu = mean(value)) %>%
dplyr::ungroup()
par(mfrow = c(3, 1))
hist(d$y)
hist(d2$value[d2$sim == 1])
hist(d3$mu)
par(mfrow = c(1, 1))
The above gives the following plots. The first summarizes the responses from the originally simulated data using getdat
. The second summarizes the data generated from the prior and the third plot summarizes the distribution of mean responses across all of the datasets simulated from the prior predictive (and looks suspiciously similar to the prior on the probability scale so I suspect might be the wrong thing to do).
I then repeat the process by tweaking my prior parameters and for different prior distributions entirely and then reassess.
Am I taking a reasonable approach or is there a better way to go about this?
What is your assessment of the current prior?
Thanks.