I have data where I am looking at predicting drug resistance (binary) by the presence/absence of different genes. Few genes are thought to have any real effect. In addition, the presence of some genes is very low, and the presence/absence can be highly correlated among genes. This can make for separation problems in the logistic regression. My goal is prediction. I want to predict the outcome (drug resistance) for a variety of drugs using the genes.
Let’s say I have a bunch of genes and a single drug, then I could simply fit a logistic regression model and specify a regularized horseshoe prior. However, I have many drugs, which are somewhat related, so I would actually like to specify a multilevel model, such that information is partially pooled across drugs for the effect of the gene on the outcome resistance.
I have simulated some data that looks pretty close to my real data, by specifying an uncorrelated varying intercepts and varying slopes model to simulate the data, where the varying effects follow a student t distribution (when I run a multilevel model on my real data, all of the correlations between varying effects are around 0.0 with credible intervals that broadly (and rather evenly) cover both sides of zero).
See this txt file for the code for the data simulation and model.
code simulation uncorrelated varying intercepts slopes resistance data.txt (3.3 KB)
The model that I have run looks like this:
#set up model, varying intercepts and slopes model where varying effects modeled via student distribution
bf_logistic <- bf(y ~ 1 + gene1 + gene2 + gene3 + gene4 + gene5 + gene6 + gene7 + gene8 + gene9 + gene10 +
gene11 + gene12 + gene13 + gene14 + gene15 + (1 + gene1 + gene2 + gene3 + gene4 + gene5 + gene6 + gene7 + gene8 + gene9 + gene10 +
gene11 + gene12 + gene13 + gene14 + gene15 | gr(drug_f, dist="student"))) + bernoulli()
#set up regularized hs prior for logistic model
p <- 5 #guess at number of relevant variables
D <- 16 #number of variables in the model (is this correct even for hierarchical model??? I assume it is number of population-level effects)
sigma <- 2 #as recommended for logistic regression in slide 14 https://github.com/avehtari/modelselection/blob/master/regularizedhorseshoe_slides.pdf
n <- nrow(d1) #number of observations in dataset
scale_global <- (p/(D-p))*(sigma/sqrt(n))
prior1 <- c(
prior(horseshoe(df = 1, scale_global = scale_global), class=b),
prior(cauchy(0, 1), class=sd),
prior(lkj(2), class=cor)
)
m1 <- brm(bf_logistic,
prior = prior1, data=d1, chains=4, iter=2000, warmup=1000,
control = list(adapt_delta = 0.999, max_treedepth=13), cores=4)
The model takes a little bit of time to run, so here are the results from the above sim and model run
sim model output.txt (13.2 KB)
Notice I have specified the regularized horseshoe prior for the ‘population-level’ effects, and a cauchy prior for the varying effects. What I would like to do is to model the sparsity in the population-level effects and allow the genes to effect the outcome differently for different drugs (but share information; partial pooling), but at the same time, I also want a similar sparse prior on the varying effects.
I am trying to wrap my mind around what it means to impose the hs prior on the population-level effects but then have all those varying slopes in the model… do the varying slopes sort of ‘water down’ the effect of the hs prior? In other words, I use this hs prior with the goal of pulling noise toward zero, but then I have a varying slope, so is it possible that the varying slope simply allows the noise to exist in the model even though the hs prior has done it’s job on the population-level effects?
Is my cauchy prior a good choice?
Based on this plot, it appears the hs prior has done it’s job for the population-level effects.
Also, using coef(m1)
in brms, I can view the slopes for each gene for each drug, and it appears that the combination of the hs prior and cauchy prior works as intended.
coef m1.txt (10.3 KB)
I would appreciate thoughts on this! It would be great if someone could confirm that using a hs prior in a multilevel model such as described above isn’t a poor idea. Thanks!