Hello,
First time user of both brms and stan here.
I am trying to fit hierarchical models of protein binding affinities in brms. The models have forms similar to
affinity ~ 1 + (1|protein) + (1|protein:organism)
so that they can be used for out-of-sample prediction. The dataset contains 30k+ combinations of protein:organism
, each with only few measurements (approx 1 to 10). Using default settings, a gaussian family and reasonable priors the models fit smoothly. No warnings are shown and Rhat/ESS look healthy.
I then noticed that different groups have different residuals, so I decided to fit sigma
as well (Estimating Distributional Models with brms (r-project.org)) to capture the uncertainty more precisely.
The updated model looks like this:
model <- brm(
bf(
affinity ~ 1 + (1|protein) + (1|protein:organism),
sigma ~ 1 + (1|protein) + (1|protein:organism), # note: in brms this is actually the log of sigma
decomp = "QR"
),
df, prior = c(...), ...
)
Suddenly fitting the model becomes much harder. I now get ~10% divergent transitions, ~90% max threedepth hits and moderately high Rhat (~1.5). Setting adapt_delta = 0.99
and increasing the number of steps reduces divergent transitions to 2 and Rhat to 1.15. I don’t doubt that pushing the parameters further I could eventually get a decent fit, but the runtime of the fit is already at 2 days with the current parameters, so this option is not ideal.
Digging through the forum and the stan docs it really looks like there is an issue in my model (or its parametrization). I tried to investigate divergent transitions to see if this is a funnel problem (Diagnosing Biased Inference with Divergences (mc-stan.org)) but I couldn’t spot anything interesting. So I checked whether the gaussian family is actually a good choice using QQ plots:
means <- fitted(model)
sigmas <- fitted(model, dpar = "sigma")
qqnorm(means[,1], main="QQ plot means")
qqline(means[,1])
qqnorm(log(sigmas[,1]), main="QQ plot log-sigmas")
qqline(log(sigmas[,1]))
hist(log(sigmas[,1]), main="log-sigmas distribution")
Now this looks like a problem: my data is normally distributed as I assumed, but the log-sigma are not. It rather looks like a student distribution.
And here finally come my questions:
- Can a poorly chosen family have such a strong impact on the fitting difficulty/time?
- If yes: I could fix this by using a student family for the sigmas (while keeping a gaussian for the means) but I could not find any information on how use multiple families in brms. Is this possible?
- If not: what other paths could I take to improve fitting performance? Should I consider a different parametrization of the problem? (how?)
Thank you!
Mattia