Results of NLMM in brms and rstanarm not same (brms longer running time)


#1

Please also provide the following information in addition to your question:

  • Operating System: Win10
  • brms Version: 2.3.1

Hi all,

I’m fitting a multilevel Bayesian non-linear mixed-effects model using both brms and rstanarm(for comparison). However, I found brms ran 30 times slower (over 12,000s each chain) than rstanarm (400s each chain) while failed getting to convergence, the rstanarm model converged perfectly due to both trace plot and R-hat.

The nonlinear function is the logistic function with 4 parameters: A, B, xmid, scal. Within the data set, there are 3 categorical predictors, we call them par1, par2, par3. The package rstanarm is automatically re-scaling the priors, so the final prior is approximately the same between the two models.

The Bayesian model I would like to fit is:
1st level: y~N(nonlin(A, B, xmid, scal), sigma_y^2)
2nd level: A~N(A_0, sigma_{A, par1}^2 + sigma_{A, par2}^2 + sigma_{A, par3}^2), B~…, xmid~…, scal~…
3rd level: A_0~N(mu_A, sigma_A^2), sigma_{A, par1}~half-cauchy(0, scale_A),…

The brms model is:

model_brms <- brm(bf(y ~ A + (B-A)/(1+exp((xmid-x)/scal)),
cbind(A,B,xmid,scal) ~ 1 + (1|par1) +(1|par2) + (1|par3), nl =TRUE),
data=data,
prior = c(prior(normal(30,3), nlpar = A, coef = Intercept),
prior(normal(20,3), nlpar = B,coef = Intercept),
prior(normal(60,3), nlpar = xmid,coef = Intercept),
prior(normal(10,3), nlpar = scal,coef = Intercept),
prior(cauchy(0,3.5), class=sigma),
prior(cauchy(0,3.5), class = sd, nlpar = A),
prior(cauchy(0,3.5), class = sd, nlpar = B),
prior(cauchy(0,3.5), class = sd, nlpar = xmid),
prior(cauchy(0,3.5), class = sd, nlpar = scal)),
inits = c(A=30,B=20,xmid=60,scal=10),
cores = 3,
algorithm = ‘sampling’,
iter = 10000,
chains = 3,
warmup = 5000,
control = list(adapt_delta = 0.9, max_treedepth = 20)
)

The rstanarm model is:

model_arm <- stan_nlmer(y ~ SSfpl(x,A,B,xmid,scal) ~
A + B + xmid + scal +
(A|par1) + (B|par1) + (xmid|par1) + (scal|par1) +
(A|par2) + (B|par2) + (xmid|par2) + (scal|par2) +
(A|par3) + (B|par3) + (xmid|par3) + (scal|par3),
data=data,
prior = normal(location = c(30, 20, 60, 10), scale = c(1, 1, 1, 1)),
prior_aux = cauchy(0,1),
algorithm = ‘sampling’,
adapt_delta = 0.9,
iter = 10000,
chains = 3,
warmup = 5000
)

I did not fit the model with correlated random effects, because my data set only has less than 400 data points. Please correct me if there’s anything going wrong in my process of specifying the models! I know it is hard for nonlinear models to converge when there are too many predictors, but I cannot figure out why the stan_nlmer function in rstanarm can converge with the same amount of parameters and almost the same priors.

By the way, it seems like there’s no way to specify random slope in rstanarm, is that correct?

Thanks,
Zoey


#2

Do you have a reproducible example for me to try out? I assume there is something wrong with the brms model.


#3

Hi Paul,

Thank you so much for your reply!

You may generate a data set like this:

A.rand <- rnorm(180, 30, 0.05)
B.rand <- rnorm(180, 20, 0.5)
xmid.rand <- rnorm(180, 60, 1)
scal.rand <- rnorm(180, 10, 1.5)
par1 <- rep(c(rep(1,4), rep(2,5), rep(3,5), rep(4,4)), 10)
par2 <- rep(c(1, 2, 4, 5, 1:5, 1:5, 1, 2, 3, 4), 10)
par3 <- rep(c(3,2,2,3,2,5,3,3,4,5,4,5,5,1,4,3,4,4), 10)
x <- rep(c(0, 5, 25, 35, 40, 50, 65, 70, 80, 90), each = 18)
y <- A.rand + (B.rand-A.rand)/(1+exp((xmid.rand-x)/scal.rand))
data <- cbind(x, y, par1, par2, par3)
data<- as.data.frame(data)

With this example, the difference between running time is not as huge as my data (70s/chain vs 450s/chain), but still, brms model does not converge within 10k iterations.

Thanks,
Zoey


#4

Sorry for the multiple times of reediting. Just tried to make the question clearer and reformatted it a little. The data generating part also had some minor changes due to my rerunning.


#5

When I run

model_brms <- brm(
  bf(y ~ A + (B - A) / (1 + exp((xmid - x) / scal)),
     cbind(A, B, xmid, scal) ~ 1 + (1|par1) + (1|par2) + (1|par3), 
     nl = TRUE),
  data=data,
  prior = c(prior(normal(30, 3), nlpar = A, coef = Intercept),
            prior(normal(20, 3), nlpar = B,coef = Intercept),
            prior(normal(60, 3), nlpar = xmid,coef = Intercept),
            prior(normal(10, 3), nlpar = scal, lb = 0),
            prior(cauchy(0, 3.5), class = sigma),
            prior(cauchy(0, 3.5), class = sd, nlpar = A),
            prior(cauchy(0, 3.5), class = sd, nlpar = B),
            prior(cauchy(0, 3.5), class = sd, nlpar = xmid),
            prior(cauchy(0, 3.5), class = sd, nlpar = scal)),
  algorithm = "sampling",
  iter = 2000,
  chains = 1
)

it works almost as fast as the equivalent model of rstanarm even if brms performs no rescaling. I think the trick is to set lb = 0 in the prior scal.


#6

Thank you so much Paul! I did not know I can specify boundaries for a normal prior.

The running time for a single chain is around 60s, but when I tried to run 3 chains parallelly on 3 cores, there was one chain finished in 60s, but the last 2 chains still spent 450s. My brief guess is the running time depends on different initial values. The running time for rstanarm is also varies from 60s to 250s, I consider there’s no significant difference.

However, when I tried the model, it still cannot get to the convergence within 10k iterations, while the rstanarm model converges perfectly. When it comes to the topic of ‘convergence’, the question has been confused me for months (especially on another project using NLMM, with larger dataset and more parameters). Do you know if there’s anything I can do to increase N_eff and make R_hat falls in a reasonable range? I tried to increase number of iter or warmup, not working and inflated running time. I tried to specify different (usually stronger) priors, the performance was improved a little but not enough to get to full convergence. I also tried to delete some parameters from the full model, but because of the nature of nonlinear model, it is hard to decide which parameter to drop. Although the model finally converges after multiple experiments, I cannot understand why I have to drop these parameters but not the others. May I ask for any suggestion or experiences about that?

By the way, what if I’d like to fit another model like:

bf(y ~ A + (B - A) / (1 + exp((xmid - x) / scal)), cbind(A, B, xmid, scal) ~ par1 + (1|par2) + (1|par3), nl = TRUE)

Apparently, the value of intercept should larger than zero, but for other categories, it is not. How can I specify the prior separately for intercept and other categories for ‘par1’? There is a warning message saying I cannot set a lower bound and at the same time having ‘coef’ specified.

Really appreciate your patient help! It’s the best to get some response from the package author, WOW!


#7

Hmm the model I showed you above (with just one chain for simplicity) converged nicely. You may want to try inits = 0 sometimes this helps. If you want to ensure that the whole of “A” is positive, replace A with exp(A).


#8

Many thanks for the inits = 0 trick, that works great and improves my convergence in this model.
For complex models, it seems like setting inits = 0 does not help much. Most of the cases I had was, every single chain converged, but when I tried 3 chains, they did not converge to a same global optimal.
When I tried exp(A), the model was harder to converge for even a single chain. Do you know if there’s any better prior other than normal() I may use to improve the behavior?
I’m sure someone has told you that, but I still have to say, brms is an amazing package, thank you so much for bringing that to us!


#9

Multiple chains not converging to the same posterior tells you that your priors are too wide or your model is somehow misspecified. You should think in more detail about the scale of your parameters so you can specifiy more informative priors.


#10

Really appreciate that. My problem was solved!


#11

How did you solve it?


#12

I used exp() to restrict my nonlinear parameters to be positive, set init=0, and specified narrower priors. I guess the most powerful part is exp(), it decreases the chance for the chains to be stuck in local optimal.