Brms on supercluster (MPI)

  • Operating System: “Red Hat Enterprise Linux Server 7.2 (Maipo)”
  • brms Version: 2.9.0
  • rstan version: 2.19.2

Recently I have been working on what are essentially generalized additive models (GAMs), initially using mgcv. However, for both computational and principled reasons I would like to move to a Bayesian framework. I am still pretty new to both GAMs and Bayesian inference, so specifying a GAM directly in rstan is beyond me, at least for now. I have therefore been using the amazing brms package to do things like this (reduced version):

Family: exponential
Links: mu = log
Formula: eeg06 ~ s(relright, k = 20) + s(relleft, k = 20)
Data: subset(eeg, select = c(deps[d], "relright", "relle (Number of observations: 5266)

However, it has quickly become clear that running these analyses on my office pc is not feasible: even a considerably reduced model takes about 4 days to run. I do have access to a supercluster, but my initial attempts don’t do much better.

From what I can find, the newest version of RStan supports MPI parallelization. My question is: does brms make use of this automagically (i.e., I cannot achieve any further performance gains)? And if not, can I set up brms for this? Or will I need to invoke rstan directly/‘manually’ to use the MPI features?

I would be most grateful for any guidance you can provide on this.

Joost Meekes

brm() has a future argument, that if set to TRUE will use the future package for parallelization. The future package appears to have support for MPI (makeClusterMPI), so that should provide an easy way to achieve this, though I’ve never tried.

2 Likes

Anecdotally, I’ve noticed much faster performance with GAMs in brms by specifying a smaller k. Unlike mgcv which “finds” a reasonable k, I don’t believe we expect brms to do the same. So it’s somewhat akin to passing a lot of linear predictors (in your case at least 40).

So one approach, though I can’t speak to it’s “principledness” might be to use mgcv to recommend a k and then use brms and LOO to compare models bracketing that k.

Not sure what your data are but 20 knots seems like a very tortuous relationship!

Thanks a lot for the helpful suggestion! I will try that, since I am having some trouble with the ‘cores’ argument, which is supposed to do the same.

However, both the ‘cores’ and the ‘future’ argument seem to concern running the chains in parallel, whereas if I understand it correctly, the MPI implementation in the newest versions of Stan perform the gradient evaluation in parallel (which seems to be the most computationally expensive step?).

That’s a good point, I have to admit that I had assumed this would work the same, since brms just uses the mgcv functions to set up the splines. Thinking (and reading) about it in more detail, the effective number of degrees of freedom is indeed only determined during fitting, based on penalization. On the other hand, the help for brms (specifically, for brms::set_prior) states that they are implemented ‘as explained in gamm’, which in turn says that this is based on a ‘Bayesian model of spline smoothing’ (which suggests to me that a suitable prior should induce suitable ‘penalization’, to mix terminology from two different statistical approaches. In any case, you are certainly right that it’s worth checking, and I’ve seen at least one model fit that would be congruent with your suggestion.

Regarding my choice of k: while I am indeed relying on the general advice in the mgcv documentation that it’s better to choose k too high than too low, what I am modeling are EEG responses to trials which contain multiple events. Since the EEG response to even a single event may show several ups and downs, I think the ‘optimal’ k might well be relatively high.

That is an interesting idea. At the present stage of this project I am mainly trying to figure out whether it is at all feasible to fit these kinds of models, so I am not too worried about ‘principledness’ :-)

You have to reformulate your model to use map_rect for parralelizing the gradient evaluation within stan. Otherwise brms/rstan/pystan run multiple chains in parallel

The below may be helpful if you would like to use threading within the model

1 Like

Thanks a lot! So that confirms my suspicion that brms does not use parallelize the gradient evaluation - nor even RStan, if I understand correctly. On the plus side, it suggests that I can still reap the huge benefits that parallelizing gradient evaluation should bring :-) Looks like I have some learning to do…

This isn’t how mgcv works; it doesn’t find k or knots or whatever. The k = -1 by default sets the upper limit on the possible complexity of the smooth as it controls how many basis functions mgcv uses to represent the smooth function for covariate x. The -1 means the default for this kind of smooth, which for univariate smooths ends up being k = 10, and then you typically get fewer functions (usually 1 fewer) because mgcv applies identifiability constraints (we need to estimate an intercept too, so we adjust the basis to remove the allow the intercept to be uniquely identified.)

What mgcv and brms do is fit penalized splines; you might start with 9 (by default) basis functions and hence 9 model coefficients for each smooth, but the resulting model may use many fewer effective degrees of freedom than 9 for the smooth as the wiggliness penalty shrinks the estimates for the model coefficients.

This is slightly hidden in brms fits as the entire GAM gets represented as a mixed model, with the wiggly basis functions being random effects and the smoothness parameter is related to the variance parameter for the random effects. But it is estimating a penalized spline model, just like mgcv does. What brms is doing is just what gamm() and gamm4:gamm4() are doing and explicitly representing the GAM as a mixed model. gam() does this if you use the method = ‘REML’ option (or ’ML’). A different procedure for selecting smoothness parameters applies if you use the default of GCV selection, but you probably shouldn’t use this any longer as it can badly under smooth.

There are differences between mgcv and brms model; the priors implied by the empirical Bayes view of the penalized spline GAM fitted via REML/ML aren’t those use by brms, but that’s a good thing as some of the priors implied by mgcv’s view are improper.

What is right is that choosing a lower k can speed things up; you have fewer coefficients to estimate. You should be very careful with this though; the main thing a user has to do with a penalized spline GAM is choose a suitable size of basis expansion such that the true smooth function f() either lies in the span of the basis or a close approximation to the true function is in the span. This is less likely to occur in a basis of size 5 than one of size 10.

If you have a lot of data, not using the thin plate regression spline can also speed up the act of forming the basis before model fitting even starts. Use bs = ‘cr’ when setting up the smooths vis s() or t2() for example.

2 Likes

Thanks for this clarification. I agree that it may be misleading to imply that mgcv finds k. Poor word choice.

Although this not what the post was originally about, I do think there is definitely room for some detailed guidelines on how to set priors in brms for splines.

Yup; rather than bracketting k (as you mentioned previously) I think setting k sufficiently large but choosing the prior(s) appropriately for the variance terms for each smooth would be useful. That way you inform the model as to your prior expectations on the wiggliness of the function(s).

Sampling from the prior for a given smooth would also be useful; you’d see the kinds of functions implied by your prior which might lead you to alter the prior before introducing the data and sampling.

1 Like