Knots and basis dimension in brms

I’m confused about the choice of basis dimension and knot positions when using s() in brms. When I print the Stan code from my model, I see that the number of bases and the number of knots are in the data block, and that the only spline-related parameters in the parameters blocks are the standardized spline coefficients. I have a couple questions regarding this and I’m hoping somebody can help answer them:

  1. How is brms choosing the number of basis functions and knots to pass to Stan? Is it using a penalized likelihood approach like mgcv does?
  2. Is it not possible to estimate the number of basis functions as a parameter in the model, using a regularizing prior to nudge the model towards a smaller number of basis functions?

Thanks!

Hi @maurice_goodman!

This is not possible using Stan, because the number of knots ist a discrete parameter. Any HMC scheme relies on gradients of the parameters and discrete parameters don’t have gradients. If I’m not mistaken, brms is taking a default number of 10 knots (basis functions are built by the mgcv package). You can change this number.

If you’re using 'optimize' then I guess this would result in a maximum penalized likelihood estimator. But in this case you’d probably be better off using mgcv straight away. When you’re using MCMC, the prior acts as the penalty or smoothing parameter. (I think you may have not noticed it in the code, but it should be there.) I hope this also answers your second question.

There’s actually some discussion about GAMs on this board. For example, here.

Cheers! :)

brms estimates models using a mixed-model representation of the penalized regression spline model of mgcv (the latter does this when method = REML). What neither are doing is choosing the number of basis functions; the default is arbitrary and hard-coded in mgcv and depends on the parameterisation of the specific basis you are using. If you are fitting a univariate smooth and simply go with the default thin plate regression spline (TPRS) basis (bs = 'tp') and don’t change m (which controls the order of the wiggliness penalty and the null space (the perfectly smooth parts) of the basis), i.e.

y ~ s(x)

then the basis dimension will be 10 when initially created, but one of these basis functions will get dropped as it is a constant function and you can’t estimate a coefficient for it and the model intercept as the two parameters are not identifiable. So you should end up with an intercept/consts term and nine coefficients for the single smooth.

Note that in this TPRS basis, technically what is going on is that there is a knot at every data point (or a randomly chosen subset of 2000 data points if there are more than 2000 unique covariate values), and the full TPRS basis is formed based on these knots. This should give an optimal smoother in some senses, but it is wasteful as you end up fitting a vast number of coefficients to estimate the smooth, far more than would typically be needed to capture the nonlinear (partial) relationship between the response and the covariate. So what Simon Wood does in mgcv is to eigen-decompose the full (or 2000 knot) basis and to take the k first eigenvectors of this decomposition as the basis functions used to estimate the smooth. This has the effect of concentrating the features of the basis in the first few decomposed basis functions, leaving the very wiggly parts to the later eigenvectors, which will be dropped by default as k = 10 initially as described above, so we keep only the first 10. This seems to work exceedingly well, preserving most of the good properties of the real TPRS basis whilst addressing the major drawback of the TPRS basis. As brms uses mgcv’s functions to set up the basis, this is also what is happening during model set up under brms.

In the case of other basis types, again, one typically doesn’t choose the specific number of knots; for a cubic regression spline one just chooses how many basis functions (via argument k) one wants for a particular smooth and absent any other user-intervention mgcv will just spread the required knots evenly over the interval of the observed range of the covariate.

The key point is that the user just needs to use as many basis functions as they think necessary to capture the wiggliness of the true function (or closely approximate it) plus a few more for good measure. If you expected a roughly cubic relationship then you might set k = 6, which would give you some flexibility to fit roughly cubic functions. That said, there is little penalty for choosing a larger number of basis functions, at least with modest sample sizes. In this setting, using k = 20 would be overkill but the resulting model would likely be similar to the k = 10 one and even the k = 6 one if the true effect was roughly cubic, because mgcv, as you say, penalises the model fit for excessive wiggliness. brms does exactly the same thing via the mixed-model representation of penalised splines. There may even be a benefit in choosing a too large basis like k = 20 as that basis will contain many more functions of a given complexity (EDF < 10) than a basis with k = 10. The trade-off however will be increased compute time.

Hence the aim is not to choose the right number of knots to yield the smooth function you want, but to choose a basis rich enough to be reasonably sure the set of functions spanned by the basis includes the true but unknown function or a close approximation to it. And then shrink away the excessively wiggly bits of that basis using the wigglines penalty.

If you want to know the gory details of the implementation in mgcv, and hence brms, then you need to consult the relevant Rd pages in mgcv:

?smooth.construct
?smooth.construct.ad.smooth.spec    ?smooth.construct.bs.smooth.spec
?smooth.construct.cc.smooth.spec    ?smooth.construct.cp.smooth.spec
?smooth.construct.cr.smooth.spec    ?smooth.construct.cs.smooth.spec
?smooth.construct.ds.smooth.spec    ?smooth.construct.fs.smooth.spec
?smooth.construct.gp.smooth.spec 	?smooth.construct.mrf.smooth.spec
?smooth.construct.ps.smooth.spec    ?smooth.construct.re.smooth.spec
?smooth.construct.sf.smooth.spec    ?smooth.construct.so.smooth.spec
?smooth.construct.sos.smooth.spec   ?smooth.construct.sw.smooth.spec
?smooth.construct.t2.smooth.spec    ?smooth.construct.tensor.smooth.spec
?smooth.construct.tp.smooth.spec 	?smooth.construct.ts.smooth.spec

The details section of these man pages should describe how k is interpreted and what the resulting basis dimension will be. In some special cases, like the MRF basis the default is to fit the full rank basis (number of groups - 1), where in others, a low rank basis is the default.

There are problems with choosing the number of basis functions as the means of estimating the model; I don’t think you can easily have a discrete parameter like this in HMC, so you’d need some other way to choose the basis size during fitting. And models with 10 basis functions won’t be nested in ones with say 9 basis functions, so doing something based on likelihood ratios would be out.

Rather than think about model selection as a discrete process of choosing the basis dimension, we can use a continuous selection process (mgcv fits models using log-smoothness parameters \log(\lambda), brms does the same thing but instead of a smoothness parameter it has a variance parameter which is constrained to be non-negative (perhaps even positive - I should check that), where the wigglier tendencies of the basis are progressively muted if they do not substantially improve the fit to the data under the penalised likelihood criterion.

11 Likes

Ah the GAM expert has spoken. I really enjoy reading your posts! Thanks! :)

Rather than think about model selection as a discrete process of choosing the basis dimension, we can use a continuous selection process ( mgcv fits models using log-smoothness parameters log(λ) , brms does the same thing but instead of a smoothness parameter it has a variance parameter which is constrained to be non-negative (perhaps even positive - I should check that), where the wigglier tendencies of the basis are progressively muted if they do not substantially improve the fit to the data under the penalised likelihood criterion.

This is also the connection between random effects (or varying intercepts) and smooth splines, which is kind of trivial if you think about it, but was pretty eye-opening to me.

2 Likes