Divergent transitions with splines


Hi all. I’m encountering divergent transitions in what I thought was a fairly straight-forward model with b-splines. I know the standard recommendation is to increase adapt.delta until these go away, but I wanted to double-check that there wasn’t any issues with the formulation of the model itself. Any input would be greatly appreciated:

Here’s the model:

data {
    // n_y: number of observations in y
    int n_y ;
    // y: vector of observations for y
    vector[n_y] y ;
    // x: values of x
    vector[n_y] x ;
    //num_splines: number of splines
    int num_splines ;
    //splines: matrix of splines
    matrix[num_splines,n_y] splines ;
parameters {
    // noise: measurement noise
    real<lower=0> noise ;
    // volatility: wiggliness of the functions
    real<lower=0> volatility ;
    //intercept: intercept of the function
    real intercept ;
    //slope: slope of the function
    real slope ;
    // splines_helper: helper variable for splines
    row_vector[num_splines] splines_helper ;
transformed parameters{
    // f: latent functions
    vector[n_y] f ;
    f = intercept + slope*x + //linear part
        to_vector((volatility*cumulative_sum(splines_helper))*splines) ; //wiggly part (random-walk prior)
model {
    intercept ~ normal(0, 1) ;
    slope ~ normal(0, 1) ;
    volatility ~ normal(0, 1) ;
    noise ~ normal(0, 1) ;
    splines_helper ~ normal(0,1);
    y ~ normal(f,noise);

And here’s R code to make and fit some fake data:

# load packages
rstan_options(auto_write = TRUE)

# n_x: number of unique samples on x-axis
n_x = 100

# prep a tibble with combination of x, conditions & reps
dat = tibble(
    x = seq(-10,10,length.out=n_x)

# set random seed for reproducibility

# wiggly true function observed with noise
dat %>%
        true = sin(x)*dnorm(x,5,8) + x/100 + 1
        , obs =  true + rnorm(n(),0,.01)
    ) ->

# show the condition functions
dat %>%
        mapping = aes(
            x = x
            , y = true

# show the noisy observations
dat %>%
        mapping = aes(
            x = x
            , y = obs

#create splines
num_knots = 10
knots = seq(min(dat$x),max(dat$x),length.out = num_knots)
spline_degree = 3
splines <- t(bs(dat$x, knots=knots, degree=spline_degree, intercept = T))

# create the data list for stan
data_for_stan = list(
    n_y = nrow(dat)
    , y = dat$obs
    , x = dat$x
    , num_splines = nrow(splines)
    , splines = splines

spline_mod = rstan::stan_model('spline_fit.stan')

# start the chains
post = rstan::sampling(
    data = data_for_stan
    , object = spline_mod
    , chains = 4
    , cores = 4
    , iter = 2e3
#note warnings here

#check parameters
    , par = c(
        , 'volatility'
        , 'intercept'
        , 'slope'

#check pairs plot
    , pars = c(
        , 'volatility'
        , 'intercept'
        , 'slope'

And here’s the pairs plot:


The yellow dots means it exceeded the maximum treedepth. I can’t see if there are any red dots indicating a divergence.


And that usually means you have regions of varying curvature and you’re trying to explore a relatiely flat part of the posterior with a small step size.

Having a cumulative sum very much links the splines_helper variables together very tightly.

The only efficiency improvement I see would be declaring f as a row_vector so you don’t need the extra copy in to_vector.


Treedepth exceedence happens also with constant curvature but strong correlation as mass matrix adaptation uses diagonal matrix. See, e.g. “Gaussian linear model with adjustable priors” in https://rawgit.com/avehtari/BDA_R_demos/master/demos_rstan/rstan_demo.html with 2.6% iterations with saturated treedepth and “Gaussian linear model with standardized data” with 0 saturations, just because in the first case there is very strong correlation between alpha and beta.


Very good point. It’s very flat along the diagonal of correlation, but very curved going off diagonal. And that’s true even if variables are both unit scale.


I enjoyed reading the comments. Know I want to add my 2 cents:

You added the intercept twice. Here:

and here:

Removing one of them should result in better mixing.

Added later

You are fitting a random intercept model:
Do we need to put a hyper-prior on the variance of slope and intercept?
And since we see a linear dependency between slope and intercept: Do we need
a multivariate normal distribution on both parameters?


Yes, I’m copying that from the Splines In Stan demo, where it serves to implement a random-walk prior, which in turn achieves a bias towards smoother functions and less sensitivity to the number of splines used.


Thanks, good catch.


Thanks for the correction. You’re right—the cumulative sum is just going into calculating the location parameter for the normal for y. The actual prior on splines_helper is independent by component.