Scalar-on-function regression using Stan

Functional data analysis (e.g. scalar-on-function regression, as well as functional response types) can be implemented in Stan quite easily - and indeed automated - by using the brms package, which itself calls mgcv to prepare the various matrices. mgcv it turns out can be prodded to do various types of functional data analysis (see Wood (2017) Generalized Additive Models, 2nd ed., p. 391).

Wood outlines a scalar-on-function regression using mgcv illustrated with data available in a CRAN package associated with this text gamair. To run this in Stan with brms simply requires replicating the formula that would be passed to the mgcv::gam() function.

library(gamair)
data(gas)
library(brms)
library(mgcv)

mStan <- brm(octane ~ s(nm, by=NIR, k=50), data=gas)
mMgcv <- gam(octane ~ s(nm, by=NIR, k=50), data=gas)

The Stan code can be retrieved as:

brms::make_stancode(octane ~ s(nm, by=NIR, k=50), data=gas)

Interestingly, in this example NIR is a matrix (i.e. where each row represents a different curve over the nm variable, and octane is the scalar response).

The Stan code prepared by brms produces a very similar fit to mgcv though considerably slower. More of concern, however, is that nearly every post-warmup sample exceeds max_treedepth. Increasing this to 15 has only a small impact.

How concerned should I be about this, considering other diagnostics suggest a good fit (e.g. bayes_R2 = 0.97; fits resemble closely those produced by mgcv, Rhats are near one, etc.)

Fruitful discussions with Paul Buerkner (e.g. see this thread on the brms issue tracker) suggest that brms is unlikely to be able to resolve this issue and model customizations will therefore be essential should the reduction in these warnings be essential or even possible.

1 Like

My experience with splines so far is that you indeed need to increase adapt_delta and max_treedepth from their defaults to get good sampling.

 > brms::make_stancode(octane ~ s(nm, by=NIR, k=50), data=gas)
// generated with brms 2.1.0
functions { 
} 
data { 
  int<lower=1> N;  // total number of observations 
  vector[N] Y;  // response variable 
  int<lower=1> K;  // number of population-level effects 
  matrix[N, K] X;  // population-level design matrix 
  // data of smooth s(nm,by=NIR,k=50)
  int nb_1;  // number of bases 
  int knots_1[nb_1]; 
  matrix[N, knots_1[1]] Zs_1_1; 
  int prior_only;  // should the likelihood be ignored? 
} 
transformed data { 
  int Kc = K - 1; 
  matrix[N, K - 1] Xc;  // centered version of X 
  vector[K - 1] means_X;  // column means of X before centering 
  for (i in 2:K) { 
    means_X[i - 1] = mean(X[, i]); 
    Xc[, i - 1] = X[, i] - means_X[i - 1]; 
  } 
} 
parameters { 
  vector[Kc] b;  // population-level effects 
  real temp_Intercept;  // temporary intercept 
  // parameters of smooth s(nm,by=NIR,k=50)
  vector[knots_1[1]] zs_1_1; 
  real<lower=0> sds_1_1; 
  real<lower=0> sigma;  // residual SD 
} 
transformed parameters { 
  vector[knots_1[1]] s_1_1 = sds_1_1 * zs_1_1; 
} 
model { 
  vector[N] mu = Xc * b + Zs_1_1 * s_1_1 + temp_Intercept; 
  // priors including all constants 
  target += student_t_lpdf(temp_Intercept | 3, **88**, 10); 
  target += normal_lpdf(zs_1_1 | 0, 1); 
  target += student_t_lpdf(sds_1_1 | 3, 0, 10)
    - 1 * student_t_lccdf(0 | 3, 0, 10); 
  target += student_t_lpdf(sigma | 3, 0, 10)
    - 1 * student_t_lccdf(0 | 3, 0, 10); 
  // likelihood including all constants 
  if (!prior_only) { 
    target += normal_lpdf(Y | mu, sigma); 
  } 
} 
generated quantities { 
  // actual population-level intercept 
  real b_Intercept = temp_Intercept - dot_product(means_X, b); 
}

Any ideas why the temporary intercept has mean 88?

Curious! That’s a question for @paul.buerkner.

In brms, the temporary intercept is parameterized to represent the mean of the response distribution, which is roughly 88 in this case. That is we can savely use a weakly informative prior on it when we center around that mean. Such a prior often helps with convergence and we don’t risk “biasing” the results through that approach (with very few exceptions perhaps).

I want to understand it a bit more.

vector[N] mu = Xc * b + Zs_1_1 * s_1_1 + temp_Intercept;

Xc is centered, so I’d expect the mean of the intercept is 0, and Zs_1_1 is the Splines
and now how is that add up?

BTW, this doesn’t look so efficient:

target += student_t_lpdf(sigma | 3, 0, 10)
    - 1 * student_t_lccdf(0 | 3, 0, 10);

Why not?

real<lower=0> sigma;
sigma ~ student_t(3, 0, 10);

To my experience not centering X also work in Stan and gain a speed bump. I know it would not in optimization.

I think this goes in a very different direction than what this thread it itended to and has nothing to do with the convergence problems of the model. Please ask these questions again on brms-users (https://groups.google.com/forum/#!forum/brms-users) or in a separate thread here on the Stan forums and I am happy to answer them.

I’ll respect that answer.
However, I don’t agree with the comment. If the model is miss-specified, then
of course the sampler has a hard time.

Sorry, I understand you are trying to help. Let me explain the model.

Since the design matrix Xc is centered, the mean of the response distribution is modeled thoroughly via temp_Intercept. The spline terms may change that a bit, but not in a relevant manner here. That is, when removing the prior on the Intercept, the results won’t change. Within b there are some linear components of the splines, while s_1_1 captures the non-linear components. This “random-effects” parameterization of the splines is prepared by mgcv and I just follow its foot steps.

I use the _lpdf formulation in brms to allow functions to recover the log-posterior including constants (e.g., the bridgesampling package needs that). The efficiency loss is minimal and not observable in practice usually.

Both rstanarm and brms use the centering of the fixed effects design matrix as it usually improves convergence and sampling speed considerably. There may be cases, where it doesn’t, but overall, I think centering is the right way to go.

We have to be careful about what “bias” means here. Bias is just expected estimation error. Or more generally, some effect on the posterior.

What happens when you use a data-dependent prior is that there’s usually a small bias from using the data. This is similar to the bias you get in the MLE for variance by dividing sum of squares by N; the unbiased estimator divides by N - 1 to adjust for the fact that the sum of squares were calculated with the sample mean rather than the true mean. What’ll happen in your case is similar in that your estimates get pulled toward the data mean rather than the true mean. The amount will depend on the difference and the variance of the prior and how much data you have.

[edit: this is also like the case where you standardize predictors to have sample mean zero and unit sample variance—it has the same property of the transform being based on your actual data and it converging to the right answer asymptotically.]

Of course you are right Bob. Sometimes I use frequentist wording to shorten what I want to say. That’s some sort of laziness. The idea is that setting such a prior will improve convergence in more complex models, without it’s effect on the posterior being ‘visible’ or at least negigble.

The problem with scalar-on-function regression remains, however, with or without this prior. I think the model is overparameterized. That is k = 50 non-linear parameters of the splines may be too much, even if this is used in the original example with mgcv. With k = 5, the convergence goes up, but the model fit goes down, while with k = 10, the model fit is up again, similar to what we have with k = 50, but unfortunately also every iteration hits the maximum treedepth again…

I hope I wasn’t implying it’s a bad idea. We trade variance for bias all the time in Bayesian models.

Standardizing inputs does the same kind of thing, and it’s built into RStanArm in order to both provide conditioning and provide a scale for default priors.