Constrained non-parametric smooths

Dear all,

I’m trying to model a dataset that looks like this:

The data is structured into 5 series from two groups. The thick lines in the figure are just empirical smooths of the groups.
I have theoretical reasons to assume that the series from the control group are just random (autocorrelated) variations around some constant value, to be estimated. Let’s call it the reference value. The quantity of interest is the mean curve for the test group, which must be at or below the reference value.

Data attached into a R file that can be sourced: model_data.r (22.5 KB)

Indeed, I’d like to assume that the test curve is at the reference value by default (kind of null hypothesis) unless there is sufficient evidence to pull it towards smaller values.

The question is how to encode this in a Stan program.

For reference, I started with a brms model with the following formula: value ~ 1 + s(location, by = x), where x is a binary variable indicative of the test group. This gives the following posterior estimates:

What I need is to prevent the posterior distribution of the red curve to climb beyond the reference level. The probabilistic model is crucial, since my goal is to estimate the posterior probability of the target being below the (as opposed to exactly at) reference level.

My first try was to use a non-linear model with brms as follows:

  value ~ mu0 - (exp(trt)-1)*(trt>0),
  mu0 ~ 1,
  trt ~ 0 + s(location, by = x),
  nl = TRUE

Essentially, using a spline as a latent curve that is collapsed to 0 all the negative values. The transformation exp(trt)-1 is not essential. It could have been simply trt.

This kind of works. Here are the posterior estimates of the group curves:

Nevertheless, there are a couple of things with which I’m not completely satisfied:

  1. I’d like to shrink the test curve a bit more towards the reference value. But I don’t have any parameter that controls that.
  2. The uncertainty band around the curve looks so tight. I would have expected much more uncertainty. With these results I’m getting almost 100% probability of departure from the reference level at every location.
  3. Looks like it is over-smoothing a bit. But I just used the default basis dimension (8). I guess I can simply adjust that by increasing this number.

But what worries me more is what happens when I try to account for the autocorrelation at the series level, which makes me think that collapsing the spline to zero is perhaps not a great idea.

Here is my second model that adds another term srs which fits a smooth curve to each series:

  value ~ mu0 - (exp(trt)-1)*(trt>=0) + srs,
  mu0 ~ 1,
  trt ~ 0 + s(location, by = x),
  srs ~ 0 + s(location, by = series, id = 1),
  nl = TRUE

Side question: notice that I specified id=1 in order to use a single standard-deviation parameter for all the series. However, the model still estimates a specific parameter for each series. Not sure what I’m doing wrong here, but ultimately I could fix it in the stan code.

In any case, the results are quite unstable. I get high Rhats and the estimates depend a lot on the starting values, sometimes giving wild results like a crazy estimate for the treatment curve compensated with highly varying series. Also giving bi-modal posteriors for the treatment group (some probability mass at zero and the rest centred at some positive value).

Perhaps that will be fixed by collapsing the standard deviations into a single parameter. Yet, it makes me think that the multiplication by 0 in the linear predictor simply breaks the flow of information from the data to the segments of the treatment curve below zero. In these segments, the curve can do whatever it wants without any impact in the likelihood, which surely isn’t a good idea.

Sorry for the long post. But I’d love to hear your thoughts about this modelling problem.
Thanks in advance.

By the way, for priors and other details, find below the full call to brms and the generated stan code for the second model.

priors <-  c(
    prior(normal(150, 100), nlpar = "mu0"),
    prior(normal(0, 5), nlpar = "trt"),
    prior(normal(0, 5), nlpar = "srs"),
    prior(normal(0, 3), class = "sds", nlpar = "trt"),
    prior(normal(0, 5), class = "sds", nlpar = "srs"),
    prior(normal(0, 3), class = "sigma")
formula <- bf(
  value ~ mu0 - (exp(trt)-1)*(trt>=0) + srs,
  mu0 ~ 1,
  trt ~ 0 + s(location, by = x),
  srs ~ 0 + s(location, by = series, id = 1),
  nl = TRUE
        formula = formula,
        data = model_data,   # attached above
        family = "gaussian",
        prior = priors,
        control = list(
          adapt_delta = .999,
          max_treedepth = 15
        seed = 20210331,
        cores = 4,
        iter = 400
// generated with brms 2.13.0
functions {
data {
  int<lower=1> N;  // number of observations
  vector[N] Y;  // response variable
  int<lower=1> K_mu0;  // number of population-level effects
  matrix[N, K_mu0] X_mu0;  // population-level design matrix
  // data for splines
  int Ks_trt;  // number of linear effects
  matrix[N, Ks_trt] Xs_trt;  // design matrix for the linear effects
  // data for spline s(location,by=is_test)
  int nb_trt_1;  // number of bases
  int knots_trt_1[nb_trt_1];  // number of knots
  // basis function matrices
  matrix[N, knots_trt_1[1]] Zs_trt_1_1;
  // data for splines
  int Ks_srs;  // number of linear effects
  matrix[N, Ks_srs] Xs_srs;  // design matrix for the linear effects
  // data for spline s(location,by=series,id=1)1
  int nb_srs_1;  // number of bases
  int knots_srs_1[nb_srs_1];  // number of knots
  // basis function matrices
  matrix[N, knots_srs_1[1]] Zs_srs_1_1;
  // data for spline s(location,by=series,id=1)6
  int nb_srs_2;  // number of bases
  int knots_srs_2[nb_srs_2];  // number of knots
  // basis function matrices
  matrix[N, knots_srs_2[1]] Zs_srs_2_1;
  // data for spline s(location,by=series,id=1)3
  int nb_srs_3;  // number of bases
  int knots_srs_3[nb_srs_3];  // number of knots
  // basis function matrices
  matrix[N, knots_srs_3[1]] Zs_srs_3_1;
  // data for spline s(location,by=series,id=1)4
  int nb_srs_4;  // number of bases
  int knots_srs_4[nb_srs_4];  // number of knots
  // basis function matrices
  matrix[N, knots_srs_4[1]] Zs_srs_4_1;
  // data for spline s(location,by=series,id=1)5
  int nb_srs_5;  // number of bases
  int knots_srs_5[nb_srs_5];  // number of knots
  // basis function matrices
  matrix[N, knots_srs_5[1]] Zs_srs_5_1;
  int prior_only;  // should the likelihood be ignored?
transformed data {
parameters {
  vector[K_mu0] b_mu0;  // population-level effects
  vector[Ks_trt] bs_trt;  // spline coefficients
  // parameters for spline s(location,by=is_test)
  // standarized spline coefficients
  vector[knots_trt_1[1]] zs_trt_1_1;
  real<lower=0> sds_trt_1_1;  // standard deviations of spline coefficients
  vector[Ks_srs] bs_srs;  // spline coefficients
  // parameters for spline s(location,by=series,id=1)1
  // standarized spline coefficients
  vector[knots_srs_1[1]] zs_srs_1_1;
  real<lower=0> sds_srs_1_1;  // standard deviations of spline coefficients
  // parameters for spline s(location,by=series,id=1)6
  // standarized spline coefficients
  vector[knots_srs_2[1]] zs_srs_2_1;
  real<lower=0> sds_srs_2_1;  // standard deviations of spline coefficients
  // parameters for spline s(location,by=series,id=1)3
  // standarized spline coefficients
  vector[knots_srs_3[1]] zs_srs_3_1;
  real<lower=0> sds_srs_3_1;  // standard deviations of spline coefficients
  // parameters for spline s(location,by=series,id=1)4
  // standarized spline coefficients
  vector[knots_srs_4[1]] zs_srs_4_1;
  real<lower=0> sds_srs_4_1;  // standard deviations of spline coefficients
  // parameters for spline s(location,by=series,id=1)5
  // standarized spline coefficients
  vector[knots_srs_5[1]] zs_srs_5_1;
  real<lower=0> sds_srs_5_1;  // standard deviations of spline coefficients
  real<lower=0> sigma;  // residual SD
transformed parameters {
  // actual spline coefficients
  vector[knots_trt_1[1]] s_trt_1_1;
  // actual spline coefficients
  vector[knots_srs_1[1]] s_srs_1_1;
  // actual spline coefficients
  vector[knots_srs_2[1]] s_srs_2_1;
  // actual spline coefficients
  vector[knots_srs_3[1]] s_srs_3_1;
  // actual spline coefficients
  vector[knots_srs_4[1]] s_srs_4_1;
  // actual spline coefficients
  vector[knots_srs_5[1]] s_srs_5_1;
  // compute actual spline coefficients
  s_trt_1_1 = sds_trt_1_1 * zs_trt_1_1;
  // compute actual spline coefficients
  s_srs_1_1 = sds_srs_1_1 * zs_srs_1_1;
  // compute actual spline coefficients
  s_srs_2_1 = sds_srs_2_1 * zs_srs_2_1;
  // compute actual spline coefficients
  s_srs_3_1 = sds_srs_3_1 * zs_srs_3_1;
  // compute actual spline coefficients
  s_srs_4_1 = sds_srs_4_1 * zs_srs_4_1;
  // compute actual spline coefficients
  s_srs_5_1 = sds_srs_5_1 * zs_srs_5_1;
model {
  // initialize linear predictor term
  vector[N] nlp_mu0 = X_mu0 * b_mu0;
  // initialize linear predictor term
  vector[N] nlp_trt = rep_vector(0, N) + Xs_trt * bs_trt + Zs_trt_1_1 * s_trt_1_1;
  // initialize linear predictor term
  vector[N] nlp_srs = rep_vector(0, N) + Xs_srs * bs_srs + Zs_srs_1_1 * s_srs_1_1 + Zs_srs_2_1 * s_srs_2_1 + Zs_srs_3_1 * s_srs_3_1 + Zs_srs_4_1 * s_srs_4_1 + Zs_srs_5_1 * s_srs_5_1;
  // initialize non-linear predictor term
  vector[N] mu;
  for (n in 1:N) {
    // compute non-linear predictor values
    mu[n] = nlp_mu0[n] - (exp(nlp_trt[n]) - 1) * (nlp_trt[n] >= 0) + nlp_srs[n];
  // priors including all constants
  target += normal_lpdf(b_mu0 | 150, 100);
  target += normal_lpdf(bs_trt | 0, 5)
    - 2 * normal_lccdf(0 | 0, 5);
  target += normal_lpdf(sds_trt_1_1 | 0, 3)
    - 1 * normal_lccdf(0 | 0, 3);
  target += std_normal_lpdf(zs_trt_1_1);
  target += normal_lpdf(bs_srs | 0, 5)
    - 5 * normal_lccdf(0 | 0, 5);
  target += normal_lpdf(sds_srs_1_1 | 0, 5)
    - 1 * normal_lccdf(0 | 0, 5);
  target += std_normal_lpdf(zs_srs_1_1);
  target += normal_lpdf(sds_srs_2_1 | 0, 5)
    - 1 * normal_lccdf(0 | 0, 5);
  target += std_normal_lpdf(zs_srs_2_1);
  target += normal_lpdf(sds_srs_3_1 | 0, 5)
    - 1 * normal_lccdf(0 | 0, 5);
  target += std_normal_lpdf(zs_srs_3_1);
  target += normal_lpdf(sds_srs_4_1 | 0, 5)
    - 1 * normal_lccdf(0 | 0, 5);
  target += std_normal_lpdf(zs_srs_4_1);
  target += normal_lpdf(sds_srs_5_1 | 0, 5)
    - 1 * normal_lccdf(0 | 0, 5);
  target += std_normal_lpdf(zs_srs_5_1);
  target += normal_lpdf(sigma | 0, 3)
    - 1 * normal_lccdf(0 | 0, 3);
  // likelihood including all constants
  if (!prior_only) {
    target += normal_lpdf(Y | mu, sigma);
generated quantities {
1 Like

sorry for not getting to you earlier. Your question is relevant and well written.

For the most suspicious bit about your model is the (exp(trt)-1)*(trt>=0) part. Usually, when one wants to move from an unconstrained scale to only positive numbers, one would use either exp or log1p_exp (a.k.a. softmax) transforms. Those are smooth and should be much better behaved. Sharp cutoffs usually pose trouble for Stan and I am actually surprised the first model did anything :-) So I would try the different transformations first.

Never used that before, but might be a bug in brms - it would help if you could try to build a simple example that demonstrates that the behaviour of brms is different than a corresponding mgcv model and report that at Issues · paul-buerkner/brms · GitHub

Best of luck with your model!

1 Like

Martin, thank you so much for taking the time for answering.

Indeed, the component (exp(trt)-1)*(trt>=0) never seemed a good idea due to the discontinuity it created. The problem with an exponential transform of the smooth curve is that it maps all the probability mass to \mathbb{R}^+, thus excluding 0. Since the main objective of this study was to infer the posterior probability of the curve being different from the reference level, I really need to quantify the chance that the treatment has exactly 0 effect. Sure, I can establish some threshold below which I consider the effect to be negligible. But it seems a bit arbitrary and hacky.

I worked out another solution based on a mixture that I want to leave here for reference as it might help others. It’s mildly satisfying, since I also had to fix some arbitrary values which makes it not so different from the approach of fixing a threshold mentioned above. So I welcome comments and thoughts to improve it.

First, the main resulting figure:

It improves all my points in the first post.

  1. Shrinks more towards the reference level. Specially when it’s close. Effectively favouring the “null” hypothesis of null effect.
  2. The uncertainties are larger (thus, the posterior probabilities of interest are less extreme).
  3. Reduced oversmoothing (I increased the basis dimension).
    Moreover, the MCMC is stable, quick and solid.

The model is as follows:

\begin{equation} \begin{aligned} y(i) | g & \sim \mathcal{N}(\mu_g(i),\, \sigma_y) \\ \mu_g(i) & = \mu_0 - f_t(i) \delta_{\text{test}}(g)\\ f_t(i) & = z_i\cdot 0 + (1-z_i) e^{h(i)} \\ \end{aligned} \end{equation}

As in the previous model, the expected mean is a reference value \mu_0 for the control group and has an additional term f_t(i) which represents the effect of the test group.
This effect is considered a mixture of a point-mass at 0, and a strictly positive smooth curve implemented as the exponential of a spline h(i).
The mixture is controlled by binary variables z_i \sim \text{Ber}(\theta_i), which selects whether the effect comes from either component at each location.

In practice, for coding the model in Stan this binary variable needs to be marginalised. This is very clearly explained in the first section of this tutorial by @betanalpha.

Now, \theta_i represents the probability of null effect at each location. This is the parameter I looked for in my previous post, where I can put some prior that shrinks the effect towards the null (\theta \to 1). However, contrary to a standard mixture model, here it makes sense that \theta is a smooth curve that evolves with the location i. But estimating it non-parametrically seems too much! Besides, it should be obviously related with [f_t(i) | z_i = 0] = e^{h(i)}, i.e. the fitted effect assuming a positive effect. So, I proposed:

\theta_i = \frac1{1 + \exp\{a + b\,h(i)\}}

which only depends on two parameters a and b that control the level of shrinkage towards 0 or 1 as a function of the distance of the conditional test mean f_t(i) | z_i = 0 from the reference level.

This looked very good, except that a and b tend to grow as much as the priors allow, thus collapsing \theta to 0 and favouring the alternative, strictly positive mean for the test group.

Upon reflection, this is understandable, as the alternative model provides a good explanation of the observations. There is no need for the null model, in any proportion, anywhere, to explain any of the observations. Which is the whole idea of mixture models: explaining different sets of observations with different components.

What happens here is that I want to favour the null model, unless there is compelling evidence for the alternative model. So I started tweaking the priors for a and b and after many trials (they are not exactly independent) I realised I was simply picking values, fitting the model and iterating until I find a convincing result. Which is hardly justifiable as a prior-elicitation procedure. Besides, the whole thing was extremely sensitive and I had to define the priors very precisely. This seemed not much different from fixing some values that worked well! So I did.

So instead of estimating a and b, I chose values that made sense and were somewhat justifiable, based on the distance between the group means and the residual standard deviation \sigma_y. Specifically, I calibrated the model by computing values of a and b such that \theta = \Phi(-1/\sigma) when h = 0 (i.e. the curves are at a unit distance) and \theta = \Phi(-1) when h = \log(\sigma) (i.e. the curves are at a distance of \sigma).

And this gives the results above.

In conclusion,

  • When I went down the mixture-model way, I hoped to be able to infer \theta from the data (and some prior pushing for the null). So, I’m not very excited with the solution.

  • Fixing a model for \theta seems not much different from setting an arbitrary threshold below which I consider the effect null. But at least this approach uses the uncertainty on \sigma during the MCMC which seems better. Although I guess I could have done something equivalent by defining an uncertain threshold using the posterior for \sigma. Now I’m asking myself if this is not an overly complicated way of fixing a threshold that will be harder to explain :)

  • I wonder whether a Hidden Markov Model would have been better than a mixture.

  • In any event, the model seems justifiable and yields reasonable results. From a pragmatic point of view, it does the job.

Sorry for the long post. I hope it was clear and helpful for someone. Thanks for reading and sharing your thoughts!


Thanks for the detailed report. Just adding that your model is similar to zero-inflated models which might provide additional ideas for improving your model.

1 Like

The setup…

and the punchline.

The problem with point null hypotheses (in this case more formally null hypothesis that are nested within the alternative hypothesis but are infinitely small relative to the entire alternative hypothesis) is that the alternative model will almost always be favored without a ridiculous amount of data. In any narrow region around the null hypothesis there will be infinitely many alternative hypothesis configurations that will be consistent with any finite data set.

More formally the power of the hypothesis test is always zero for nested models. This is why everyone “cheats” and introduces a minimal effect size in their power calculations. A minimum effect size, however, is exactly one of those arbitrary and hacky thresholds! These thresholds are always hiding somewhere when you have a nested null hypothesis and finite data.

The reality is that if you want to be able to say anything nontrivial with only finite data then you have to introduce some kind of non-zero threshold. Many methods try to hide this threshold, but it’s always there. Politics aside, the best statistical approach is to embrace the threshold and incorporate into your analysis as transparently as possible.

Note that these problems also manifest in mixture models – if you’re trying to mix model with different dimensions (for example inflating a continuous outcome at a discrete value, also sometimes known as hurdle models) then all you really do is model all observations at that exact discrete value as coming from the inflated component and all of the other observations as coming from the continuous model. If there are any observations not at the inflated value, no matter how close to the inflated value, then any model selection will prefer the continuous model.