Identifiability issues with monotone splines?!

ermeel · October 20, 2018, 3:08pm

G’day Stanimals,

I am wondering if someone can help me with the following problem.

I have the following simple Stan model

data {
    int<lower=1> Npoints;                                   
    int<lower=1> m;                                                 
    matrix[Npoints,m] i_spline_basis_evals;
    vector[Npoints] y;
}
parameters {
    vector<lower=0>[m] gammas; 
    real<lower=0> sigma;
}
model {
    gammas ~ normal(0, 2);
    sigma ~ cauchy(0,1);
    y ~ normal(i_spline_basis_evals*gammas, sigma);
}

When I run this on the following synthetic dataset:

library(splines2)
set.seed(42)
tmin <- 0
tmax <- 10
Npoints <- 100
ninterior_knots <- 5
sigma_noise <- .1
times <- seq(tmin,tmax,length.out = Npoints)
knots <- seq(tmin, tmax, length.out = ninterior_knots+2)[2:ninterior_knots+1]
nknots <- length(knots)
mspline_degree<- 3
i_spline_basis_evals <- iSpline(times, knots=knots, degree=mspline_degree+1,
                                intercept=TRUE)
m_spline_basis_evals <- deriv(i_spline_basis_evals)
nbasis <- dim(m_spline_basis_evals)[2]
gammas <- abs(rnorm(nbasis, 0, 1))
y_mean <- as.vector(i_spline_basis_evals %*% as.matrix(gammas))
y = rnorm(Npoints, y_mean, sigma_noise)

the data looks like that

plot_zoom_png-3

Note that here I use the so-called i-spline basis functions, which have the property that they are non-decreasing and therefore any convex linear combination inherits this property, too (which is essential for another application I am working on, however here I use a simplified model to highlight the actual problem).

Now when I run

stan_data <- list(
  Npoints = Npoints,
  m = nbasis,
  i_spline_basis_evals=i_spline_basis_evals,
  y=y
)
fit <- sampling(sm, data=stan_data, seed=42, chains=4, cores=2, iter=4000)

and inspect the results I observe for the coefficients gammas the following:

plot_zoom_png-2

This looks suspicious to me! So I did a web-search and found that people usually suggest to use sum-to-zero constraint for the coefficients in such settings. However in my particular case I don’t see that this would do the job, since the gammas need to be non-negative (to guarantee that the linear combination of the spline-basis functions in non-decreasing).

I know that I can adjust parameters like adapt_delta and max_treedepth to get rid of the divergences, but I still think the model is sub-optimal and there should be another way to get around this problem / parametrise the model better… Any idea?

Cheers,

Eren

Addendum:

> print(fit, pars=sprintf("gammas[%d]",1:nbasis), digits=5)
Inference for Stan model: toy_model_ispline.
4 chains, each with iter=4000; warmup=2000; thin=1; 
post-warmup draws per chain=2000, total post-warmup draws=8000.

             mean se_mean      sd    2.5%     25%     50%     75%   97.5% n_eff    Rhat
gammas[1] 1.32603 0.00201 0.09157 1.15718 1.26174 1.32353 1.38789 1.50739  2076 1.00232
gammas[2] 0.59840 0.00505 0.18640 0.20282 0.47159 0.61565 0.73499 0.91839  1363 1.00404
gammas[3] 0.30071 0.00676 0.20332 0.01605 0.13805 0.26928 0.42530 0.76952   905 1.00553
gammas[4] 0.79245 0.00683 0.19371 0.33493 0.67583 0.81855 0.93443 1.09339   803 1.00512
gammas[5] 0.26692 0.00477 0.16055 0.01720 0.13876 0.25191 0.37577 0.61341  1133 1.00282
gammas[6] 0.27278 0.00321 0.16136 0.01507 0.14216 0.26341 0.38838 0.59912  2524 1.00033
gammas[7] 1.27305 0.00380 0.20059 0.83702 1.14307 1.29261 1.42352 1.60679  2787 0.99993
gammas[8] 0.33318 0.00380 0.19969 0.01880 0.18066 0.31568 0.45781 0.77493  2760 0.99994
gammas[9] 1.86549 0.00279 0.16263 1.53046 1.75967 1.87166 1.97916 2.16111  3406 0.99973

Samples were drawn using NUTS(diag_e) at Sat Oct 20 15:10:08 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

whereas the “true” values for in gammas are

[1] 1.37095845 0.56469817 0.36312841 0.63286260 0.40426832 0.10612452 1.51152200 0.09465904 2.01842371

bgoodri · October 20, 2018, 3:18pm

In situations like this where the coefficients have to be non-negative, it can often be good to decompose them into a simplex and a scale factor, like

parameters {
  real<lower = 0> alpha;  // scale factor
  simplex[m] pi;          // primitives of coefficients
  real<lower = 0> sigma;  // error standard deviation
}
transformed parameters {
  vector[m] gammas = alpha * pi;
}
model {
  alpha ~ something;
  // implicit: pi is uniform or you can give it a prior
  sigma ~ cauchy(0, 1);
  y ~ normal(i_spline_basis_evals * gammas, sigma);
}

You may get better sampling that way because the dependence among the gammas is mostly due to the magnitude of alpha.

ermeel · October 20, 2018, 3:36pm

Great, many thanks! Any suggestions for the prior-family of alpha? For pi would you go with the default (uniform) prior or maybe better Dirichlet?

bgoodri · October 20, 2018, 3:39pm

I am almost always starting with exponential for positive scalars, in which case you could actually write

transformed parameters {
  vector[m] gammas = prior_mean * alpha * pi;
}
model {
  alpha ~ exponential(1);
  ...
}

For pi, I would start uniform but that is just Dirichlet with a vector of ones for the hyperparameters.

ermeel · October 20, 2018, 3:44pm

Thanks! Another question, I was thinking to actually increase the number of knots and use an autoregressive prior for gammas (auto-regressive on the log scale). Would that somehow circumvent the above problem? This relates to the discussion we had here.

bgoodri · October 20, 2018, 3:57pm

Maybe. You can also do random-walks on the positive real numbers with exponential priors. Something like

parameters {
  vector<lower = 0>[m] lambda;
}
transformed parameters {
  vector[m] gammas;
  gammas[1] = lambda[1];
  for (i in 2:m) gammas[i] = gammas[i - 1] * lambda[i];
}
model {
  lambda ~ exponential(1);
  ...
}

ermeel · October 20, 2018, 4:09pm

Just tried a variant of the first proposal above, i.e.:

data {
    int<lower=1> Npoints;                                   
    int<lower=1> m;                                                 
    matrix[Npoints,m] i_spline_basis_evals;
    vector[Npoints] y;
}
parameters {
    simplex[m] pi;          // primitives of coefficients
    real<lower=0> alpha;    // scale factor
    real<lower=0> sigma;  // error standard deviation
}
transformed parameters {
  vector[m] gammas = alpha * pi;
}
model {
    alpha ~ cauchy(0,2);
    sigma ~ cauchy(0,1);
    y ~ normal(i_spline_basis_evals*gammas, sigma);
}

which still results in

pairs_simplex_ben

ermeel · October 20, 2018, 4:13pm

So multiplicative random walks?

bgoodri · October 20, 2018, 4:34pm

It looks that way due the non-centering but it is equivalent to an exponential prior for the i-th value with expectation being the \left(i - 1\right)-th value.

ermeel · October 21, 2018, 12:54pm

Got it, thanks!

Now, coming back to your first proposal above, which still results in a sub-optimal posterior from my perspective (see above). Any way to reparametrise it?

For your reference, I am using i-splines, see here.

bgoodri · October 22, 2018, 4:02am

I have long speculated that a polar decomposition might be more useful than a QR decomposition. In your case, you have \boldsymbol{\mu} = \mathbf{X} \boldsymbol{\gamma}, where \mathbf{X} is the spline basis and \boldsymbol{\gamma} is elementwise positive. If we write \mathbf{X} = \mathbf{U}\mathbf{P} in the form of a polar decomposition, then \boldsymbol{\mu} = \mathbf{U} \mathbf{P} \boldsymbol{\gamma} = \mathbf{U} \boldsymbol{\theta} where \boldsymbol{\theta} = \mathbf{P} \boldsymbol{\gamma}. I wonder if there are any simple constraints on \boldsymbol{\theta} that satisfy the monotonicity requirement, in which case we could do the regression with \mathbf{U} and \boldsymbol{\theta} instead of \mathbf{X} and \boldsymbol{\gamma} and then obtain \boldsymbol{\gamma} = \mathbf{P}^{-1} \boldsymbol{\theta} as a generated quantity.

ermeel · October 23, 2018, 12:16pm

Is there a particular intuitive geometric interpretation of P that we could utilize or would it make sense to adopt a spectral perspective (knowing P is a positive-semidefinite Hermitian matrix)?

bgoodri · October 23, 2018, 1:07pm

\mathbf{P} can loosely be interpreted as a “scale” factor. That is why I thought it might work.

hhau · October 23, 2018, 2:00pm

Is there R code to do this decomposition? Googling points to this, but playing around with it has me convinced it’s not doing what you (and the polar decomposition Wikipedia page) are suggesting. I know I could go to the source, but that seems like a wheel reinventing opportunity.

I guess one could initially check by propagating a whole lot of samples of \gamma through \text{P} and looking at the resulting pairs plot to see if there are any obvious truncations / restrictions on \theta to impose.

spinkney · January 22, 2021, 12:52pm

Well, 2.5 years later…I was interested in monotone splines and did the polar decomposition. No divergences or treedepth issues with the simulated code. ESS is way higher too.

The polar decomposition can be accomplished with SVD. What’s really cool is that the rc of Stan now has SVD in it! I pass the P and U matrices to Stan.

library(splines2)
library(cmdstanr)
set.seed(42)
tmin <- 0
tmax <- 10
Npoints <- 100
ninterior_knots <- 5
sigma_noise <- .1
times <- seq(tmin,tmax,length.out = Npoints)
knots <- seq(tmin, tmax, length.out = ninterior_knots+2)[2:ninterior_knots+1]
nknots <- length(knots)
mspline_degree<- 3
i_spline_basis_evals <- iSpline(times, knots=knots, degree=mspline_degree+1,
                                intercept=TRUE)
m_spline_basis_evals <- deriv(i_spline_basis_evals)
nbasis <- dim(m_spline_basis_evals)[2]
gammas <- abs(rnorm(nbasis, 0, 1))
y_mean <- as.vector(i_spline_basis_evals %*% as.matrix(gammas))
y = rnorm(Npoints, y_mean, sigma_noise)

spline_svd <- svd(i_spline_basis_evals )
spline_P <- spline_svd$v %*% diag(spline_svd$d) %*% t(spline_svd$v)
spline_U <- spline_svd$u %*% t(spline_svd$v)

stan_data <- list(
  Npoints = Npoints,
  m = nbasis,
  spline_U = spline_U,
  spline_P = spline_P,
  y = y
)

I have no idea if the thing I do in transformed data is generalizable to other problems. I inspected P and P \gamma to see if any constraint makes sense from Ben’s comment above. But it seems to sample fine without any extra constraints. However, gamma is not quite identified, if you’re really interested in that.

data {
    int<lower=1> Npoints;                                   
    int<lower=1> m;                                                 
    matrix[Npoints, m] spline_U;
    matrix[m, m] spline_P;
    vector[Npoints] y;
}
parameters {
    vector[m] theta;
    real<lower=0> sigma;  // error standard deviation
}
model {
    sigma ~ std_normal();
    y ~ normal(spline_U * theta, sigma);
}
generated quantities {
  vector[m] gamma = mdivide_left_spd(spline_P, reverse(theta));
}

Here’s the summary of gamma from the model

> fit$summary("gamma")
# A tibble: 9 x 10
  variable    mean    median    sd   mad     q5   q95  rhat ess_bulk ess_tail
  <chr>      <dbl>     <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>    <dbl>    <dbl>
1 gamma[1] 1.26     1.26     0.119 0.116  1.07  1.46   1.00    2670.    2339.
2 gamma[2] 0.802    0.807    0.298 0.296  0.306 1.28   1.00    2458.    2470.
3 gamma[3] 0.00609 -0.000957 0.402 0.394 -0.657 0.681  1.00    2444.    2359.
4 gamma[4] 1.07     1.07     0.420 0.417  0.391 1.76   1.00    2450.    2459.
5 gamma[5] 0.0849   0.0886   0.382 0.373 -0.522 0.701  1.00    2354.    2458.
6 gamma[6] 0.347    0.342    0.360 0.339 -0.240 0.939  1.00    2292.    2345.
7 gamma[7] 1.29     1.30     0.353 0.344  0.704 1.86   1.00    2281.    2154.
8 gamma[8] 0.261    0.259    0.297 0.290 -0.203 0.745  1.00    2256.    2396.
9 gamma[9] 1.92     1.92     0.200 0.199  1.60  2.24   1.00    2196.    2561.

vs true gamma

> gammas
[1] 1.37095845 0.56469817 0.36312841 0.63286260 0.40426832 0.10612452 1.51152200 0.09465904 2.01842371

Pairs plot

Xstan · September 17, 2024, 4:29pm

Thank you for your post @spinkney. I tried it, but didn’t get the desired monotonicity.

Here’s my code and result:

library(cmdstanr)
set.seed(1)
tmin <- 0
tmax <- 10
Npoints <- 100
sigma_noise <- 0.1
k <- 15
times <- seq(tmin,tmax,length.out = Npoints)
B <- splines2::iSpline(times, df = k, degree = 3, intercept = FALSE)
ymean <- 2/(1+exp(-(3*(times - tmax/2))))
y = rnorm(Npoints, mean = ymean, sd = sigma_noise)
spline_svd <- svd(B)
spline_P <- spline_svd$v %*% diag(spline_svd$d) %*% t(spline_svd$v)
spline_U <- spline_svd$u %*% t(spline_svd$v)
standata2 <- list(
  Npoints = Npoints,
  m = k,
  spline_U = spline_U,
  spline_P = spline_P,
  y = y
)
stanmodel2 <- cmdstanr::cmdstan_model("simple-spline-polar.stan")
stanfit2 <- stanmodel2$sample(
  data = standata2,
  iter_warmup = 1000,
  iter_sampling = 2000,
  chains = 4,
  parallel_chains = 4,
  refresh = 100
)
stanfit.summary2 <- stanfit2$summary()
gamma.polar <- stanfit.summary2$mean[match(paste0("gamma[", 1:k, "]"), stanfit.summary2$variable)]
fitted.polar <- as.vector(B %*% as.matrix(gamma.polar))
plot(fitted.polar, type = "l")
lines(ymean, col = "red")

And STAN code:

data {
    int<lower=1> Npoints;
    int<lower=1> m;
    matrix[Npoints, m] spline_U;
    matrix[m, m] spline_P;
    vector[Npoints] y;
}
parameters {
    vector[m] theta;
    // positive_ordered[m] theta;
    real<lower = 0> tau_epsilon;
}
model {
    tau_epsilon    ~ cauchy(0, 1);
    y ~ normal(spline_U * theta, 1/tau_epsilon);
}
generated quantities {
  vector[m] gamma = mdivide_left_spd(spline_P, theta);
}

And result:

spinkney · September 17, 2024, 4:38pm

Instead of taking the summary mean, which is the mean of the draws which may not respect the monotone constraints, take each draw and see if that respects the constraint. You may have to do some different summary measure to get a “mean” that is monotone.

Xstan · September 17, 2024, 5:03pm

It looks like the individual samples are not monotone either.

spinkney · September 17, 2024, 5:19pm

Ah too bad. This was a long time ago and it may take some time for me to look into this. Let me know if you solve it in the meantime.

spinkney · September 25, 2024, 2:11pm

I’m getting monotone outputs for the original problem, see below. For yours I am not and confirm what you have. However, if you just put X in without doing the polar decomposition for your model and constrain gamma to be positive it samples fine. I suspect that when the data are strictly increasing the polar decomposition works (there may be cases where it doesn’t but id have to investigate more). On your problem where the function is nondecreasing (or increasing at such a slow rate) it does not work. To make this robust we’d have to find a spline which when taking the inverse of P results in a positive gamma. That is gamma = P^-1 * theta > 0. I do not know how to do that though.

library(splines2)
library(cmdstanr)
set.seed(42)
tmin <- 0
tmax <- 10
Npoints <- 100
ninterior_knots <- 5
sigma_noise <- .1
times <- seq(tmin,tmax,length.out = Npoints)
knots <- seq(tmin, tmax, length.out = ninterior_knots+2)[2:ninterior_knots+1]
nknots <- length(knots)
mspline_degree<- 3
i_spline_basis_evals <- iSpline(times, knots=knots, degree=mspline_degree+1,
                                intercept=TRUE)
m_spline_basis_evals <- deriv(i_spline_basis_evals)
nbasis <- dim(m_spline_basis_evals)[2]
gammas <- abs(rnorm(nbasis, 0, 1))
y_mean <- as.vector(i_spline_basis_evals %*% as.matrix(gammas))
y = rnorm(Npoints, y_mean, sigma_noise)

spline_svd <- svd(i_spline_basis_evals )
spline_P <- spline_svd$v %*% diag(spline_svd$d) %*% t(spline_svd$v)
spline_U <- spline_svd$u %*% t(spline_svd$v)

stan_data <- list(
  Npoints = Npoints,
  m = nbasis,
  spline_U = spline_U,
  spline_P = spline_P,
  y = y
)

mod_polar <- cmdstanr::cmdstan_model("monotone_splines.stan")

mod_polar_out <- mod_polar$sample(
  data = stan_data,
  iter_warmup = 1000,
  iter_sampling = 2000,
  chains = 4,
  parallel_chains = 4,
  refresh = 100
)

Topic		Replies	Views
Non-centered parameterization for basic spline Modeling specification	5	579	June 13, 2020
Spline fitting demo inc comparison of sparse vs non-sparse Modeling	5	1231	February 13, 2018
Case study on splines Modeling	7	2750	June 11, 2019
Curve fitting -- including convexity constraint Modeling	13	2355	June 14, 2019
Problem with vectors Modeling	5	409	May 10, 2022

Identifiability issues with monotone splines?!

Related topics