Identifiability issues with monotone splines?!

G’day Stanimals,

I am wondering if someone can help me with the following problem.

I have the following simple Stan model

data {
    int<lower=1> Npoints;                                   
    int<lower=1> m;                                                 
    matrix[Npoints,m] i_spline_basis_evals;
    vector[Npoints] y;
}
parameters {
    vector<lower=0>[m] gammas; 
    real<lower=0> sigma;
}
model {
    gammas ~ normal(0, 2);
    sigma ~ cauchy(0,1);
    y ~ normal(i_spline_basis_evals*gammas, sigma);
}

When I run this on the following synthetic dataset:

library(splines2)
set.seed(42)
tmin <- 0
tmax <- 10
Npoints <- 100
ninterior_knots <- 5
sigma_noise <- .1
times <- seq(tmin,tmax,length.out = Npoints)
knots <- seq(tmin, tmax, length.out = ninterior_knots+2)[2:ninterior_knots+1]
nknots <- length(knots)
mspline_degree<- 3
i_spline_basis_evals <- iSpline(times, knots=knots, degree=mspline_degree+1,
                                intercept=TRUE)
m_spline_basis_evals <- deriv(i_spline_basis_evals)
nbasis <- dim(m_spline_basis_evals)[2]
gammas <- abs(rnorm(nbasis, 0, 1))
y_mean <- as.vector(i_spline_basis_evals %*% as.matrix(gammas))
y = rnorm(Npoints, y_mean, sigma_noise)

the data looks like that

plot_zoom_png-3

Note that here I use the so-called i-spline basis functions, which have the property that they are non-decreasing and therefore any convex linear combination inherits this property, too (which is essential for another application I am working on, however here I use a simplified model to highlight the actual problem).

Now when I run

stan_data <- list(
  Npoints = Npoints,
  m = nbasis,
  i_spline_basis_evals=i_spline_basis_evals,
  y=y
)
fit <- sampling(sm, data=stan_data, seed=42, chains=4, cores=2, iter=4000)

and inspect the results I observe for the coefficients gammas the following:

plot_zoom_png-2

This looks suspicious to me! So I did a web-search and found that people usually suggest to use sum-to-zero constraint for the coefficients in such settings. However in my particular case I don’t see that this would do the job, since the gammas need to be non-negative (to guarantee that the linear combination of the spline-basis functions in non-decreasing).

I know that I can adjust parameters like adapt_delta and max_treedepth to get rid of the divergences, but I still think the model is sub-optimal and there should be another way to get around this problem / parametrise the model better… Any idea?

Cheers,

Eren

Addendum:

> print(fit, pars=sprintf("gammas[%d]",1:nbasis), digits=5)
Inference for Stan model: toy_model_ispline.
4 chains, each with iter=4000; warmup=2000; thin=1; 
post-warmup draws per chain=2000, total post-warmup draws=8000.

             mean se_mean      sd    2.5%     25%     50%     75%   97.5% n_eff    Rhat
gammas[1] 1.32603 0.00201 0.09157 1.15718 1.26174 1.32353 1.38789 1.50739  2076 1.00232
gammas[2] 0.59840 0.00505 0.18640 0.20282 0.47159 0.61565 0.73499 0.91839  1363 1.00404
gammas[3] 0.30071 0.00676 0.20332 0.01605 0.13805 0.26928 0.42530 0.76952   905 1.00553
gammas[4] 0.79245 0.00683 0.19371 0.33493 0.67583 0.81855 0.93443 1.09339   803 1.00512
gammas[5] 0.26692 0.00477 0.16055 0.01720 0.13876 0.25191 0.37577 0.61341  1133 1.00282
gammas[6] 0.27278 0.00321 0.16136 0.01507 0.14216 0.26341 0.38838 0.59912  2524 1.00033
gammas[7] 1.27305 0.00380 0.20059 0.83702 1.14307 1.29261 1.42352 1.60679  2787 0.99993
gammas[8] 0.33318 0.00380 0.19969 0.01880 0.18066 0.31568 0.45781 0.77493  2760 0.99994
gammas[9] 1.86549 0.00279 0.16263 1.53046 1.75967 1.87166 1.97916 2.16111  3406 0.99973

Samples were drawn using NUTS(diag_e) at Sat Oct 20 15:10:08 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

whereas the “true” values for in gammas are

[1] 1.37095845 0.56469817 0.36312841 0.63286260 0.40426832 0.10612452 1.51152200 0.09465904 2.01842371

In situations like this where the coefficients have to be non-negative, it can often be good to decompose them into a simplex and a scale factor, like

parameters {
  real<lower = 0> alpha;  // scale factor
  simplex[m] pi;          // primitives of coefficients
  real<lower = 0> sigma;  // error standard deviation
}
transformed parameters {
  vector[m] gammas = alpha * pi;
}
model {
  alpha ~ something;
  // implicit: pi is uniform or you can give it a prior
  sigma ~ cauchy(0, 1);
  y ~ normal(i_spline_basis_evals * gammas, sigma);
}

You may get better sampling that way because the dependence among the gammas is mostly due to the magnitude of alpha.

1 Like

Great, many thanks! Any suggestions for the prior-family of alpha? For pi would you go with the default (uniform) prior or maybe better Dirichlet?

I am almost always starting with exponential for positive scalars, in which case you could actually write

transformed parameters {
  vector[m] gammas = prior_mean * alpha * pi;
}
model {
  alpha ~ exponential(1);
  ...
}

For pi, I would start uniform but that is just Dirichlet with a vector of ones for the hyperparameters.

2 Likes

Thanks! Another question, I was thinking to actually increase the number of knots and use an autoregressive prior for gammas (auto-regressive on the log scale). Would that somehow circumvent the above problem? This relates to the discussion we had here.

Maybe. You can also do random-walks on the positive real numbers with exponential priors. Something like

parameters {
  vector<lower = 0>[m] lambda;
}
transformed parameters {
  vector[m] gammas;
  gammas[1] = lambda[1];
  for (i in 2:m) gammas[i] = gammas[i - 1] * lambda[i];
}
model {
  lambda ~ exponential(1);
  ...
}
2 Likes

Just tried a variant of the first proposal above, i.e.:

data {
    int<lower=1> Npoints;                                   
    int<lower=1> m;                                                 
    matrix[Npoints,m] i_spline_basis_evals;
    vector[Npoints] y;
}
parameters {
    simplex[m] pi;          // primitives of coefficients
    real<lower=0> alpha;    // scale factor
    real<lower=0> sigma;  // error standard deviation
}
transformed parameters {
  vector[m] gammas = alpha * pi;
}
model {
    alpha ~ cauchy(0,2);
    sigma ~ cauchy(0,1);
    y ~ normal(i_spline_basis_evals*gammas, sigma);
}

which still results in

pairs_simplex_ben

So multiplicative random walks?

It looks that way due the non-centering but it is equivalent to an exponential prior for the i-th value with expectation being the \left(i - 1\right)-th value.

1 Like

Got it, thanks!

Now, coming back to your first proposal above, which still results in a sub-optimal posterior from my perspective (see above). Any way to reparametrise it?

For your reference, I am using i-splines, see here.

I have long speculated that a polar decomposition might be more useful than a QR decomposition. In your case, you have \boldsymbol{\mu} = \mathbf{X} \boldsymbol{\gamma}, where \mathbf{X} is the spline basis and \boldsymbol{\gamma} is elementwise positive. If we write \mathbf{X} = \mathbf{U}\mathbf{P} in the form of a polar decomposition, then \boldsymbol{\mu} = \mathbf{U} \mathbf{P} \boldsymbol{\gamma} = \mathbf{U} \boldsymbol{\theta} where \boldsymbol{\theta} = \mathbf{P} \boldsymbol{\gamma}. I wonder if there are any simple constraints on \boldsymbol{\theta} that satisfy the monotonicity requirement, in which case we could do the regression with \mathbf{U} and \boldsymbol{\theta} instead of \mathbf{X} and \boldsymbol{\gamma} and then obtain \boldsymbol{\gamma} = \mathbf{P}^{-1} \boldsymbol{\theta} as a generated quantity.

2 Likes

Is there a particular intuitive geometric interpretation of P that we could utilize or would it make sense to adopt a spectral perspective (knowing P is a positive-semidefinite Hermitian matrix)?

\mathbf{P} can loosely be interpreted as a “scale” factor. That is why I thought it might work.

Is there R code to do this decomposition? Googling points to this, but playing around with it has me convinced it’s not doing what you (and the polar decomposition Wikipedia page) are suggesting. I know I could go to the source, but that seems like a wheel reinventing opportunity.

I guess one could initially check by propagating a whole lot of samples of \gamma through \text{P} and looking at the resulting pairs plot to see if there are any obvious truncations / restrictions on \theta to impose.

Well, 2.5 years later…I was interested in monotone splines and did the polar decomposition. No divergences or treedepth issues with the simulated code. ESS is way higher too.

The polar decomposition can be accomplished with SVD. What’s really cool is that the rc of Stan now has SVD in it! I pass the P and U matrices to Stan.

library(splines2)
library(cmdstanr)
set.seed(42)
tmin <- 0
tmax <- 10
Npoints <- 100
ninterior_knots <- 5
sigma_noise <- .1
times <- seq(tmin,tmax,length.out = Npoints)
knots <- seq(tmin, tmax, length.out = ninterior_knots+2)[2:ninterior_knots+1]
nknots <- length(knots)
mspline_degree<- 3
i_spline_basis_evals <- iSpline(times, knots=knots, degree=mspline_degree+1,
                                intercept=TRUE)
m_spline_basis_evals <- deriv(i_spline_basis_evals)
nbasis <- dim(m_spline_basis_evals)[2]
gammas <- abs(rnorm(nbasis, 0, 1))
y_mean <- as.vector(i_spline_basis_evals %*% as.matrix(gammas))
y = rnorm(Npoints, y_mean, sigma_noise)

spline_svd <- svd(i_spline_basis_evals )
spline_P <- spline_svd$v %*% diag(spline_svd$d) %*% t(spline_svd$v)
spline_U <- spline_svd$u %*% t(spline_svd$v)

stan_data <- list(
  Npoints = Npoints,
  m = nbasis,
  spline_U = spline_U,
  spline_P = spline_P,
  y = y
)

I have no idea if the thing I do in transformed data is generalizable to other problems. I inspected P and P \gamma to see if any constraint makes sense from Ben’s comment above. But it seems to sample fine without any extra constraints. However, gamma is not quite identified, if you’re really interested in that.

data {
    int<lower=1> Npoints;                                   
    int<lower=1> m;                                                 
    matrix[Npoints, m] spline_U;
    matrix[m, m] spline_P;
    vector[Npoints] y;
}
parameters {
    vector[m] theta;
    real<lower=0> sigma;  // error standard deviation
}
model {
    sigma ~ std_normal();
    y ~ normal(spline_U * theta, sigma);
}
generated quantities {
  vector[m] gamma = mdivide_left_spd(spline_P, reverse(theta));
}

Here’s the summary of gamma from the model

> fit$summary("gamma")
# A tibble: 9 x 10
  variable    mean    median    sd   mad     q5   q95  rhat ess_bulk ess_tail
  <chr>      <dbl>     <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>    <dbl>    <dbl>
1 gamma[1] 1.26     1.26     0.119 0.116  1.07  1.46   1.00    2670.    2339.
2 gamma[2] 0.802    0.807    0.298 0.296  0.306 1.28   1.00    2458.    2470.
3 gamma[3] 0.00609 -0.000957 0.402 0.394 -0.657 0.681  1.00    2444.    2359.
4 gamma[4] 1.07     1.07     0.420 0.417  0.391 1.76   1.00    2450.    2459.
5 gamma[5] 0.0849   0.0886   0.382 0.373 -0.522 0.701  1.00    2354.    2458.
6 gamma[6] 0.347    0.342    0.360 0.339 -0.240 0.939  1.00    2292.    2345.
7 gamma[7] 1.29     1.30     0.353 0.344  0.704 1.86   1.00    2281.    2154.
8 gamma[8] 0.261    0.259    0.297 0.290 -0.203 0.745  1.00    2256.    2396.
9 gamma[9] 1.92     1.92     0.200 0.199  1.60  2.24   1.00    2196.    2561.

vs true gamma

> gammas
[1] 1.37095845 0.56469817 0.36312841 0.63286260 0.40426832 0.10612452 1.51152200 0.09465904 2.01842371

Pairs plot

7 Likes