# Curve fitting -- including convexity constraint

Suppose one has an unknown univariate function f : \mathbb{R}^+ \to \mathbb{R} such that f is differentiable and strictly convex. Suppose further one has some data \{ X , Y = f(X) \}. The ultimate goal is to estimate some function g_\theta : \mathbb{R}^+ \to \mathbb{R} that approximates f, such that one can estimate the indexing parameters \theta and then fix them (to their posterior means, say) in a future Stan program.

I would appreciate insights and references to approaches that can accommodate what we know about f. I’m not very fluent on either Gaussian processes or splines, but I guess these would be good methods to check out. What I don’t know is how to enforce convexity, for instance.

Edit: we also have an important constraint that f(0) = 0.

Do you know what values of X you will want to evaluate the estimated function at in your second model? If this is the case, then you could construct an I-Spline basis at the X values of interest in the first and second model, and fix the basis function parameter estimates to their posterior means. Shape constrained GPs are certainly a thing (I don’t have a good reference for them, but Estimating Shape Constrained Functions Using Gaussian Processes seems relevant) and may be an option if you don’t know the X values of interest in your second model, but I have zero idea about getting the posterior estimate into Stan.
Shape constrained polynomials are also an option (I’ve written some Stan code to do them, but it’s pretty ugly and the parameterisation is not necessarily unique, which makes posterior means somewhat interesting).

There is a lot out there on shape constrained regression (if you want to dive in, maybe start here: Monotone Regression Splines in Action, and trawl through the many articles that cite it). This one also seems relevant: Monotonic Regression Based on Bayesian P–Splines, and I think their truncated gaussian distribution in Section 3.1 could maybe be substituted with an intercept term and a positive_ordered vector in Stan, although as you have f(0) = 0 intercepts are probably not that big of a concern for you :).

1 Like

I was interested in this, so here’s some code to do it with B-Splines

library(splines)
library(rstan)
library(ggplot2)

x <- seq(from = 0, to = 1, length.out = 1000)
bs_mat <- bs(x)

stan_data <- list(
spline_deg = ncol(bs_mat),
n_points = nrow(bs_mat),
x_mat = bs_mat
)

model_prefit <- stan_model("mono-spline.stan")
model_fit <- sampling(
model_prefit,
data = stan_data
)

prior_pred_samples <- rstan::extract(model_fit, "prior_pred_y")[[1]]
prior_pred_quantiles <- matrixStats::colQuantiles(prior_pred_samples, probs = c(0.1, 0.5, 0.9))

plot_df <- data.frame(
x = x,
lower = prior_pred_quantiles[, 1],
median = prior_pred_quantiles[, 2],
upper = prior_pred_quantiles[, 3]
)

ggplot(plot_df, aes(x = x)) +
geom_line(aes(y = median)) +
geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.25)


With Stan code:

data {
int spline_deg;
int n_points;
matrix [n_points, spline_deg] x_mat;
}

parameters {
positive_ordered [spline_deg] theta;
}

model {
theta ~ normal(0, 5);
}

generated quantities {
vector [n_points] prior_pred_y = x_mat * theta;
}


1 Like

I don’t. But I can bound the domain of f without much loss of generality. I suppose this would allow me to interpolate easily.

Thanks for the references.

@avehtari has published a paper about monotocity GP with code in GPstuff.

1 Like

This doesn’t quite work as it is written, here’s something that does (also demonstrates interpolation / fitting to data)

library(splines)
library(rstan)
library(ggplot2)
library(MonoPoly)

extra_width <- 0.1
x_pred <- seq(from = -1 - extra_width , to = 1 + extra_width, length.out = 1000)

n_knots <- 10
boundary_knots <- c(min(x_pred), max(x_pred))
internal_knots <- seq(
from = boundary_knots[1],
to = boundary_knots[2],
length.out = n_knots
)[2 : (n_knots - 1)]

bs_data_mat <- bs(
hawkins$x, knots = internal_knots, Boundary.knots = boundary_knots ) bs_pred_mat <- bs( x_pred, knots = internal_knots, Boundary.knots = boundary_knots ) # stan_data <- list( n_knots = ncol(bs_data_mat), n_pred_points = length(x_pred), x_pred_mat = bs_pred_mat, n_data_points = nrow(hawkins), x_data_mat = bs_data_mat, y = hawkins$y
)

model_prefit <- stan_model("mono-spline.stan")
model_fit <- sampling(
model_prefit,
data = stan_data,
cores = 4
)

post_pred_samples <- rstan::extract(model_fit, "post_pred_y")[[1]]
post_pred_quantiles <- matrixStats::colQuantiles(post_pred_samples, probs = c(0.01, 0.5, 0.99))

plot_df <- data.frame(
x = x,
lower = post_pred_quantiles[, 1],
median = post_pred_quantiles[, 2],
upper = post_pred_quantiles[, 3]
)

ggplot(plot_df, aes(x = x)) +
geom_point(data = hawkins, inherit.aes = F, aes(x = x, y = y)) +
geom_line(aes(y = median)) +
geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.25)


data {
int n_knots;

// the prediction spline
int n_pred_points;
matrix [n_pred_points, n_knots] x_pred_mat;

// the data spline
int n_data_points;
matrix [n_data_points, n_knots] x_data_mat;

vector [n_data_points] y;
}

parameters {
positive_ordered [n_knots] theta;
real <lower = 0> y_sd;
real intercept;
}

model {
theta ~ normal(0, 15);
y ~ normal(intercept + x_data_mat * theta, y_sd);
y_sd ~ normal(0, 4);
intercept ~ normal(0, 10);
}

generated quantities {
vector [n_pred_points] post_pred_y = intercept + x_pred_mat * theta;
}



The intercept term is needed for the spline to remain monotonic and fit the data in the regions where y < 0, but this does not satisfy your requirement for f(0) = 0. I don’t have any thoughts off the top of my head for how to build this in as well. Edit: If f only takes nonnegative values, and you want f(0) = 0 and f to be monotonic (convexity is slightly harder) you can drop the intercept, and it should be fine.

Thanks a lot. f is not monotonic, though. I’ll work through the code and see what works for me and what doesn’t.

1 Like

If you drop in:

bs_data_mat <- splines2::cSpline(
hawkins\$x,
knots = internal_knots,
Boundary.knots = boundary_knots
)

bs_pred_mat <- splines2::cSpline(
x_pred,
knots = internal_knots,
Boundary.knots = boundary_knots
)


you should get a convex spline instead (which is a terrible fit to this data, but actually sounds like your problem).

OK, here’s a question for @avehtari , @betanalpha , @anon79882417 or @rtrangucci: how exactly does one incorporate the convexity constraint in a GP?
Suppose I have a set of n measured inputs t and want to enforce convexity at a collection (grid) of m inputs s.
Drawing on papers by Riihimaki & Vehtari and Wang & Berger (linked above by @hhau), I can write the covariance matrix between the process Z and its second derivative, Z^{\prime\prime} as

\begin{bmatrix} Z(x) \\ Z^{\prime\prime}(s) \end{bmatrix} \sim \text{multivariate_normal}\left( \begin{bmatrix} m(x) \\ 0_m \end{bmatrix} , \begin{bmatrix} K(x,x) & K^{02}(x, s) \\ K^{20}(s, x) & K^{22}(s, s) \end{bmatrix} \right)

where we have K(t, t') = \text{quad_cov_exp}(t, t', \alpha, \rho), K^{02} = \left( k^{20}(t, s) \right), K^{20} = \text{transpose}(K^{02}), K^{22} = \left( k^{22}(s, s) \right), and

k^{20}(t, s) = \alpha^2 \left( - \frac{(t-s)^2}{2\rho^2} \right)\left( \frac{(t-s)^2}{\rho^4} - \frac{1}{\rho^2}\right),\\ k^{22}(s, s') = \alpha^2 \left( - \frac{(s-s')^2}{2\rho^2} \right)\frac{1}{\rho^4}\left( \frac{(s-s')^2 }{\rho^4} -\frac{6(s-s')^2}{\rho^2} + 3\right).

Here is my feeble attempt at fitting the GP with the constraint that Z^{\prime\prime} > 0\:\forall s:

functions{
/*Functions to create all three covariance (sub) matrices (fourth is just cov_quad_exp) */
real K22(real s, real sp, real alpha, real rho){
real sq_delta = square(s-sp);
real sq_rho = square(rho);
real cov = square(alpha) * exp(-sq_delta/(2*sq_rho)) * 1/square(sq_rho) * ( square(sq_delta)/square(sq_rho) - (6*sq_delta/sq_rho) + 3 );
return(cov);
}
matrix K22_cov_mat(real[] s,  real alpha, real rho){
int K = size(s);
matrix[K, K] X;
real diag_entry = 3*square(alpha/square(rho));
for (i in 1:K) {
X[i, i] = diag_entry;
for (j in 1:K) {
X[i, j] = K22(s[i], s[j], alpha, rho);
X[j, i] = X[i, j];
}
}
return(X);
}
real K20(real t, real s, real alpha, real rho){
real cov; // computation in log-space for numerical stability
real sq_delta = square(t-s);
real sq_rho = square(rho);
cov = square(alpha) *  exp(-sq_delta/(2*sq_rho)) * ( sq_delta / square(sq_rho) - 1/sq_rho);
return(cov);
}
matrix K20_cov_mat(real[] x, real[] s, real alpha, real rho){
int N = size(x);
int K = size(s);
matrix[N, K] Y;
for (i in 1:N) {
for (j in 1:K) {
Y[i, j] = K20(x[i], s[j], alpha, rho);
}
}
return(Y);
}
matrix ffprime_cov_mat(real [] x, real[] s, real alpha, real rho){
int N = size(x);
int K = size(s);
int J = N + K;
matrix[N, N] A;
matrix[N, K] B;
matrix[K, K] C;
matrix[J, J] M;
// | A, B|
// | B',C|
B = K20_cov_mat(x, s, alpha, rho);
C = K22_cov_mat(s, alpha, rho);
/*Filling in the matrix*/
M[1:N, 1:N] = A;
M[1:N, (N+1):J] = B;
M[(N + 1):J, (N+1):J] = C;
M[(N + 1):J, 1:N] = B';
return(M);
}
matrix full_cov_mat(real [] x, real[] s, real[] x_pred, real alpha, real rho){
int N = size(x);
int K = size(s);
int N_pred = size(x_pred);
int W = N + K + N_pred;
matrix[W, W] M;
matrix[N_pred, K] Q = K20_cov_mat(x_pred, s, alpha, rho);
/* building sub-matrices */
// | (f, f)     (f,f'')  (f, f_p)   |
// | (f'',f)  (f'', f'') (f'', f_p) |
// | (f_p, f) (f_p, f'') (f_p, f_p) |
M[1:(N + K), 1:(N + K)] = ffprime_cov_mat(x, s, alpha, rho);
M[(N + K + 1):W, 1:N] = cov_exp_quad(x_pred, x, alpha, rho);
M[(N + K + 1):W, (N + 1):(N + K)] = Q;
M[1:N, (N + K + 1):W] = cov_exp_quad(x, x_pred, alpha, rho);
M[(N + 1):(N + K), (N + K + 1):W] = Q';
M[(N + 1 + K):W, (N + 1 + K):W] = cov_exp_quad(x_pred, x_pred, alpha, rho);
return(M);
}
}
data {
int<lower=1> N1;
real x1[N1];
vector[N1] y1;
int<lower=1> N2;
real x2[N2];
int<lower=1> N3;
real x3[N3];
real<lower=0> nu;
}
transformed data {
real delta = 1e-9;
int<lower=1> N_total = N1 + N2 + N3;
}
parameters {
real<lower=0> rho;
real<lower=0> alpha;
real<lower=0> sigma;
vector[N_total] eta;
}
transformed parameters {
vector[N_total] f;
{
matrix[N_total, N_total] L_K;
matrix[N_total, N_total] K = full_cov_mat(x1, x2, x3, alpha, rho) + diag_matrix(rep_vector(delta, N_total));
L_K = cholesky_decompose(K);
f = L_K * eta;
}
}
model {
rho ~ inv_gamma(5, 5);
alpha ~ normal(0, 1);
sigma ~ normal(0, 1);
eta ~ normal(0, 1);
y1 ~ normal(f[1:N1], sigma);
target += log(Phi(f[(N1 + 1):(N1 + N2)]./nu)); // feeble attempt at enforcing convexity
}
generated quantities {
vector[N3] y_pred;
for (n3 in 1:N3)
y_pred[n3] = normal_rng(f[N1 + N2 + n3], sigma);
}


This leads to a myriad of initialisation problems and/or divergences. Besides, even when it works, the predictions do not match the measured curve in any appreciable degreee. If I run this program without the constraint, the fitted (predicted) curve resembles the measured one, but the whole idea of including convexity is to help the GP learn the function better and capture its curvature.
I can share R code and data if needed. I can also move this to a thread of its own if people judge it appropriate.

I’m sure someone else would have a better explanation…

I only took a quick look at your code, and it looks fine. Similar to what I’ve seen for monotonic GPs (covariance matrix expanded by partial derivatives). If you want to do a quick check to see if that’s implemented correctly, you could probably simulate some convex function + noise and then just visualize the fit, and don’t use HMC to fit it, just use MAP optimization (BFGS).

But I want to note that I’ve seen divergences with Stan’s adaptive HMC on simpler models, for example using a GP prior for a single parameter weibull or log-gaussian survival model. For that reason, I would often use MAP optimization (in Stan, it’s BFGS) (and many people will cringe at me saying this).

If you check out other packages that are specialized for GPs, such as GPstuff, you’ll notice that many are also using MAP optimization (laplace approximation, BFGS, sometimes EP, INLA), and not HMC. It’s because the audience of Stan is a bit catered toward people who fit hierarchical models, and understandably so, and Stan has yet to get good GP support.

My guess is that the posterior gets more complex than even in the funnel common in hierarchical models. For that reason, it the Hamiltonian trajectory is more often hitting a boundary causing the divergence.

It might be useful to consider adjusting some of the tuning parameters in Stan’s adaptive HMC, but this might require digging through C++ code (I’m guessing increased adapt delta, and the others you can check out in the cmdstan guide, but it doesn’t exactly describe what function they have in HMC, might require digging through C++ code or sifting through related pubs, apologies for my ignorance). With that said, what’s the best place to find info about the Stan’s HMC tuning parameters?

I’d be one of those people ;-)

I’m more than willing to experiment with that, but I wanted to make sure the model looked sane before I went about messing with the computational specs.

That’s good advice. I’ll have a look at what the simulations from this model look like. Thanks.

I think that paper is behind a paywall. I could not read it. Thus in the following I may say something stupid.

How is Z′′>0∀s ensured in the code? That’s the criteria for convexity, if I’m not wrong.
That could be addressed by reformulation to a Lagrange - Multiplier problem and
the keyword I read somewhere else is  bordered hessian matrix.

If this problem could not be solved easily maybe the algebra solver may a help for.

Updated. I think you can do it with:

You’re not wrong. In a 2010 paper (linked above), @avehtari proposes using a probit likelihood to enforce the constraint and that’s what I have tried to replicate here, without much success.

Yes, he used “something” to find a positive solution using the EP algorithmn. See that paper you
referenced on p. 647 right side:

“Since the likelihood for the derivative observations in
(8) is not Gaussian, the posterior is analytically intractable. We apply the EP algorithm, and compute
the Gaussian approximation for the posterior distribution.”

Where you have already have a solution and put something on it. For me that’s two different things.

Update: I simplified your code and used the sampling tMVN from @bgoodri
Sometimes it produces : Exception: Exception: Phi: x is -nan, but must not be nan!
real v; real u_star = Phi(z_star);

functions{
vector[] make_stuff(vector mu, matrix L, vector b, vector s, vector u) {
int K = rows(mu); vector[K] d; vector[K] z; vector[K] out[2];
for (k in 1:K) {
int km1 = k - 1;
if (s[k] != 0) {
real z_star = (b[k] -
(mu[k] + ((k > 1) ? L[k,1:km1] * head(z, km1) : 0))) /
L[k,k];
real v; real u_star = Phi(z_star);
if (s[k] == -1) {
v = u_star * u[k];
d[k] = u_star;
}
else {
d[k] = 1 - u_star;
v = u_star + d[k] * u[k];
}
z[k] = inv_Phi(v);
}
else {
z[k] = inv_Phi(u[k]);
d[k] = 1;
}
}
out[1] = z;
out[2] = d;
return out;
}
/*Functions to create all three covariance (sub) matrices (fourth is just cov_quad_exp) */
real K22(real s, real sp, real alpha, real rho){
real sq_delta = square(s-sp);
real sq_rho = square(rho);
real cov = square(alpha) * exp(-sq_delta/(2*sq_rho)) * 1/square(sq_rho) * ( square(sq_delta)/square(sq_rho) - (6*sq_delta/sq_rho) + 3 );
return(cov);
}
matrix K22_cov_mat(real[] s,  real alpha, real rho){
int K = size(s);
matrix[K, K] X;
real diag_entry = 3*square(alpha/square(rho));
for (i in 1:K) {
X[i, i] = diag_entry;
for (j in 1:K) {
X[i, j] = K22(s[i], s[j], alpha, rho);
X[j, i] = X[i, j];
}
}
return(X);
}
real K20(real t, real s, real alpha, real rho){
real cov; // computation in log-space for numerical stability
real sq_delta = square(t-s);
real sq_rho = square(rho);
cov = square(alpha) *  exp(-sq_delta/(2*sq_rho)) * ( sq_delta / square(sq_rho) - 1/sq_rho);
return(cov);
}
matrix K20_cov_mat(real[] x, real[] s, real alpha, real rho){
int N = size(x);
int K = size(s);
matrix[N, K] Y;
for (i in 1:N) {
for (j in 1:K) {
Y[i, j] = K20(x[i], s[j], alpha, rho);
}
}
return(Y);
}
matrix ffprime_cov_mat(real [] x, real[] s, real alpha, real rho){
int N = size(x);
int K = size(s);
int J = N + K;
matrix[N, N] A;
matrix[N, K] B;
matrix[K, K] C;
matrix[J, J] M;
// | A, B|
// | B',C|
B = K20_cov_mat(x, s, alpha, rho);
C = K22_cov_mat(s, alpha, rho);
/*Filling in the matrix*/
M[1:N, 1:N] = A;
M[1:N, (N+1):J] = B;
M[(N + 1):J, (N+1):J] = C;
M[(N + 1):J, 1:N] = B';
return(M);
}
matrix full_cov_mat(real [] x, real[] s, real[] x_pred, real alpha, real rho){
int N = size(x);
int K = size(s);
int N_pred = size(x_pred);
int W = N + K + N_pred;
matrix[W, W] M;
matrix[N_pred, K] Q = K20_cov_mat(x_pred, s, alpha, rho);
/* building sub-matrices */
// | (f, f)     (f,f'')  (f, f_p)   |
// | (f'',f)  (f'', f'') (f'', f_p) |
// | (f_p, f) (f_p, f'') (f_p, f_p) |
M[1:(N + K), 1:(N + K)] = ffprime_cov_mat(x, s, alpha, rho);
M[(N + K + 1):W, 1:N] = cov_exp_quad(x_pred, x, alpha, rho);
M[(N + K + 1):W, (N + 1):(N + K)] = Q;
M[1:N, (N + K + 1):W] = cov_exp_quad(x, x_pred, alpha, rho);
M[(N + 1):(N + K), (N + K + 1):W] = Q';
M[(N + 1 + K):W, (N + 1 + K):W] = cov_exp_quad(x_pred, x_pred, alpha, rho);
return(M);
}

}
data {
int<lower=1> K;
real x[K];
}
transformed data {
vector[K*2] b = rep_vector(0, 2*K);                // lower or upper bound
vector[K*2] mu = rep_vector(0, 2*K);
real alpha = 1.0;
real rho = 1.0;

// s[k] ==  0 implies no constraint; otherwise
// s[k] == -1 -> b[k] is an upper bound
// s[k] == +1 -> b[k] is a lower bound
vector<lower=-1,upper=1>[K*2] s = append_row(rep_vector(0, K), rep_vector(1, K));

matrix[K*2,K*2] Sigma =
append_row(
append_col(cov_exp_quad(x, alpha, rho), K20_cov_mat(x, x, alpha, rho))
, append_col( K20_cov_mat(x, x, alpha, rho), K22_cov_mat(x, alpha, rho))
) + diag_matrix(rep_vector(1e-5,2*K));
matrix[K*2,K*2] L = cholesky_decompose(Sigma);
}
parameters {
vector<lower=0,upper=1>[K*2] u;
}
model {
target += log(make_stuff(mu, L, b, s, u)[2]); // Jacobian adjustments
// implicit: u ~ uniform(0,1)
}
generated quantities {
matrix[K*2,K*2] Sigma_gen = Sigma;
vector[K*2] y = mu + L * make_stuff(mu, L, b, s, u)[1];
}


GPconv.stan (4.2 KB) GPconv.R (692 Bytes)

Hint: If you flip Z’’ and Z in the covariance matrix and adjust the constraints, you’ll get a better performance.

GPconv_fit.R (1.4 KB) GPconv_fit.stan (4.3 KB)

Update: Constraint a GP f(0) = 0
https://stackoverflow.com/questions/37539498/conditioning-gaussian-process-regression
Quoting Julien Bect:

• the samplepaths of your GP will satisfy the property f(0)=0 almost surely if, and only if, the mean function and the variance function vanish at x=0.
• Note that this is only possible for a non-stationary GP (or for the trivial GP with null mean and variance functions, of course).
• A typical example of this the standard Brownian motion over [0; +\infty), which has a zero mean function and covariance function k(x,y) = min(x,y).
• Starting from a generic covariance function k0, you can obtain this property using the conditioning formula: k(x,y) = k0(x,y) - k0(x,0) k0(0,y) / k0(0,0).

Let’s try:

x <- seq(-1,1,length.out=11)
k0<-function(xi, xj) exp(-(xi - xj)^2 / 2)

outer(x, x, function(x, y) { k0(x,y) - k0(x,0)* k0(0,y) / k0(0,0) })
}

[,1]       [,2]        [,3]        [,4]        [,5] [,6]        [,7]        [,8]        [,9]      [,10]      [,11]
[1,]  0.6321206  0.5397670  0.41649935  0.27537184  0.13162849    0 -0.10776829 -0.18458727 -0.22857969 -0.2425330 -0.2325442
[2,]  0.5397670  0.4727076  0.37366801  0.25279630  0.12349989    0 -0.10523966 -0.18356779 -0.23121956 -0.2492551 -0.2425330
[3,]  0.4164994  0.3736680  0.30232367  0.20914709  0.10438559    0 -0.09258172 -0.16452093 -0.21092407 -0.2312196 -0.2285797
[4,]  0.2753718  0.2527963  0.20914709  0.14785621  0.07536126    0 -0.06956721 -0.12599475 -0.16452093 -0.1835678 -0.1845873
[5,]  0.1316285  0.1234999  0.10438559  0.07536126  0.03921056    0 -0.03767309 -0.06956721 -0.09258172 -0.1052397 -0.1077683
[6,]  0.0000000  0.0000000  0.00000000  0.00000000  0.00000000    0  0.00000000  0.00000000  0.00000000  0.0000000  0.0000000
[7,] -0.1077683 -0.1052397 -0.09258172 -0.06956721 -0.03767309    0  0.03921056  0.07536126  0.10438559  0.1234999  0.1316285
[8,] -0.1845873 -0.1835678 -0.16452093 -0.12599475 -0.06956721    0  0.07536126  0.14785621  0.20914709  0.2527963  0.2753718
[9,] -0.2285797 -0.2312196 -0.21092407 -0.16452093 -0.09258172    0  0.10438559  0.20914709  0.30232367  0.3736680  0.4164994
[10,] -0.2425330 -0.2492551 -0.23121956 -0.18356779 -0.10523966    0  0.12349989  0.25279630  0.37366801  0.4727076  0.5397670
[11,] -0.2325442 -0.2425330 -0.22857969 -0.18458727 -0.10776829    0  0.13162849  0.27537184  0.41649935  0.5397670  0.6321206


Looks good. What’s left over?

• monotonicity
The covariances can be found in the papers. Same approach as convexity.

• f(0) = 0 and monotonicity / convexity