Inequality constraints on linear combinations of parameters

Arne_Henningsen · March 15, 2025, 6:27pm

Hi

I would like to use Stan to estimate a regression model with inequality constraints on linear combinations of parameters. Here is a simplified example, a quadratic function:

y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + 0.5 \beta_{11} x_{1,i}^2 + \beta_{12} x_{1,i} x_{2,i} + 0.5 \beta_{22} x_{2,i}^2 + u_i,

where subscript i = 1, …, N indicates the observation, y_i is the dependent variable, x_{1,i} and x_{2,i} are two explanatory variables, u_i is the error term with mean zero and variance \sigma^2, and \beta_0, \beta_1, \beta_2, \beta_{11}, \beta_{12}, \beta_{22}, and \sigma are 7 parameters to be estimated. We have prior knowledge that the partial derivatives of y_i with respect to x_{1,i} and x_{2,i} are non-negative for all observations i = 1, …, N:

\partial y_i / \partial x_{1,i} = \beta_1 + \beta_{11} x_{1,i} + \beta_{12} x_{2,i} \geq 0 \; \forall \; i = 1, …, N

\partial y_i / \partial x_{2,i} = \beta_2 + \beta_{12} x_{1,i} + \beta_{22} x_{2,i} \geq 0 \; \forall \; i = 1, …, N

What is the best way to estimate this model specification, including the 2 \cdot N inequality restrictions in Stan?

We also have prior knowledge about the ranges of the partial derivatives, which we expect most observations to have, e.g., 90% of observations are expected to have 0.1 \leq \partial y_i / \partial x_{1,i} \leq 0.6 or something similar. Can this additional prior knowledge also be used?

Thanks a lot in advance
Arne

Corey.Plate · March 18, 2025, 12:27am

transformed parameters{
real<lower=0.1,upper=0.6>[N] t_1 = parametric_function_1;
real<lower=0.1,upper=0.6>[N] t_2 = parametric_function_2;
}

Unfortunately, this method can cause Stan to hang or to random walk because it only rejects unsuitable parameters, rather than retroactively restricting the beta parameters themselves. It won’t work unless the beta parameters are nicely initialized and naturally fall into those constraints once sampling begins. So this is by no means a perfect solution to your problem, and someone with more Stan experience might have a better one.

andrjohns · March 18, 2025, 4:29am

It might not be completely applicable to your model, but there are some previous examples/discussions of inequality constraints here: Simple normal regression model with inequality constraints on the parameters

Arne_Henningsen · March 18, 2025, 7:04am

Thank you, @andrjohns, for making me aware of this previous discussion! As I have in my (simplified) example 2 \cdot N inequality restrictions (with N = number of observations, which is usually many hundreds or some thousands) and K = 5 parameters, the matrix D in the discussion that you mentioned is not quadratic but has many more rows (2 \cdot N) than columns (K). Hence, as far as I can see, the matrix D cannot be inverted and, thus, I am afraid that I cannot use this method.

Arne_Henningsen · March 18, 2025, 9:21am

Corey.Plate:

transformed parameters{
real<lower=0.1,upper=0.6>[N] t_1 = parametric_function_1;
real<lower=0.1,upper=0.6>[N] t_2 = parametric_function_2;
}
Unfortunately, this method can cause Stan to hang or to random walk because it only rejects unsuitable parameters, rather than retroactively restricting the beta parameters themselves. It won’t work unless the beta parameters are nicely initialized and naturally fall into those constraints once sampling begins. So this is by no means a perfect solution to your problem, and someone with more Stan experience might have a better one.

Thanks a lot, @Corey.Plate, for your response! I used your suggestion to specify the lower bound of zero for the partial derivatives. (The range 0.1 \leq \partial y_i / \partial x_{1,i} \leq 0.6 in my example was not a strict constraint but an expected interval that includes the partial derivatives of around 90% of the observations). As you pointed out, I had to carefully initialise the \beta parameters so that the inequality constraints are fulfilled at initial values and the sampling can start. However, the number of divergent transitions after warmup is huge (more than 80% even if I increase adapt_delta to 0.999) and the effective sample size is small. Does anybody have suggestions for solving the issue with the divergent transitions or for imposing the inequality constraints in a different way?

My R code and Stan code are attached below and are available at GitHub: GitHub - arne-henningsen/Stan_inequality: Inequality constraints in Stan

Stan_inequality.R:

# load required package(s)
library( "rstan" )

# number of observations (before removing unsuitable observations)
nObs <- 500
# explanatory variables
set.seed( 123 )
dat <- data.frame(
  x1 = rnorm( nObs ),
  x2 = rnorm( nObs )
)
# parameters
b0 <- 1
b1 <- 0.5
b2 <- 0.7
b11 <- -0.1
b12 <- 0.2
b22 <- -0.25
sigma <- 0.5
# dependent variable
dat$y <- with( dat, b0 + b1 * x1 + b2 * x2 + 0.5 * b11 * x1^2
  + b12 * x1 * x2 + 0.5 * b22 * x2^2 ) + rnorm( nObs, 0, sigma )
# partial derivatives
dat$dydx1 <- with( dat, b1 + b11 * x1 + b12 * x2 )
dat$dydx2 <- with( dat, b2 + b12 * x1 + b22 * x2 )
par( mfrow = c( 2, 1 ) )
hist( dat$dydx1, 20 )
hist( dat$dydx2, 20 )
# remove observations with negative partial derivatives
dat <- subset( dat, dydx1 >= 0 & dydx2 >= 0 )

# prepare the data for Stan
dat_stan <- list(
  N = nrow( dat ),
  y = dat$y,
  x1 = dat$x1,
  x2 = dat$x2
)

# set number of cores
options( mc.cores = parallel::detectCores() )

# compile the model
model_stan <- stan_model( "Stan_inequality.stan" )

# generate initial values
set.seed( 12345 )
init_stan <- list()
for( chain in 1:4 ){
  init_stan[[ chain ]] <- list(
    b1 = runif( 1, min = 0.2, max = 2 ),
    b2 = runif( 1, min = 0.2, max = 2 ),
    b11 = runif( 1, min = -0.02, max = 0.02 ),
    b12 = runif( 1, min = -0.02, max = 0.02 ),
    b22 = runif( 1, min = 0.02, max = 0.02 )
  )
}

# fit the model
fit_stan <- sampling( model_stan, data = dat_stan, iter = 2000, chains = 4,
  init = init_stan, control = list( adapt_delta = 0.999, max_treedepth = 20 ) )
mean( get_divergent_iterations( fit_stan ) )
print( fit_stan, pars = grep( "^[^d]", fit_stan@model_pars, value = TRUE ) )
plot( fit_stan, plotfun = "rhat" )
plot( fit_stan, pars = grep( "^[^d]", fit_stan@model_pars, value = TRUE ),
  plotfun = "trace", inc_warmup = TRUE )

# investigate first derivatives (across all observations)
samples_stan <- as.data.frame( fit_stan )
par( mfrow = c( 2, 1 ) )
hist( unlist( samples_stan[ , grep( "^dydx1", names( samples_stan ) ) ] ), 30 )
hist( unlist( samples_stan[ , grep( "^dydx2", names( samples_stan ) ) ] ), 30 )

Stan_inequality.stan:

// data
data {
  int<lower=0> N;
  vector[N] y;
  vector[N] x1;
  vector[N] x2;
}

// define parameters
parameters {
  real b0;
  real b1;
  real b2;
  real b11;
  real b12;
  real b22;
  real<lower=0> sigma;
}

// partial derivatives
transformed parameters{
  vector<lower=0>[N] dydx1;
  vector<lower=0>[N] dydx2;
  for( i in 1:N ) {
    dydx1[i] = b1 + b11 * x1[i] + b12 * x2[i];
    dydx2[i] = b2 + b12 * x1[i] + b22 * x2[i];
  }
}

// specify the model
model {
  // priors
  b0 ~ normal( 0, 5 );
  b1 ~ normal( 0.5, 5 );
  b2 ~ normal( 0.5, 5 );
  b11 ~ normal( 0, 1 );
  b12 ~ normal( 0, 1 );
  b22 ~ normal( 0, 1 );
  sigma ~ normal( 0, 10 );
  
  // fitted values
  vector[N] mu;
  for( i in 1:N ) {
    mu[i] = b0 + b1 * x1[i] + b2 * x2[i] + 0.5 * b11 * x1[i]^2 
      + b12 * x1[i] * x2[i] + b22 * x2[i]^2;
  }
  y ~ normal( mu, sigma );
}

Arne_Henningsen · March 18, 2025, 12:47pm

I might have gotten a little closer to a solution: setting priors on the partial derivatives (transformed parameters). If I use priors with very low probabilities for values close to zero, e.g.:

  dydx1 ~ lognormal( log(0.8), 1 );
  dydx2 ~ lognormal( log(0.8), 1 );

there are no longer divergent transitions after warmup but the estimates are largely driven by the (quite ‘spiky’) priors with the estimated coefficients being quite far away from their ‘true’ values. If I use rather flat priors with not too small probabilities for values close to zero, e.g.:

  dydx1 ~ gamma( 1.05, 0.125 );
  dydx2 ~ gamma( 1.05, 0.125 );

the estimated coefficients are rather close to their ‘true’ values but there is again a huge number of divergent transitions after warmup. Perhaps, I can find a ‘compromise’, e.g.,:

  dydx1 ~ gamma( 1.2, 0.5 );
  dydx2 ~ gamma( 1.2, 0.5 );

with only few divergent transitions after warmup and the estimated coefficients being not too far away from their ‘true’ values. Does anybody have other suggestions or hints?

Note: as the partial derivatives (transformed parameters) are linear functions of the estimated parameters, I don’t think that I need adjust the ‘target’ with the log absolute determinant of the Jacobian of the transform when specifying priors for the transformed parameters.

martinmodrak · March 18, 2025, 2:19pm

Arne_Henningsen:

If I use priors with very low probabilities for values close to zero, e.g.:
  dydx1 ~ lognormal( log(0.8), 1 );
  dydx2 ~ lognormal( log(0.8), 1 );
there are no longer divergent transitions after warmup but the estimates are largely driven by the (quite ‘spiky’) priors with the estimated coefficients being quite far away from their ‘true’ values.

This seems to imply that your prior on the derivatives is in conflict with your “true” values. The prior is even not that spiky - it implies that 90% of values should lie between 0.15 and 4.14. I would double check that the derivatives actually are mostly between 0.1 and 0.6 for your “true” values.

This “soft-constraint” approach is pretty general and based on user reports on this forum often works reasonably well (see e.g. Test: Soft vs Hard sum-to-zero constrain + choosing the right prior for soft constrain).

But since you already are out of the realm of generative models, you can basically add any regularization you want, assuming it is smooth, it doesn’t have to be a valid distribution. So e.g. to just avoid dydx1 < lower_threshold you should be able to do things like:

for(i in 1:N) {
  if(dydx1[i] < lower_threshold) {
      target += normal_lpdf(dydx[i] - lower_threshold | 0, regularization_sigma);
  } else {
      // Adding a constant to ensure smoothness around `lower_threshold`
      target += normal_lpdf(0 | 0, regularization_sigma);     
  }
}

where regularization_sigma handles the strength of the regularization (lower = more regularization away from values below threshold). Though I’ve not tested this in any actual model, this is just a guess that should work.

A “nicer” way to solve this is to find a parametrization of \beta that satisfies the constraints by construction. The admisible region is going to be an intersection of several half-planes (possibly a convex polygon, but could also not be finite) and I am not aware of a natural well behaved parametrization of such a space - the problem is that at the intersections of the halfplanes, the constraints are not smooth, which could hinder sampling. Though it might exist, e.g. for quadrilaterals, there is one so maybe it generalizes.

The good part is that this is a problem that you can solve outside of Stan, since the x values are fixed. In principle, given x you should be able to solve for absolute lower and upper bound on \beta_{1}, then find a piecewise linear functions that provide upper and lower bounds on \beta_{11} given \beta_1 and then find piecewise linear functions that provide upper and lower bounds on \beta_{12} given \beta_1 and \beta_{11}. Those can then be directly implemented in Stan and as long as the changes at the intersections are not that large, things could work OK. You can reduce the chance of problems by suitably scaling and rotating the parametrization so that the earlier coefficients aligns with the larger dimensions of your polygon-like shape…

Hope that helps at least a bit

Bob_Carpenter · March 18, 2025, 6:00pm

Hi, @Arne_Henningsen.

This is actually easy in this particular case. You can set the following lower bound on beta1 given x and beta11 and beta12 (and similarly for beta2),

real<lower=-min(beta11 * x[1] + beta12 * x[2])> beta1;

and it will guarantee that \beta_1 + \beta_{11} \cdot x_{1, n} + \beta_{12} \cdot x_{2, n} > 0 for all n for any \beta_1 satisfying the constraint.

The posterior may not be easy to sample, but I’d give it a try.

@martinmodrak: I’m never quite sure what folks mean by “generative.” To me, it’s just a matter of being able to draw parameters from the prior \theta^\text{sim} \sim p(\theta) and then draw observations from the sampling distribution, y^\text{sim} \sim p(y \mid \theta^\text{sim}). I’m guessing what you mean here is that there’s not an easy way to simulate the parameters using standard built-in RNGs from p(\theta) when there are constraints. With Stan, you can set up the prior as usual, then move the likelihood to generated quantities and you have a generative process. It’s not going to generate i.i.d. simulations, but it will simulate from the prior and then generate data according to the sampling distribution. Or, you write a function to do something like rejection sampling and then you can take independent draws—some of the built-in RNGs work that way and we don’t call the resulting models non-generative.

Arne_Henningsen · March 18, 2025, 9:08pm

martinmodrak:

Arne_Henningsen:
If I use priors with very low probabilities for values close to zero, e.g.:
  dydx1 ~ lognormal( log(0.8), 1 );
  dydx2 ~ lognormal( log(0.8), 1 );
there are no longer divergent transitions after warmup but the estimates are largely driven by the (quite ‘spiky’) priors with the estimated coefficients being quite far away from their ‘true’ values.
This seems to imply that your prior on the derivatives is in conflict with your “true” values. The prior is even not that spiky - it implies that 90% of values should lie between 0.15 and 4.14. I would double check that the derivatives actually are mostly between 0.1 and 0.6 for your “true” values.

Although the vast majority of the ‘true’ values of the partial derivatives are between the 0.05 and the 0.95 quantile of the prior, the estimated partial derivatives were always very close to the mode of the prior with a much smaller variation than the ‘true’ values. I think that the log-normal prior is indeed quite spiky: The density of the log-normal prior at its 0.05 quantile (0.15) is 0.17, while the density at the mode (0.29) is 0.82, i.e., almost 5 times as likely as at its 0.05 quantile (0.15). It seems that the ‘signal’ from the prior is too strong compared to the ‘signal’ from the data.

martinmodrak:

This “soft-constraint” approach is pretty general and based on user reports on this forum often works reasonably well (see e.g. Test: Soft vs Hard sum-to-zero constrain + choosing the right prior for soft constrain).

But since you already are out of the realm of generative models, you can basically add any regularization you want, assuming it is smooth, it doesn’t have to be a valid distribution. So e.g. to just avoid dydx1 < lower_threshold you should be able to do things like:
for(i in 1:N) {
  if(dydx1[i] < lower_threshold) {
      target += normal_lpdf(dydx[i] - lower_threshold | 0, regularization_sigma);
  } else {
      // Adding a constant to ensure smoothness around `lower_threshold`
      target += normal_lpdf(0 | 0, regularization_sigma);     
  }
}
where regularization_sigma handles the strength of the regularization (lower = more regularization away from values below threshold). Though I’ve not tested this in any actual model, this is just a guess that should work.

Great suggestion! These ‘soft constraints’ work very well: there are no divergent transitions, the mean and median estimated values of the parameters are very close to their ‘true’ values, all mean and median estimates of the partial derivatives are positive at all observations, all mean and median estimates of the partial derivatives are very close to their ‘true’ values, and there occur during sampling only very few very slightly negative partial derivatives (at only very few observations at only very few iterations). Thanks a lot, @martinmodrak!

martinmodrak:

A “nicer” way to solve this is to find a parametrization of \beta that satisfies the constraints by construction. The admisible region is going to be an intersection of several half-planes (possibly a convex polygon, but could also not be finite) and I am not aware of a natural well behaved parametrization of such a space - the problem is that at the intersections of the halfplanes, the constraints are not smooth, which could hinder sampling. Though it might exist, e.g. for quadrilaterals, there is one so maybe it generalizes.

The good part is that this is a problem that you can solve outside of Stan, since the x values are fixed. In principle, given x you should be able to solve for absolute lower and upper bound on \beta_{1}, then find a piecewise linear functions that provide upper and lower bounds on \beta_{11} given \beta_1 and then find piecewise linear functions that provide upper and lower bounds on \beta_{12} given \beta_1 and \beta_{11}. Those can then be directly implemented in Stan and as long as the changes at the intersections are not that large, things could work OK. You can reduce the chance of problems by suitably scaling and rotating the parametrization so that the earlier coefficients aligns with the larger dimensions of your polygon-like shape…

Hope that helps at least a bit

Thank you also for this suggestion, @martinmodrak! As this suggestion seems to be quite complicated and potentially suffers from non-smooth intersections of halfplanes, I will for now use your suggestion to specify ‘soft constraints’ as it is easy to implement and works very well.

Arne_Henningsen · March 18, 2025, 9:24pm

Bob_Carpenter:

Arne_Henningsen:

We have prior knowledge that the partial derivatives of y_i with respect to x_{1,i} and x_{2,i} are non-negative for all observations i = 1, …, N:

Hi, @Arne_Henningsen.

martinmodrak:

In principle, given x you should be able to solve for absolute lower and upper bound on \beta_{1}

This is actually easy in this particular case. You can set the following lower bound on beta1 given x and beta11 and beta12 (and similarly for beta2),
real<lower=-min(beta11 * x[1] + beta12 * x[2])> beta1;
and it will guarantee that \beta_1 + \beta_{11} \cdot x_{1, n} + \beta_{12} \cdot x_{2, n} > 0 for all n for any \beta_1 satisfying the constraint.

The posterior may not be easy to sample, but I’d give it a try.

…

Thanks for the suggestion, @Bob_Carpenter! Yes, we can calculate the lower bound of \beta_1 based on \beta_{11}, \beta_{12}, and the data. However, as far as I understand, all \beta parameters are sampled simultaneously. Therefore, I am concerned that defining the lower bound of \beta_1 based on the currently sampled values of \beta_{11} and \beta_{12} creates sampling problems. Given that the ‘soft constraints’ suggested by @martinmodrak work very well, I will use ‘soft constraints’ (for now).

Arne_Henningsen · March 18, 2025, 9:30pm

If somebody is interested in the code that uses ‘soft constraints’, it is available at GitHub (GitHub - arne-henningsen/Stan_inequality: Inequality constraints in Stan) and here:

Stan_inequality.R:

# load required package(s)
library( "miscTools" )
library( "rstan" )

# number of observations (before removing unsuitable observations)
nObs <- 500
# explanatory variables
set.seed( 123 )
dat <- data.frame(
  x1 = rnorm( nObs ),
  x2 = rnorm( nObs )
)
# parameters
b0 <- 1
b1 <- 0.5
b2 <- 0.7
b11 <- -0.1
b12 <- 0.2
b22 <- -0.25
sigma <- 0.5
# dependent variable
dat$y <- with( dat, b0 + b1 * x1 + b2 * x2 + 0.5 * b11 * x1^2
  + b12 * x1 * x2 + 0.5 * b22 * x2^2 ) + rnorm( nObs, 0, sigma )
# partial derivatives
dat$dydx1 <- with( dat, b1 + b11 * x1 + b12 * x2 )
dat$dydx2 <- with( dat, b2 + b12 * x1 + b22 * x2 )
par( mfrow = c( 2, 1 ) )
hist( dat$dydx1, 20 )
hist( dat$dydx2, 20 )
par( mfrow = c( 1, 1 ) )
hist( dat$dydx1 + dat$dydx2, 20 )
# remove observations with negative partial derivatives
dat <- subset( dat, dydx1 >= 0 & dydx2 >= 0 )

# prepare the data for Stan
dat_stan <- list(
  N = nrow( dat ),
  y = dat$y,
  x1 = dat$x1,
  x2 = dat$x2,
  lt = 0, # lower threshold
  rs = 0.01 # regularisation sigma
)

# set number of cores
options( mc.cores = parallel::detectCores() )

# compile the model
model_stan <- stan_model( "Stan_inequality.stan" )

# generate initial values
set.seed( 12345 )
init_stan <- list()
for( chain in 1:4 ){
  init_stan[[ chain ]] <- list(
    b1 = runif( 1, min = 0.2, max = 1 ),
    b2 = runif( 1, min = 0.2, max = 1 ),
    b11 = runif( 1, min = -0.02, max = 0.02 ),
    b12 = runif( 1, min = -0.02, max = 0.02 ),
    b22 = runif( 1, min = -0.02, max = 0.02 )
  )
}

# fit the model
fit_stan <- sampling( model_stan, data = dat_stan, iter = 2000, chains = 4,
  init = init_stan, control = list( adapt_delta = 0.999, max_treedepth = 20 ) )
mean( get_divergent_iterations( fit_stan ) )
print( fit_stan, pars = grep( "^[^d]", fit_stan@model_pars, value = TRUE ) )
plot( fit_stan, plotfun = "rhat" )
plot( fit_stan, pars = grep( "^[^d]", fit_stan@model_pars, value = TRUE ),
  plotfun = "trace", inc_warmup = TRUE )
suppressWarnings(
  pairs( fit_stan, pars = grep( "^[^d]", fit_stan@model_pars, value = TRUE ) ) )
  
# investigate first derivatives (across all observations)
samples_stan <- as.data.frame( fit_stan )
par( mfrow = c( 2, 1 ) )
hist( unlist( samples_stan[ , grep( "^dydx1", names( samples_stan ) ) ] ), 30 )
hist( unlist( samples_stan[ , grep( "^dydx2", names( samples_stan ) ) ] ), 30 )

# compare estimated partial derivatives with 'true' partial derivatives
summary_stan <- summary( fit_stan )$summary
# mean estimates
dat$dydx1estMean <- summary_stan[ grep( "^dydx1", rownames( summary_stan ) ), "mean" ]
dat$dydx2estMean <- summary_stan[ grep( "^dydx2", rownames( summary_stan ) ), "mean" ]
all.equal( dat$dydx1estMean, with( dat, summary_stan[ "b1", "mean" ] +
    summary_stan[ "b11", "mean" ] * x1 + summary_stan[ "b12", "mean" ] * x2 ) )
all.equal( dat$dydx2estMean, with( dat, summary_stan[ "b2", "mean" ] +
    summary_stan[ "b12", "mean" ] * x1 + summary_stan[ "b22", "mean" ] * x2 ) )
summary( dat$dydx1estMean )
summary( dat$dydx2estMean )
compPlot( dat$dydx1, dat$dydx1estMean )
compPlot( dat$dydx2, dat$dydx2estMean )
# median estimates
dat$dydx1estMedian <- summary_stan[ grep( "^dydx1", rownames( summary_stan ) ), "50%" ]
dat$dydx2estMedian <- summary_stan[ grep( "^dydx2", rownames( summary_stan ) ), "50%" ]
summary( dat$dydx1estMedian )
summary( dat$dydx2estMedian )
compPlot( dat$dydx1, dat$dydx1estMedian )
compPlot( dat$dydx2, dat$dydx2estMedian )

Stan_inequality.stan:

// data
data {
  int<lower=0> N;
  vector[N] y;
  vector[N] x1;
  vector[N] x2;
  real lt; // lower threshold
  real rs; // regularisation sigma
}

// define parameters
parameters {
  real b0;
  real b1;
  real b2;
  real b11;
  real b12;
  real b22;
  real<lower=0> sigma;
}

// partial derivatives
transformed parameters{
  vector[N] dydx1;
  vector[N] dydx2;
  for( i in 1:N ) {
    dydx1[i] = b1 + b11 * x1[i] + b12 * x2[i];
    dydx2[i] = b2 + b12 * x1[i] + b22 * x2[i];
  }
}

// specify the model
model {
  // priors
  b0 ~ normal( 0, 5 );
  b1 ~ normal( 0.5, 5 );
  b2 ~ normal( 0.5, 5 );
  b11 ~ normal( 0, 1 );
  b12 ~ normal( 0, 1 );
  b22 ~ normal( 0, 1 );
  sigma ~ normal( 0, 10 );
  // soft constraints on the partial derivatives
  for( i in 1:N ) {
    if( dydx1[i] < lt ) {
        target += normal_lpdf( dydx1[i] - lt | 0, rs );
    } else {
        // Adding a constant to ensure smoothness around `lower_threshold`
        target += normal_lpdf( 0 | 0, rs );     
    }
    if( dydx2[i] < lt ) {
        target += normal_lpdf( dydx2[i] - lt | 0, rs );
    } else {
        // Adding a constant to ensure smoothness around `lower_threshold`
        target += normal_lpdf( 0 | 0, rs );     
    }
  }  

  // fitted values
  vector[N] mu;
  for( i in 1:N ) {
    mu[i] = b0 + b1 * x1[i] + b2 * x2[i] + 0.5 * b11 * x1[i]^2 
      + b12 * x1[i] * x2[i] + 0.5 * b22 * x2[i]^2;
  }
  y ~ normal( mu, sigma );
}

Bob_Carpenter · March 18, 2025, 9:45pm

This is not how variables work in Stan. Sampling happens on the unconstrained scale and if the constraints are well-defined, that is guaranteed to produce values meeting the constraints on the constrained scale. It’s how simplexes and other “mutually dependent” constraints work in Stan. Like the one I defined, you can order the parameters so the constraints make sense. In this case, you need to declare beta11, beta12 before beta1

Basically, think of having unconstrained values for beta11_unc, beta12_unc, beta1_unc. These get transformed in the following order to

beta11 = beta11_unc
beta12 = beta12_unc
beta = exp(beta1_unc) - min(beta11 * x[1] + beta12 * x[2])

Now the three variables satisfy your constraint. The values for the unconstrained parameters come from the Hamiltonian Monte Carlo sampler all at the same time—it’s the magic of constraining transforms that puts things in order.

Arne_Henningsen · March 18, 2025, 10:35pm

Bob_Carpenter:

Arne_Henningsen:

Therefore, I am concerned that defining the lower bound of \beta_1 based on the currently sampled values of \beta_{11} and \beta_{12} creates sampling problems.

This is not how variables work in Stan. Sampling happens on the unconstrained scale and if the constraints are well-defined, that is guaranteed to produce values meeting the constraints on the constrained scale. It’s how simplexes and other “mutually dependent” constraints work in Stan. Like the one I defined, you can order the parameters so the constraints make sense. In this case, you need to declare beta11, beta12 before beta1

Basically, think of having unconstrained values for beta11_unc, beta12_unc, beta1_unc. These get transformed in the following order to

beta11 = beta11_unc

beta12 = beta12_unc

beta = exp(beta1_unc) - min(beta11 * x[1] + beta12 * x[2])

Now the three variables satisfy your constraint. The values for the unconstrained parameters come from the Hamiltonian Monte Carlo sampler all at the same time—it’s the magic of constraining transforms that puts things in order.

Thanks a lot, @Bob_Carpenter, for the illustrative and intuitive explanations! I implemented your suggested approach and it works very well. It takes more time to estimate this specification than the one with ‘soft constraints’ but the estimates are basically identical – of course except for having not a single negative partial derivative in any iteration when using this approach :-). I is good to have two well-working approaches to choose from :-) Unfortunately, it seems that I cannot mark two different posts as “Solutions”…

If someone is interested in the code, here it is:

Stan_inequality.R:

# load required package(s)
library( "miscTools" )
library( "rstan" )

# number of observations (before removing unsuitable observations)
nObs <- 500
# explanatory variables
set.seed( 123 )
dat <- data.frame(
  x1 = rnorm( nObs ),
  x2 = rnorm( nObs )
)
# parameters
b0 <- 1
b1 <- 0.5
b2 <- 0.7
b11 <- -0.1
b12 <- 0.2
b22 <- -0.25
sigma <- 0.5
# dependent variable
dat$y <- with( dat, b0 + b1 * x1 + b2 * x2 + 0.5 * b11 * x1^2
  + b12 * x1 * x2 + 0.5 * b22 * x2^2 ) + rnorm( nObs, 0, sigma )
# partial derivatives
dat$dydx1 <- with( dat, b1 + b11 * x1 + b12 * x2 )
dat$dydx2 <- with( dat, b2 + b12 * x1 + b22 * x2 )
par( mfrow = c( 2, 1 ) )
hist( dat$dydx1, 20 )
hist( dat$dydx2, 20 )
par( mfrow = c( 1, 1 ) )
hist( dat$dydx1 + dat$dydx2, 20 )
# remove observations with negative partial derivatives
dat <- subset( dat, dydx1 >= 0 & dydx2 >= 0 )

# prepare the data for Stan
dat_stan <- list(
  N = nrow( dat ),
  y = dat$y,
  x1 = dat$x1,
  x2 = dat$x2
)

# set number of cores
options( mc.cores = parallel::detectCores() )

# compile the model
model_stan <- stan_model( "Stan_inequality.stan" )

# generate initial values
set.seed( 12345 )
init_stan <- list()
for( chain in 1:4 ){
  init_stan[[ chain ]] <- list(
    b1 = runif( 1, min = 0.2, max = 1 ),
    b2 = runif( 1, min = 0.2, max = 1 ),
    b11 = runif( 1, min = -0.02, max = 0.02 ),
    b12 = runif( 1, min = -0.02, max = 0.02 ),
    b22 = runif( 1, min = -0.02, max = 0.02 )
  )
}

# fit the model
fit_stan <- sampling( model_stan, data = dat_stan, iter = 2000, chains = 4,
  init = init_stan, control = list( adapt_delta = 0.999, max_treedepth = 20 ) )
mean( get_divergent_iterations( fit_stan ) )
print( fit_stan, pars = grep( "^[^d]", fit_stan@model_pars, value = TRUE ) )
plot( fit_stan, plotfun = "rhat" )
plot( fit_stan, pars = grep( "^[^d]", fit_stan@model_pars, value = TRUE ),
  plotfun = "trace", inc_warmup = TRUE )
suppressWarnings(
  pairs( fit_stan, pars = grep( "^[^d]", fit_stan@model_pars, value = TRUE ) ) )
  
# investigate first derivatives (across all observations)
samples_stan <- as.data.frame( fit_stan )
par( mfrow = c( 2, 1 ) )
hist( unlist( samples_stan[ , grep( "^dydx1", names( samples_stan ) ) ] ), 30 )
hist( unlist( samples_stan[ , grep( "^dydx2", names( samples_stan ) ) ] ), 30 )

# compare estimated partial derivatives with 'true' partial derivatives
summary_stan <- summary( fit_stan )$summary
# mean estimates
dat$dydx1estMean <- summary_stan[ grep( "^dydx1", rownames( summary_stan ) ), "mean" ]
dat$dydx2estMean <- summary_stan[ grep( "^dydx2", rownames( summary_stan ) ), "mean" ]
all.equal( dat$dydx1estMean, with( dat, summary_stan[ "b1", "mean" ] +
    summary_stan[ "b11", "mean" ] * x1 + summary_stan[ "b12", "mean" ] * x2 ) )
all.equal( dat$dydx2estMean, with( dat, summary_stan[ "b2", "mean" ] +
    summary_stan[ "b12", "mean" ] * x1 + summary_stan[ "b22", "mean" ] * x2 ) )
summary( dat$dydx1estMean )
summary( dat$dydx2estMean )
compPlot( dat$dydx1, dat$dydx1estMean )
compPlot( dat$dydx2, dat$dydx2estMean )
# median estimates
dat$dydx1estMedian <- summary_stan[ grep( "^dydx1", rownames( summary_stan ) ), "50%" ]
dat$dydx2estMedian <- summary_stan[ grep( "^dydx2", rownames( summary_stan ) ), "50%" ]
summary( dat$dydx1estMedian )
summary( dat$dydx2estMedian )
compPlot( dat$dydx1, dat$dydx1estMedian )
compPlot( dat$dydx2, dat$dydx2estMedian )

Stan_inequality.stan:

// data
data {
  int<lower=0> N;
  vector[N] y;
  vector[N] x1;
  vector[N] x2;
}

// define parameters
parameters {
  real b0;
  real b11;
  real b12;
  real b22;
  real<lower=-min( b11 * x1 + b12 * x2 )> b1;
  real<lower=-min( b12 * x1 + b22 * x2 )> b2;
  real<lower=0> sigma;
}

// partial derivatives
transformed parameters{
  vector[N] dydx1;
  vector[N] dydx2;
  for( i in 1:N ) {
    dydx1[i] = b1 + b11 * x1[i] + b12 * x2[i];
    dydx2[i] = b2 + b12 * x1[i] + b22 * x2[i];
  }
}

// specify the model
model {
  // priors
  b0 ~ normal( 0, 5 );
  b1 ~ normal( 0.5, 5 );
  b2 ~ normal( 0.5, 5 );
  b11 ~ normal( 0, 1 );
  b12 ~ normal( 0, 1 );
  b22 ~ normal( 0, 1 );
  sigma ~ normal( 0, 10 );

  // fitted values
  vector[N] mu;
  for( i in 1:N ) {
    mu[i] = b0 + b1 * x1[i] + b2 * x2[i] + 0.5 * b11 * x1[i]^2 
      + b12 * x1[i] * x2[i] + 0.5 * b22 * x2[i]^2;
  }
  y ~ normal( mu, sigma );
}

Bob_Carpenter · March 19, 2025, 9:48pm

No worries—I have enough karma on the forums to last a lifetime :-).

This might be a good case study for @mitzimorris, who is writing case studies on different ways to enforce things like sum-to-zero constraints. As with your case, she has found that sometimes softer constraints are better.

We’ve just introduced a new technique for sum-to-zero vectors which we used to implement crudely (as in my suggestion above), but now have some fancier implementations. Unfortunately, I don’t know how to extend those simple ideas to your much more complicated case.

Bob_Carpenter · March 19, 2025, 9:55pm

I forgot to add that what you want to measure is effective sample size per second, not just overall time. It may be that they take roughly the same amount of time, but one gives more precision in the estimates. If you don’t care about the extra precision, then you can reduce the number of sampling iterations (and often the number of warmup iterations).

mitzimorris · March 20, 2025, 1:10pm

regarding sum-to-zero constraints, I have a case study, the 2nd part of which shows how to implement the sum-to-zero constraining transform in Stan, which allows the model to put sum-to-zero contraints on slices of a vector. (case study is currently under revision, but here’s the relevant section: The Sum-to-Zero Constraint in Stan)

Arne_Henningsen · March 20, 2025, 8:31pm

I ran both estimations (i.e., with soft and hard constraints, respectively) with 4 chains and each chain with 2,000 iterations (incl. warmup). In the estimation with soft constraints, the slowest chain needed 45 seconds, while in the estimation with hard constraints, the slowest chain needed 124 seconds. The effective sample sizes were somewhat similar in the two estimations (for some parameters, the estimations with soft constraints had a larger effective sample size, while for other parameters the estimation with hard constraints had a larger effective sample size).

mitzimorris · March 21, 2025, 7:19pm

this is not surprising - for some models/data regimes, the hard sum-to-zero constraint works better, for others, the soft sum-to-zero constraint. can you use the sum_to_zero_vector constrained parameter type or apply the constraining transform to a transformed parameter? it should be faster than either. if implemented correctly, all three should give you the same estimates, and comparable effective sample sizes. since the sum_to_zero_vector transform is fastest, you can afford to add sampling iterations to get a higher EFF, as needed.

Arne_Henningsen · March 21, 2025, 7:48pm

Sounds good but unfortunately I do not understand how I should use sum_to_zero_vector in my (simplified) example as I don’t have a vector of numbers that sum to zero but two scalars (\beta_1, \beta_2) that must be larger than or equal to two certain values, respectively. Sorry for my lack of understanding!

Bob_Carpenter · March 21, 2025, 8:04pm

It was just an analogy from a situation where we do this all the time—alternative between soft and hard (and now better hard) constraints.

Topic		Replies	Views
Benchmarking with Inequality constraints on the true value in Stan Modeling	7	110	February 4, 2025
Bayesian analysis model - implicit function, priors and parameter constraints Modeling techniques , specification , brms	5	202	March 1, 2025
How to specify that a subset of parameters must be positive Modeling specification	18	1015	September 27, 2023
How to constrain vector parameters with stan Modeling rstan	17	737	February 5, 2024
Simple normal regression model with inequality constraints on the parameters Modeling	7	996	August 10, 2022

Inequality constraints on linear combinations of parameters

Related topics