How to specify that a subset of parameters must be positive

I am using the following prior specification for a set of contrasts in parameters to good effect. C is the contrast matrix and beta is the parameter vector,

data {
         int<lower = 1> p;   // number of predictors
	int<lower = 0> cn;  // number of contrasts given priors
	matrix[cn, p] C;    // contrasts
	// prior means and SDs for contrasts
	vector[cn] cmus;
	vector[cn] csds;
}
model {
	if(cn > 0) target += normal_lpdf(C * beta | cmus, csds);
        ...
}

Now I want to pass to Stan a logical vector (which may be all FALSE), say pos, that would force those elements of C * beta with pos=TRUE to be zero or positive. What is the best way to code this?

1 Like

I believe this question and solution are what you’re after: Simple normal regression model with inequality constraints on the parameters - #2 by andrjohns

With your model having only lower-bound constraints of either 0 or -infinity (i.e., constrained positive or unconstrained)

1 Like

Thanks that’s very helpful. I see now how to do it by passing a vector of lower and a vector of upper limits (assuming R infinite values are OK to Stan). But I’m not seeing how to do that when the constraints are applied to a matrix product. Or are you saying that the only approach is to use the matrix inverse approach shown in that link and to use the transformed limits to put limits on beta, and those limits will carry forward to impose the desired limits on C * beta? Also I’m not clear how to handle \pm \infty in the matrix calculations.

Or are you saying that the only approach is to use the matrix inverse approach shown in that link and to use the transformed limits to put limits on beta, and those limits will carry forward to impose the desired limits on C * beta?

That’s right. You can’t apply a constraint directly to the result of a transformation of parameters, you have to instead specify constraints on the parameters directly that will result in the desired constraint on the transformation

As a minimal example:

data {
  int p;
  int cn;
  matrix[cn, p] C;
  array[p] int<lower=0, upper=1> pos; 
}

transformed data {
  vector[p] lbs;
  for (i in 1:p) {
    lbs[i] = pos[i] == 1 ? 0 : negative_infinity();
  }
}

parameters {
  // Implies lbs <= (C * beta) 
  vector<lower=(C \ lbs)>[p] beta; 
}
1 Like

Fascinating. Does left matrix division allow for non-square C? The contrast matrix will usually have less rows than beta elements.

I’m not seeing how the math works. If I want to restrict beta[1]-beta[2] > 0 there is no separate restriction on beta[1]. If there are 4 betas this would be for C=[1 -1 0 0] , a 1x4 matrix. So I’m not getting how the betas can be isolated in the first place.

Update

I see that the method requires the contrast matrix to span the space of all the betas (which is a problem for the applications I have where I want to keep non-referenced parameters as having a noninformative prior). Here’s an example where C is a 4x4 non-singular matrix. But you can see at the end that the contraint that beta[1] > beta[2] is not being enforced.

testc.stan

data {
  int p;
  int cn;
  matrix[cn, p] C;
	vector[cn] low;
	vector[cn] hi;
}

parameters {
  // Implies low <= (C * beta) <= hi
  vector<lower=(C \ low), upper=(C \ hi)>[p] beta; 
}

transformed parameters {  \\ to return to R for checking
  vector[cn] Lower; 
  vector[cn] Upper;

	Lower = C \ low;
	Upper = C \ hi;
}

R Code

cmdstanr::set_cmdstan_path('/usr/local/bin/cmdstan-2.32.2')
C=rbind(c(1,0,0,0), c(1,-1,0,0), c(0,0,1,0), c(0,0,0,1))
A <- -1.
B <- 1.
d <- list(p=4L, cn=4L, C=C,
          low=as.array(c(A, 0., A, A)),
          hi =as.array(c(B, B, B, B)),
          seed=1L)
mod <- cmdstanr::cmdstan_model('testc.stan')
f <- mod$sample(data=d)
betas <- f$draws()[, 1, 2:5, drop=TRUE]
round(betas[1:20, ], 2)

iteration beta[1] beta[2] beta[3] beta[4]
       1     0.78   -0.59    0.42   -0.69
       2    -0.84   -0.31   -0.48    0.67
       3     0.83   -0.55   -0.66    0.82
       4     0.13   -0.71    0.38    0.92
       5    -0.41   -0.67   -0.47    0.44
       6    -0.74   -0.38    0.54   -0.47
       7     0.64   -0.77   -0.88    0.00
       8    -0.80   -0.21    0.88   -0.21
       9    -0.30   -0.93   -1.00   -0.26
       10    0.70   -0.05    1.00    0.30
       11   -0.46   -0.32    1.00    0.28
       12    0.47   -0.48    0.43   -0.24
       13   -0.34   -0.60   -0.48    0.03
       14   -0.24   -0.51   -0.85    0.07
       15   -0.75   -0.17   -0.96    0.27
       16   -0.88   -0.06    0.47    0.43
       17   -0.42   -0.36   -0.81    0.28
       18    0.44   -0.64    0.60    0.49
       19    0.82   -0.37    0.65    0.55
       20    0.22   -0.60   -0.29    0.00

table(betas[,1] - betas[,2] >= 0.)

FALSE  TRUE 
  261   739

The computed values of Lower and Upper are

Lower = -1 -1 -1 -1
Upper = 1 0 1 1

which doesn’t make sense to me as applying to individual betas.

Wouldn’t it be simpler to just put a positive-support prior on selected contrasts? Would there be sampling problems with this?

1 Like

So you want something like?

> betas
# A draws_matrix: 1000 iterations, 4 chains, and 4 variables
    variable
draw beta[1] beta[2] beta[3] beta[4]
  1    0.350   0.112    0.64   -0.21
  2   -0.276  -0.298   -0.52   -0.27
  3   -0.531  -0.690    0.44    0.47
  4    0.884  -0.211    0.85   -0.52
  5    0.244  -0.731    0.91    0.13
  6    0.204  -0.675    0.93   -0.60
  7    0.326  -0.066    0.95   -0.75
  8    0.412  -0.576   -0.88    0.93
  9   -0.053  -0.199    0.23    0.43
  10   0.683  -0.743   -0.96   -0.45
# ... with 3990 more draws
> table(betas[,1] - betas[,2] >= 0.)

TRUE 
4000 

The trouble is the dependence upon previous beta values. I think this does what you want. It gets more complicated if you do want unbounded lower or upper bounds because then we have to have a branch to handle each case.

data {
  int p;
  int cn;
  matrix[p, p] C;
	vector[p] low;
	vector[p] hi;
}

parameters {
  // Implies low <= (C * beta) <= hi
 vector[p] beta_raw; 
 }

transformed parameters {//to return to R for checking
  vector[p] beta;
  real log_abs_jac = 0;
  
    beta[1] = low[1] + (hi[1] - low[1]) * inv_logit(beta_raw[1]);
    log_abs_jac += log(hi[1] - low[1]) + log_inv_logit(beta_raw[1]) +
                      log1m_inv_logit(beta_raw)[1];
  
  for (i in 2:p) {
    int counter = 0;
    real low_c_raw = 0;
    real hi_c_raw = 0;
    real low_c, hi_c;

    for (j in 1:i - 1) {
      if (C[i, j] == -1) low_c_raw += beta[j];
       else if (C[i, j] == 1) hi_c_raw += beta[j]; 
       else {
         counter += 1;
         if (counter == i - 1) {
           low_c_raw = low[i];
           hi_c_raw = hi[i];
         }
       }
    low_c = low_c_raw == 0 ? low[i] : low_c_raw;
    hi_c = hi_c_raw == 0 ? hi[i] : hi_c_raw;
    }
    
    beta[i] = low_c + (hi_c - low_c) * inv_logit(beta_raw[i]);
    log_abs_jac += log(hi_c - low_c) + log_inv_logit(beta_raw[i]) +
                      log1m_inv_logit(beta_raw[i]);
  }
}
model {
  target += log_abs_jac;
}

Note that I removed the 0 in the low and hi arrays.

C = rbind(c(1,0,0,0), c(1,-1,0,0), c(0,0,1,0), c(0,0,0,1))
A <- -1.
B <- 1.
d <- list(p=4L, cn=4L, C=C,
          low=as.array(c(A, A, A, A)),
          hi =as.array(c(B, B, B, B)),
          seed=1L)

mod <- cmdstanr::cmdstan_model('test_frank_h.stan')

out <- mod$sample(
  data = d,
  parallel_chains = 4
)
out_sum <- out$summary("beta")

1 Like

Thank you very much for this. Extremely educational. I think it fully exposes the difficulty of the task. Since I’m seeking solutions where positivity constraints can be placed on selected rows of general contrasts in the betas, I don’t think the approach will scale but I could use it in special cases. If you or anyone else thinks of another attack that will generalize to setting ranges on C*beta that would be most appreciated, if it’s possible to do. I wish that we could just put half-normal priors on selected rows. I tried putting an exponential distribution prior on one row and got Stan errors about negative arguments to the density function. I guess that positive support priors are only for parameters that are already guaranteed to be positive like variances? Side note: I wonder if a highly skewed prior distribution that places very small mass on negative values would work. (But I could not find such a distribution on a quick search).

Would something like

target += std_normal_lpdf( beta[cn] ) - 1 * normal_lccdf( 0 | 0, 1 );
target += std_normal_lpdf( beta[free] | 0, 1 );

work as the half-normals? This is the way that brms explicitly enforces lower bounds on parameters, so if the rows that need to be positive are known beforehand (or could be determine within the model), then this seems like it would be similar to putting a half-normal prior on those rows. Perhaps that’s not what you had in mind, though – or maybe there’s some technical difference between this specification and a true half-normal

Did you declare the lower of the parameter? I’ve made this mistake before where I assume that, since the prior distribution is strictly positive, the parameter declaration doesn’t need to indicate that the parameter is lower bounded at zero, but Stan will throw errors about impossibly negative parameters being tried despite the strictly positive prior unless the lower bound is in the parameter declaration. At least this was the case at some point in the not-to-distant-past – recent updates or changes in Stan may have resolved that

You can place constraints on any, none, or all rows using the above. What I don’t understand are the constraints you want to place on the individual betas. Do you want them to be between -1 and 1 or to be free? Are the only constraints then on those rows you want to be positive?

The constraint is on a function of the parameters, i.e, on a linear contrast of the betas, so there is no way to use <lower=0> on a contrast row such as C[1,]*beta.

The -1 to 1 range for betas was just for this example. In general I would leave the betas to have unrestricted ranges. My goal is to place constraints on linear combinations of parameters through a contrast matrix. This would cover a lot of bases were it to work, e.g.

  • A 1 -1 positivity constraint for one row of the contrast matrix would force beta[2] > beta[1]
  • A constraint on a differences in slopes at different x-points being positive would constrain the 2nd derivative to be positive
  • A constraint on a double difference would prevent an interaction from being a qualitative interaction and may only allow it to be a quantitative (no direction change) interaction

If you post an example constraint matrix that has those in it I can see if I can get it to work.

I’m assuming it will be a p x p matrix, like you have above, where a row without constraints has 1 in the diagonal. The rows with constraints will have 1 and -1 for the linear combo of parameters whose row sum should > 0. Is that right?

Thanks. I’d like to leave un-referenced betas (those that always have zeros for them in the contrast matrix C) out of the picture so they will get non-informative priors. C will be c x p where c can be < p and p is the number of betas.

Here’s an example model: y ~ alpha + beta[1][x1=b] + beta[2][x1=c] + beta[3]x2

where x1 is a 3-level categorical variable and x2 is continuous. We may want to assume that E(y | x1=c) >= E(y | x1=b) i.e. that beta[2] - beta[1] > 0. Let’s also assume that beta[3] > 0. Then

C =    | 1   -1  0 |
       | 0    0  1 |

I have much more complex examples fully implemented here where I capitalize on Stan’s ability to specify priors on contrasts. I specify small prior variance for elements of contrasts to constrain certain differences or double differences to be small. You’ll see in the examples that putting priors on contrasts opens us up to all kinds of advantages. The new need is to be able to restrict certain contrasts to be positive.

Thanks for taking a look into this.

1 Like

You can probably do this but I’m not understanding how C, in the form you have, can be used. You have to make some assumptions to do this and I’m not sure those are met with C. Because Stan needs to know each constraint on the previous beta’s to work. The assumption is that for row i all the previous beta’s (j < i) are known and then we use those to construct the constraint on beta i.

For example, if the constraints are

c1 + c3 - 2 * c2 > 0
c3 + c5 - 2 * c4 > 0

In Stan we want

c1 unconstrained
c2 unconstrained
c3 > 2 * c2 - c1
c4 unconstrained
c5 > 2 * c4 - c3

However, the constraints on c3 and c5 will induce c1, c2, c4 to be non-uniform.

You can code this in the parameters block as

parameters {
 real c1, c2;
 real<lower=2 * c2 - c1> c3;
 real c4;
 real<lower=2 * c4 - c3> c5;
}

But that becomes unwieldy to do for many different combinations.

Let’s put the lower bounds in a matrix

M = | -1 2 0 0 0 |
    | 0 0 -1 2 0 |
I_c = {3, 5} // says row[1] is for index 3, row[2] is for index 5
I_uc = {1, 2, 4}
parameters {
 vector[K] x_raw;
}
transformed parameters {
 real log_abs_jac = 0;
{
 int R = rows(M);
 int K = cols(M);
 vector[K] x;

x[I_uc] = x_raw[I_uc];

 for (r in 1:R) {
   x[I_c[r]] = M[r] * x + exp(x_raw[I_c[r]]);
  log_abs_jac += x_raw[I_c[r]]];
 }

 }
}

What does M look like if you just want the value to be constrained to 0 itself? Well, you just make a row of all 0s.

For e.g., let x[4] > 0. We now have

M = | -1 2 0 0 0 |
         | 0 0 0 0 0 |
    | 0 0 -1 2 0 |
I_c = {3, 4, 5} 
I_uc = {1, 2}

I’m sure you know that having many constraints could lead you out of a feasible set.

I’d code this as an _lp function like:

functions {
  vector positive_constrain_contrasts_lp(vector y, matrix M, array[] int I_c,
                                         array[] int I_uc) {
    int R = rows(M);
    int K = cols(M);
    vector[K] x;
    
    x[I_uc] = y[I_uc];
    
    for (r in 1 : R) {
      x[I_c[r]] = M[r, 1 : I_c[r] - 1] * x[1 : I_c[r] - 1] + exp(y[I_c[r]]);
    }
    
    target += y[I_c];
    
    return x;
  }
}
data {
  int p;
  int cn;
  matrix[cn, p] M;
  array[cn] int I_c;
  array[p - cn] int I_uc;
}
parameters {
  vector[p] beta_raw;
}
transformed parameters {
  vector[p] beta = positive_constrain_contrasts_lp(beta_raw, M, I_c, I_uc);
}
model {
  beta ~ std_normal();
}

And you can test in R as

mod <- cmdstanr::cmdstan_model('test_c_fun.stan')

out1 <- mod$sample(
  data = list(
    p = 5,
    cn = 2,
    M = matrix(c(-1, 2, 0, 0, 0,
                 0, 0, -1, 2, 0), byrow = T, nr = 2),
    I_c = c(3, 5),
    I_uc = c(1, 2, 4)
  ),
  parallel_chains = 4
)
out1$summary("beta")
beta <- out1$draws("beta", format = "matrix")
table(beta[, 1] + beta[, 3] - 2 * beta[, 2] > 0)
table(beta[, 3] + beta[, 5] - 2 * beta[, 4] > 0)

out2 <- mod$sample(
  data = list(
    p = 5,
    cn = 3,
    M = matrix(c(-1, 2, 0, 0, 0,
                 0, 0, 0, 0, 0,
                 0, 0, -1, 2, 0), byrow = T, nr = 3),
    I_c = c(3, 4, 5),
    I_uc = c(1, 2)
  ),
  parallel_chains = 4
)

out2$summary("beta")
beta2 <- out2$draws("beta", format = "matrix")
table(beta2[, 1] + beta2[, 3] - 2 * beta2[, 2] > 0)
table(beta2[, 4] > 0)
table(beta2[, 3] + beta2[, 5] - 2 * beta2[, 4] > 0)
4 Likes

I really appreciate all this thinking and work. I’ve read over it and realize I’ll need to study it more intensively to understand what you re doing. Among other things I don’t understand the + exp() parts.

I’d like to be able to handle general contrasts, e.g., those for which parameters are not separable, such as a contrast that approximates the derivative of a spline function. I know that’s asking a lot.

I wish that we could specify strange prior distributions for selected rows of the contrasts. I wish that positive support priors would have worked. Is there a possibility of specifying a strange prior that has total mass 0.01 for x \in (-\infty, 0] and is half-normal-looking for x > 0 with continuity at x=0? Would this lead to sampling problems anyway?

The exp bit is the typical way to constrain parameters to the positive reals. Whenever you put a lower=0 in the parameters block Stan does the same transformation in the backend. Look up constraint transforms in the Stan user guide for more information.

I don’t know much about splines and contrasts so you will have to describe and show what you would like to do. It may be possible to do what you want but we need a better description and examples.

The prior idea is what we typically hear called a soft constraint. These crop up all the time when you want the sum of parameters to be 0 but don’t need to force it to be exactly 0. There’s a few examples on the forum, look up sum to zero or soft vs hard constraints.

1 Like

Thanks. Here’s an example. To allow generality, also think of how this would be extended to handle constraints in the shape of an effect for a particular level of an interacting factor, or think of higher-order polynomials or regression splines (e.g., segmented cubic splines). Consider a quadratic effect y = alpha + beta[1]x + beta[2]x^2 and one row of a constraint matrix correponding to a difference in predicted values between x=4 and x=2, or beta[2](16 - 4) + beta[1](4 - 2) = 2beta[1] + 12beta[2]. Higher-order polynomial or splines would involve more betas, so best not to make any simplifications on the combination of coefficients.

This still looks like the same problem solved above. It is a linear combination of parameters and what I gave above will constrain rows with a convex combination to be > 0. The higher order seems to come from x not from the parameters.

What won’t work is a nonlinear specification such as beta[1] * beta[2] > 0. You can see that if beta[1] < 0 then beta[2] must also be < 0. Similarly, if beta[1] > 0 then beta[2] must be > 0. You must decide a priori if beta[1] is going to be constrained to the negative or positive reals to enforce the overall constraint. In this case, you either pick or try a soft-constraint with a prior but I’m not sure how stable that will be.

1 Like