I would like to use Stan to estimate a regression model with inequality constraints on linear combinations of parameters. Here is a simplified example, a quadratic function:
where subscript i = 1, …, N indicates the observation, y_i is the dependent variable, x_{1,i} and x_{2,i} are two explanatory variables, u_i is the error term with mean zero and variance \sigma^2, and \beta_0, \beta_1, \beta_2, \beta_{11}, \beta_{12}, \beta_{22}, and \sigma are 7 parameters to be estimated. We have prior knowledge that the partial derivatives of y_i with respect to x_{1,i} and x_{2,i} are non-negative for all observations i = 1, …, N:
What is the best way to estimate this model specification, including the 2 \cdot N inequality restrictions in Stan?
We also have prior knowledge about the ranges of the partial derivatives, which we expect most observations to have, e.g., 90% of observations are expected to have 0.1 \leq \partial y_i / \partial x_{1,i} \leq 0.6 or something similar. Can this additional prior knowledge also be used?
Unfortunately, this method can cause Stan to hang or to random walk because it only rejects unsuitable parameters, rather than retroactively restricting the beta parameters themselves. It won’t work unless the beta parameters are nicely initialized and naturally fall into those constraints once sampling begins. So this is by no means a perfect solution to your problem, and someone with more Stan experience might have a better one.
Thank you, @andrjohns, for making me aware of this previous discussion! As I have in my (simplified) example 2 \cdot N inequality restrictions (with N = number of observations, which is usually many hundreds or some thousands) and K = 5 parameters, the matrix D in the discussion that you mentioned is not quadratic but has many more rows (2 \cdot N) than columns (K). Hence, as far as I can see, the matrix D cannot be inverted and, thus, I am afraid that I cannot use this method.
Thanks a lot, @Corey.Plate, for your response! I used your suggestion to specify the lower bound of zero for the partial derivatives. (The range 0.1 \leq \partial y_i / \partial x_{1,i} \leq 0.6 in my example was not a strict constraint but an expected interval that includes the partial derivatives of around 90% of the observations). As you pointed out, I had to carefully initialise the \beta parameters so that the inequality constraints are fulfilled at initial values and the sampling can start. However, the number of divergent transitions after warmup is huge (more than 80% even if I increase adapt_delta to 0.999) and the effective sample size is small. Does anybody have suggestions for solving the issue with the divergent transitions or for imposing the inequality constraints in a different way?
I might have gotten a little closer to a solution: setting priors on the partial derivatives (transformed parameters). If I use priors with very low probabilities for values close to zero, e.g.:
there are no longer divergent transitions after warmup but the estimates are largely driven by the (quite ‘spiky’) priors with the estimated coefficients being quite far away from their ‘true’ values. If I use rather flat priors with not too small probabilities for values close to zero, e.g.:
the estimated coefficients are rather close to their ‘true’ values but there is again a huge number of divergent transitions after warmup. Perhaps, I can find a ‘compromise’, e.g.,:
with only few divergent transitions after warmup and the estimated coefficients being not too far away from their ‘true’ values. Does anybody have other suggestions or hints?
Note: as the partial derivatives (transformed parameters) are linear functions of the estimated parameters, I don’t think that I need adjust the ‘target’ with the log absolute determinant of the Jacobian of the transform when specifying priors for the transformed parameters.
This seems to imply that your prior on the derivatives is in conflict with your “true” values. The prior is even not that spiky - it implies that 90% of values should lie between 0.15 and 4.14. I would double check that the derivatives actually are mostly between 0.1 and 0.6 for your “true” values.
But since you already are out of the realm of generative models, you can basically add any regularization you want, assuming it is smooth, it doesn’t have to be a valid distribution. So e.g. to just avoid dydx1 < lower_threshold you should be able to do things like:
for(i in 1:N) {
if(dydx1[i] < lower_threshold) {
target += normal_lpdf(dydx[i] - lower_threshold | 0, regularization_sigma);
} else {
// Adding a constant to ensure smoothness around `lower_threshold`
target += normal_lpdf(0 | 0, regularization_sigma);
}
}
where regularization_sigma handles the strength of the regularization (lower = more regularization away from values below threshold). Though I’ve not tested this in any actual model, this is just a guess that should work.
A “nicer” way to solve this is to find a parametrization of \beta that satisfies the constraints by construction. The admisible region is going to be an intersection of several half-planes (possibly a convex polygon, but could also not be finite) and I am not aware of a natural well behaved parametrization of such a space - the problem is that at the intersections of the halfplanes, the constraints are not smooth, which could hinder sampling. Though it might exist, e.g. for quadrilaterals, there is one so maybe it generalizes.
The good part is that this is a problem that you can solve outside of Stan, since the x values are fixed. In principle, given x you should be able to solve for absolute lower and upper bound on \beta_{1}, then find a piecewise linear functions that provide upper and lower bounds on \beta_{11} given \beta_1 and then find piecewise linear functions that provide upper and lower bounds on \beta_{12} given \beta_1 and \beta_{11}. Those can then be directly implemented in Stan and as long as the changes at the intersections are not that large, things could work OK. You can reduce the chance of problems by suitably scaling and rotating the parametrization so that the earlier coefficients aligns with the larger dimensions of your polygon-like shape…
and it will guarantee that \beta_1 + \beta_{11} \cdot x_{1, n} + \beta_{12} \cdot x_{2, n} > 0 for all n for any \beta_1 satisfying the constraint.
The posterior may not be easy to sample, but I’d give it a try.
@martinmodrak: I’m never quite sure what folks mean by “generative.” To me, it’s just a matter of being able to draw parameters from the prior \theta^\text{sim} \sim p(\theta) and then draw observations from the sampling distribution, y^\text{sim} \sim p(y \mid \theta^\text{sim}). I’m guessing what you mean here is that there’s not an easy way to simulate the parameters using standard built-in RNGs from p(\theta) when there are constraints. With Stan, you can set up the prior as usual, then move the likelihood to generated quantities and you have a generative process. It’s not going to generate i.i.d. simulations, but it will simulate from the prior and then generate data according to the sampling distribution. Or, you write a function to do something like rejection sampling and then you can take independent draws—some of the built-in RNGs work that way and we don’t call the resulting models non-generative.
Although the vast majority of the ‘true’ values of the partial derivatives are between the 0.05 and the 0.95 quantile of the prior, the estimated partial derivatives were always very close to the mode of the prior with a much smaller variation than the ‘true’ values. I think that the log-normal prior is indeed quite spiky: The density of the log-normal prior at its 0.05 quantile (0.15) is 0.17, while the density at the mode (0.29) is 0.82, i.e., almost 5 times as likely as at its 0.05 quantile (0.15). It seems that the ‘signal’ from the prior is too strong compared to the ‘signal’ from the data.
Great suggestion! These ‘soft constraints’ work very well: there are no divergent transitions, the mean and median estimated values of the parameters are very close to their ‘true’ values, all mean and median estimates of the partial derivatives are positive at all observations, all mean and median estimates of the partial derivatives are very close to their ‘true’ values, and there occur during sampling only very few very slightly negative partial derivatives (at only very few observations at only very few iterations). Thanks a lot, @martinmodrak!
Thank you also for this suggestion, @martinmodrak! As this suggestion seems to be quite complicated and potentially suffers from non-smooth intersections of halfplanes, I will for now use your suggestion to specify ‘soft constraints’ as it is easy to implement and works very well.
Thanks for the suggestion, @Bob_Carpenter! Yes, we can calculate the lower bound of \beta_1 based on \beta_{11}, \beta_{12}, and the data. However, as far as I understand, all \beta parameters are sampled simultaneously. Therefore, I am concerned that defining the lower bound of \beta_1 based on the currently sampled values of \beta_{11} and \beta_{12} creates sampling problems. Given that the ‘soft constraints’ suggested by @martinmodrak work very well, I will use ‘soft constraints’ (for now).
This is not how variables work in Stan. Sampling happens on the unconstrained scale and if the constraints are well-defined, that is guaranteed to produce values meeting the constraints on the constrained scale. It’s how simplexes and other “mutually dependent” constraints work in Stan. Like the one I defined, you can order the parameters so the constraints make sense. In this case, you need to declare beta11, beta12 before beta1
Basically, think of having unconstrained values for beta11_unc, beta12_unc, beta1_unc. These get transformed in the following order to
Now the three variables satisfy your constraint. The values for the unconstrained parameters come from the Hamiltonian Monte Carlo sampler all at the same time—it’s the magic of constraining transforms that puts things in order.
Thanks a lot, @Bob_Carpenter, for the illustrative and intuitive explanations! I implemented your suggested approach and it works very well. It takes more time to estimate this specification than the one with ‘soft constraints’ but the estimates are basically identical – of course except for having not a single negative partial derivative in any iteration when using this approach :-). I is good to have two well-working approaches to choose from :-) Unfortunately, it seems that I cannot mark two different posts as “Solutions”…
No worries—I have enough karma on the forums to last a lifetime :-).
This might be a good case study for @mitzimorris, who is writing case studies on different ways to enforce things like sum-to-zero constraints. As with your case, she has found that sometimes softer constraints are better.
We’ve just introduced a new technique for sum-to-zero vectors which we used to implement crudely (as in my suggestion above), but now have some fancier implementations. Unfortunately, I don’t know how to extend those simple ideas to your much more complicated case.
I forgot to add that what you want to measure is effective sample size per second, not just overall time. It may be that they take roughly the same amount of time, but one gives more precision in the estimates. If you don’t care about the extra precision, then you can reduce the number of sampling iterations (and often the number of warmup iterations).
regarding sum-to-zero constraints, I have a case study, the 2nd part of which shows how to implement the sum-to-zero constraining transform in Stan, which allows the model to put sum-to-zero contraints on slices of a vector. (case study is currently under revision, but here’s the relevant section: The Sum-to-Zero Constraint in Stan)
I ran both estimations (i.e., with soft and hard constraints, respectively) with 4 chains and each chain with 2,000 iterations (incl. warmup). In the estimation with soft constraints, the slowest chain needed 45 seconds, while in the estimation with hard constraints, the slowest chain needed 124 seconds. The effective sample sizes were somewhat similar in the two estimations (for some parameters, the estimations with soft constraints had a larger effective sample size, while for other parameters the estimation with hard constraints had a larger effective sample size).
this is not surprising - for some models/data regimes, the hard sum-to-zero constraint works better, for others, the soft sum-to-zero constraint. can you use the sum_to_zero_vector constrained parameter type or apply the constraining transform to a transformed parameter? it should be faster than either. if implemented correctly, all three should give you the same estimates, and comparable effective sample sizes. since the sum_to_zero_vector transform is fastest, you can afford to add sampling iterations to get a higher EFF, as needed.
Sounds good but unfortunately I do not understand how I should use sum_to_zero_vector in my (simplified) example as I don’t have a vector of numbers that sum to zero but two scalars (\beta_1, \beta_2) that must be larger than or equal to two certain values, respectively. Sorry for my lack of understanding!