QR decomposition with non-negativity constraint

Is it possible (and if so recommended) to do a QR decomposition in case one has non-negativity constraints on the regression coefficients? E.g. suppose I have X \gamma in a regression, where X is a design matrix and \gamma\geq 0 a parameter vector. Writing X=QR and \tilde{\gamma}=R\gamma.

If you want \gamma \geq 0, then you need to transform that to R^{-1} \tilde{\gamma} \geq 0, assuming R is invertible, which may be difficult to encode with Stan’s constraints. I’m not good enough with the linear algebra to know if R will be invertible or if it’ll be easy to enforce that constraint.

May I ask why there are non-negativity constraints?

If you add a lower bound of zero and there is probability mass around zero from the model, it will push that probability mass up (relative to an unconstrained parameter with the same prior). So we generally discourage hard bounds other than for physical constraints (such as a chemical concentration being positive).

Thanks Bob.

Sure. I work with monotone i-splines to model the cumulative baseline hazard for a survival model; since the cumulative baseline hazard is minus the log of the survival function, which in turn is non-increasing, the log cumulative baseline hazard needs to be non-decreasing. Now, the \gamma are used to find a linear combination of the non-decreasing i-ispline basis functions evaluated at the individual survival (or censored) times.

Moreover the likelihood has not only a term that depends on the log cumulative hazard but also a term that is proportional to the log of the derivative of the log of the cumulative hazard function with respect to time.

For the model, see here.

Qualitatively, this is how the gamma vector of a particular model instance looks like; it looks suboptimal to me. It looks like I have funnel-like posteriors?!

They look absolutely fine to me (for what it’s worth). Constraints such as positivity, and other convex sets of constraints, truncate the posterior. Here, there is an unavoidable truncation at zero, and so you only see some portion of the posterior you would expect to see in the unconstrained case. There is no easy way to avoid thus, other than switching to a model that naturally encodes positivity.

A comparable example; attached is a pairs plot the parameters from a model I fit that has a monotonicity constraint, which is also convex like the positivity constraint, but isn’t static as the boundary of the feasible region moves with the other parameter values. The truncation of the parameter space is clearly visible (and expected) in the V2/V4 pair plot. These parameters here also happen to correspond to the regression parameters of a design matrix that has had the QR decomposition applied to it. Such posteriors are absolutely acceptable, and the inference from this one was entirely sensible.

monoBoundaryPairs.pdf (2.7 MB)

1 Like

You probably don’t want to do a QR reparameterization if you have a prior on the coefficients that is not centered at the origin. But in order to do what you are talking about, you would have to do some GHK stuff
https://groups.google.com/d/msg/stan-users/GuWUJogum1o/LvxjlUBnBwAJ

1 Like

The important point is that Stan’s sampler works on the unconstrained space. So any positive-constrained parametes are log transformed. Plotting the logs will show the true posterior shapes.

1 Like

I don’t mean to be too much of a pedant, but this is somewhat conflating the issue of model specification with the practicalities of sampling said model. The “true” posterior is the one defined by the model and the data. In the case of positivity, if your unconstrained posterior (the model where you forget about the constraint) has notable probability either side of zero, then the constrained posterior will be the truncated, normalised version of the unconstrained posterior. This will result in a lot of probability “pushing up against” the constraint. The fact that Stan works in the unconstrained space has nothing to do with this.

As an aside, does this not result in a bunch of divergent transition warnings? If my constrained posterior has a lot of mass close to zero, then my Hamiltonian trajectory also has to get very close to zero at certain points. This is negative infinity on the log scale, and I would suspect the stepsize and number of leapfrog steps are not simultaneously suitable for “smallish” values, say [-3, 0] on the log scale, and “very small” values of [-30, -20], and smaller. I guess I should go test this to see.

No worries. We love pedantic here.

Correct. That’s how Stan works. Stan samples on the unconstrained space. So to use Stan efficiently, you need to think about how sampling works on the unconstrained space.

That’s what you care about for inference. But Stan’s a programming language and to make it work, you also need to think about the “actual” posterior being sampled with HMC, and that’s the unconstrained version with the Jacobian for the constraining transform.

This is what our users often expect from bounded parameters, but it’s not wht happens in the posterior.

The effect of a real<lower = 0> sigma constraint simply truncates the prior for sigma if the distribution given to sigma doesn’t depend on a parameter. If it depends on a parameter, you need to include the truncation sigma [0, ] to adjust for the non-constant CDF.

The problem is that this isn’t the same as truncating the posterior, which also has a factor for the likelihood.

Right, but the actual posterior being defined does. I’m just saying that if you care about Stan actually being able to fit and you want diagnostics for sampling, those are best done in the unconstrained space where sampling is actually happening in HMC.

The Hamiltonians only exist in the unconstrained space. To get a lower-bounded parameter near zero means going toward negative infinity in the unconstrained space.

This isn’t as bad as you might suspect because Hamiltonian dynamics doesn’t look like Metropolis jumps. Instead, they build up momentum across leapfrog steps and that’s adjustbed by the potential (negative log density on the unconstrained scale). When you’re out in the tails, there is high potential energy and momentum builds up on the way back to the typical set.

It’s much trickier to deal with high curvature than flat spaces, as it can require an arbitrarily small step size to avoid divergences. That’s why the funnel’s so tricky (you’ll see that’s also mapped on the unconstrained space in the manual).

2 Likes

Thank you for indulging me and writing all of that up! I appreciate the explanation.

Yes, I now realise my intuition was a little off here.

It is interesting (to me at least) that models like this one:

data {

}

parameters {
 real <lower = 0> x;
}

model {
  x ~ normal(0, 0.1);
}

Also require a higher adapt_delta (\delta \geq 0.95) / smaller step size, in order to avoid divergent transitions. To me this indicates that there is some kind of strangeness going on in the posterior - I would expect it to be regularised out upon observing any amount of data though. The “funnel” is obviously a far more pathological example though, as additional data does not necessarily improve the shape of the posterior.

I just ran that model and didn’t see any divergent transitions in four default chains.

What’s problematic about it is that the lower = 0 constraint causes a log-transform on x. So as x goes to zero, the unconstrained parameter goes to minus infinity. There’s also probability mass above 0.2, which only transforms to -2 or so. So it’s hard to get a step size that is appropriate everywhere. The gradient and momentum help, though—the potential is very high out in the tails.