Non-centered parameterization for non-hierarchical parameters?

Hello all. I am very familiar with the non-centered reparameterization for hierarchical parameters aka the classic “Matt trick”; I use this all the time. Less commonly, I’ve also seen non-centering used in a non-hierarchical context. My conceptual understanding of this second usage is that it involves shifting and scaling a primitive parameter by the (known) approximate posterior mean and SD of its unscaled analog to make it closer to N(0,1) and improve adaptation of the mass matrix. I’ve never done this before, but now I have a use case with a few additional wrinkles and I’d appreciate some advice on whether I’m thinking about this right or (more likely) not.

A cartoon of the relevant portions of the model I’m fitting looks something like this (backstory here if anyone’s really interested, but it’s not necessary):

data {
  int<lower=1> N;        // sample size
  vector<lower=0>[N] S;  // observed spawners
  vector<lower=0>[N] C;  // known catch
  real<lower=0> sigma;   // penalty scale 
}

parameters {
  vector<lower=0,upper=1>[N] h;  // catch rate
}

transformed parameters {
  vector<lower=0>[N] S_hat;  // predicted spawners
  // do some stuff that includes using h and calculating S_hat
}

model {
  // likelihood (penalty)
  C ~ lognormal(log(S_hat) + logit(h), sigma);
}

The likelihood statement is really just a “penalty” to force \log(\widehat{C}) = \log(\widehat{S}) + \text{logit}(h) to be as close to \log(C) as possible. (I should note that \widehat{S} is constrained by a hierarchical state-space model as well as its own observation model, so this is not nonidentifiable as it would be if \widehat{S} and h were completely free.) The problem is that as I make \sigma smaller, especially < 0.05 or so, sampling starts to slow down dramatically, presumably because of (lack of) adaptation to the corresponding narrow posterior contours of h. This seems like a good case for non-centering since I know the posterior should be close to \text{logit}(h) \sim N(\log(C) - \log(\widehat{S}), \sigma), and I can approximate \widehat{S} with the observed S. So is the following what I want to do? (BTW, I’m not using the <offset, multiplier> construct because vector arguments aren’t in the current rstan release yet.)

data {
  int<lower=1> N;        // sample size
  vector<lower=0>[N] S;  // observed spawners
  vector<lower=0>[N] C;  // known catch
  real<lower=0> sigma;   // penalty scale 
}

transformed data {
  vector[N] mu = log(C) - log(S);  // approximate posterior mean of logit(h)
}

parameters {
  vector[N] h_z;  // Z-scored logit(h) 
}

transformed parameters {
  vector<lower=0,upper=1>[N] h = inv_logit(mu + sigma * h_z);
  vector<lower=0>[N] S_hat;  // predicted spawners
  // do some stuff that includes using h and calculating S_hat
}

model {
  // prior
  // implies logit(h) ~ logistic(0,1) <=> h ~ uniform(0,1)
  h_z ~ logistic(-mu/sigma, 1/sigma);  
  
  // likelihood (penalty)
  C ~ lognormal(log(S_hat) + logit(h), sigma);
}

The prior on the primitive h_z seems wrong, but I’m not sure how else to ensure h ~ uniform(0,1) as in the original model. I could just specify that directly, or do logit(h) ~ logistic(0,1) (or compute logit_h in transformed parameters), but then I’d need a Jacobian. In any case, putting the usual unit prior on the primitive parameters like h_z ~ logistic(0,1) definitely does not give me the intended prior on h. Do I have the affine transformation inverted or something?

Thanks for bearing with me!

2 Likes

Hi,
unless I am missing something I would expect the transformation to not help you much - Stan in its default settings does adaptation of the diagonal of the mass matrix which AFAIK implies that it independently rescales individual parameters (on the unconstrained scale) as it sees fit. So a rescaling by a constant is likely to make very little difference.

Yes, but is my impression that the Jacobian transform described at 10.4 Lower and upper bounded scalar | Stan Reference Manual should be easy to adapt.

Also, if you care about minimizing the penalty, why not just use the optimize method? I presume there is a reason, I just don’t see it :-)

I know I didn’t answer your question directly, but hope that helps at least a little.

Thanks for the reply @martinmodrak. I eventually tried a working version of the mock-up above (you have to make S_hat a parameter and include its observation likelihood) and sure enough it does not help. In fact, although it gives the intended posterior, it’s slower than the centered version. I didn’t try the versions that need a Jacobian, although yes, it is very simple in this case; I don’t imagine they’d be faster.

I’m a bit puzzled, though, as the <offset, multiplier> examples in the manual (and equivalent hand-coded syntax that’s been discussed in various threads on here) suggest that shifting and/or rescaling by a constant can help the adaptation find the posterior scales more quickly and sample more efficiently. I wonder if the slowdown in my model as \sigma gets small is due not just to the scales, but to the tightening (local) correlation between \log(\widehat{S}) and \text{logit}(h) which a diagonal mass matrix cannot account for.

Sorry, I should have been more clear that this is just one small part of a large, complex state-space population dynamics model that I am fitting with full Bayes. In general the data contain observation error, but C represents a small absolute number of fish taken from the population that are individually handled, measured, sexed, and spawned in a hatchery, so it is effectively known without error. (I only called it catch in my toy example because I thought it would be easier to understand, LOL.)

When I first started developing this model, way back in the Google Group days, I simply subtracted C from \widehat{S}, which of course led to constraint violations and rejections when the difference was negative. Michael Betancourt advised me in no uncertain terms that this was verboten (it’s a very common practice in fisheries models, but that’s another matter) and so I had to resort to this kludge of estimating the removal rate and giving C an arbitrarily precise likelihood. The alternative would be to somehow incorporate the removals into the lower bounds of the hierarchically noncentered primitives that give rise to \widehat{S}, which is not feasible practically (you’d have to invert the complex process model, plus there’s the hassle of vector lower bounds) or even conceptually (the model is iterative, so you’d basically have to do the transformed parameters block within the parameters or transformed parameters declarations).

I’ve mostly made my peace with the kludge, but I wish I could get the penalty CV under 5% without taking a major performance hit. In some data sets the actual observation errors are around the same magnitude!

2 Likes

I can imagine a fixed multiplier could make some difference if it is very large or very small, but I think the interesting cases are when the offset/multiplier are parameters or functions of parameters.

I think issues as you try to move observation error towards zero are quite common - my understanding is that they often induce a sharp peak/ridge in the likelihood which is then problematic for the sampler. I guess you couldn’t use algebra_solver to enforce the equality by solving for h given C and S_hat instead of having it as a parameter (or solving for other suitable subset of the parameters)…

It might also be the case the varying the sigma for individual elements could make the geometry better behaved, but I am just guessing here…

1 Like

I can confirm that passing the (diagonal) inverse metric can dramatically speed up computation in some models (otherwise initial exploration requires large treedepths, and this lasts fairly deep into warmup). Offset-multiplier can achieve the same thing by transforming the posterior to one in which the initial state of the mass matrix (all ones) is good.

However, there is potential additional speedup (and potential risk) in achieving this with offset/multiplier rather than the inverse metric, because the offset/multiplier also results in inits that are directly based on the user’s initial guesses for the marginal posteriors. This can speed early adaptation, avoid bad initial updates to the mass matrix (if windowed adaptation otherwise begins before convergence), and shorten the overall length required for warmup. Of course, it also yields inits that are not overdispersed with respect to the user’s initial guess for the posterior, which can be dangerous.

True, I’ve seen this behavior in other state-space contexts and in other parts of this model as well. It’s amazingly nonlinear, though: going from \sigma = 0.1 to 0.05 doesn’t do much, but going from 0.05 to 0.01 slows down sampling by nearly 2X (measured as wall time or time/n_eff).

Interesting suggestion. I think solving for h directly would run into another version of the original constraints problem, in that h could exceed 1. In principle I could solve the system of equations to yield constraints on some combination of primitive and/or transformed parameters, either by hand or using algebra_solver (which I reckon would be slower for sure). But then I’d be back to the problem of how to encode those constraints in the parameter declarations, given that finding the constraints basically requires executing the transformed parameters block (or some inversion thereof). Maybe there’s some tortuous way to do it using local variables, but when I’ve thought it through in the past, I can’t see it.

Good point. This model has a custom inits function, but I hadn’t thought about how to modify those inits when using <offset, multiplier> or the equivalent. (If it helped, which it doesn’t.)

I know I’ve seen discussions about passing in a mass matrix, but haven’t followed closely. Maybe I should look into that.

Passing in the mass matrix works quite well, and conveniently it works even when you need to hierarchically non-center some of the parameters. Not sure it would help in your particular case, of course.

One question about your use case: is it possible that the remainder of the model believes very strongly that \hat{C} is not equal to C, so that as you tighten your constraint you end up forced into a weird part of the tail of the distribution of \hat{C}, which might present challenging geometry?

1 Like

Leaving aside the “how to do it” question, one thing I’ve wondered about this approach is the workflow. Do you fit an initial short run to estimate the mass matrix for subsequent longer runs? Or just use the estimate from the first full run, assuming you’re going to be fitting the same model a bunch of times? What happens when you tweak the model or get new data, etc.? Seems like this would work best for stable models fitted to a single or a few data sets.

Good question; I’m pretty certain this is not the issue. For one thing, I see the same behavior with simulated data. And with real data, \widehat{C} generally ends up being very close to C, without any crazy outliers. It’s just that if I try to tighten up the relationship a wee bit more, sampling slows down dramatically. I do suspect that the local correlation I mentioned earlier plays a role – it’s not just the narrowing marginal posterior of h, but the tightening ridge that HMC struggles with.

Whatever strategy you had for estimating the multiplier term for \hat{S}, you can apply to the corresponding element of the mass matrix instead (and leave the rest as ones). My use cases for passing the full inverse metric have been models that for which small parts of the structure and/or data are iteratively updated several times. Using the inverse metric from a previous fit can be a real time-saver in these cases.

Assuming that you have so many parameters that you need a diagonal (rather than dense) mass matrix, a better inverse metric probably won’t help you here (and neither would the equivalent bit of offset-multiplier tinkering).

If the inverse metric is the problem, then one telltale signal will be that the sampling improves pretty dramatically over the course of warmup, and the treedepths reached during late warmup and sampling are lower than those reached during the earlyish phases of warmup.

Ah, OK, that’s another thing I wasn’t sure about. This does make it more attractive to experiment with.

This is also very helpful w.r.t. “how to know when to do it”. Warmup in my model is certainly slower per iteration than sampling, but it seems to find the bulk pretty quickly. I haven’t systematically looked at how treedepth changes, though, so I’ll try that. Thanks!

The variance / covariance estimate gets regularized towards the identity matrix, see Specify metric regularization constant?

Not sure whether this may be the problem.

I am not sure I follow completely, but is definitely possible to have algebra_solver honour at least some constraints on the solutions by using the same/similar transforms Stan uses to enforce its constraints, so if e.g. we want 0 < h < 1 one can rewrite the equation system with h_bar = logit(h) as the free parameter to be determined which automatically enforces the constraint on h. It is still possible that the solver calls would be too expensive to be practical, but just wanted to note the option.

Oh, I just meant that, whether using algebra_solver or actual algebra (which wouldn’t be mathematically complex, just very cumbersome) to solve for lower bounds on a “suitable subset of parameters” that jointly enforce the positivity constraint, the obstacle still would be computing those solutions in order to use them as parameter bounds. Like I say, I’ve tried to figure out a way of doing this with local variables to no avail. If I were just subtracting C from a (hierarchically re-centered) primitive parameter, it would be straightforward, but I’m “subtracting” it from a linear combination of multiple transformed vector and matrix parameters that include covariate effects, etc. etc. Figuring out at the parameter declaration stage that S_hat[i] is going to wind up being negative is a nontrivial problem.

OK, so I obviously don’t see the details, so I might be missing stuff or repeating things you already know, but some steps I’ve found practical:

  1. Maybe you could parametrize in terms of S_hat instead? If "valid S_hat" implies "valid h" and not vice versa, that might help
  2. In recent Stan you can define a vector of upper/lower bounds and then do stuff like
functions {
  vector  compute_bounds(a, lot, of, params) {
     // complex computation
     return bounds;
  }
}

parameters {
   vector<upper = compute_bounds(some, relevant, param, values)>[N] h;
}

which should let you do quite complex computations without much trouble.

EDIT:

And as a last resort, if you e.g. need to know value of h[1] to determine bounds for h[2] you can always do the constraint transforms yourself (the transforms and Jacobians can be found at 10 Constraint Transforms | Stan Reference Manual) e.g.

parameters {
   vector[N] h_raw;
}

transformed parameters {
   vector[N] h;
   real log_jacobian = 0;
   for(n in 1:N) {
       real lower;
       real upper;
       real inv_logit_h_raw = inv_logit(h_raw)
       //complex computation to get the bounds, 
       //possible depending on earlier elements of h
      h[n] = lower + (upper - lower) * inv_logit_h_raw;
      log_jacobian = log_jacobian + log(upper - lower) + log(inv_logit_h_raw)  + log1m(inv_logit_h_raw);
   }

} 

model {
   h ~ whatever you need;
   target += log_jacobian;
}

I actually tried something along these lines back when I was first wrestling with this problem. It was extremely awkward and totally mangled the generative interpretability of the model, but I did get a simplified version coded up correctly as far as I could tell. Sampling was disastrous; evidently the posterior geometry was much more difficult than the “natural” parameterization.

I am looking forward to the vector bounds making their way into rstan, although I’m not sure it will help in this case, at least not without concatenating many separate vector and matrix parameters with interdependent bounds into one big vector or array. It’s a feature I’ve wished for in general, though!

If rstan version is the only thing holding you back, you can get latest rstan (which includes vector bounds) via Repository for distributing (some) stan-dev R packages | r-packages . We are slowly working towards updating rstan on CRAN, but it will likely still be a while for mostly stupid reasons.

And as I said above, you can always do the constraining manually which leaves more room for error, but will work for any case.

1 Like