Linearly scaled, unconstrained parameters

This is something @andrewgelman has been asking for and it finally dawned on me during yesterday’s meeting why we need it.

I created an issue for this in stan-dev/stan with all the details.

The basic idea is that if we allow something like

vector<loc = mu, scale = sigma>[K] beta;

and we transform unconstrained x to mu + sigma * x, then we can leave our parameters on their natural scales and still start with a much better place for adaptation. Even better, if we just add

beta ~ normal(mu, sigma);

then we get a non-centered parameterization in

beta_unc = (beta - mu) / sigma;

because the Jacobian on the transform cancels the -log(sigma) normalization.

The plan is to get to this after refactoring the base model class to allow the algorithms to be precompiled.

3 Likes

Can you say more about this? What are mu and sigma defined as? Where does x come in?

Let me try again. mu and sigma are just arguments for the linear transform like the arguments to lower = and upper = in bounded real values.

I might write something like this to generate a non-centered paramterization of beta.

parameters {
  real mu;
  real<lower = 0> sigma;
  int<lower = 0> K;
  vector<loc = mu, scale = sigma>[K] beta;
  ...
model {
  beta ~ normal(mu, sigma);
  ...

Given the transform, the “unconstrained” vector over which we sample is given by

beta_unc = (beta - mu) / sigma

The inverse transform from unconstrained to constrained is just

beta = mu + sigma * beta_unc

So the log Jacobian adjustment is K * log(sigma). The normalization in the normal is K * -log(sigma), so it all cancels out.

As a result, beta_unc ~ normal(0, 1) as required for the non-centered parameterization.

The centered parameterization would just drop the linear transform,

vector[K] beta;

with no constraints (or defaults loc = 0 and scale = 1).

2 Likes

K should be data, right? Just double checking. I’m not quite sure what’s happening with this syntax… should the model block actually read: beta ~ normal(0, 1); ? Where do we get the 0, 1 standardized version? What does it mean that we have both the prior over beta as normal(mu, sigma) and the loc and scale of beta set to mu and sigma? What would happen if they weren’t the same? Sorry if I’m being dense here.

More questions about the feature being proposed here:

  1. What happens if you use loc and scale with beta and then write that beta has some distribution other than normal?
  2. Is this basically just a mechanism for standardizing the parameters? I think that could be made more clear syntactically. For example, we could a plain distribution called something like standardized_normal, leave beta declared as is and just change the prior line to beta ~ standardized_normal(mu, sigma);.
  3. I guess it just feels kludgy to me that loc and scale are both part of the same transform but the words loc and scale have very many meanings in the world of statistics and it is not at all clear reading this what it would do. You would need to spend a long, long time Googling to even find what it means. I would suggest at the very least we make these names Google-able and show that they are part of the same transform; something like real<standardizing_loc = mu, standardizing_scale = sigma>. But I might suggest we could go further and encapsulate this single transformation in one language concept.

When I had to construct ‘something like this’, I used the terms offset and multiplier.

1 Like

The answer to all of your questions is that it’s just a linear transform. Our uppper-bound is a log transform, as is a lower bound. If you have upper and lower bounds, it’s a logit transform. For the simplex, it’s stick breaking. The behavior is fully defined by the Jacobian that corrects for the transform.

All sizes have to be integer data.

No, it should read beta ~ normal(mu, sigma). That’s the cool part. You get the non-centered parameterization through the transform. The model stats in its more natural form (that is, the way everyone writes it in papers).

Non-centered parameterization

vector<loc = mu, scale = sigma>[K] beta;
...
beta ~ normal(mu, sigma);

Centered parameterization

vector<loc = 0, scale = 1>[K] beta;
...
beta ~ normal(mu, sigma);

loc = 0, scale = 1 is the default, like lower = -infinity, upper = infinity.

You need to work through the formula for the normal log density and the Jacobian of the transform to see how it all works out.

Nothing presupposes even a location-scale distribution (such as normal or Student-t) here. So you could use this with a lognormal, or a gamma, or anything else you want. This is similar for our lower and upper bounds. We can write real <lower = 0> sigma and then combine that with sigma ~ normal(0, 1) and we get a half-normal due to the combination of the full normal and constraint. That is, everything’s just compositional as defined.

It’s more general. We don’t need to apply it to just normall. We can’t guarantee the paramete will be standard normal in the prior or in the posterior. We often “standardize” predictors by mapping a vector x to z(x) = (x - mean(x)) / sd(x), which makes sure that the vector z(x) has a sample mean of zero and sample sd of 1. But what we’re really aiming for is something that’s roughly standard normal in the posterior.

I agree it’s clunky and potentially confusing with the names in statistics. But it’s not standardizing, so I don’t like standardizing_loc.

We could use intercept and slope to make it clear it’s linear and remain neutral and not confuse things with the normal parameter names.

I don’t think we want to go down the ggplot2 route for the name of something simple like this. We don’t use standardizing_lower or anything like that and it really adds verbosity where we don’t want it in the variable definitions (making it hard to scan the variable names). I’m thinking we need to go the other way and make things like pos_real for real<lower = 0> and prob or unit_real for real<lower = 0, upper = 1>.

2 Likes

Question: do we want to restrict the slope/scale to be positive? I’m implementing this now.

This is great! I have been doing this manually in some of my models and it seems to help a lot with the warmup. When the parameters are on wildly different scales, the first stage of the warmup before the mass matrix is set can go very slowly.

I was thinking about trying to just specify the mass matrix:

But this is a much cleaner way to do it and dosen’t require matching up parameter indices. Looking forward to this!

So have we, which is the motivation for adding this.

Yes. There’s nothing gained from allowing it to be negative.

In the multivariate version, the slope/scale/multiplier is a Cholesky factor of a positive-definite matrix. So restricting the 1-D case to be positive makes sense.

After more thought, I like @Charles_Driver’s suggestion of offset and multiplier because it works well for both the univariate and Cholesky factor version and doesn’t imply any underlying statistical model. We could also go with location/scale. As @bgoodri pointed out, slope/intercept doesn’t make much sense for the multivariate case.

Irrespective of what we call the keywords, this transformation only makes sense if the prior is among the location-scale family of distributions, so it seems like we might as well call the keywords location and scale.

scale can make sense even if location doesn’t, for example in a Gamma distributed parameter.

I think that location and scale are entirely accurate names here. Remember that any distribution can be made into a “location-scale” family of distributions by applying the same transformation that’s being considered here, so it’s consistent with the statistical usage and independent of any particular distribution. In particular this transformation is meaningful regardless of whether or not it “undoes” a preceding transformation implicit in an existing location-scale family of distributions. It also works for both the scalar and the multivariate case as location and scale have conventional generalizations to higher-dimensional spaces.

I implemented this now using the names location and scale in the PRs


https://github.com/stan-dev/math/pull/1048 .

Will add some documentation and try to get it merged asap.

It’s not too late to change the names/syntax if people feel strongly about it.

Does anyone have strong feelings against location and scale? In my mind they would also be fine terms for the multivariate case of a vector and a Cholesky factor. The PRs are ready for review, but it’s not too late to change the names.

location/scale sound good to me, I also like offset/scale because offset sort-a makes more sense with distributions defined on bounded domains.

After a lot of thought, I prefer location and scale, but would also be OK with offset and multiplier. Both of these work with the multivariate case.

Andrew brought up stacking multiple transforms, such as the non-centered parameterization on the log scale, which might look like this:

parameters {
  real&lt;location = mu, scale = sigma&gt;&lt;lower = 0&gt; theta;

model {
  theta ~ lognormal(mu, sigma);

and would be equivalent to

parameters {
  real theta_std;

transformed parameters {
  real<lower = 0> theta = exp(mu + sigma * theta_raw)

model {
  theta_std ~ normal(0, 1);

The transforms are arranged from real outward in order of transforming unconstrained parameters. Each transform is picked out by its arguments here, but @seantalts has been suggesting a less implicit syntax going forward.

The log Jacobians just add, so it’s trivial from a computational point of view.

We don’t want to roll out bigger changes likethis, though, until we switch over parsing frameworks, so we’ll be sticking to just the single transform case for the time being.

I like the idea of stacking multiple transformations.

But I wonder if it could be problematic if someone writes:

parameters {
  real<lower = 0, location = mu> theta;

Since that breaks down if mu < -theta. Or would you impose an ordering, such scale can’t follow location, and location and scale can’t follow the interval constraints?

That’s not inconsistent if done in the proper order,

real<location = mu><lower = 0> theta:

which yields

theta = exp(theta_raw + mu)

The other way around

real<lower = 0><location = mu> theta;

leads to

theta = exp(theta_raw) + mu

so we have to make sure the bounds are respected, as always. We have the same kind of problem in <lower = L, upper = U> if U < L.