Generated code for offset-multiplier constraints (with an eye towards multivariate Gaussians and GPs)

Hi all,
I’m obviously going isolation mad because I’m picking up all of my projects that I’ve been swearing I’d get around to when I’ve got uninterrupted time. So apologies for the sudden content surge!

I’ve been looking at the generated code for the offset_muliplier_constrain and I’ve got a couple of questions about what is possible if we’re going to be able to extend this to multivariate settings. I found a dormant branch @Matthijs wrote that has the code for doing this (PR, branch). There’s no discussion on why it’s dormant, but i think I can see the problem, so I’ll outline it here.

Background

The offset_multiplier_constrain funciton is the backbone of one of our fancy bijectors, which allows us to write non-centred code like this

// irrelevant code
parameters {
  real mu;
  real sigma;
  vector<offset = mu, multiplier = sigma>[10] a;
// other code
}
model {
  // irrelevant code
  a ~ normal(mu, sigma);
}

This is obviously lovely because it just involves a “type change” to go between centred and non-centred parameterizations. So cool! (Weird thing: vector<multiplier = sigma, offset = mu> throws an error, even though there’s no logical order for those named parameters)

Looking at the generated code, a (simplified) pseudocode version of what is generated is

// bunch of stuff including defs for mu and sigma 

Matrix<scalar> a = a_unconstr; // a_unconstr is the unconstrained value
a = offset_multiplier_constrain(a, mu, sigma, lp__);  // lp__ accumulates jacobian

// bunch of stuff

lp__ += normal_log<propto>(a, mu, sigma);

//more stuff

Mathematically, this is roughly the same as computing
lp__ += -size(a)*log(sigma) - 0.5*( (sigma*a_unconstr + mu) - mu)^2/sigma^2;

For the scalar case, the redundant calculation doesn’t really matter.

What would happen when we’re multivariate?

Well, nothing right now. But if we use the implementation on Matthijs’ branch, we would be able to write something like

// irrelevant code
parameters {
  vector[10] mu;
  matrix[10,10] sigma;
  vector<offset = mu, chol = cholesky_decompose(sigma)>[10] a;
// other code
}
model {
  // irrelevant code
  a ~ multi_normal(mu, sigma);
}

Note the slightly cool thing that the parameters to the bijector can be functions!

Using the same type of code geneation would be bad for two reasons:

  • We would need a second, unnecessary Cholesky decomposition inside multi_normal. Because of the way the blocks currently work, we can’t save the computed Cholesky for reuse in multi_normal_cholesky.
  • If we ignore mu, the generated code will be computing the equivalent of a^TL^T\Sigma^{-1}La, where \Sigma = LL^T. Although this is mathematically equivalent to a^Ta it will probably not be in floating point.

So we’ve got an expensive operation (a cholesky, a solve, and two matrix-vector mults) that is not floating point safe.

Questions

Is there a way to generate the code so it doesn’t do this? Basically, it would recognize that

  • a uses an offset-multiplier bijector
  • a is the “target” (lvalue? I’m bad on lingo) of a multi_normal(m, sigma) or multi_student_t(nu, m, sigma) distribution with the same sigma (m doesnt’ need to equal mu).

In this case the generated code for a slightly different case (to take into account that the offset may not equal the man of the multivarite normal!). There’s also slightly different syntax in the bijector to explicitly give the covariance matrix rather than the cholesky. I think this is fairly clean syntax for the multivariate non-centring.

// irrelevant code (possibly defining data mu)
parameters {
// irrelevant code (possibly defining parameter mu)
  vector[10] m;
  matrix[10,10] sigma;
  vector<offset = mu, covariance = sigma>[10] a;
// other code
}
model {
  // irrelevant code
  a ~ multi_normal(m, sigma);
}

would be more like

// bunch of stuff including defs for mu and sigma, m

Matrix<scalar> a = a_unconstr; // a_unconstr is the unconstrained value
lp__ += normal_log<propto>(a, m - mu, 1.0); // 

// bunch of stuff

a = offset_multiplier_constrain(a, mu, sigma, lp__);  // lp__ accumulates jacobian

//more stuff

Note that I had to re-sort the expression graph! This may not be possible! The generated code with the transformed parameters-based non-centring actually stores the unconstrained values separately, so it can keep the original order.

If something like this would be possible, it’d be great because it would mean we could use the same code for centring and non-centring GPs and multi_normals that we use for the scalar case!

3 Likes

Ack. I see the problem.

Not that I see while preserving sigma as a covariance matrix.

For the example at hand, it’s better all around to do this:

parameters {
  vector[10] mu;
  cholesky_factor[10] L_Sigma;
  vector<offset = mu, multiplier = L_Sigma> a;
  ...
}
model {
  y ~ multi_normal_cholesky(mu, L_Sigma);
}

I’d also strongly prefer to keep multiplier rather than chol to reduce number of names. In retrospect the whole lpdf vs. lpmf thing was a mistake as we’re about to introduce mixed distributions for things like HMMs.

That’s a bug. Would you mind raising an issue on stanc3 repo? There also shouldn’t be an order of upper and lower when both are present.

1 Like

Sorry for reviving a thread after two months, but I think @nhuurre has solved this.

See https://github.com/stan-dev/stanc3/pull/552

With develop cmdstan you can set this in make/local to test this PR:

STANC3_TEST_BIN_URL = https://jenkins.mc-stan.org/job/stanc3-test-binaries/106/artifac

1 Like

This would be awesome if it’s fixed!

For reasons I’m not going to try to get to the bottom of on a Friday night, I can’t build this.

But if someone manages to, could you please check the generated C++ code for

parameters {
  vector[10] mu;
  cholesky_factor[10] L_Sigma;
  vector<offset = mu, multiplier = L_Sigma> a;
  ...
}
model {
  a ~ multi_normal_cholesky(mu, L_Sigma);
}

to see what’s generated in by that last line. In particular, check that the transformation isn’t being done and undone like it is in the scalar case. (It should be a standard normal on the transformed variable. If it calls multi_normal_cholesky_lpdf(a, mu, L_Sigma) then this will be very very very inefficient).

(I’d check the OCaml but I can’t even work out where to look.)

Not quite. That only implements arrays of scalar multipliers. Cholesky factors are not supported.

1 Like

Ah ok, I missunderstood then. Sorry for wasting (part of) your friday evening @Daniel_Simpson.
I misremembered the thread and didnt reread it closely enough. My bad.