Feature request: Jacobians of unconstraining transform

I’m interested in using Stan’s modeling language and autodiff for general modeling tasks, not just the inference algorithms that Stan provides. The ambition is that a modeler could write the functions they need in the Stan language, compile them, and have access to a high-level interface that returns gradients, Hessians, and Jacobians of constrained and unconstrained parameterizations as needed, taking care of all the constraining and vectorization behind the scenes.

A lot of the pieces are already there, but one critical piece of functionality appears to be missing from the modeling language: the ability to calculate the Jacobian of the constraining and unconstraining transform. Sometimes modelers need derivatives in the constrained space. One example is getting the updates for the natural parameters of mean field coordinate ascent in VB, which can be expressed as partials with respect to the constrained moment parameters. Another is doing prior robustness calculations, which requires the gradient of the log prior with respect to the original prior parameters. With the Jacobian of the transform, these could be calculated from the already-provided gradients in the unconstrained space. And just to be extra clear, I’m not talking about the log determinant of the Jacobian that goes into the log likelihood – I mean the Jacobian matrix.

Any thoughts on how possible it would be to expose this for Stan models?

Thanks, Ryan

1 Like

Mind putting together a concrete example? I’m not quite following.

You really want the Jacobians of the inverse transforms from unconstrained
to constrained space?

They’re all described in a chapter in the manual, but not implemented
as such, because we only needed the log determinant, which is often much
much simpler.

The Jacobians themselves can be hairy—the unconstraining transform
for a covariance matrix is literally maps R^(n choose 2) -> R^(n x n),
which leads to a rather large Jacobian (though it’s symmetric, so you
really only need (n choose 2) + n output dimensions if you had the right
kind of expression template.

Then they all just combine together into a big block diagonal to give
you the transforms from all parameters.

  • Bob

Yes, I understand that the Jacobian matrix may be large. Let me give a simple concrete example. Daniel suggested a potential solution off-thread that I’ll mention at the end with a small caveat.

Suppose I have a multivariate normal model with unknown mean, theta and known covariance, x_cov. Let’s assume a conjugate prior on theta with known prior_mu and prior_sigma:

x ~ MVN(x | theta, x_cov)
theta ~ MVN(theta | prior_mu, prior_sigma);

Here is a stan model:

// The generating model for data x.
data {
    int <lower=0> K;
    vector[K] prior_mu;
    cov_matrix[K] prior_sigma;
    cov_matrix[K] x_cov;
    vector[K] x;
}
parameters {
    vector[K] theta;
}
model {
    // The prior
    target += multi_normal_lpdf(theta | prior_mu, prior_sigma);

    // The likelihood
    target += multi_normal_lpdf(x | theta, x_cov);
}

Now, suppose I want to calculate the local sensitivity of the posterior expectation of theta to the prior parameters. For some prior parameter \alpha this can be expressed as

Cov(\theta, \frac{\partial}{\partial \alpha} \log p(\alpha))

For each of the prior parameters, prior_mu and all the upper-diagonal components of prior_sigma, I would like to calculate \frac{\partial}{\partial \alpha} \log p(\alpha), for each draw \theta from the model. Here is how I could do that with a Stan model:

// A model for differentiating the prior of data_model at a particular
// observation of theta.
data {
    // The "data" are the parameter at which we're evaluating the prior.
    int <lower=0> K;
    vector[K] theta;
}
parameters {
    // The "parameters" are the prior parameters that we are differentiating
    // with respect to.
    vector[K] prior_mu;
    cov_matrix[K] prior_sigma;
}
model {
    target += multi_normal_lpdf(theta | prior_mu, prior_sigma);
}

By getting the gradient of the log of the second model, I can calculate the gradient of the log prior in the unconstrained space. However, that space is hard to interpret – I actually want the gradient in the constrained space. To transform it back I need the Jacobian.

Off-thread, Daniel suggested simply making the parameters in the second model unconstrained, i.e.

vector[K] prior_mu;
matrix[K, K] prior_sigma;

One small problem is that prior_sigma is not actually unconstrained – it must be symmetric. I.e., the derivative with respect to prior_sigma[1, 2] is not meaningful – I would always want the derivative with respect to prior_sigma[1, 2] plus the derivative with respect to prior_sigma[2, 1]. I could do this by hand, of course, but it would be very nice and much less error-prone to simply get the right answer directly.

We’ve been talking about factoring the model in natural
parameters and in unconstrained parameters.

But I don’t see how just getting the Jacobian is going to
help here. We actually do two things, we transform
the unconstrained variable and add the log Jacobian.
The transform itself will have a derivative in it.

So if we have an unconstraied variable sigma_raw, the
unconstrained log density looks like:

f(sigma_raw, …) {
sigma = exp(sigma_raw);
if (include_jacobian)
target += sigma_raw;
… other transforms …

model(sigma, …)

where the model function is what’s defined in the model
block in terms of constrained parameters. The include_jacobian
is a template parameter in the top-level function.

I think what you want is just what I’ve written as model(sigma,…).
Otherwise, you get the derivative of the exp() and other inverse
transforms back to constrained variables.

  • Bob

Yes, that’s a good point – in my example, I want the hessian of the target with respect to the unconstrained variables without adding the log Jacobian, since I’m using the log probability for something other than sampling. But if you turned off that template parameter so that the log Jacobian of the determinant of the transform weren’t added, then I think the transform Jacobian * the unconstrained gradient would give me the constrained gradient, right?

Let me provide a better example and propose a more general solution. Stan autodiff has three really powerful pieces: gradients, Hessians, and Jacobians. Right now, the former two can be accessed from a Stan model by setting target to be equal to whatever you want to differentiate. (As Daniel points out, if necessary you can get unconstrained derivatives by unconstraining the inputs, modulo some gotchas like the one I mentioned above.) However, as-is, Jacobians can’t be calculated except through the C++ interface, and that limits what you can do.

Here’s an example model (from my research). It’s a simple mean field variational approximation to a multivariate normal model. There’s a variational distribution on the mean, mu, and one on the covariance, sigma. Both the ELBO and the posterior moments are defined as functions of the input parameters, which parameterize the variational distributions.

To calculate the linear response covariance of a posterior moment, you need to calculate

d moment / d parameters^T (d^2 ELBO / d parameters d parameters ^T) ^{-1} d moment^T / d parameters

The middle term can be calculated with Stan from the model, but the Jacobian d moment / d parameters cannot. Furthermore, maybe the user doesn’t actually want all the moment parameters, and so doesn’t want to compute the whole (huge) Jacobian.

Here’s a solution that could work. Suppose there were an option to calculate the Jacobian of parameters defined in the generated quantities block. The columns of the Jacobian matrix could be referenced with param_names, and you could define something like generated_param_names to reference the rows, just like you do with unconstrained_param_names. Then the user could make a separate stan program and copy the transormations they need into the generated quantities block, omit the rest of the model, and caculate the Jacobians they need.

In fact, if you exposed the functions that are used to convert an unconstrained parameter to a constrained parameter (maybe they already are exposed?) then this solution would also solve my original, more narrowly scoped feature request.

By the way, I know this is all a big ask, and I appreciate the consideration. I just think there would be a lot of benefit having a high-level modeling language with full-featured fast, compiled autodiff tools that also do all the constraining and indexing that Stan does. And it’s most of the the way there as-is.

Below is an sketch of the variational model I described. WARNING: I wrote this up but didn’t check anything but that it compiles, so there are very likely a few typos.

functions {
    // The multivariate lgamma and digamma functions.
    real mv_digamma(real x, int n) {
        real ret;
        real i_real;
        ret = 0;
        for (i in 1:n) {
            i_real = i; 
            ret = ret + digamma(x + (1 - i_real) / 2);
        }
        return ret;
    }
    real mv_lgamma(real x, int n) {
        real ret;
        real i_real;
        ret = 0;
        for (i in 1:n) {
            i_real = i; 
            ret = ret + lgamma(x + (1 - i_real) / 2);
        }
        return ret;
    }
}
data {
    int <lower=0> K;
    int <lower=0> N;

    // Observations
    vector[K] x[N];
    
    // Prior parameters
    vector[K] mu_prior_mean;
    cov_matrix[K] mu_prior_info;

    cov_matrix[K] sigma_prior_v_inv;
    real <lower=0> sigma_prior_nu;
}
parameters {
    // Variational parameters

    // q(mu) = MVN(mean, info ^ {-1})
    vector[K] mu_q_mean;
    cov_matrix[K] mu_q_info;
    
    // q(sigma) = Wishart(v, nu)
    cov_matrix[K] sigma_q_v;
    real <lower=0> sigma_q_nu;
}
transformed parameters {

    /////////////////////////////
    // Variational mean parameters
    matrix[K, K] e_mu_mu_t;
    cov_matrix[K] e_sigma;
    real e_log_sigma;

    // Expected sufficient statistics for mu.
    // e_mu is mu_q_mean.
    e_mu_mu_t = mu_q_mean * mu_q_mean' + inverse_spd(mu_q_info);
    
    // Expected sufficient statistics for sigma.
    e_sigma = sigma_q_v * sigma_q_nu;
    e_log_sigma = mv_digamma(sigma_q_nu / 2, K) +
                  K * log(2) + log_determinant(sigma_q_v);
}
model {
    // The target will contain the ELBO.

    /////////////////////////////
    // The expected log priors
    // MVN prior for mu:
    target += -0.5 * (mu_q_mean' * mu_prior_info * mu_prior_mean +
                      mu_prior_mean' * mu_prior_info * mu_q_mean +
                      trace(mu_prior_info * e_mu_mu_t));

    // Wishart prior for sigma:
    target += 0.5 * (sigma_prior_nu - K - 1) * e_log_sigma
              -0.5 * trace(sigma_q_v \ e_sigma);

    /////////////////////////////
    // The entropy:
    // For mu:
    target += 0.5 * K + 0.5 * K * log(2 * 3.14159) - 0.5 * log_determinant(mu_q_info);
    
    // For sigma:
    target += 0.5 * (K + 1) * e_log_sigma +
              0.5 * K * (K + 1) * log(2) + mv_lgamma(sigma_q_nu / 2, K) +
              -0.5 * (sigma_q_nu - K - 1) * mv_digamma(sigma_q_nu / 2, K) +
              0.5 * sigma_q_nu * K;

    /////////////////////////////
    // The expected log likelihood
    for (n in 1:N) {
        target += -0.5 * (mu_q_mean' * e_sigma * x[n] +
                               x[n]' * e_sigma * mu_q_mean +
                                 trace(e_sigma * e_mu_mu_t) +
                                 trace(e_sigma * x[n] * x[n]')) +
                  -0.5 * e_log_sigma;    
    }
}

How about a very simple example:

parameters {
real<lower=0> sigma;
}
model {
sigma ~ gamma(1, 1);
}

The positive constraint on sigma means the unconstrained parameter
is

u_sigma = log(sigma)

so going the other way

sigma = exp(u_sigma)

This compiles down to the following two unconstrained functions:

  1. With Jacobian:

f(u_sigma) = log gamma(exp(u_sigma) | 1, 1) + u_sigma

  1. Without Jacobian:

g(u_sigma) = log gamma(exp(u_sigma) | 1, 1)

But I think what you want is just:

h(sigma) = log gamma(sigma | 1, 1)

  • Bob

Yes, that’s right. I’d like to use Stan to differentiate arbitrary functions with respect to unconstrained parameters, which means I don’t want to include the log determinant of the Jacobian of the transform included in the function I’m differentiating.

Well that’s easy. I thought you’d wanted the derivatives w.r.t.
the constrained parameters.

For the unconstrained parameter version, all you need to do is
set the Jacobian template parameter to false. You can see where
it gets included in the generated code and how the template
parameter controls it.

  • Bob

P.S. Sorry about the formatting. This Discourse thing is mucking
with our emails. I wish we could just turn all these “helpful” features
off.

I’m a little confused. I agree that this is what I want:

“But I think what you want is just: h(sigma) = log gamma(sigma | 1, 1)”

That is, I want the derivative with respect to sigma (which is constrained, right?) without the log Jacobian of the transformation. Right now, if I understand correctly, if I set the template parameter to false, I will get the derivative of

h(u_sigma) = log gamma(exp(u_sigma | 1, 1))

That is not what I want, since it’s with respect to u_sigma, not sigma. Or maybe I’m mistaken?

Maybe more importantly, though, is that I’m really requesting access to Stan’s autodiff Jacobians through the model interface, the same way you can now get gradients and Hessians (Hessians take a little hacking in RStan, but are available in principle). To motivate this, I’m suggesting two use cases:

  • Transforming gradients wrt the unconstrained variables into gradients wrt the constrained variables using the matrix d (unconstrained) / d (constrained) when the dimensions match up
  • Doing LRVB calculations using the unconstrained Hessian, which requires d (transformed variables) / d (unconstrained variables)

There are other use cases for Jacobians, but these are the ones I’ve come across. This request is really more in the spirit of making Stan an easy-to-use high-level full-featured autodiff tool than my particular use cases.

I am, too, since one messgae back you said this:

Yes, that’s right. I’d like to use Stan to differentiate
arbitrary functions with respect to unconstrained parameters,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The unconstrained parameter is u_sigma:

sigma = exp(u_sigma)

We can do derivatives w.r.t. u_sigma with or without the Jacobian of
the transform. But we cannot take derivatives w.r.t. sigma as Stan
is currently written.

It would be possible to expose the derivatives w.r.t. the constrained
parameters, but the code generator needs to be factored. And it leaves
the design issue of how you want to input the constrained parameters.
Right now it’s through the implementation of a virtual base class
that provides reader callbacks and we have implementations for files,
R, and Python.

  • Bob

Right, I am sorry – that was a mistake, and a pretty confusing one. I meant to say “constrained parameters”.

The design questions come back to the real request: the ability to calculate the Jacobian of generated parameters. Getting the derivatives WRT to the constrained parameters would be one potential application of such functionality.

So what happened here?

I wrote something like this:

   foo bar baz
           ^^^

but I didn’t use a code block backticks so it decided to treat it like
a quote and remove all the aligned spaces. Same thing it does with code.

Is there a way I can somehow turn off all the places where it
tries to guess what I mean and change the characters I send? Or am I just
stuck with having to put everything I want to align in code blocks?

  • Bob

Andrew had the same request. It can be done (in Stan 2.13.1) with a user-declared function that is not defined in the .Stan program, an external C++ file that provides the function definition and implements the autodiffed Jacobian, and calling that function in the generated quantities block.

Do you have an example I could look at?

For RStan, there was a Discourse thread about the now released vignette. It is not implemented for PyStan yet, but that usually follows RStan. For CmdStan, it is a bit different but documented in the CmdStan guide.

In Andrew’s case, the Stan program was

functions {
  real log_kernel(real lambda, real b_linear, vector b_logit,
                  vector lambda_logit, vector lambda_kernel, vector b_kernel,
                  vector y, real mu) {
    return lambda * b_linear + dot_product(lambda_logit, b_logit) + 
                               dot_product(lambda_kernel, b_kernel) +
           lambda * cauchy_lpdf(y | mu, 1) + (1 - lambda) * normal_lpdf(mu | 0, 10);     
  }
  
  real[] log_kernel_gradient(real lambda, real b_linear, vector b_logit,
                             vector lambda_logit, vector lambda_kernel, 
                             vector b_kernel, vector y, real mu); // undefined now
  
}
data {
  int N;
  vector[N] y;
  real a_lower;
  real a_upper;
  real b_linear;
  int K_logit;
  vector[K_logit] mu_logit;
  vector[K_logit] sigma_logit;
  vector[K_logit] b_logit;
  int K_kernel;
  vector[K_kernel] mu_kernel;
  vector[K_kernel] sigma_kernel;
  vector[K_kernel] b_kernel;
}
parameters {
  real mu;
  real<lower=0,upper=2> a;
}
transformed parameters {
  real a_reflected;
  real a_scaled;
  real lambda;
  vector[K_logit] lambda_logit;
  vector[K_kernel] lambda_kernel;
  a_reflected = 1 - fabs(1 - a);;
  a_scaled = (a_reflected - a_lower)/(a_upper - a_lower);
  if (a_scaled <= 0)
    lambda = 0;
  else if (a_scaled < 1)
    lambda = 0.5 + 0.25*(3*(2*a_scaled - 1) - (2*a_scaled - 1)^3);
  else
    lambda = 1;
  lambda_logit = 1 ./ (1 + exp(-(lambda - mu_logit) ./ sigma_logit));  
  lambda_kernel = exp(-.5*(lambda - mu_kernel) .* (lambda - mu_kernel) ./ (sigma_kernel .* sigma_kernel));
}
model {
  target += log_kernel(lambda, b_linear, b_logit,
                       lambda_logit, lambda_kernel, b_kernel,
                       y, mu);
}
generated quantities {
  real grad_lambda[1] =
    log_kernel_gradient(lambda, b_linear, b_logit,
                        lambda_logit, lambda_kernel, b_kernel,
                        y, mu);
}

and the log_kernel_gradient.hpp file that calculated the gradient was

template <typename T0__, typename T1__, typename T2__, typename T3__, typename T4__, typename T5__, typename T6__, typename T7__>
std::vector<typename boost::math::tools::promote_args<T0__, T1__, T2__, T3__, typename boost::math::tools::promote_args<T4__, T5__, T6__, T7__>::type>::type>
log_kernel_gradient(const T0__& lambda,
                    const T1__& b_linear,
                    const Eigen::Matrix<T2__, Eigen::Dynamic,1>& b_logit,
                    const Eigen::Matrix<T3__, Eigen::Dynamic,1>& lambda_logit,
                    const Eigen::Matrix<T4__, Eigen::Dynamic,1>& lambda_kernel,
                    const Eigen::Matrix<T5__, Eigen::Dynamic,1>& b_kernel,
                    const Eigen::Matrix<T6__, Eigen::Dynamic,1>& y,
                    const T7__& mu, std::ostream* pstream__) {

  auto lambda_v = to_var(lambda); // auto foo_v = to_var(foo) for anything unknown
  var lp = log_kernel(lambda_v, b_linear, b_logit, // using lambda_v, not lambda
                      lambda_logit, lambda_kernel, b_kernel,
                      y, mu, 0); // need extra 0 at the end
  std::vector<var> theta;
  theta.push_back(lambda_v); // call theta.push_back(foo_v) for anything unknown
  std::vector<double> g;
  lp.grad(theta, g);
  return g;
}

Calculating Jacobians is about the same and discussed in section 2.4 of the Stan Math paper.

I see, that looks like it could work! One piece is missing, though. I presume Andrew’s example will give the derivative of log_kernel_gradient with respect to lambda. For both my examples, I would need access to the functions that constrain variables (e.g. writer__.cov_matrix_unconstrain for covariance matrices). Can I access those functions in the Stan language somehow?

Not yet. We want to roll those functions in soon but
they’re not there now.

  • Bob

Make the constrained parameters be arguments to your foo_jacobian function and then you don’t have to worry about the transformations to and from the unconstrained space.

There could certainly be times when that would be useful. But for either
of the other two use cases (converting the unconstrained gradients of the
log probability or Hessian to a constrained space, or getting the Jacobian
of the transform itself) I’d need access to the constraining functions or
at least the unconstrained variable values, right?