What is the point of the multi_normal_cholesky parametrization?

jachymb · January 14, 2025, 3:14pm

Basically title. When should I prefer multi_normal_cholesky over multi_normal? Is the lpdf calculation significantly more performant under some conditions? Or is there some advantage to autodiff? The documentation simply states it exists in the STAN math library without any elaboration.

Bob_Carpenter · January 14, 2025, 4:33pm

Under all conditions. By starting with a pre-factored covariance matrix, the evaluation is quadratic in dimension rather than cubic. It saves on both the determinant calculation (because Cholesky is triangular, it’s only diagonal) and the quadratic form that involves the inverse of the covariance matrix. The autodiff is similarly sped up because it follows the basic evaluation.

There’s even more advantage within Stan because the way we parameterize a dense covariance matrix is using a Cholesky factor under the hood (N choose 2 unconstrained elements below the diagonal, and N diagonal elements which must be positive, so they’re log transformed). So it saves a lot of work in just creating a well-formed covariance matrix.

We should probably hint that it’s both more numerically stable and more efficient. We go into that fairly early in the User’s Guide in the regression chapter.

jachymb · January 15, 2025, 1:48pm

OK, cool, thanks. Perhaps at least adding some link from the function documentation to another part would be nice. It’s not obvious to look this up in the regression section, one may easily use normal distribution in contexts that wouldn’t be considered a regression model.

Bob_Carpenter · January 15, 2025, 4:09pm

Definitely. Here’s the issue in our documentation repository:

github.com/stan-dev/docs

motivate Cholesky factor parameterizations in functions reference and elsewhere

opened 04:09PM - 15 Jan 25 UTC

bob-carpenter

enhancement good first issue

#### Summary: We define the Cholesky-based parameterization, but don't mentio…n that they're more efficient and more numerically stable. #### Description: This should go in the functions reference for the Cholesky-based parameterizations (LKJ, Wisharts, and multi-normal). This arose through an issue on the Stan forums from [jachymb](https://discourse.mc-stan.org/u/jachymb/summary), which also describes why the Cholesky parameterizations are better: https://discourse.mc-stan.org/t/what-is-the-point-of-the-multi-normal-cholesky-parametrization/38483/2

jachymb · February 3, 2025, 8:55am

I was thinking about this. If the inverse is the main computational bottleneck, why don’t we also have a parametrization where the precomputed inverse is already used as a parameter? Of course, the determinant also needs to be computed for the density normalization and that has a comparable complexity to the inverse, so we would need to know both to take advantage of it.

Something like this:

    real multi_normal_invcov_lpdf(vector y, vector mu, matrix invcov, real invcov_det) {
        // maybe add some validity checks etc
        int k = size(mu);
        vector[k] dist = y - mu;
        return -0.5*(
              k*log(2*pi())
            + invcov_det
            + dist' * inv_cov * dist
        );
    }

Of course, this is useless when we’re estimating the covariance or want a prior over it, because we can’t simply estimate the determinant without computing it, so it must be known in the data or transformed data. But there is a use case, when this happens for me: fitting a Gaussian process (as a latent variable) when the kernel parameters are known beforehands (i.e. not being estimated.) This allows to compute the inverse covariance (and thus also the determinant) already in transformed data so my thinking is that could be exploited for improved performance and don’t recompute it at every step of the model. This should be a lot faster than the Cholesky decomposition trick.

Would that be an interesting case? Does it actually even make sense or am I making some mistake here?

btw, another somewhat interesting case could be when the covariance is diagonal. This could then be inverted in O(n). This can in principle then be reduced to the univaritate case so it would be a bit redundant, but maybe it’d be sometimes better interpretable expressed as multivariate diagonal.

jachymb · February 3, 2025, 10:12am

Actually maybe that’s how multi_normal_prec already works? I am not sure about the determinant, but maybe that’s not actually needed in lupdf? If I have the precision matrix in transformed data what is the complexity if I use multi_normal_prec in my model?

spinkney · February 3, 2025, 12:47pm

If you have the precision matrix as data the best thing to do is to precompute the determinant. Even with the multi_normal_prec, Stan will take the Cholesky factorization (well the LDLt factorization), math/stan/math/prim/prob/multi_normal_prec_lpdf.hpp at ac977e4ed79bd593440b9afbc25e1bba905e9276 · stan-dev/math · GitHub.

Whenever you have data it’s good to tell the compiler the argument is data by adding data before the element as in the function call below. This tells Stan that it doesn’t need to autodiff through the data args.

// let's say you have an N array of a K-length vector y
// mu is size K
// log_det_precision is passed in as data
array[N] vector[K] y;
vector[K] mu;

//  you can easily modify this for an array or matrix of mu

real multi_normal_precision2_lpdf(array[] vector y, vector mu, data matrix precision_mat, data real log_det_precision) {
  int N = size(y);
  int K = size(mu);
  real log_prob = 0;

  log_prob += 0.5 * log_det_precision * N;
  log_prob += -0.5 * N * K * (log2() + log(pi());

  for (n in 1:N) {
    log_prob += trace_quad_form(precision_mat, to_matrix(y[n] - mu));
  }

  log_prob = -0.5 * sum_lp_vec;

  return log_prob;
}

Side questions: @WardBrian in the multi_normal_precision stan-math code we can pass in a vector to trace_quad_form for the second argument. Is it possible to pass in a vector here (or can we update stanc to allow it?) instead of having to call to_matrix(y - mu)?

jachymb · February 3, 2025, 2:50pm

There is a function with the following signature: real quad_form(matrix A, vector B)

WardBrian · February 3, 2025, 3:05pm

I think the signature you’re asking for already exists?

$ stanc --dump-stan-math-signatures | grep quad_form
quad_form(matrix, vector) => real
quad_form(matrix, matrix) => matrix
quad_form_diag(matrix, vector) => matrix
quad_form_diag(matrix, row_vector) => matrix
quad_form_sym(matrix, vector) => real
quad_form_sym(matrix, matrix) => matrix
trace_gen_quad_form(matrix, matrix, matrix) => real
trace_quad_form(matrix, vector) => real
trace_quad_form(matrix, matrix) => real

spinkney · February 3, 2025, 3:11pm

cool! need to update the doc Matrix Operations.

Issue at Update docs for function signatures · Issue #860 · stan-dev/docs · GitHub

Topic		Replies	Views
About computation speed difference between multi_normal_lpdf and lpdf matrix operation General specification	2	525	July 30, 2020
Multivariate normal data covariance estimation Modeling performance , covariance , multivariate-normal	3	1449	May 10, 2020
Cholesky decomposition/speed/divergencies Modeling	11	2813	August 19, 2017
Diagonal covariance matrix Modeling	5	2568	January 13, 2019
Rewrite lpdf in transformed parameters for multivariate normal distribution Modeling	0	513	December 23, 2021

What is the point of the multi_normal_cholesky parametrization?

Related topics