A sufficient implementation of the multivariate normal lpdf

I worked out a sufficient implementation of the multivariate normal. Compute the element-wise sums of \sum \mathbf{y} \mathbf{y}^{\text{T}} and \sum \mathbf{y}^\text{T} (xxts and xts in the function below), then pass to the function below (along with the number of observations, n) to get a dramatic speed boost (depending on dataset size and number of unique cells you can summarize to).

Wrote about the implementation here — there is very likely a better version of this function that could be written, but works as-is for now!

real multi_normal_sufficient_lpdf(
  matrix xxts,
  row_vector xts,
  real n,
  vector mu,
  matrix Sigma
) {
  int K = size(mu);
  matrix[K,K] invs = inverse(Sigma);
  real lp = 0.0;
  lp += (-n * K) / 2 * log(2 * pi());
  lp += -n/2 * log_determinant(Sigma);
  lp += -0.5 * trace(Sigma \ xxts);
  lp += trace(invs * mu * xts);
  lp += -n/2 * (mu' * invs * mu);
  return lp;
}

In the current version you have three calls (inverse, log_determinant, ) which all do the same O(n^3) matrix factorization. Following code does it only once

real multi_normal_sufficient_lpdf(
  matrix xxts,
  row_vector xts,
  real n,
  vector mu,
  matrix Sigma
) {
  int K = size(mu);

  matrix[K, K] L = cholesky_decompose(Sigma);
  vector[K] Linv_mu  = mdivide_left_tri_low(L, mu);
  vector[K] Linv_xts = mdivide_left_tri_low(L, xts');
  matrix[K, K] Linv_xxts = mdivide_left_tri_low(L, xxts);
  real log_det_Sigma = 2 * sum(log(diagonal(L)));
  real tr_Sinv_xxts = trace(mdivide_left_tri_low(L, Linv_xxts'));
  real lp = 0.0;
  lp += -0.5 * n * K * log(2 * pi());
  lp += -0.5 * n * log_det_Sigma;
  lp += -0.5 * tr_Sinv_xxts;
  lp += dot_product(Linv_xts, Linv_mu);
  lp += -0.5 * n * dot_self(Linv_mu);
  return lp;
}

so this should be 3 times faster.

@avehtari—when will this implementation be faster than what we actually do in Stan? For the Stan implementation, we do all the factoring, then iterate over the inputs, each of which spawns an mdivide_left_ldlt. Is there a size at which it’d be better to implement this way? If so, I’m guessing it’s something @stevebronder could knock out in no time.

Here’s what we do now (the autodiff’s all in prim now for functions that don’t involve autodiff):

Yeah, I did consider also mentioning multi_normal_lpdf, but in a hurry focused on pointing out that the original code did do matrix factorization three times. I didn’t have time to think how to use it for the sufficient statistic version, but that would be a useful addition for this thread

Ok, I did check and it really is the sufficient statistic part that is making multi_normal_lpdf not applicable or if it would be used inside it would add another unnecessary factorization

Otherwise multi_normal_lpdf is also doing just one factorization, but instead of Cholesky doing LDL-factorization (aka square-root-free Cholesky decomposition). I’m just used to use Cholesky, but LDL is often a better choice, so what multi_normal_lpdf is doing is just fine

This version also could be trivially changed to take L as an input instead of Sigma. And then this version could just calculate L and pass it to that version of the function.

And the blog post also has the math for doing a version of the function that takes the inverse of the covariance matrix as an input.