Multivariate normal cholesky optimization?

This is about the code in multi_normal_cholesky_lpdf. Is there good reason to do the inverse of L alone and not in a solve? Specifically this line in stan-math:

  const matrix_partials_t inv_L_dbl
      = mdivide_left_tri<Eigen::Lower>(value_of(L_ref));

I ask because it seems it would be more performant to
\begin{aligned} (Y - \mu)^T \Sigma^{-1} (Y - \mu) &= (Y - \mu)^T L^{-T}L^{-1} (Y - \mu) \\ &= (L^{-1}(Y - \mu)) ^T (L^{-1}(Y - \mu)) \end{aligned}
set X := Y - \mu and use the triangular view in Eigen to solve for Z = L^{-1}X. Something like

Z = L.triangularView<Eigen::Lower>().solve(X)

and then do rows_dot_self(Z).

There’s some precedence for this method in pytorch, see the source

 def log_prob(self, value):
        if self._validate_args:
            self._validate_sample(value)
        diff = value - self.loc
        M = _batch_mahalanobis(self._unbroadcasted_scale_tril, diff)
        half_log_det = self._unbroadcasted_scale_tril.diagonal(dim1=-2, dim2=-1).log().sum(-1)
        return -0.5 * (self._event_shape[0] * math.log(2 * math.pi) + M) - half_log_det
2 Likes

Good catch. I’d definitely prefer triangular solves here just for the better numerics.

It looks like the vecorization is handled with loops here so I don’t know if we can go all the way to rows_dot_self (still need to increment the adjoints).

It should be possible to get benchmarks on this using the benchmarks stuff. Something like:

benchmarks/benchmark.py multi_normal_cholesky_lpdf

Yea, looking at the gradient calculations gives a reason for the way it’s written. The code defines \operatorname{\mathtt{half}} and \operatorname{\mathtt{scaled-diff}}. Where what I call Z^T is equal to \operatorname{\mathtt{half}}. Then

\begin{align} \operatorname{\mathtt{scaled-diff}} &= L^{-T}Z \\ \operatorname{\mathtt{scaled-diff}} \cdot \operatorname{\mathtt{half}} &= L^{-T} ZZ^{T}\end{align}

It would be interesting to test if doing 2 solves prior to the loop - solving for Z and then solving for W = L^{-T}Z is faster.

  const matrix_partials_t inv_L_dbl
      = mdivide_left_tri<Eigen::Lower>(value_of(L_ref));

  if (include_summand<propto, T_y, T_loc, T_covar_elem>::value) {
    for (size_t i = 0; i < size_vec; i++) {
      const auto& y_col = as_column_vector_or_scalar(y_vec[i]);
      const auto& mu_col = as_column_vector_or_scalar(mu_vec[i]);

      const row_vector_partials_t half
          = (inv_L_dbl.template triangularView<Eigen::Lower>()
             * (value_of(y_col) - value_of(mu_col))
                   .template cast<T_partials_return>())
                .transpose();
      const vector_partials_t scaled_diff
          = (half * inv_L_dbl.template triangularView<Eigen::Lower>())
                .transpose();

      logp -= 0.5 * dot_self(half);

      if (!is_constant_all<T_y>::value) {
        ops_partials.edge1_.partials_vec_[i] -= scaled_diff;
      }
      if (!is_constant_all<T_loc>::value) {
        ops_partials.edge2_.partials_vec_[i] += scaled_diff;
      }
      if (!is_constant_all<T_covar>::value) {
        ops_partials.edge3_.partials_ += scaled_diff * half;
      }
    }
  }
2 Likes

Ooooh yeah yeah, that could be big-time more efficient if you had multiple right hand sides. So first copy stuff to a matrix, then do the solves, and then pass the results on to whatever.