Efficiency of multivariate models with partially missing data

I am working with longitudinal MMRM-like Bayesian models for clinical trial data. For each patient, a continuous outcome variable is measured a pre-determined n > 2 number of times, and those outcome measurements are treated as multivariate normal for each patient. Outcomes from different patients are assumed independent conditional on the parameters.

It is common for patients to drop out during the study, so there are a lot of missing outcomes. To integrate out the missing outcomes, I am using the technique described at 3.5 Missing multivariate data | Stan User’s Guide because it works well with ELPD methods like the one by Silva and Zanella (2022). In addition, it easy to generalize to the n-dimensional multivariate case because the marginal of a multivariate normal is multivariate normal on a subset of elements of mean vector \mu and a subset of the rows and columns of covariance matrix \Sigma. Overall, this approach works well. I am very happy with it, and it is much simpler and more computationally efficient than 3.1 Missing data | Stan User’s Guide in my case.

However, I feel like the way I am coding it is not as efficient as it could be. Each patient uses a different subset of \Sigma, and I code their likelihood in Stan as

y[subset] ~ multi_normal(mu[subset], Sigma[subset, subset])

where:

  • y is the patient-specific vector of observations.
  • mu is the patient-specific mean vector.
  • Sigma is the shared covariance matrix among different measurements within a patient.
  • subset is a patient-specific int array to select only the observed components of y.

I want to use multi_normal_cholesky(), but subset is different from patient to patient, and I would have to re-compute the Cholesky factor of Sigma[subset, subset] for each patient and each MCMC iteration.

Is there a computational shortcut to compute the Cholesky factor of Sigma[subset, subset] given an existing Cholesky factorization \Sigma = L L^T of the full covariance matrix? I tried to derive one at linear algebra - Fast shortcut to get the Cholesky factor of a submatrix - Mathematics Stack Exchange, but my answer in that post is wrong because the Cholesky-like factor I came up with is not lower triangular.

I don’t know how to update the Cholesky factor of a subset, but I know how to update the inverse of Sigma and the log-determinant of Sigma for subset. This becomes especially helpful if the number of missing entries tends to be much less than the dimension of Sigma.

In the function below:

Sigmainv is the inverse of the full Sigma matrix
obsidx has the indexing of the observed entries (what you call subset), followed by the indexing of the missing entries
Nobs is the number of observed entries
np is the dimension of Sigma
logdet is the log-determinant of the full Sigma

Then the upper left square of out contains the inverse of Sigma[subset, subset], with the last diagonal entry containing the log determinant of Sigma[subset, subset].

matrix sig_inv_update(matrix Sigmainv, int[] obsidx, int Nobs, int np, real logdet) {
    matrix[Nobs + 1, Nobs + 1] out = rep_matrix(0, Nobs + 1, Nobs + 1);
    int nrm = np - Nobs;
    matrix[nrm, nrm] H;
    matrix[nrm, Nobs] A;

    if (nrm == 0) {
      out[1:Nobs, 1:Nobs] = Sigmainv;
      out[Nobs + 1, Nobs + 1] = logdet;
    } else {
      H = Sigmainv[obsidx[(Nobs + 1):np], obsidx[(Nobs + 1):np]];
      A = Sigmainv[obsidx[(Nobs + 1):np], obsidx[1:Nobs]];

      out[1:Nobs, 1:Nobs] = Sigmainv[obsidx[1:Nobs], obsidx[1:Nobs]] - A' * mdivide_left_spd(H, A);
      out[Nobs + 1, Nobs + 1] = logdet + log_determinant(H);
    }

    return out;
  }

To make it more efficient, I arrange the rows of y by missing data pattern and compute the above once per missing data pattern. Then I use a custom computation of the multivariate normal lpdf for more efficiency (below).

In the below function:
xbar and S are the mean and covariance matrix (divided by n instead of (n-1)) of data with this missing data pattern (if there is only one observation in this pattern, S is a matrix of 0s)
Mu is the multivariate normal mean vector for this missing data pattern
Supdate is out from the above function
N is the number of observations in this missing data pattern

  real multi_normal_suff(vector xbar, matrix S, vector Mu, matrix Supdate, int N) {
    int Nobs = dims(S)[1];
    real out;

    // using elementwise multiplication + sum here for efficiency
    out = -.5 * N * ( sum(Supdate[1:Nobs, 1:Nobs] .* (S + (xbar - Mu) * (xbar - Mu)')) + Supdate[Nobs + 1, Nobs + 1] + Nobs * log(2 * pi()) );

    return(out);
  }
1 Like

The Eigen matrix library defines operations which allow to update Cholesky factors. These are not yet available in Stan-math, but I filed an issue for it: Update Cholesky factors · Issue #2855 · stan-dev/math · GitHub You could reimplement these in Stan directly, I think.

In brms Cholesky factors are computed for every subset pattern “on demand” (have a look at the unstr facility of brms), but they are re-used as much as possible. One key thing to note is that the Cholesky factor of the correlation matrix is needed while the residual error terms (sigmas) can be simply used to scale that Cholesky factor. The approach from @edm looks very promising too.

2 Likes

Thank you both for the ideas, and thanks @wds15 for submitting that Stan-math issue!