Multivariate normal with varying design matrix

On an observation date t in 1,...,T, I have \theta_t, a latent J-dimensional vector drawn from a multivariate normal:

\theta_t \sim N(\mu, \Sigma)

On date t, I make N_t observations of \theta_t, say y_t := (y_{t,i},...,y_{t,N_t}) with i.i.d. noise \tau^2:

y_{t,i} \sim \theta_{t, j(t, N_t)} + N(0, \tau^2)

for a design function j(t,i) that dictates which component of \theta_t is observed.

See model.stan for an implementation of this. The reason I’m posting is that this program becomes slow as T and J get larger. I’m interested in T\approx 1000, J\approx 20, N_t \approx 40 which runs on the order of an hour for 500 samples. Not bad but I’d like it to be faster as I’ll be adding regression and autoregression components (turning it into a DLM a la gaussian_dlm_obs but with time-varying design). And I need to run it thousands of times for various experiments.

Conversation with @rtrangucci led me to realize I am unnecessarily sampling tens of thousands of theta parameters and so I wanted to try speeding it up by integrating them out.

Mathematically it’s straightforward. I am writing the design function j(t, \cdot) as an J \times N_t matrix F_t so that we have

y_t \sim F_t\theta_t + N(0, \tau^2 I_{N_t})

Then I can get rid of the thetas:

y_t \sim N(F_t\mu_t, F_t \Sigma F_t' + \tau^2 I_{N_t})

Implementing it in Stan has been less easy. I’ve tried two approaches:

For loop approach

Because y_t has varying dimension N_t I am writing this in Stan by stacking the y_t and F_t in a big vector y and matrix F respectively of length N = \sum_t N_t and then doing:

...
int pos;

model {
  for (i in 1:T) {
    segment(y, pos, n[i]) ~ multi_normal(
            block(F, 1, pos, J, n[i])' * mu,
            quad_form(Sigma, block(F, 1, pos, J, n[i]) ) +
            diag_matrix(rep_vector(tau, n[i])));
    pos += n[i];
  }
}

The full code is attached as model_no_theta_irreg.stan. For my dimensions (see above), this turns out to be basically no faster than estimating all the thetas!

Big covariance matrix approach

I also tried modeling the big vector y with a huge covariance structure. I offset and padded with zeros each F_t matrix to be J \times N so that I could write

y \sim N(F\mu, \Sigma_{\text{full}})

with a big N\times N block diagonal covariance given by

\Sigma_{\text{full}} = N(0, \tau^2 I_N) + \sum_t \tilde F_t \Sigma {\tilde F_t}'

This code is attached in model_no_theta_irreg2.stan (Note that the augmented F_t are called F2 in that code). The big covariance matrix approach was too slow to even run (gradient evaluation took 3 seconds).

Questions

Any ideas for speeding up either of these approaches? I hope I’m missing something easy. I was thinking of the following:

  • I know the for loop is bad, but I’m not sure how to avoid it. Should I be worried about the other bits in there like segment, block and diag_matrix? They are potentially avoidable using more clever data structures.

  • In practice, even though T\approx 1000 in my currently example, there are actually only say 50 different design matrices F_t so I could shrink the loop to 50 iterations, one for each unique F_t. It occurs to me as I write this that this is likely my best option. I’d still appreciate feedback on the efficiency of functions block and segment and to_matrix in sampling statement, which I’d probably still end up using here?

  • Switch to sparse (CRS) matrices for \tilde F_t in the big covariance matrix approach?

  • … try R-INLA?

model.stan (997 Bytes)
model_no_theta_irreg.stan (1.0 KB)
model_no_theta_irreg2.stan (983 Bytes)

A loop is not bad. Even if you somehow were to eliminate the appearance of a loop, there would still be the same loop in the C++ code. Adding a diag_matrix to a matrix is bad. Just make a temporary matrix and loop over its diagonal elements.

Thanks @bgoodri. I eliminated the diag(rep_vector) as you suggested, which improved performance about 10%. I got an additional 30% boost by using multi-indexing to construct the covariance matrices “by hand” instead of doing F_t\Sigma F'_t. This is possible in my case because F_t only has one non-zero entry per row. In case anyone else finds this useful in the future, the relevant part of the model block looks like this and the full code is attached. I should also add that the naive model that estimated theta, while the iterations run faster, the effective sample size is much worse.

...
  pos = 1;
  for (t in 1:T) {
    int jj_t[n[t]];
    matrix[n[t], n[t]] Sigma_t;
    vector[n[t]] mu_t;

    jj_t = segment(jj, pos, n[t]);
    mu_t = mu[jj_t];
    Sigma_t = Sigma[jj_t, jj_t];

    for (j in 1:n[t])
      Sigma_t[j, j] += tau;

    segment(logy, pos, n[t]) ~ multi_normal(
            mu_t,
            Sigma_t);
    pos += n[t];
  }

model_no_theta_irreg_manual.stan (1.2 KB)
: