Reduce_sum with multi_normals in for-loops

Hi everyone,

I’ve got a big model with about half million observations. I’m modeling environmental variables associated with species occurrence records with multivariate normals, and I’m estimating separate MVNs for each species and decade hierarchically. Because multi_normal_cholesky requires a matrix for the Cholesky factor of the covariance matrix, I believe I have to call the likelihood inside for-loops. Even though all of the records for each species and decade can be vectorised within the for-loops, I’m still stuck with n_species * n_decade likelihood calls, leading to the following model template:

data {
  // ..
  int<lower=1> n_obs, n_species, n_decade, n_dim;
  array[n_obs] vector[n_dim] y;
  // ..
}

parameters {
  // ..
  array[n_species, n_decade] vector[n_dim] mu;
  array[n_species, n_decade] vector<lower=0>[n_dim] sigma;
  array[n_species, n_decade] cholesky_factor_corr[n_dim] Omega_L;
  // ..
}

model {
  // ..
  for (i in 1:n_species) {
    for (j in 1:n_decade) {
      y[i, j] ~ multi_normal_cholesky(mu[i, j], diag_pre_multiply(sigma[i, j], Omega_L[i, j])
    }
  }
  // ..
}

At the moment, I’ve written a partial_sum function to parallelise the likelihood, but this happens for each species/decade combination separately. The model has become a bit more complex with first and n_obs giving the index of the first observation and the number of records for each species/decade, respectively, and there are also observation-level random effects leading to the new array of mean vectors mu_f.

functions {
  real partial_sum_lpdf(array[] vector y_slice, int start, int end, array[] vector mu, matrix Sigma_L) {
    return multi_normal_cholesky_lupdf(y_slice | mu[start:end], Sigma_L);       
  }
}

model {
  // ..
  for (i in 1:n_species) {
    for (j in 1:n_decade) {
       target += reduce_sum(partial_sum_lupdf, 
                            segment(y, first[i, j], n_obs[i, j]), 
                            grainsize[i, j], 
                            mu_f[i, j, 1:n_obs[i, j]], 
                            diag_pre_multiply(sigma[i, j], Omega_L[i, j]));
    }
  }
  // ..
}

My question is, is there some obvious way to parallelise the model further or more efficiently? TL;DR, my main concern is that the likelihood requires a matrix for the covariance for the Cholesky factor, and I can’t think of an alternative approach to what I’m currently using, i.e. an array of matrices.

Thanks!

Matt

You really need to let reduce_sum handle the big loop! Why can’t you write a partial sum function which takes all these inputs and then you do sums over the species and decades.

If it is easier for you could even nest reduce_sum calls which makes sense if per species and decade there is a lot to sum up.

Recall that the partial sum function itself can also have some for loops or whatever you need to write down the partial sum.

Given the large data set you refer to it would make a lot of sense to have one big reduce_sum. I also would be very surprised if you need a species and decade specific grain size here! One grain size should do. Don’t spend too much time on finding perfect grain sizes, but instead cram everything into a single partial sum function (which would also handle the sigma * cholesky factor multiplication, for example).

Hey, thanks a lot for your input. Do you have any examples of a more complicated partial sim function? I’ve been staring at the manual for a while but can’t quite grasp how to include more stuff into it with the slicing etc!

forget about the slicing. Have a look at Stan models generated from brms when using threading. brms just passes in a dummy argument which is a sequence of integers running from 1 to number of elements to sum over. You then just slice out the data you actually need from the shared arguments. No need to worry about efficiency here!

1 Like

Hi there, OK I see. I had a look at some brms stan code. But I don’t immediately understand how to apply it to my case where my Cholesky factors of the covariance matrices are in n_species * n_decade arrays?

Edit: just to reiterate, the main concern I see now is that multi_normal_cholesky requires a single covariance matrix, but I don’t know how to guarantee that the slicing is done on only the observations within each species/decade. That’s why I originally nested reduce_sum inside of the for-loops in the model.

One approach, is instead of passing y as the sliced variable for the reduce sum function, pass what you are calling index i as a vector 1:n_species as the sliced variable.

Then, pass y as you do the remaining variables (i.e., it isn’t sliced for you) and use start, end or whatever inside your partial sum function to slice it as needed.

1 Like

Thank you so much. I was just about to update this with an implementation I’ve tried. This stuff is hard to get!

Because I have around 200 species, I’m slicing over them. So like you said, below I (1) give a sequence of 1:n_spp, (2) assign the start and end of the array of observed vectors (data, y) and mean vectors (mu), and then (3) update the (partial) target using the relevant subset of the data y and the associated mu and covariance matrices (Cholesky) Sigma_L. Does this look right? Oh yeah, I’m using “hypervolume” (hyp) instead of “decade” here.

real partial_sum_lpmf(array[] int seq_spp, int start, int end, data array[] vector y, array[] vector mu, array[, ] matrix Sigma_L, array[] int n_hyp, array[, ] int n_obs) {
    // initialise partial log prob and grainsize
    real ptarget = 0;
    int N = end - start + 1;
    // number of observations per species
    array[N] int n_obs_spp;
    // for each species
    for (i in 1:N) {
      n_obs_spp[i] = sum(n_obs[i, 1:n_hyp[i]]);
      // for each hypervolume
      for (j in 1:n_hyp[i]) {
        // start and end of data/mean arrays of vector
        int y_end = sum(n_obs_spp[1:i]) - sum(n_obs[i, 1:j]) + n_obs[i, j];
        int y_start = y_end - n_obs[i, j] + 1;
        // partial likelihood
        ptarget += multi_normal_cholesky_lupdf(y[y_start:y_end] | mu[y_start:y_end], Sigma_L[i, j]);
      } // j
    } // i
    return ptarget;
  }

By the way, if I call my function partial_sum_lpdf which includes ptarget += multi_normal_cholesky_lupdf() and then in the model { } block call reduce_sum(partial_sum_lupdf, ... ), I get the error:

Probability density functions require real variates (first argument). Instead found type array int.

I can get around this by calling my function partial_sum_lpmf and subsequently calling reduce_sum(partial_sum(lupmf, ...), but this feels hacky and I’m not sure if it’s correct. Is this the desired behaviour of Stan, or am I missing something important?

Function names ending in that mean special things in Stan,

Partial sum functions are just general functions with a specific signature and must return a real (e.g., the log likelihood when incrementing target) to be used inside reduce_sum.

1 Like

Hey, OK now I’m confused. In this recent post someone is also using _lupmf for a partial sum function, and the frequently referenced minimum example of reduce_sum also uses this, albeit for a simple expression that doesn’t do anything else inside the partial sum function. I also ran a small version of the model overnight and everything seems to have sampled as usual in less time…

The intended difference between _lpdf (“log-probability density function”) and _lpmf (log-probability mass function) is that when used with tilde notation

variable ~ custom_distribution(parameter);

the left hand side is integer-like for custom_distribution_lpmf and real/vector/continuous for custom_distribution_lpdf.
This convention interacts poorly with reduce_sum() so sometimes you need to write a function you wouldn’t use with tildes and give it the wrong suffix.

1 Like

OK, so using _lupmf in the name of a partial sum function to access a _lupdf function inside that function is fine?

Yes.

1 Like