Vectorising user-defined multivariate lpdf/lpmf functions

Hi! I’m a fairly new Stan user and this is my first time posting to the forum so I hope I’m formulating everything correctly and apologies if not (also thank you to everyone who asks and answers questions on this group, it’s an amazing resource!).

I’m using @bgoodri’s excellent parameterisation of the multivariate probit model (original available here).

To make the program integrate more easily with library(loo) and the new reduce_sum parallelisation I tried to redefine the bulk of the model inside a single _lpmf function but I can’t work out how to vectorise the function to improve the computational performance.

The function as it stands is:

functions {
   real multiprobit_lpmf(
   int[] y,  // response data - integer array of 1s and 0s
   vector x, // covariate data
   matrix beta, // coefficients for covariates
   real[] u, // nuisance that absorbs inequality constraints
   int D,  // response dimension size
   matrix L_Omega // lower cholesky of correlation matrix
   ){
      vector[D] mu; // means
      vector[D] z; // latent continuous state
      vector[D] logLik; // log likelihood
      real prev;
      mu = beta * x;
      prev = 0;
     for (d in 1:D) { // Phi and inv_Phi may overflow and / or be numerically inaccurate
        real bound; // threshold at which utility = 0
        bound = Phi( -(mu[d] + prev) / L_Omega[d,d]  );
        if (y[d] == 1) {
          real t;
          t = bound + (1 - bound) * u[d];
          z[d] = inv_Phi(t);       // implies utility is positive
          logLik[d] = log1m(bound);  // Jacobian adjustment
        }
        else {
          real t;
          t = bound * u[d];
          z[d] = inv_Phi(t);     // implies utility is negative
          logLik[d]= log(bound);  // Jacobian adjustment
        }
        if (d < D) prev = L_Omega[d+1,1:d] * head(z, d);
        // Jacobian adjustments imply z is truncated standard normal
        // thus utility --- mu + L_Omega * z --- is truncated multivariate normal
      }
      return sum(logLik);
   }
}

And the data being used:

data {
  int<lower=1> K;  // number of covariates + intercept
  int<lower=1> D;  // response dimension size
  int<lower=0> N; // sample size 
  int<lower=0,upper=1> y[N,D]; // response data - integer array of 1s and 0s
  vector[K] x[N]; // covariate data 
}

At the moment the model is retrieving simulation parameters well but it would be nice to be able to remove the for loop, particularly when wrapping the function into partial_sums functions

model {
  L_Omega ~ lkj_corr_cholesky(1);
  to_vector(beta) ~ normal(0, 10);
  // implicit: u is iid standard uniform a priori
  { // likelihood
  for(n in 1:N){
        target += multiprobit_lpmf(y[n] | x[n], beta, u[n], D, L_Omega);
  }
  }
}

Any advice on how to vectorise the function (or if this is possible/desirable) is greatly appreciated- thanks!

2 Likes

I am not optimistic that you can get much speedup from parallelization in this case because you have to solve for the first element of z, before you can solve for the second, third, … D-th.

2 Likes

Thanks for the quick reply. I am probably misunderstanding so apologies but the idea was to split the N rows (rather than the D columns) to parallelise and wouldn’t the \boldsymbol{Z_{n}}'s be conditionally independent?

They are conditionally independent in the statistical sense (before you condition on the data), but you still have to know the value of z_1 before you can solve for z_2, etc.

I hadn’t realised that was how it worked, thank you!

Just in case anyone is interested, my experiment with reduce_sum just finished and the speed up is fairly minor as Ben predicted (~20% with 4 threads per chain, N=6266).

If you have a multivariate probit with D dimensions and N observations, surely you can still parallelise over the N observarions, even though the calculation for the likelihood over each of the dimensions within in a single observation depends on the previous one? i.e. if Y is a array of vectors, each of length D, I can’t see a reason you couldn’t paralellise across each of vectors in Y. I think that’s what @fergusjchadwick was asking.

2 Likes

Thanks @dpascall, that is exactly what I meant but I wasn’t phrasing very clearly!

@wds15, do you have any thoughts on whether this is a good candidate for paralellisation? I’ve managed to get a modest speed up and I’m wondering whether it’s worth pursuing it further or whether there is something intrinsic to the model structure (as suggested by Ben) that stops this from being a good parallelisation candidate.

The reduce_sum function that gave me the speed up is:

  real partial_sum(
      int[,] y_slice,
      int start,
      int end,
      vector[] x,
      matrix beta,
      real[,] u,
      int K,
      int D,
      matrix L_Omega
      ){
        // int slice_size;
        real ll[(end-start)+1];
        vector[K] x_slice[(end-start)+1];
        real u_slice[(end-start)+1, D];
        // slice_size = (end-start)+1;
        x_slice = x[start:end];
        u_slice = u[start:end];
        for(i in 1:(end-start)+1){
          ll[i] = multiprobit_lpmf(y_slice[i] | x_slice[i], beta, u_slice[i], D, L_Omega);
        }
      return sum(ll);  
      }

In particular, I was wondering if getting rid of the for loop inside the partial_sum function would improve things (and then how to do that!). Thanks for any input!!

1 Like

If you can avoid the for loop and write it as one vectorize statement … do that!

Also: In case x is data, then do not assign it to a temporary variable, but rather use it directly with some additional index coding.

Finally, is u a parameter? If so, the slice over u instead of the data y.

Thanks @wds15. I’ve rewritten the code to slice over u (it is a parameter) and it’s currently running. Why is it better to slice over a parameter?

Great catch on the temporary variable, feel a bit silly for putting that in now!

I think to vectorise I’ll need to overhaul the code quite a lot (if indeed it’s possible). I’ll post here with the results of slicing over u and vectorising if I manage to!

Why its better to slice over parameters:

It‘s well worth trying to vectorize this thing!

1 Like

@wds15 Thanks to your advice (removing temporary variables and slicing over u) the model is now twice as fast as the non-parallelised version. Here is my new function code for anyone trying to do similar:

    real partial_sum(
      real[,] u,
      int start,
      int end,
      vector[] x,
      matrix beta,
      int[,] y,
      int K,
      int D,
      matrix L_Omega
      ){
        real ll[(end-start)+1];
        for(i in start:end){
          ll[i-start+1] = multiprobit_lpmf(y[i] | x[i], beta, u[i-start+1], D, L_Omega);
        }
      return sum(ll);  
      }

Now to try vectorising!

2 Likes

Interesting, I am also using bgoodri parameterization of MVP - I have fairly large N and have tried to use reduce_sum around a month ago but got no speedup whatsoever . The version I’m currently using is a bit of an extension - it is a mixture model (so definitely cannot be vectorised as stan doesnt support vectorised mixtures yet), hierarchical, and the likelihood is a mixture of ordered probit and standard (binary) probit, so maybe that’s why it doesnt benefit as much. However, I dont recall trying to slice over u, only the data, so this might work.

@CerulloE, one thing that might be affecting it (you probably already know) is that you can only split across conditionally independent parts of the model so you’d need to find a way to split that reflects your hierarchy, if that’s possible.

1 Like