# Vectorising user-defined multivariate lpdf/lpmf functions

Hi! I’m a fairly new Stan user and this is my first time posting to the forum so I hope I’m formulating everything correctly and apologies if not (also thank you to everyone who asks and answers questions on this group, it’s an amazing resource!).

I’m using @bgoodri’s excellent parameterisation of the multivariate probit model (original available here).

To make the program integrate more easily with `library(loo)` and the new `reduce_sum` parallelisation I tried to redefine the bulk of the model inside a single _lpmf function but I can’t work out how to vectorise the function to improve the computational performance.

The function as it stands is:

``````functions {
real multiprobit_lpmf(
int[] y,  // response data - integer array of 1s and 0s
vector x, // covariate data
matrix beta, // coefficients for covariates
real[] u, // nuisance that absorbs inequality constraints
int D,  // response dimension size
matrix L_Omega // lower cholesky of correlation matrix
){
vector[D] mu; // means
vector[D] z; // latent continuous state
vector[D] logLik; // log likelihood
real prev;
mu = beta * x;
prev = 0;
for (d in 1:D) { // Phi and inv_Phi may overflow and / or be numerically inaccurate
real bound; // threshold at which utility = 0
bound = Phi( -(mu[d] + prev) / L_Omega[d,d]  );
if (y[d] == 1) {
real t;
t = bound + (1 - bound) * u[d];
z[d] = inv_Phi(t);       // implies utility is positive
logLik[d] = log1m(bound);  // Jacobian adjustment
}
else {
real t;
t = bound * u[d];
z[d] = inv_Phi(t);     // implies utility is negative
}
if (d < D) prev = L_Omega[d+1,1:d] * head(z, d);
// Jacobian adjustments imply z is truncated standard normal
// thus utility --- mu + L_Omega * z --- is truncated multivariate normal
}
return sum(logLik);
}
}
``````

And the data being used:

``````data {
int<lower=1> K;  // number of covariates + intercept
int<lower=1> D;  // response dimension size
int<lower=0> N; // sample size
int<lower=0,upper=1> y[N,D]; // response data - integer array of 1s and 0s
vector[K] x[N]; // covariate data
}
``````

At the moment the model is retrieving simulation parameters well but it would be nice to be able to remove the for loop, particularly when wrapping the function into partial_sums functions

``````model {
L_Omega ~ lkj_corr_cholesky(1);
to_vector(beta) ~ normal(0, 10);
// implicit: u is iid standard uniform a priori
{ // likelihood
for(n in 1:N){
target += multiprobit_lpmf(y[n] | x[n], beta, u[n], D, L_Omega);
}
}
}
``````

Any advice on how to vectorise the function (or if this is possible/desirable) is greatly appreciated- thanks!

2 Likes

I am not optimistic that you can get much speedup from parallelization in this case because you have to solve for the first element of `z`, before you can solve for the second, third, … D-th.

2 Likes

Thanks for the quick reply. I am probably misunderstanding so apologies but the idea was to split the N rows (rather than the D columns) to parallelise and wouldn’t the \boldsymbol{Z_{n}}'s be conditionally independent?

They are conditionally independent in the statistical sense (before you condition on the data), but you still have to know the value of z_1 before you can solve for z_2, etc.

I hadn’t realised that was how it worked, thank you!

Just in case anyone is interested, my experiment with reduce_sum just finished and the speed up is fairly minor as Ben predicted (~20% with 4 threads per chain, N=6266).

If you have a multivariate probit with D dimensions and N observations, surely you can still parallelise over the N observarions, even though the calculation for the likelihood over each of the dimensions within in a single observation depends on the previous one? i.e. if Y is a array of vectors, each of length D, I can’t see a reason you couldn’t paralellise across each of vectors in Y. I think that’s what @fergusjchadwick was asking.

2 Likes

Thanks @dpascall, that is exactly what I meant but I wasn’t phrasing very clearly!

@wds15, do you have any thoughts on whether this is a good candidate for paralellisation? I’ve managed to get a modest speed up and I’m wondering whether it’s worth pursuing it further or whether there is something intrinsic to the model structure (as suggested by Ben) that stops this from being a good parallelisation candidate.

The reduce_sum function that gave me the speed up is:

``````  real partial_sum(
int[,] y_slice,
int start,
int end,
vector[] x,
matrix beta,
real[,] u,
int K,
int D,
matrix L_Omega
){
// int slice_size;
real ll[(end-start)+1];
vector[K] x_slice[(end-start)+1];
real u_slice[(end-start)+1, D];
// slice_size = (end-start)+1;
x_slice = x[start:end];
u_slice = u[start:end];
for(i in 1:(end-start)+1){
ll[i] = multiprobit_lpmf(y_slice[i] | x_slice[i], beta, u_slice[i], D, L_Omega);
}
return sum(ll);
}
``````

In particular, I was wondering if getting rid of the for loop inside the partial_sum function would improve things (and then how to do that!). Thanks for any input!!

1 Like

If you can avoid the for loop and write it as one vectorize statement … do that!

Also: In case x is data, then do not assign it to a temporary variable, but rather use it directly with some additional index coding.

Finally, is u a parameter? If so, the slice over u instead of the data y.

Thanks @wds15. I’ve rewritten the code to slice over u (it is a parameter) and it’s currently running. Why is it better to slice over a parameter?

Great catch on the temporary variable, feel a bit silly for putting that in now!

I think to vectorise I’ll need to overhaul the code quite a lot (if indeed it’s possible). I’ll post here with the results of slicing over u and vectorising if I manage to!

Why its better to slice over parameters:

It‘s well worth trying to vectorize this thing!

@wds15 Thanks to your advice (removing temporary variables and slicing over u) the model is now twice as fast as the non-parallelised version. Here is my new function code for anyone trying to do similar:

``````    real partial_sum(
real[,] u,
int start,
int end,
vector[] x,
matrix beta,
int[,] y,
int K,
int D,
matrix L_Omega
){
real ll[(end-start)+1];
for(i in start:end){
ll[i-start+1] = multiprobit_lpmf(y[i] | x[i], beta, u[i-start+1], D, L_Omega);
}
return sum(ll);
}
``````

Now to try vectorising!

2 Likes

Interesting, I am also using bgoodri parameterization of MVP - I have fairly large N and have tried to use reduce_sum around a month ago but got no speedup whatsoever . The version I’m currently using is a bit of an extension - it is a mixture model (so definitely cannot be vectorised as stan doesnt support vectorised mixtures yet), hierarchical, and the likelihood is a mixture of ordered probit and standard (binary) probit, so maybe that’s why it doesnt benefit as much. However, I dont recall trying to slice over u, only the data, so this might work.

@CerulloE, one thing that might be affecting it (you probably already know) is that you can only split across conditionally independent parts of the model so you’d need to find a way to split that reflects your hierarchy, if that’s possible.

1 Like