# Help with multi-threading a simple ordinal probit model using reduce_sum

I’m parallelising some existing programmes but am struggling to code up a simple ordinal probit model using `partial_sum` and `reduce_sum`.

The non-`reduce_sum` likelihood statement is (for a 3-level ordinal response):

``````// linear predictor
mu = alpha + beta * x;

for(n in 1:N){// loop over cases
vector[K] theta; // probability of ordinal categories
theta[1] = Phi( (thresh[1] - mu[n])/sigma);
theta[2] = Phi( (thresh[2] - mu[n])/sigma) - Phi( (thresh[1] - mu[n])/sigma);
theta[3] = 1 - Phi( (thresh[2] - mu[n])/sigma);

target += categorical_lpmf(y[n] | theta);
}
``````

I wrote a `partial_sum` function:

``````real partial_sum(
int[] slice_y,
int start, int end,
matrix theta_slice
)
{
return categorical_lpmf(slice_y[start:end] | theta_slice[start:end,:]');
}
``````

and a `reduce_sum` likelihood:

``````matrix[N, K] theta_slice;

theta_slice[, 1] = Phi( (thresh[1] - mu)/sigma);
theta_slice[, 2] = Phi( (thresh[2] - mu)/sigma) - Phi( (thresh[1] - mu)/sigma);
theta_slice[, 3] = 1 - Phi( (thresh[2] - mu)/sigma);

target += reduce_sum(partial_sum, y, 1, theta_slice);
``````

but I’m running into problems that the `theta_slice` parameter cannot be anything but a column-vector in `categorical_lpmf`, but I’m passing in `theta_slice` as a matrix.

If I try to make a `for`-loop in `partial_sum`, it also doesn’t work:

``````real partial_sum(
int[] slice_y,
int start, int end,
matrix theta_slice
)
{
for(i in start:end)
return categorical_lpmf(slice_y[i] | theta_slice[i,:]');
}
``````

which causes the error:

``````Chain 3 TBB Warning: Exact exception propagation is requested by application but the linked library is built without support for it
Chain 4 Exception: Exception: categorical_lpmf: Number of categories is 0, but must be in the interval [1, 3] (in '/tmp/Rtmp5Xc0sm/model-24e17999fc70.stan', line 10, column 6 to column 62) (in '/tmp/Rtmp5Xc0sm/model-24e17999fc70.stan', line 10, column 6 to column 62) [origin: unknown original type]
``````

Any good ideas?

Thanks!

Here is the full R script: ordinal-reduce_sum-script.R (1.4 KB)
and Stan code ordinal-reduce_sum.stan (1.1 KB)

Hi Conor,

Firstly, Stan has a built-in `ordered_probit` distribution which might be a cleaner for you, there’s more information here in the docs.

So your non-reduce loop would be:

``````// linear predictor
mu = alpha + beta * x;

target += ordered_probit_lpmf(y | mu, thresh);
``````

To get your reduce_sum likelihood working, you will need to have a loop in the partial_sum function, since the `categorical_lpmf` isn’t vectorised in the way that you’re trying to apply it. However, then using a loop with `reduce_sum` you need to be careful with the indexing, since the ‘sliced’ argument ‘slice_y’ is not the same length as theta (since it’s been sliced) the same index ‘i’ will reference different individuals.

Then to access the correct vector for `theta` without it returning matrix, use `theta_slice[i]'` rather than ```theta_slice[i,:]’.

Putting those things together, your `partial_sum` function then becomes:

``````  real partial_sum(
int[] slice_y,
int start, int end,
matrix theta_slice
)
{
real lp = 0;
for(i in start:end)
lp += categorical_lpmf(slice_y[i - start + 1] | theta_slice[i]');
return lp;
}
``````
1 Like

Thanks @andrjohns!

I am using the ‘fixed threshold’ parameterisation of the cumulative probit, so that’s why I have it coded up directly.

The code runs but is actually much (~ 6x) slower than the non `reduce_sum` version. I’ll investigate a bit more.

Yeah that is kinda expected in this case. Since `reduce_sum` only slices the first argument, `theta` isn’t getting sliced and the entire (5000x3) matrix gets copied to each thread/process. This copy time can outweigh the performance benefit of splitting the loop between processes.

At least that’s how I understand things, @wds15 does that sound right?

1 Like

Yes, that’s correct.

By all means, you should slice the theta and not the data y!

I did implement for a model based on categorical logit the respective reduce_sum thing and it did run crazy fast with it (like 4 cores = 4x as fast).

Oh good call, that would work better.

@cgoold So for your model, your `partial_sum` function would then look like:

``````  real partial_sum(
matrix theta_slice,
int start, int end,
int[] y
)
{
real lp = 0;
for(i in start:end)
lp += categorical_lpmf(y[i] | theta_slice[i - start + 1]');
return lp;
}
``````

And your `reduce_sum` call would be:

``````target += reduce_sum(partial_sum, theta_slice, 1,  y);
``````

Thanks all. It makes much more sense to me now to slice theta.

@andrjohns I used your re-write, but ran into the following error at compilation:

``````Error in processx::run(command = make_cmd(), args = c(tmp_exe, cpp_options_to_compile_flags(cpp_options),  :
System command 'make' failed, exit status: 2, stderr (last 10 lines):
E> (T[,], int, int, ...) => real, T[,], int, ...
E> (T[,,], int, int, ...) => real, T[,,], int, ...
E> (T[,,,], int, int, ...) => real, T[,,,], int, ...
E> (T[,,,,], int, int, ...) => real, T[,,,,], int, ...
E> (T[,,,,,], int, int, ...) => real, T[,,,,,], int, ...
E> (T[,,,,,,], int, int, ...) => real, T[,,,,,,], int, ...
E> Where T is any one of int, real, vector, row_vector or matrix.
E> Instead supplied arguments of incompatible type: (matrix, int, int, int[]) => real, matrix, int, int[]
``````

I tried to change `theta_slice` to a vector (i.e. `vector[N] theta_slice[K]`), and while it compiled, I haven’t gotten it to run yet. Here’s my code:

``````real partial_sum(
vector[] theta_slice,
int start, int end,
int[] y
)
{
real lp = 0;
for(i in start:end){
lp += categorical_lpmf(y[i] | to_vector(theta_slice[1:3, i - start + 1]));
}
return lp;
}
``````

where in the model block:

``````theta_slice[1] = Phi( (thresh[1] - mu)/sigma);
theta_slice[2] = Phi( (thresh[2] - mu)/sigma) - Phi( (thresh[1] - mu)/sigma);
theta_slice[3] = 1 - Phi( (thresh[2] - mu)/sigma);

target += reduce_sum(partial_sum, theta_slice, 1, y);
``````

which returns the error at run-time referring to the line `target += ...` in the `partial_sum` function:

``````Chain 1 Exception: Exception: array[multi,...] index: accessing element out of range. index 2 out of range; expecting index to be between 1 and 1 (in '/tmp/RtmpF1dPiz/model-be61b76c42a.stan', line 17, column 6 to column 79) (in '/tmp/RtmpF1dPiz/model-be61b76c42a.stan', line 17, column 6 to column 79) [origin: unknown original type]
``````

Thanks for any comments you have!

@wds15 Thanks. Is the categorical logit model code somewhere (e.g. as a case study?)

Sorry for leading you astray there, I thought it would have sliced across matrix rows. You’re right in that the next step is to use an array of vectors instead. So theta_slice is declared as:

``````  vector[K] theta_slice[N];
``````

And the reduce_sum likelihood call is:

``````  if(use_reduce_sum == 1){
theta_slice[, 1] = to_array_1d(Phi( (thresh[1] - mu)/sigma));
theta_slice[, 2] = to_array_1d(Phi( (thresh[2] - mu)/sigma) - Phi( (thresh[1] - mu)/sigma));
theta_slice[, 3] = to_array_1d(1 - Phi( (thresh[2] - mu)/sigma));

target += reduce_sum(partial_sum, theta_slice, 1, y);
}
``````

The `to_array_1d` calls are there so that the types are compatible when accessing the first/second/third element for each individual.

And the partial_sum call is:

``````  real partial_sum(
vector[] theta_slice,
int start, int end,
int[] slice_y
)
{
real lp = 0;
for(i in start:end)
lp += categorical_lpmf(slice_y[i] | theta_slice[i - start + 1]);
return lp;
}
``````

That ran for me in ~30secs, but I’ve got things compiling in the background so your results could be even better.

1 Like

Thanks again! For me, it runs just as quick as the non-`reduce_sum` version now, but not any quicker really. But maybe on machine with more cores/threads it would?

How many (physical) cores do you have? Given that you’re running 4 chains, with 2 threads, you would need an 8-core cpu to see a real benefit. Alternatively, you could try just running 2 chains at a time and seeing how the performance compares.

Also, I experimented a little with your model, and by moving more of the calculation/construction to the `partial_sum` function, I’ve gotten the runtime down to ~10secs:

``````functions{

real partial_sum(
int[] slice_y,
int start, int end,
vector x, vector thresh,
real alpha, real beta,
real sigma,int K
)
{
real lp = 0;
int N = (end - start + 1);
matrix[N,K] theta_slice;
vector[N] mu = alpha + beta * x[start:end];
theta_slice[, 1] = (Phi( (thresh[1] - mu)/sigma));
theta_slice[, 2] = (Phi( (thresh[2] - mu)/sigma) - Phi( (thresh[1] - mu)/sigma));
theta_slice[, 3] = (1 - Phi( (thresh[2] - mu)/sigma));
for(i in 1:N)
lp += categorical_lpmf(slice_y[i] | theta_slice[i]');
return lp;
}
}

data{
int<lower=1> N;
vector[N] x;
int y[N];
int K;
vector[K-1] thresh;
int use_reduce_sum;
}

parameters{
real alpha;
real beta;
real<lower=0> sigma;
}

model{
alpha ~ normal(0, 5);
beta ~ std_normal();
sigma ~ std_normal();

if(use_reduce_sum == 1){

target += reduce_sum(partial_sum, y, 1, x, thresh, alpha, beta, sigma, K);
}
else{
vector[N] mu;
mu = alpha + beta * x;
// likelihood (non reduce_sum)
for(n in 1:N){
vector[K] theta;
theta[1] = Phi( (thresh[1] - mu[n])/sigma);
theta[2] = Phi( (thresh[2] - mu[n])/sigma) - Phi( (thresh[1] - mu[n])/sigma);
theta[3] = 1 - Phi( (thresh[2] - mu[n])/sigma);

target += categorical_lpmf(y[n] | theta);
}
}
}

``````
1 Like

Sorry… no, this code is not in the public.

Another important step is to move most calculations into the partial_sum function. So you should really move the `Phi` calls which prep the theta object into the partial_sum function. As I recall, the `Phi`function is anyway not really vectorised in Stan => so you do not use performance when you do that element wise.

1 Like

Thanks. I’ll give this a go later. I experimented with this before slicing `theta` so it might work. I’ll also try 2 chains next time - maybe i’m misunderstanding my system!

only try with 1 chain for now with multiple threads. once that works you figure out the optimal chain / thread trade-off on your machine

Thank you both for your time a few days ago and very helpful comments. I managed to get it all working with the expected speedup.

I’ve implemented the `reduce_sum` approach on a hierarchical ordinal probit model now, and I’m finding about a 2x speedup. I’ll be applying this to a joint-likelihood ordinal probit model soon, with hurdle and zero-inflated components and cross-classified random effects. All going well, I can make a new post for that model (and preprint) if it would be helpful.

1 Like