Help with multi-threading a simple ordinal probit model using reduce_sum

I’m parallelising some existing programmes but am struggling to code up a simple ordinal probit model using partial_sum and reduce_sum.

The non-reduce_sum likelihood statement is (for a 3-level ordinal response):

// linear predictor
mu = alpha + beta * x;

for(n in 1:N){// loop over cases
      vector[K] theta; // probability of ordinal categories
      theta[1] = Phi( (thresh[1] - mu[n])/sigma);
      theta[2] = Phi( (thresh[2] - mu[n])/sigma) - Phi( (thresh[1] - mu[n])/sigma);
      theta[3] = 1 - Phi( (thresh[2] - mu[n])/sigma);
   
      target += categorical_lpmf(y[n] | theta);
}

I wrote a partial_sum function:

real partial_sum(
    int[] slice_y,
    int start, int end,
    matrix theta_slice
  )
  {
    return categorical_lpmf(slice_y[start:end] | theta_slice[start:end,:]');
  }

and a reduce_sum likelihood:

matrix[N, K] theta_slice;

theta_slice[, 1] = Phi( (thresh[1] - mu)/sigma);
theta_slice[, 2] = Phi( (thresh[2] - mu)/sigma) - Phi( (thresh[1] - mu)/sigma);
theta_slice[, 3] = 1 - Phi( (thresh[2] - mu)/sigma);

target += reduce_sum(partial_sum, y, 1, theta_slice);

but I’m running into problems that the theta_slice parameter cannot be anything but a column-vector in categorical_lpmf, but I’m passing in theta_slice as a matrix.

If I try to make a for-loop in partial_sum, it also doesn’t work:

real partial_sum(
    int[] slice_y,
    int start, int end,
    matrix theta_slice
  )
  {
    for(i in start:end)
      return categorical_lpmf(slice_y[i] | theta_slice[i,:]');
  }

which causes the error:

Chain 3 TBB Warning: Exact exception propagation is requested by application but the linked library is built without support for it
Chain 4 Exception: Exception: categorical_lpmf: Number of categories is 0, but must be in the interval [1, 3] (in '/tmp/Rtmp5Xc0sm/model-24e17999fc70.stan', line 10, column 6 to column 62) (in '/tmp/Rtmp5Xc0sm/model-24e17999fc70.stan', line 10, column 6 to column 62) [origin: unknown original type]

Any good ideas?

Thanks!

Here is the full R script: ordinal-reduce_sum-script.R (1.4 KB)
and Stan code ordinal-reduce_sum.stan (1.1 KB)

Hi Conor,

Firstly, Stan has a built-in ordered_probit distribution which might be a cleaner for you, there’s more information here in the docs.

So your non-reduce loop would be:

// linear predictor
mu = alpha + beta * x;

target += ordered_probit_lpmf(y | mu, thresh);

To get your reduce_sum likelihood working, you will need to have a loop in the partial_sum function, since the categorical_lpmf isn’t vectorised in the way that you’re trying to apply it. However, then using a loop with reduce_sum you need to be careful with the indexing, since the ‘sliced’ argument ‘slice_y’ is not the same length as theta (since it’s been sliced) the same index ‘i’ will reference different individuals.

Then to access the correct vector for theta without it returning matrix, use theta_slice[i]' rather than ```theta_slice[i,:]’.

Putting those things together, your partial_sum function then becomes:

  real partial_sum(
    int[] slice_y,
    int start, int end,
    matrix theta_slice
  )
  {
    real lp = 0;
    for(i in start:end)
      lp += categorical_lpmf(slice_y[i - start + 1] | theta_slice[i]');
    return lp;
  }
1 Like

Thanks @andrjohns!

I am using the ‘fixed threshold’ parameterisation of the cumulative probit, so that’s why I have it coded up directly.

The code runs but is actually much (~ 6x) slower than the non reduce_sum version. I’ll investigate a bit more.

Yeah that is kinda expected in this case. Since reduce_sum only slices the first argument, theta isn’t getting sliced and the entire (5000x3) matrix gets copied to each thread/process. This copy time can outweigh the performance benefit of splitting the loop between processes.

At least that’s how I understand things, @wds15 does that sound right?

1 Like

Yes, that’s correct.

By all means, you should slice the theta and not the data y!

I did implement for a model based on categorical logit the respective reduce_sum thing and it did run crazy fast with it (like 4 cores = 4x as fast).

Oh good call, that would work better.

@cgoold So for your model, your partial_sum function would then look like:

  real partial_sum(
    matrix theta_slice,
    int start, int end,
    int[] y
  )
  {
    real lp = 0;
    for(i in start:end)
      lp += categorical_lpmf(y[i] | theta_slice[i - start + 1]');
    return lp;
  }

And your reduce_sum call would be:

target += reduce_sum(partial_sum, theta_slice, 1,  y);

Thanks all. It makes much more sense to me now to slice theta.

@andrjohns I used your re-write, but ran into the following error at compilation:

Error in processx::run(command = make_cmd(), args = c(tmp_exe, cpp_options_to_compile_flags(cpp_options),  : 
  System command 'make' failed, exit status: 2, stderr (last 10 lines):
E> (T[,], int, int, ...) => real, T[,], int, ...
E> (T[,,], int, int, ...) => real, T[,,], int, ...
E> (T[,,,], int, int, ...) => real, T[,,,], int, ...
E> (T[,,,,], int, int, ...) => real, T[,,,,], int, ...
E> (T[,,,,,], int, int, ...) => real, T[,,,,,], int, ...
E> (T[,,,,,,], int, int, ...) => real, T[,,,,,,], int, ...
E> Where T is any one of int, real, vector, row_vector or matrix.
E> Instead supplied arguments of incompatible type: (matrix, int, int, int[]) => real, matrix, int, int[]

I tried to change theta_slice to a vector (i.e. vector[N] theta_slice[K]), and while it compiled, I haven’t gotten it to run yet. Here’s my code:

real partial_sum(
    vector[] theta_slice,
    int start, int end,
    int[] y
  )
  {
    real lp = 0;
    for(i in start:end){
      lp += categorical_lpmf(y[i] | to_vector(theta_slice[1:3, i - start + 1]));
    }
    return lp;
  }

where in the model block:

theta_slice[1] = Phi( (thresh[1] - mu)/sigma); 
theta_slice[2] = Phi( (thresh[2] - mu)/sigma) - Phi( (thresh[1] - mu)/sigma);
theta_slice[3] = 1 - Phi( (thresh[2] - mu)/sigma);

target += reduce_sum(partial_sum, theta_slice, 1, y);

which returns the error at run-time referring to the line target += ... in the partial_sum function:

Chain 1 Exception: Exception: array[multi,...] index: accessing element out of range. index 2 out of range; expecting index to be between 1 and 1 (in '/tmp/RtmpF1dPiz/model-be61b76c42a.stan', line 17, column 6 to column 79) (in '/tmp/RtmpF1dPiz/model-be61b76c42a.stan', line 17, column 6 to column 79) [origin: unknown original type]

Thanks for any comments you have!

@wds15 Thanks. Is the categorical logit model code somewhere (e.g. as a case study?)

Sorry for leading you astray there, I thought it would have sliced across matrix rows. You’re right in that the next step is to use an array of vectors instead. So theta_slice is declared as:

  vector[K] theta_slice[N];

And the reduce_sum likelihood call is:

  if(use_reduce_sum == 1){
    theta_slice[, 1] = to_array_1d(Phi( (thresh[1] - mu)/sigma));
    theta_slice[, 2] = to_array_1d(Phi( (thresh[2] - mu)/sigma) - Phi( (thresh[1] - mu)/sigma));
    theta_slice[, 3] = to_array_1d(1 - Phi( (thresh[2] - mu)/sigma));

    target += reduce_sum(partial_sum, theta_slice, 1, y);
  }

The to_array_1d calls are there so that the types are compatible when accessing the first/second/third element for each individual.

And the partial_sum call is:

  real partial_sum(
    vector[] theta_slice,
    int start, int end,
    int[] slice_y
  )
  {
    real lp = 0;
    for(i in start:end)
      lp += categorical_lpmf(slice_y[i] | theta_slice[i - start + 1]);
    return lp;
  }

That ran for me in ~30secs, but I’ve got things compiling in the background so your results could be even better.

1 Like

Thanks again! For me, it runs just as quick as the non-reduce_sum version now, but not any quicker really. But maybe on machine with more cores/threads it would?

How many (physical) cores do you have? Given that you’re running 4 chains, with 2 threads, you would need an 8-core cpu to see a real benefit. Alternatively, you could try just running 2 chains at a time and seeing how the performance compares.

Also, I experimented a little with your model, and by moving more of the calculation/construction to the partial_sum function, I’ve gotten the runtime down to ~10secs:

functions{

  real partial_sum(
    int[] slice_y,
    int start, int end,
    vector x, vector thresh,
    real alpha, real beta,
    real sigma,int K
  )
  {
    real lp = 0;
    int N = (end - start + 1);
    matrix[N,K] theta_slice;
    vector[N] mu = alpha + beta * x[start:end];
    theta_slice[, 1] = (Phi( (thresh[1] - mu)/sigma));
    theta_slice[, 2] = (Phi( (thresh[2] - mu)/sigma) - Phi( (thresh[1] - mu)/sigma));
    theta_slice[, 3] = (1 - Phi( (thresh[2] - mu)/sigma));
    for(i in 1:N)
      lp += categorical_lpmf(slice_y[i] | theta_slice[i]');
    return lp;
  }
}

data{
  int<lower=1> N;
  vector[N] x;
  int y[N];
  int K;
  vector[K-1] thresh;
  int use_reduce_sum;
}

parameters{
  real alpha;
  real beta;
  real<lower=0> sigma;
}

model{
  alpha ~ normal(0, 5);
  beta ~ std_normal();
  sigma ~ std_normal();


  if(use_reduce_sum == 1){

    target += reduce_sum(partial_sum, y, 1, x, thresh, alpha, beta, sigma, K);
  }
  else{
    vector[N] mu;
    mu = alpha + beta * x;
    // likelihood (non reduce_sum)
    for(n in 1:N){
      vector[K] theta;
      theta[1] = Phi( (thresh[1] - mu[n])/sigma);
      theta[2] = Phi( (thresh[2] - mu[n])/sigma) - Phi( (thresh[1] - mu[n])/sigma);
      theta[3] = 1 - Phi( (thresh[2] - mu[n])/sigma);

      target += categorical_lpmf(y[n] | theta);
      }
  }
}

1 Like

Sorry… no, this code is not in the public.

Another important step is to move most calculations into the partial_sum function. So you should really move the Phi calls which prep the theta object into the partial_sum function. As I recall, the Phifunction is anyway not really vectorised in Stan => so you do not use performance when you do that element wise.

1 Like

Thanks. I’ll give this a go later. I experimented with this before slicing theta so it might work. I’ll also try 2 chains next time - maybe i’m misunderstanding my system!

only try with 1 chain for now with multiple threads. once that works you figure out the optimal chain / thread trade-off on your machine

@andrjohns @wds15

Thank you both for your time a few days ago and very helpful comments. I managed to get it all working with the expected speedup.

I’ve implemented the reduce_sum approach on a hierarchical ordinal probit model now, and I’m finding about a 2x speedup. I’ll be applying this to a joint-likelihood ordinal probit model soon, with hurdle and zero-inflated components and cross-classified random effects. All going well, I can make a new post for that model (and preprint) if it would be helpful.

1 Like