Parallelization in cmdstanr: using reduce_sum() on a multinomial likelihood

Hello,

I am currently working on a multinomial regression model with many covariates and a large dataset. I am interested in reducing runtime by using within-chain parallelization. I initially thought the multinomial was a good candidate to use reduce_sum(). However, I don’ t seem to find a way that works and I got to a point where the model compiles properly but the chains don’ t start and the error messages are more puzzling than helpful.
This is as far as I got, with a very simple example for clarity:

Stan code:

// multinomial.stan

functions{
  real partial_sum(int[,] y_slice,
                   int start,
                   int end,
                   vector[] theta) {
    real interm_sum = 0;
    for (i in start:end){
      interm_sum += multinomial_lpmf(y_slice[i] | theta[i]);
    }
    return interm_sum;
  }
}

data {
  int<lower=0> Ny;
  int<lower=0> K;
  int<lower=0> Y_obs[K,Ny];
  int grainsize;
}

parameters {
  simplex[Ny] theta[K];
}

model {
  target += reduce_sum(partial_sum,
                         Y_obs,
                         grainsize,
                         theta);
}

R code:

data <- t(rmultinom(5,500,c(0.5,0.3,0.2)))

stan_data <- list(Ny = 3,
                  K = 5,
                  Y_obs = data,
                  grainsize = 1)

mod <- cmdstan_model("./multinomial.stan",
                     cpp_options = list(stan_threads = TRUE))

fit3 <- mod$sample(data = stan_data,
                   seed = 5446,
                   chains = 3,
                   parallel_chains = 3,
                   threads_per_chain = 2,
                   iter_warmup = 500,
                   iter_sampling = 500,
                   refresh = 100)

here is part of the error message:

Running MCMC with 3 parallel chains, with 2 thread(s) per chain...

Chain 1 Unrecoverable error evaluating the log probability at the initial value. 
Chain 1 Exception: Exception: array[uni, ...] index: accessing element out of range. index 2 out of range; expecting index to be between 1 and 1 (in '/tmp/RtmpVuSgYx/model-da9c97f8797.stan', line 10, column 6 to column 60) (in '/tmp/RtmpVuSgYx/model-da9c97f8797.stan', line 30, column 2 to line 33, column 32) 
Chain 1 Exception: Exception: array[uni, ...] index: accessing element out of range. index 2 out of range; expecting index to be between 1 and 1 (in '/tmp/RtmpVuSgYx/model-da9c97f8797.stan', line 10, column 6 to column 60) (in '/tmp/RtmpVuSgYx/model-da9c97f8797.stan', line 30, column 2 to line 33, column 32)
Chain 1 Exception: Exception: array[uni, ...] index: accessing element out of range. index 2 out of range; expecting index to be between 1 and 1 (in '/tmp/RtmpVuSgYx/model-da9c97f8797.stan', line 10, column 6 to column 60) (in '/tmp/RtmpVuSgYx/model-da9c97f8797.stan', line 30, column 2 to line 33, column 32)

The same error message is repeated for chains 2 and 3, I just truncated the output above.

I also tried hard-coding the multinomial log-likelihood equation instead of using the multinomial() function (this is what I do without parallelization because it allows for vectorisation), but I run into incompatibility issues with data and variable types.

Did anyone manage to find a workaround with multinomial_lpmf() or hard-coded log-likelihood components?

Thank you

Joane

y_slice will only contain start to end elements. Thus, you need to adjust the indices accordingly (it should be start -i + 1).

1 Like

Oh. That was an easy fix, thank you.
Updating the model with code that works (Updated after comment below):

// simplest multinomial

functions{
  real partial_sum(int[,] y_slice,
                   int start,
                   int end,
                   vector[] theta) {
    real interm_sum = 0;
    for (i in 1: end-start+1){
      interm_sum += multinomial_lpmf(y_slice[i] | theta[start+i-1]);
    }
    return interm_sum;
  }
}

data {
  int<lower=0> Ny;
  int<lower=0> K;
  int<lower=0> Y_obs[K,Ny];
  int grainsize;
}

parameters {
  simplex[Ny] theta[K];
}

model {
  target += reduce_sum(partial_sum,
                         Y_obs,
                         grainsize,
                         theta);
}
1 Like

Not quite. Theta is not sliced, but a shared argument which is passed as a whole that needs an index start + i- 1.

It would be nice if you could suggest how to improve the documentation. Many people make the same indexing mistakes.

1 Like

Thank you @wds15.

I implemented both the multinomial model with reduce_sum() and the equivalent non-parallelized model. Whatever parameter I tweak, all the parallelized runs take longer than the non-parallelized ones. I tried:

  • setting threads_per_chain to 1 → highest computation time, way higher than non-parallelized
  • increasing threads_per_chain from 2 to 10 → increases computation time. all times higher than non-parallelized example but lower than run with threads_per_chain = 1
  • testing grainsize, starting value at K/N_{cores} (where K is the number of log-likelihood sum components) then iteratively dividing by 2 → increases computation time
  • increasing K → increases computation time

To sense check that this wasn’ t a problem with my computing system I ran the binomial example Reduce Sum: A Minimal Example (mc-stan.org) and got similar results to the ones obtained there.

Could there be something about the multinomial distribution that makes parallelization gains impossible?

Cheers

Code:

// simplest_multinomial.stan

data {
  int<lower=0> D;
  int<lower=0> K;
  int<lower=0> Y[K,D];
}

parameters {
  simplex[D] theta[K];
}

model {
  for(k in 1:K){
    target += multinomial_lpmf(Y[k] | theta[k]);
  }
}

multithread version:

// simplest_multinomial_redsum.stan

functions{
  real partial_sum(int[,] y_slice,
                   int start,
                   int end,
                   vector[] theta) {
    real interm_sum = 0;
    for (i in 1: end-start+1){
      interm_sum += multinomial_lpmf(y_slice[i] | theta[start+i-1]);
    }
    return interm_sum;
  }
}

data {
  int<lower=0> D;
  int<lower=0> K;
  int<lower=0> Y[K,D];
  int grainsize;
}

parameters {
  simplex[D] theta[K];
}

model {
  target += reduce_sum(partial_sum,
                         Y,
                         grainsize,
                         theta);
}

R code to test a dummy example:

library(cmdstanr)
library(posterior)

ncores = 4
options(mc.cores = ncores)

# create fake data
K <- 500
Y <- t(rmultinom(K, 100, c(0.8,0.1,0.1)))

stan_data <- list(D = 3,
                  K = K,
                  Y = Y)

# basic
mod0 <- cmdstan_model("./simplest_multinomial.stan")

time0 = system.time(
  fit0 <- mod0$sample(data = stan_data,
                      seed = 5446,
                      chains = 2,
                      parallel_chains = 2,
                      iter_warmup = 500,
                      iter_sampling = 500,
                      refresh = 100)
)

# multithread
stan_data$grainsize <- 1

mod1 <- cmdstan_model("./simplest_multinomial_redsum.stan",
                      cpp_options = list(stan_threads = TRUE))


time1 = system.time(
  fit1 <- mod1$sample(data = stan_data,
                      seed = 5446,
                      chains = 2,
                      parallel_chains = 2,
                      threads_per_chain = 2,
                      iter_warmup = 500,
                      iter_sampling = 500,
                      refresh = 100)
)

# compare
time1[["elapsed"]] / time0[["elapsed"]]
2 Likes

Can you explain why did you change loop range in this part? Should it always start with 1?

According to this fantastic tutorial for a binomial likelihood, the proper syntax inside the partial_sum function should just be:

 return multinomial_lpmf(y_slice | theta[start:end]);
  

I don’t know if you need to use a for-loop at all.

@sonicking, it seems to be not so simple - see @wds15 reply above. The problem is that for such distributions as multinomial the minimal example is not that useful. It is often convenient to deal with loops (usually theta is a complex function itself). Anyway, the problem that @joaneelleouet raises persists - I’ve used her code and obtained the same disappointing results. Still waiting for some authoritative to check this case thoroughly. I also have no clue how this start, end logic works.
If you are able to refactor the code in the example to use vector notation, it will be much appreciated.

My apology. I was thinking a categorical distribution. Please disregard my previous reply.

@garej @sonicking Thanks for your interest in the issue.

@garej reduce_sum() works in a way that resets the indexes for each slice. that is, a y slice will be taken from start to end but then will be re-indexed from 1 to the length of the slice, which is end-start+1. Theta is not sliced - the original indexing is kept. Hence theta[start+i-1].

@sonicking the fact that the multinomial function is not vectorised complicates things. In the standard stan model, hard-coding the multinomial (dropping constant terms) makes it quite easy and more computationally efficient. However, moving to multithreading, the constraints on having to input y_slice as an array in partial_sum() makes vectorisation impossible to me - I did not find a way to avoid the loop.

My colleague performed some additional tests and found that for a high enough K, multithreading does give some marginal time gains - but we’ re talking hundreds of thousands here.

3 Likes

Hi. Is D, the number of alternatives, specific for your problem at hand? Or are you building some sort of production code?

Because if D is fixed, my suggestion is to decompose the closed-form multinomial likelihood calculation into the numerator and the denominator, which can be vectorized.

If if D is not fixed, I am still trying to figure it out…

Good luck!

Hi @joaneelleouet . Thank for your beautiful code. I have a problem of nested for-loop by using reduce_sum.

If my code like these:

// N is the index of participants
// T is the index of trial for each participants

for (i in 1:N){
    for (t in 1:T){
      X = alpha[i] * S[i,t] + beta[i] * O[i,t] - gamma[i] * P[i,t];
      Y = 0;
      
      Z[i,t] ~ bernoulli_logit(IT[i] * (X-Y));
    }
  }
}

How could I use reduce_sum to speed up the code?

In another word, which part should I slice? The index of participants or the index of trial?

Thanks!

start with the outer one… so i

Thank for your reply. But I still have no idea to “split the participants”

For example, I defined the total number of participants in data, like these:

data {
int <lower = 0> N;
int <lower =0> T;
//[skiped]
}

Now, I should slice the participant into array, like this:

functions {
    real partial_sum_lpmf (array[] int participants_slice, int start, int end, ...[other parameter], int T) {
         real partial_log_lik = 0;
         for (i in 1:) { // I don't know how to loop in this part
             X = alpha[i] * S[i,t] + beta[i] * O[i,t] - gamma[i] * P[i,t];
             Y = 0;
         for (t in 1:T) {
              partial_log_lik += bernoulli_logit_lpmf (Z[i,t] | IT[i] * (X - Y)
    }
}
}

I’m confusing about split the split of participants, could you give me some advices? Tks~

Just think about how you can enumerate your likelihood from 1 to N (N being all your terms) and then write a function which can work with these indices. You do not need to slice over aspects of your data if that complicates your life- just use an arbitrary counting variable. Checkout the Stan programs generated by brms when you add threading. There you can see how things can be coded with such a dummy index.

… once you have that function you should be able to do (without reduce_sum):

target += my_function(1, N);

or

target += my_function(1, N/2) + my_function(N/2 + 1, N);

as a way to test if you got it right.