# Parallelization in cmdstanr: using reduce_sum() on a multinomial likelihood

Hello,

I am currently working on a multinomial regression model with many covariates and a large dataset. I am interested in reducing runtime by using within-chain parallelization. I initially thought the multinomial was a good candidate to use reduce_sum(). However, I don’ t seem to find a way that works and I got to a point where the model compiles properly but the chains don’ t start and the error messages are more puzzling than helpful.
This is as far as I got, with a very simple example for clarity:

Stan code:

``````// multinomial.stan

functions{
real partial_sum(int[,] y_slice,
int start,
int end,
vector[] theta) {
real interm_sum = 0;
for (i in start:end){
interm_sum += multinomial_lpmf(y_slice[i] | theta[i]);
}
return interm_sum;
}
}

data {
int<lower=0> Ny;
int<lower=0> K;
int<lower=0> Y_obs[K,Ny];
int grainsize;
}

parameters {
simplex[Ny] theta[K];
}

model {
target += reduce_sum(partial_sum,
Y_obs,
grainsize,
theta);
}
``````

R code:

``````data <- t(rmultinom(5,500,c(0.5,0.3,0.2)))

stan_data <- list(Ny = 3,
K = 5,
Y_obs = data,
grainsize = 1)

mod <- cmdstan_model("./multinomial.stan",

fit3 <- mod\$sample(data = stan_data,
seed = 5446,
chains = 3,
parallel_chains = 3,
iter_warmup = 500,
iter_sampling = 500,
refresh = 100)
``````

here is part of the error message:

``````Running MCMC with 3 parallel chains, with 2 thread(s) per chain...

Chain 1 Unrecoverable error evaluating the log probability at the initial value.
Chain 1 Exception: Exception: array[uni, ...] index: accessing element out of range. index 2 out of range; expecting index to be between 1 and 1 (in '/tmp/RtmpVuSgYx/model-da9c97f8797.stan', line 10, column 6 to column 60) (in '/tmp/RtmpVuSgYx/model-da9c97f8797.stan', line 30, column 2 to line 33, column 32)
Chain 1 Exception: Exception: array[uni, ...] index: accessing element out of range. index 2 out of range; expecting index to be between 1 and 1 (in '/tmp/RtmpVuSgYx/model-da9c97f8797.stan', line 10, column 6 to column 60) (in '/tmp/RtmpVuSgYx/model-da9c97f8797.stan', line 30, column 2 to line 33, column 32)
Chain 1 Exception: Exception: array[uni, ...] index: accessing element out of range. index 2 out of range; expecting index to be between 1 and 1 (in '/tmp/RtmpVuSgYx/model-da9c97f8797.stan', line 10, column 6 to column 60) (in '/tmp/RtmpVuSgYx/model-da9c97f8797.stan', line 30, column 2 to line 33, column 32)
``````

The same error message is repeated for chains 2 and 3, I just truncated the output above.

I also tried hard-coding the multinomial log-likelihood equation instead of using the multinomial() function (this is what I do without parallelization because it allows for vectorisation), but I run into incompatibility issues with data and variable types.

Did anyone manage to find a workaround with multinomial_lpmf() or hard-coded log-likelihood components?

Thank you

Joane

y_slice will only contain start to end elements. Thus, you need to adjust the indices accordingly (it should be start -i + 1).

1 Like

Oh. That was an easy fix, thank you.
Updating the model with code that works (Updated after comment below):

``````// simplest multinomial

functions{
real partial_sum(int[,] y_slice,
int start,
int end,
vector[] theta) {
real interm_sum = 0;
for (i in 1: end-start+1){
interm_sum += multinomial_lpmf(y_slice[i] | theta[start+i-1]);
}
return interm_sum;
}
}

data {
int<lower=0> Ny;
int<lower=0> K;
int<lower=0> Y_obs[K,Ny];
int grainsize;
}

parameters {
simplex[Ny] theta[K];
}

model {
target += reduce_sum(partial_sum,
Y_obs,
grainsize,
theta);
}
``````
1 Like

Not quite. Theta is not sliced, but a shared argument which is passed as a whole that needs an index start + i- 1.

It would be nice if you could suggest how to improve the documentation. Many people make the same indexing mistakes.

1 Like

Thank you @wds15.

I implemented both the multinomial model with reduce_sum() and the equivalent non-parallelized model. Whatever parameter I tweak, all the parallelized runs take longer than the non-parallelized ones. I tried:

• setting threads_per_chain to 1 → highest computation time, way higher than non-parallelized
• increasing threads_per_chain from 2 to 10 → increases computation time. all times higher than non-parallelized example but lower than run with threads_per_chain = 1
• testing grainsize, starting value at K/N_{cores} (where K is the number of log-likelihood sum components) then iteratively dividing by 2 → increases computation time
• increasing K → increases computation time

To sense check that this wasn’ t a problem with my computing system I ran the binomial example Reduce Sum: A Minimal Example (mc-stan.org) and got similar results to the ones obtained there.

Could there be something about the multinomial distribution that makes parallelization gains impossible?

Cheers

Code:

``````// simplest_multinomial.stan

data {
int<lower=0> D;
int<lower=0> K;
int<lower=0> Y[K,D];
}

parameters {
simplex[D] theta[K];
}

model {
for(k in 1:K){
target += multinomial_lpmf(Y[k] | theta[k]);
}
}

``````

``````// simplest_multinomial_redsum.stan

functions{
real partial_sum(int[,] y_slice,
int start,
int end,
vector[] theta) {
real interm_sum = 0;
for (i in 1: end-start+1){
interm_sum += multinomial_lpmf(y_slice[i] | theta[start+i-1]);
}
return interm_sum;
}
}

data {
int<lower=0> D;
int<lower=0> K;
int<lower=0> Y[K,D];
int grainsize;
}

parameters {
simplex[D] theta[K];
}

model {
target += reduce_sum(partial_sum,
Y,
grainsize,
theta);
}
``````

R code to test a dummy example:

``````library(cmdstanr)
library(posterior)

ncores = 4
options(mc.cores = ncores)

# create fake data
K <- 500
Y <- t(rmultinom(K, 100, c(0.8,0.1,0.1)))

stan_data <- list(D = 3,
K = K,
Y = Y)

# basic
mod0 <- cmdstan_model("./simplest_multinomial.stan")

time0 = system.time(
fit0 <- mod0\$sample(data = stan_data,
seed = 5446,
chains = 2,
parallel_chains = 2,
iter_warmup = 500,
iter_sampling = 500,
refresh = 100)
)

stan_data\$grainsize <- 1

mod1 <- cmdstan_model("./simplest_multinomial_redsum.stan",

time1 = system.time(
fit1 <- mod1\$sample(data = stan_data,
seed = 5446,
chains = 2,
parallel_chains = 2,
iter_warmup = 500,
iter_sampling = 500,
refresh = 100)
)

# compare
time1[["elapsed"]] / time0[["elapsed"]]
``````
2 Likes

Can you explain why did you change loop range in this part? Should it always start with `1`?

According to this fantastic tutorial for a binomial likelihood, the proper syntax inside the partial_sum function should just be:

`````` return multinomial_lpmf(y_slice | theta[start:end]);

``````

I don’t know if you need to use a for-loop at all.

@sonicking, it seems to be not so simple - see @wds15 reply above. The problem is that for such distributions as `multinomial` the minimal example is not that useful. It is often convenient to deal with loops (usually `theta` is a complex function itself). Anyway, the problem that @joaneelleouet raises persists - I’ve used her code and obtained the same disappointing results. Still waiting for some authoritative to check this case thoroughly. I also have no clue how this `start`, `end` logic works.
If you are able to refactor the code in the example to use vector notation, it will be much appreciated.

My apology. I was thinking a categorical distribution. Please disregard my previous reply.

@garej @sonicking Thanks for your interest in the issue.

@garej reduce_sum() works in a way that resets the indexes for each slice. that is, a y slice will be taken from `start` to `end` but then will be re-indexed from 1 to the length of the slice, which is `end-start+1`. Theta is not sliced - the original indexing is kept. Hence `theta[start+i-1]`.

@sonicking the fact that the multinomial function is not vectorised complicates things. In the standard stan model, hard-coding the multinomial (dropping constant terms) makes it quite easy and more computationally efficient. However, moving to multithreading, the constraints on having to input `y_slice` as an array in `partial_sum()` makes vectorisation impossible to me - I did not find a way to avoid the loop.

My colleague performed some additional tests and found that for a high enough K, multithreading does give some marginal time gains - but we’ re talking hundreds of thousands here.

3 Likes

Hi. Is D, the number of alternatives, specific for your problem at hand? Or are you building some sort of production code?

Because if D is fixed, my suggestion is to decompose the closed-form multinomial likelihood calculation into the numerator and the denominator, which can be vectorized.

If if D is not fixed, I am still trying to figure it out…

Good luck!