Multi-logit regression using reduce_sum

tl;dr I’m trying to use multiple threads when estimating a multi-logit regression and all my tries are slower than the Stan Users Guide implemented model, using 3 threads per chain and confirming that the parallelization is working.

Hello, all!

After reading some discussions about “reduce-sum” implementation, I understood that the slicing approach is effective in parameters vector/matrix and, in cases where it is not applicable, data slice is acceptable.

Based on the codes presented for a probit model and on the Minimal Example instructions, I was (or I think I was) able to write a code that runs 3 threads per chain but is slower than the one without parallelization.

As I’m forcing sum-to-zero constraint in my parameters, the updated multi-logi regression model without parallelization is

data {
  int K; // number of categories
  int N; // sample size
  int D; // design matrix column dimension (intercept + D-1 covariates)
  array[N] int y;
  matrix[N, D] x;
}

parameters {

  matrix [D, K-1] beta_raw; // Coefficients for each category K (except the reference)
  
}

transformed parameters {

  matrix [D, K] beta;
  
  for(d in 1:D){
  
    beta[d] = append_col(-sum(beta_raw[d]), beta_raw[d]); 
    
  }

}

model {

  matrix[N, K] eta= x * beta;
  
  to_vector(beta_raw) ~ normal(0, 5);
  
  for (n in 1:N) {

    target += categorical_logit_lpmf(y[n] | eta[n]');

  }
  
}

The fastest version I was able to create for a parallelized model calculate/construct linear predictor \eta matrix in the stan partial_sum function, according to what is shown in this answer:


functions{
  
  real partial_sum_lpmf(array[] int Y_slice,
                        int start, int end,
                        matrix beta,
                        matrix x,
                        int K){
    
    real lp = 0;
    
    int N = (end-start+1);
    

    matrix[N, K] eta_slice = x[start:end] * beta;
    
    for(i in 1:N)
      lp += categorical_logit_lupmf(Y_slice[i] | eta_slice[i]');
     
    return lp;
   
   }
  
}



data {
  int K;
  int N;
  int D;
  array[N] int y;
  matrix[N, D] x;
}

parameters {

  matrix [D, K-1] beta_raw; // Coefficients for each category K (except the reference)
  
}

transformed parameters {

  matrix [D, K] beta;
  
  for(d in 1:D){
  
    beta[d] = append_col(-sum(beta_raw[d]), beta_raw[d]); 
    
  }

}

model {

  to_vector(beta_raw) ~ normal(0, 5);
  
target += reduce_sum(partial_sum_lpmf, y, 1, beta, x, K); 
  
}

I’m running them on a server, so GPU capacity and number of cores are not a problem, but the reduced model is ~60% slower than the non reduced:

image

The estimated parameters are equal, which indicates that the parallelized code is making the right calculations, at least:

I’ve tried other parallelization approaches, but they were worse :/

image

The set of parameters I used for all runnings was this (with objects updated according to the models):

The codes and data simulation files are attached.
All kinds of help, comments, or suggestions about how to make the parallelized code the fastest would be appreciated :)

Thanks in advance!

P.S.: As an additional question, I saw much discussion about how to interpret parameters when the identifiability problem is fixed by forcing one of the \beta vector to be zero. Does anyone know how to interpret, or recover the original parameters that generated data?

Thanks so much!

non.stan (629 Bytes)
reduce.stan (981 Bytes)
reduce2.stan (924 Bytes)
reduce3.stan (936 Bytes)
reduce4.stan (950 Bytes)
Data.R (1.8 KB)

Sorry for the bad practice, but I’m tagging here @rok_cesnovar, @cwhittaker1000, and @andrjohns because you were part of the discussions and brought good solutions.

Thanks!

Have you looked into how many CPU are in use when running each iteration of the model, to check if the code is actually running with explicit parallelization?

Two suggestions:

  1. Have you tried specifying it in brms with the cmdstanr backend and seeing the code that it generates automatically for a reduce_sum model? This type of model is supported, I believe, and you could compare its code and results and speed to yours.
  2. Have you tried profiling to find the ideal grainsize for each of your reduce_sum threads? A poor choice can easily be two or three times slower than the best.

Good luck!

Hello, @Corey.Plate!

Yes, I checked the use of CPU’s and the code is running with parallelization:

Hello, @bachlaw, thanks for the suggestions!

I haven’t tried brms yet, but I will learn more about it and try to see what happens in the backend whe I run a categorical_logit model. I checked and this model is supported.

I didn’t do that, all the content I saw related to reduce_sum used grainsize=1 and I thought that was going to work fine.

When I get the answers I’ll come back with you ;)

Thanks!

Have you compared this CPU usage to the unparallelized model to ensure that it is, in fact, taking advantage of the explicit parallelization of reduce_sum? My own unparallelizd model ran on 28 CPU vs the 80 CPU when it was explicitly parallelized with reduce_sum and compiled with make STAN_THREADS=TRUE on cmdstan

Hello, @Corey.Plate!

I checked and compared both models. The unparallelized runs with a single thread for each chain and the parallelized with 3 threads per chain, as set in model$sample.

Was that your suggestion? I’m not sure if you meant that by “CPU”.

Thanks.

When I say CPU, I mean the actual number of CPU engaged by the parallelized version vs the unparallelized version. The more parallelized the code is, the more CPU it should theoretically use over the unparallelized version. This information would either come from your computer’s activity monitor, or, if you use a cluster, from the summary information about the batch.

The incentive for checking is to diagnose whether the code is indeed running with the explicit parallelization as it should be. If it is, then it’s a coding issue. If it’s not, then it’s a model compiling issue.