Multi-logit regression using reduce_sum

NicholasWE · December 24, 2023, 3:44pm

tl;dr I’m trying to use multiple threads when estimating a multi-logit regression and all my tries are slower than the Stan Users Guide implemented model, using 3 threads per chain and confirming that the parallelization is working.

Hello, all!

After reading some discussions about “reduce-sum” implementation, I understood that the slicing approach is effective in parameters vector/matrix and, in cases where it is not applicable, data slice is acceptable.

Based on the codes presented for a probit model and on the Minimal Example instructions, I was (or I think I was) able to write a code that runs 3 threads per chain but is slower than the one without parallelization.

As I’m forcing sum-to-zero constraint in my parameters, the updated multi-logi regression model without parallelization is

data {
  int K; // number of categories
  int N; // sample size
  int D; // design matrix column dimension (intercept + D-1 covariates)
  array[N] int y;
  matrix[N, D] x;
}

parameters {

  matrix [D, K-1] beta_raw; // Coefficients for each category K (except the reference)
  
}

transformed parameters {

  matrix [D, K] beta;
  
  for(d in 1:D){
  
    beta[d] = append_col(-sum(beta_raw[d]), beta_raw[d]); 
    
  }

}

model {

  matrix[N, K] eta= x * beta;
  
  to_vector(beta_raw) ~ normal(0, 5);
  
  for (n in 1:N) {

    target += categorical_logit_lpmf(y[n] | eta[n]');

  }
  
}

The fastest version I was able to create for a parallelized model calculate/construct linear predictor \eta matrix in the stan partial_sum function, according to what is shown in this answer:


functions{
  
  real partial_sum_lpmf(array[] int Y_slice,
                        int start, int end,
                        matrix beta,
                        matrix x,
                        int K){
    
    real lp = 0;
    
    int N = (end-start+1);
    

    matrix[N, K] eta_slice = x[start:end] * beta;
    
    for(i in 1:N)
      lp += categorical_logit_lupmf(Y_slice[i] | eta_slice[i]');
     
    return lp;
   
   }
  
}



data {
  int K;
  int N;
  int D;
  array[N] int y;
  matrix[N, D] x;
}

parameters {

  matrix [D, K-1] beta_raw; // Coefficients for each category K (except the reference)
  
}

transformed parameters {

  matrix [D, K] beta;
  
  for(d in 1:D){
  
    beta[d] = append_col(-sum(beta_raw[d]), beta_raw[d]); 
    
  }

}

model {

  to_vector(beta_raw) ~ normal(0, 5);
  
target += reduce_sum(partial_sum_lpmf, y, 1, beta, x, K); 
  
}

I’m running them on a server, so GPU capacity and number of cores are not a problem, but the reduced model is ~60% slower than the non reduced:

The estimated parameters are equal, which indicates that the parallelized code is making the right calculations, at least:

I’ve tried other parallelization approaches, but they were worse :/

The set of parameters I used for all runnings was this (with objects updated according to the models):

The codes and data simulation files are attached.
All kinds of help, comments, or suggestions about how to make the parallelized code the fastest would be appreciated :)

Thanks in advance!

P.S.: As an additional question, I saw much discussion about how to interpret parameters when the identifiability problem is fixed by forcing one of the \beta vector to be zero. Does anyone know how to interpret, or recover the original parameters that generated data?

Thanks so much!

non.stan (629 Bytes)
reduce.stan (981 Bytes)
reduce2.stan (924 Bytes)
reduce3.stan (936 Bytes)
reduce4.stan (950 Bytes)
Data.R (1.8 KB)

NicholasWE · December 26, 2023, 1:02pm

Sorry for the bad practice, but I’m tagging here @rok_cesnovar, @cwhittaker1000, and @andrjohns because you were part of the discussions and brought good solutions.

Thanks!

Corey.Plate · December 28, 2023, 9:53pm

Have you looked into how many CPU are in use when running each iteration of the model, to check if the code is actually running with explicit parallelization?

bachlaw · December 28, 2023, 11:20pm

Two suggestions:

Have you tried specifying it in brms with the cmdstanr backend and seeing the code that it generates automatically for a reduce_sum model? This type of model is supported, I believe, and you could compare its code and results and speed to yours.
Have you tried profiling to find the ideal grainsize for each of your reduce_sum threads? A poor choice can easily be two or three times slower than the best.

Good luck!

NicholasWE · December 30, 2023, 2:21am

Hello, @Corey.Plate!

Yes, I checked the use of CPU’s and the code is running with parallelization:

NicholasWE · December 30, 2023, 2:30am

Hello, @bachlaw, thanks for the suggestions!

I haven’t tried brms yet, but I will learn more about it and try to see what happens in the backend whe I run a categorical_logit model. I checked and this model is supported.

I didn’t do that, all the content I saw related to reduce_sum used grainsize=1 and I thought that was going to work fine.

When I get the answers I’ll come back with you ;)

Thanks!

Corey.Plate · December 31, 2023, 11:51pm

Have you compared this CPU usage to the unparallelized model to ensure that it is, in fact, taking advantage of the explicit parallelization of reduce_sum? My own unparallelizd model ran on 28 CPU vs the 80 CPU when it was explicitly parallelized with reduce_sum and compiled with make STAN_THREADS=TRUE on cmdstan

NicholasWE · January 4, 2024, 12:03am

Hello, @Corey.Plate!

I checked and compared both models. The unparallelized runs with a single thread for each chain and the parallelized with 3 threads per chain, as set in model$sample.

Was that your suggestion? I’m not sure if you meant that by “CPU”.

Thanks.

Corey.Plate · January 4, 2024, 12:54am

When I say CPU, I mean the actual number of CPU engaged by the parallelized version vs the unparallelized version. The more parallelized the code is, the more CPU it should theoretically use over the unparallelized version. This information would either come from your computer’s activity monitor, or, if you use a cluster, from the summary information about the batch.

The incentive for checking is to diagnose whether the code is indeed running with the explicit parallelization as it should be. If it is, then it’s a coding issue. If it’s not, then it’s a model compiling issue.

Topic		Replies	Views
Reduce_sum() no time saving for multilevel model Modeling techniques	1	453	December 24, 2020
Reduce_sum performance many threads Modeling	9	657	August 14, 2020
RStan parallelising using reduce_sum() Modeling	1	445	July 30, 2021
Parallelization in cmdstanr: using reduce_sum() on a multinomial likelihood Modeling cmdstanr , multinomial-response	14	1178	January 9, 2024
Help with reduce_sum Modeling	32	1431	August 4, 2020

Multi-logit regression using reduce_sum

Related topics