Try to use multipe core by reduce_sum But seems to be even less efficient than it was

AlchemistXC · August 25, 2022, 3:05pm

I am trying to run the lee-carter mortality model, using a Core i7-12700 CPU, because this CPU has 12 cores, I tried to rewrite the original model to contain reduce_sum. As follows, the first one is the original model, and the second one is the model with reduce_sum added.
The original model is：

data {
  int<lower=0> T;                  // number of years
  int<lower=0> A;                  // number of age categories
  int<lower=0> Tf;                 // number of forecast years
  int<lower=0> death[A*T];        // deaths
  vector<lower=0>[A*T] expos;        // exposures
}
transformed data {
  vector[A*T] loge = log(expos);        // log exposures
}
parameters {
  vector[A] a;                    // alpha_x
  simplex[A] b;                   // beta_x, strictly positive and sums to 1
  vector[T] ks;                 // vector of k_t(Uncentralized)
  vector[T] alpha_err;
  real<lower = 0> sigmasq;          
  real<lower = 0> tausq;
  real<lower=0> sigma_a;
}
transformed parameters  {
  vector[T] k;                   // vector of k_t(Centralized)
  vector[A*T] logm;
  vector[T] alpha;
  real<lower = 0> sigma = sqrt(sigmasq);
  real<lower = 0> tau = sqrt(tausq);
  real km;
  
  km = mean(ks);
  k = ks - km;
  for (j in 1:T) for (i in 1:A) {
    logm[A*(j-1)+i] = a[i] + b[i] * k[j];
  }
  alpha[1] = alpha_err[1];
  for(t in 2:T){
    alpha[t] = alpha[t-1] + sigma_a * alpha_err[t];
  }
}
model {
  target += poisson_log_lpmf(death|logm+loge);
  target += normal_lpdf(ks|alpha,sigma);
  target += normal_lpdf(alpha_err[2:T]|0,1);
  target += normal_lpdf(a|0,10);
  target += dirichlet_lpdf(b|rep_vector(1, A));
  target += inv_gamma_lpdf(sigmasq| 0.001,0.001);  
  target += inv_gamma_lpdf(tausq| 0.001,0.001);
  target += student_t_lpdf(sigma_a|7, 0, 1);
}
generated quantities {
  vector[Tf] k_p;
  vector[Tf] alpha_p;
  vector[A*Tf] mfore; // predicted death rates
  vector[A*T] log_lik;
  vector[A*T] m = exp(logm);
  int pos1 = 1;
  alpha_p[1] = alpha[T] + sigma_a * normal_rng(0,1);
  for (t in 2:Tf){alpha_p[t] = alpha_p[t-1] + sigma_a * normal_rng(0,1);}
  for (t in 1:Tf){k_p[t] = alpha_p[t] - km  + sigma * normal_rng(0,1);}
  for (t in 1:Tf) for (i in 1:A) {
    mfore[pos1] = exp(a[i] + b[i] * k_p[t]);
    pos1 += 1;
  }
  for(i in 1:A*T){log_lik[i] = poisson_log_lpmf(death[i]|logm[i] + loge[i]);}
}

the model with reduce_sum added is:

functions {
  real partial_sum_lpmf(array[] int death,
                        int start, int end,
                        vector logm,
                        vector loge) {
    return poisson_log_lupmf(death | logm[start:end] +loge[start:end]);
  }
}
data {
  int<lower=0> T;                  // number of years
  int<lower=0> A;                  // number of age categories
  int<lower=0> Tf;                 // number of forecast years
  array[A*T] int<lower=0> death;        // deaths
  vector<lower=0>[A*T] expos;        // exposures
}
transformed data {
  vector[A*T] loge = log(expos);        // log exposures
}
parameters {
  vector[A] a;                    // alpha_x
  simplex[A] b;                   // beta_x, strictly positive and sums to 1
  vector[T] ks;                
  vector[T] alpha_err;
  real<lower = 0> sigmasq;    
  real<lower = 0> tausq;
  real<lower=0> sigma_a;
}
transformed parameters  {
  vector[T] k;                   
  vector[A*T] logm;
  vector[T] alpha;
  real<lower = 0> sigma = sqrt(sigmasq);
  real<lower = 0> tau = sqrt(tausq);
  real km;
  
  km = mean(ks);
  k = ks - km;
  for (j in 1:T) for (i in 1:A) {
    logm[A*(j-1)+i] = a[i] + b[i] * k[j];
  }
  alpha[1] = alpha_err[1];
  for(t in 2:T){
    alpha[t] = alpha[t-1] + sigma_a * alpha_err[t];
  }
}
model {
  int grainsize = 1;
  target += reduce_sum(partial_sum_lupmf, death, grainsize, logm, loge);
  //target += poisson_log_lpmf(death|logm+loge);
  target += normal_lpdf(ks|alpha,sigma);
  target += normal_lpdf(alpha_err[2:T]|0,1);
  target += normal_lpdf(a|0,10);
  target += dirichlet_lpdf(b|rep_vector(1, A));
  target += inv_gamma_lpdf(sigmasq| 0.001,0.001);  
  target += inv_gamma_lpdf(tausq| 0.001,0.001);
  target += student_t_lpdf(sigma_a|7, 0, 1);
}
generated quantities {
  vector[Tf] k_p;
  vector[Tf] alpha_p;
  vector[A*Tf] mfore; // predicted death rates
  vector[A*T] log_lik;
  vector[A*T] m = exp(logm);
  int pos1 = 1;
  alpha_p[1] = alpha[T] + sigma_a * normal_rng(0,1);
  for (t in 2:Tf){alpha_p[t] = alpha_p[t-1] + sigma_a * normal_rng(0,1);}
  for (t in 1:Tf){k_p[t] = alpha_p[t] - km  + sigma * normal_rng(0,1);}
  for (t in 1:Tf) for (i in 1:A) {
    mfore[pos1] = exp(a[i] + b[i] * k_p[t]);
    pos1 += 1;
  }
  for(i in 1:A*T){log_lik[i] = poisson_log_lpmf(death[i]|logm[i] + loge[i]);}
}

When I ran the original model, the CPU speed was around 4.3GHZ, the utilization was around 50% and it took about 700 seconds to simulate two thousand times.
When using the modified model, I set threads_per_chain = 2 or 3, when the CPU speed was around 3.7GHZ, the utilization was close to 100%, but even after spending more than 1400 seconds, the simulation was still not completed 2000 times.
Maybe I have some misunderstanding about the way reduce_sum is set up. Have I coded this incorrectly?

wds15 · August 25, 2022, 6:55pm

The grainsize of 1 looks too small for me. The grainsize is an indicator to the library how large the individual partial sum terms should be. With just 1 you say that you want a by element sum, which throws away vectorization. Try to set grainsize to about number of data rows / (2*threads_per_chain) maybe.

… but looking at the model you should try to go for the recently introduced optimizations by using SoA data structures instead of AoS. That would require you to get rid of the for loops and replace that by vectorized expressions (which is anyway a good thing). Is it possible to write the second for loop as a cumulative sum? These acronyms AoS / SoA have been discussed on the forum recently and are mentioned in the release notes of Stan (look for the debug mem pattern of stanc3).

AlchemistXC · August 25, 2022, 8:37pm

Thank you very much! The second loop is a re-parameterization to avoid the “transitions hit the maximum treedepth limit” warning in a tutorial I saw between now and then (but it doesn’t seem to work well, the warning still appears)
I have put the alpha-related calculations back into the “model block” and the model now looks like this：

functions {
  real partial_sum_lpmf(array[] int death,
                        int start, int end,
                        vector logm,
                        vector loge) {
    return poisson_log_lupmf(death | logm[start:end] +loge[start:end]);
  }
}
data {
  int<lower=0> T;                  // number of years
  int<lower=0> A;                  // number of age categories
  int<lower=0> Tf;                 // number of forecast years
  array[A*T] int<lower=0> death;        // deaths
  vector<lower=0>[A*T] expos;        // exposures
}
transformed data {
  vector[A*T] loge = log(expos);        // log exposures
}
parameters {
  vector[A] a;                    // alpha_x
  simplex[A] b;                   // beta_x, strictly positive and sums to 1
  vector[T] ks;                
  vector[T] alpha;
  real<lower = 0> sigma;    
  real<lower = 0> tau;
  real<lower=0> sigma_a;
}
transformed parameters  {
  vector[T] k;                   
  vector[A*T] logm;
  real km;
  
  km = mean(ks);
  k = ks - km;
  for (j in 1:T) for (i in 1:A) {
    logm[A*(j-1)+i] = a[i] + b[i] * k[j];
  }
}
model {
  int grainsize = A*T%/%3;
  target += reduce_sum(partial_sum_lupmf, death, grainsize, logm, loge);
  //target += poisson_log_lpmf(death|logm+loge);
  target += normal_lpdf(ks|alpha,sigma);
  target += normal_lpdf(alpha[1]|0,sigma_a);
  target += normal_lpdf(alpha[2:T]|alpha[1:(T-1)],sigma_a);
  target += normal_lpdf(a|0,10);
  target += dirichlet_lpdf(b|rep_vector(1, A));
  target += student_t_lpdf(sigma|7, 0, 1);
  target += student_t_lpdf(tau|7, 0, 1);
  target += student_t_lpdf(sigma_a|7, 0, 1);
}
generated quantities {
  vector[Tf] k_p;
  vector[Tf] alpha_p;
  vector[A*Tf] mfore; // predicted death rates
  vector[A*T] log_lik;
  vector[A*T] m = exp(logm);
  int pos1 = 1;
  alpha_p[1] = alpha[T] + sigma_a * normal_rng(0,1);
  for (t in 2:Tf){alpha_p[t] = alpha_p[t-1] + sigma_a * normal_rng(0,1);}
  for (t in 1:Tf){k_p[t] = alpha_p[t] - km  + sigma * normal_rng(0,1);}
  for (t in 1:Tf) for (i in 1:A) {
    mfore[pos1] = exp(a[i] + b[i] * k_p[t]);
    pos1 += 1;
  }
  for(i in 1:A*T){log_lik[i] = poisson_log_lpmf(death[i]|logm[i] + loge[i]);}
}

I set the grainsize to one-third of the total length and set threads_per_chain = 3，chain=4 to ensure that all 12 cores are assigned to tasks (if my understanding of the grainsize is not wrong again)。
Now it only takes about 500 seconds to sample two thousand times, really thanks.

andrjohns · August 26, 2022, 12:30pm

The important thing to also remember is that any additional arguments to the partial_lpdf function are copied in full to each parallel worker. This can cause a performance hit when large containers of parameters are being copied (copying int arrays on the other hand is much more efficient).

For your model, if you construct logm as an array[] real type, then you can pass it to reduce_sum as the argument to be sliced over, which could be more efficient:

functions {
  real partial_sum_lpdf(array[] real logm,
                        int start, int end,
                        array[] int death) {
    return poisson_log_lupmf(death[start:end] | logm);
  }
}
data {
  int<lower=0> T;                  // number of years
  int<lower=0> A;                  // number of age categories
  int<lower=0> Tf;                 // number of forecast years
  array[A*T] int<lower=0> death;        // deaths
  vector<lower=0>[A*T] expos;        // exposures
}
transformed data {
  vector[A*T] loge = log(expos);        // log exposures
}
parameters {
  vector[A] a;                    // alpha_x
  simplex[A] b;                   // beta_x, strictly positive and sums to 1
  vector[T] ks;                
  vector[T] alpha_err;
  real<lower = 0> sigmasq;    
  real<lower = 0> tausq;
  real<lower=0> sigma_a;
}
transformed parameters  {
  vector[T] k;                   
  array[A*T] real logm;
  vector[T] alpha;
  real<lower = 0> sigma = sqrt(sigmasq);
  real<lower = 0> tau = sqrt(tausq);
  real km;
  
  km = mean(ks);
  k = ks - km;
  for (j in 1:T) for (i in 1:A) {
    logm[A*(j-1)+i] = loge[A*(j-1)+i] + a[i] + b[i] * k[j];
  }
  alpha[1] = alpha_err[1];
  for(t in 2:T){
    alpha[t] = alpha[t-1] + sigma_a * alpha_err[t];
  }
}
model {
  int grainsize = 1;
  target += reduce_sum(partial_sum_lpdf, logm, grainsize, death);
  //target += poisson_log_lpmf(death|logm+loge);
  target += normal_lpdf(ks|alpha,sigma);
  target += normal_lpdf(alpha_err[2:T]|0,1);
  target += normal_lpdf(a|0,10);
  target += dirichlet_lpdf(b|rep_vector(1, A));
  target += inv_gamma_lpdf(sigmasq| 0.001,0.001);  
  target += inv_gamma_lpdf(tausq| 0.001,0.001);
  target += student_t_lpdf(sigma_a|7, 0, 1);
}

AlchemistXC · August 26, 2022, 6:14pm

Thank you very much for your help, I have modified the model I am using as you said and it feels much faster than it was.

functions {
  real partial_sum_lpmf(array[] int death,
                        int start, int end,
                        array[] real logm) {
    return poisson_log_lupmf(death | logm[start:end]);
  }
}
data {
  int<lower=0> T;                  // number of years
  int<lower=0> A;                  // number of age categories
  int<lower=0> Tf;                 // number of forecast years
  array[A*T] int<lower=0> death;        // deaths
  array[A*T] real<lower=0> expos;        // exposures
}
transformed data {
  array[A*T] real loge = log(expos);        // log exposures
}
parameters {
  vector[A] a;                    // alpha_x
  simplex[A] b1;                   // beta_x, strictly positive and sums to 1
  simplex[A] b2;
  vector[T] ks;                 // vector of k_t(without first element)
  vector[T+A-1] gs;
  vector[T] alpha_k;
  vector[T] delta_k;
  vector[T+A-1] alpha_g;
  vector[T+A-1] delta_g;
  real<lower = 0> sigma_k;          // standard deviation of the random walk
  real<lower = 0> sigma_g; 
  real<lower = 0> tau;
  real<lower=0> sigmak_a;
  real<lower=0> sigmak_d;
  real<lower=0> sigmag_a;
  real<lower=0> sigmag_d;
}
transformed parameters  {
  vector[T] k;   
  vector[T+A-1] g;
  array[A*T] real logm;
  real km;
  real gm;
  
  km = mean(ks);
  k = ks - km;
  gm = mean(gs);
  g = gs - gm;
  for (j in 1:T) for (i in 1:A) {
    logm[A*(j-1)+i] = loge[A*(j-1)+i] + a[i] + b1[i] * k[j] + b2[i] * g[j-i+A];
  }
}
model {
  int grainsize = A*T%/%3;
  target += reduce_sum(partial_sum_lupmf, death, grainsize, logm);
  target += normal_lpdf(ks|alpha_k,sigma_k);
  target += normal_lpdf(alpha_k[1]|delta_k[1],sigmak_a);
  target += normal_lpdf(alpha_k[2:T]|alpha_k[1:(T-1)]+delta_k[2:T],sigmak_a);
  target += normal_lpdf(delta_k[1]|0,sigmak_d);
  target += normal_lpdf(delta_k[2:T]|delta_k[1:(T-1)],sigmak_d);
  target += normal_lpdf(gs|alpha_g,sigma_g);
  target += normal_lpdf(alpha_g[1]|delta_g[1],sigmag_a);
  target += normal_lpdf(alpha_g[2:(T+A-1)]|alpha_g[1:(T+A-2)]+delta_g[2:(T+A-1)],sigmag_a);
  target += normal_lpdf(delta_g[1]|0,sigmag_d);
  target += normal_lpdf(delta_g[2:(T+A-1)]|delta_g[1:(T+A-2)],sigmag_d);
  target += normal_lpdf(a|0,10);
  target += dirichlet_lpdf(b1|rep_vector(1, A));
  target += dirichlet_lpdf(b2|rep_vector(1, A));
  target += student_t_lpdf(sigma_k|7, 0, 1);
  target += student_t_lpdf(sigma_g|7, 0, 1);
  target += student_t_lpdf(tau|7, 0, 1);
  target += student_t_lpdf(sigmak_a|7, 0, 1);
  target += student_t_lpdf(sigmak_d|7, 0, 1);
  target += student_t_lpdf(sigmag_a|7, 0, 1);
  target += student_t_lpdf(sigmag_d|7, 0, 1);
}
generated quantities {
  vector[Tf] k_p;
  vector[Tf] alpha_k_p;
  vector[Tf] delta_k_p;
  vector[Tf] g_p;
  vector[T+Tf+A-1] g_f;
  vector[Tf] alpha_g_p;
  vector[Tf] delta_g_p;
  vector[A*Tf] mfore; // predicted death rates
  vector[A*T] log_lik;
  array[A*T] real m;
  int pos1 = 1;
  delta_k_p[1] = delta_k[T] + sigmak_d * normal_rng(0,1);
  alpha_k_p[1] = alpha_k[T] + delta_k_p[1] + sigmak_a * normal_rng(0,1);
  for (t in 2:Tf){
    delta_k_p[t] = delta_k_p[t-1] + sigmak_d * normal_rng(0,1);
    alpha_k_p[t] = alpha_k_p[t-1] + delta_k_p[t] + sigmak_a * normal_rng(0,1);
  }
  for (t in 1:Tf){k_p[t] = alpha_k_p[t] - km + sigma_k * normal_rng(0,1);}
  delta_g_p[1] = delta_g[T+A-1] + sigmag_d * normal_rng(0,1);
  alpha_g_p[1] = alpha_g[T+A-1] + delta_g_p[1] + sigmag_a * normal_rng(0,1);
  for (t in 2:Tf){
    delta_g_p[t] = delta_g_p[t-1] + sigmag_d * normal_rng(0,1);
    alpha_g_p[t] = alpha_g_p[t-1] + delta_g_p[t] + sigmag_a * normal_rng(0,1);
  }
  for (t in 1:Tf){g_p[t] = alpha_g_p[t] - gm + sigma_g * normal_rng(0,1);}
  g_f = append_row(g, g_p);
  for (t in 1:Tf) for (i in 1:A) {
    mfore[pos1] = exp(a[i] + b1[i] * k_p[t] + b2[i] * g_f[T+t-i+A]);
    pos1 += 1;
  }
  for(i in 1:A*T){
    m[i] = exp(logm[i] - loge[i]);
    log_lik[i] = poisson_log_lpmf(death[i]|logm[i]);
  }
}

I am now trying to resolve these warning messages below.

Warning: 113 of 40000 (0.0%) transitions ended with a divergence.
See https://mc-stan.org/misc/warnings for details.

Warning: 39776 of 40000 (99.0%) transitions hit the maximum treedepth limit of 10.
See https://mc-stan.org/misc/warnings for details.

It seems as if the inclusion of a local level or local trend model in the stan model must trigger the “hit the maximum treedepth limit” warning, I tried to replace the for loops in the model with vectorized expressions as much as possible, and I’ve tried re-parameterization before, but that didn’t stop the warning. It seems that the only way to stop the warning is to increase the values of maximum_treedepth and adapt_delta, but the problem is that makes the simulation too slow.

wds15 · August 26, 2022, 7:20pm

Copies are done once per thread…which is often efficient enough. I also wanted to suggest to add things up first with the exposures, but I am not sure if this is worth the effort since it is data !

nhuurre · August 26, 2022, 10:23pm

The model has a lot of latent structure and weak priors. The treedepth problems suggest the sampler spends a lot of time exploring different parameter configurations. Flexible models can be difficult to fit.

However, it may be possible to remove some problematic latent structure without losing too much flexibility.

Subtracting the mean removes its effect on the posterior probability but the sampler will still treat is like a parameter that needs to be estimated.
Sometimes it’s better to declare ks with one less component and build a vector whose mean is guaranteed to be zero, for example

parameters {
  vector[T-1] ks;
}
transformed parameters  {
  vector[T] k = append_row(ks, 0.0) - append_row(0.0, ks); 
}

AlchemistXC:

  target += normal_lpdf(alpha_k[1]|delta_k[1],sigmak_a);
  target += normal_lpdf(alpha_k[2:T]|alpha_k[1:(T-1)]+delta_k[2:T],sigmak_a);
  target += normal_lpdf(delta_k[1]|0,sigmak_d);
  target += normal_lpdf(delta_k[2:T]|delta_k[1:(T-1)],sigmak_d);

Do you need these latent random walks? This code roughly says that the difference between alpha_k[i] and alpha_k[i+1] is about the same as the difference between alpha_k[i-1] and alpha_k[i] but allows the actual difference to slightly^[1] decouple from the “latent trend” delta_k[i]. Even alpha_k is not known exactly and must be estimated so delta_k quite disconnected from the data and difficult to make inferences for.
If you do want lots of latent parameters then you need stronger priors for them.

An alternative would be to omit delta_k and just use the actual difference in alpha_k as the local trend.

target += normal_lpdf(alpha_k[3:T]| alpha_k[2:T-1] + (alpha_k[2:T-1] - alpha_k[1:T-2]), sigma);

Even alpha_k just tracks k so I’m not sure it does anything other than add noise with tricky correlations.

So, here’s a model that I believe is qualitatively similar but has much more tractable parameter structure.

parameters {
  vector[A] a;                    // alpha_x
  simplex[A] b1;                   // beta_x, strictly positive and sums to 1
  simplex[A] b2;
  vector[T-1] ks;                 // vector of k_t differences
  vector[T+A-2] gs;
  real<lower = 0> sigma_k;          // standard deviation of the random walk
  real<lower = 0> sigma_g; 
}
transformed parameters  {
  vector[T] k = append_row(ks, 0.0) - append_row(0.0, ks); 
  vector[T+A-1] g = append_row(gs, 0.0) - append_row(0.0, gs);
  array[A*T] real logm;

  for (j in 1:T) for (i in 1:A) {
    logm[A*(j-1)+i] = loge[A*(j-1)+i] + a[i] + b1[i] * k[j] + b2[i] * g[j-i+A];
  }
}
model {
  int grainsize = A*T%/%3;
  target += reduce_sum(partial_sum_lupmf, death, grainsize, logm);

  // k[1:2], g[1:2] ~ improper flat priors; or whatever
  target += normal_lpdf(k[3:T]| 2*k[2:T-1] - k[1:T-2], sigma_k);
  target += normal_lpdf(g[3:T+A-1]| 2*g[2:T+A-2] - g[1:T+A-3], sigma_g );

  target += normal_lpdf(a|0,10);
  target += dirichlet_lpdf(b1|rep_vector(1, A));
  target += dirichlet_lpdf(b2|rep_vector(1, A));

  target += student_t_lpdf(sigma_k|7, 0, 1);
  target += student_t_lpdf(sigma_g|7, 0, 1);
}

or a lot, if sigmak_a is large, which is also allowed by the priors… ↩︎

AlchemistXC · August 27, 2022, 12:06am

Thanks for the help, I refer to chapter 12 of the book “Bayesian Demographic Estimation and Forecasting” for the model structure of the ks and gs presets, this book uses not stan but a different package called “demest”. I am ashamed to have tried to quote it without understanding what these structures really mean.
Anyway, thanks, and I will try to modify my model today as you suggest.

Topic		Replies	Views
Reduce_sum performance Modeling	5	864	May 22, 2020
Try reduce_sum but seems unsuccessfully enabled multicore computation General cmdstanr	1	409	September 25, 2022
Can't get reduce_sum to help with model runtime Modeling	7	346	December 19, 2023
Help with reduce_sum Modeling	32	1403	August 4, 2020
Reduce_sum() no time saving for multilevel model Modeling techniques	1	449	December 24, 2020

Try to use multipe core by reduce_sum But seems to be even less efficient than it was

Related topics