Slow Stan Code

Hello everyone. I am trying to fit a model. This is the stan code produced by make_stan function. However, the model takes hours to run and even in simpler versions of the model, the effective sample is low. The model I am trying to fit is the following:

log1p(Number of units sold)= 0 + (1 + log1(Price) + log1p(Distribution) + log1p(Distribution on Display) + log1p(Distribution on Leaflet) + log1p(Distribution Leaflet + Display)|Brand)

  • fixed effects of all these variables.
functions { 
} 
data { 
  int<lower=1> N;  // total number of observations 
  vector[N] Y;  // response variable 
  int<lower=1> K;  // number of population-level effects 
  matrix[N, K] X;  // population-level design matrix 
  // data for group-level effects of ID 1
  int<lower=1> J_1[N];
  int<lower=1> N_1;
  int<lower=1> M_1;
  vector[N] Z_1_1;
  vector[N] Z_1_2;
  vector[N] Z_1_3;
  vector[N] Z_1_4;
  vector[N] Z_1_5;
  vector[N] Z_1_6;
  int<lower=1> NC_1;
  int prior_only;  // should the likelihood be ignored? 
} 
transformed data { 
} 
parameters { 
  vector[K] b;  // population-level effects 
  real<lower=0> sigma;  // residual SD 
  vector<lower=0>[M_1] sd_1;  // group-level standard deviations
  matrix[M_1, N_1] z_1;  // unscaled group-level effects
  // cholesky factor of correlation matrix
  cholesky_factor_corr[M_1] L_1;
} 
transformed parameters { 
  // group-level effects 
  matrix[N_1, M_1] r_1 = (diag_pre_multiply(sd_1, L_1) * z_1)';
  vector[N_1] r_1_1 = r_1[, 1];
  vector[N_1] r_1_2 = r_1[, 2];
  vector[N_1] r_1_3 = r_1[, 3];
  vector[N_1] r_1_4 = r_1[, 4];
  vector[N_1] r_1_5 = r_1[, 5];
  vector[N_1] r_1_6 = r_1[, 6];
} 
model { 
  vector[N] mu = X * b; 
  for (n in 1:N) { 
    mu[n] += r_1_1[J_1[n]] * Z_1_1[n] + r_1_2[J_1[n]] * Z_1_2[n] + r_1_3[J_1[n]] * Z_1_3[n] + r_1_4[J_1[n]] * Z_1_4[n] + r_1_5[J_1[n]] * Z_1_5[n] + r_1_6[J_1[n]] * Z_1_6[n];
  } 
  // priors including all constants 
  target += normal_lpdf(b[1] | 0.5,0.5); 
  target += normal_lpdf(b[3] | 0.5,0.5); 
  target += normal_lpdf(b[4] | 0.5,0.5); 
  target += normal_lpdf(b[5] | 0.5,0.5); 
  target += student_t_lpdf(sigma | 3, 0, 10)
    - 1 * student_t_lccdf(0 | 3, 0, 10); 
  target += inv_gamma_lpdf(sd_1 | .5,.5); 
  target += normal_lpdf(to_vector(z_1) | 0, 1);
  target += lkj_corr_cholesky_lpdf(L_1 | 1); 
  // likelihood including all constants 
  if (!prior_only) { 
    target += normal_lpdf(Y | mu, sigma); 
  } 
} 
generated quantities { 
  corr_matrix[M_1] Cor_1 = multiply_lower_tri_self_transpose(L_1);
  vector<lower=-1,upper=1>[NC_1] cor_1;
  // take only relevant parts of correlation matrix
  cor_1[1] = Cor_1[1,2]; 
  cor_1[2] = Cor_1[1,3]; 
  cor_1[3] = Cor_1[2,3]; 
  cor_1[4] = Cor_1[1,4]; 
  cor_1[5] = Cor_1[2,4]; 
  cor_1[6] = Cor_1[3,4]; 
  cor_1[7] = Cor_1[1,5]; 
  cor_1[8] = Cor_1[2,5]; 
  cor_1[9] = Cor_1[3,5]; 
  cor_1[10] = Cor_1[4,5]; 
  cor_1[11] = Cor_1[1,6]; 
  cor_1[12] = Cor_1[2,6]; 
  cor_1[13] = Cor_1[3,6]; 
  cor_1[14] = Cor_1[4,6]; 
  cor_1[15] = Cor_1[5,6]; 
} 

You can find a subset of the dataset here:
https://expirebox.com/download/ec988977cd40671e7916aed7e02e1b5a.html

[edit: escaped code]

No prior on b[2]. Same for K > 5. If K < 5 priors on b[K+1… K] undefined.

This doesn’t look good either:
target += student_t_lpdf(sigma | 3, 0, 10) - 1 * student_t_lccdf(0 | 3, 0, 10);
why pdf - lower ccdf? You might go along with a simpler sigma ~ student_t(3, 0, 10);

target += lkj_corr_cholesky_lpdf(L_1 | 1);
You might try values greater 1 first.

target += inv_gamma_lpdf(sd_1 | .5,.5);
Not wrong, but no sqrt on sq_1 in:

Shoud it be like that ?

data {
int<lower=1> N; // total number of observations
vector[N] Y; // response variable
int<lower=1> K; // number of population-level effects
matrix[N, K] X; // population-level design matrix
// data for group-level effects of ID 1
int<lower=1> J_1[N];
int<lower=1> N_1;
int<lower=1> M_1;
vector[N] Z_1_1;
vector[N] Z_1_2;
vector[N] Z_1_3;
vector[N] Z_1_4;
vector[N] Z_1_5;
vector[N] Z_1_6;
int<lower=1> NC_1;
int prior_only; // should the likelihood be ignored?
}
transformed data {
}
parameters {
vector[K] b; // population-level effects
real<lower=0> sigma; // residual SD
vector<lower=0>[M_1] sd_1; // group-level standard deviations
matrix[M_1, N_1] z_1; // unscaled group-level effects
// cholesky factor of correlation matrix
cholesky_factor_corr[M_1] L_1;
}
transformed parameters {
// group-level effects
matrix[N_1, M_1] r_1 = (diag_pre_multiply(sd_1, L_1) * z_1)';
vector[N_1] r_1_1 = r_1[, 1];
vector[N_1] r_1_2 = r_1[, 2];
vector[N_1] r_1_3 = r_1[, 3];
vector[N_1] r_1_4 = r_1[, 4];
vector[N_1] r_1_5 = r_1[, 5];
vector[N_1] r_1_6 = r_1[, 6];
}
model {
vector[N] mu = X * b;
for (n in 1:N) {
mu[n] += r_1_1[J_1[n]] * Z_1_1[n] + r_1_2[J_1[n]] * Z_1_2[n] + r_1_3[J_1[n]] * Z_1_3[n] + r_1_4[J_1[n]] * Z_1_4[n] + r_1_5[J_1[n]] * Z_1_5[n] + r_1_6[J_1[n]] * Z_1_6[n];
}
// priors including all constants
target += normal_lpdf(b[1] | 0.5,0.5);
target += normal_lpdf(b[2] | -0.5,0.5);
target += normal_lpdf(b[3] | 0.5,0.5);
target += normal_lpdf(b[4] | 0.5,0.5);
target += normal_lpdf(b[5] | 0.5,0.5);
target += student_t_lpdf(sigma | 3, 0, 10)
- 1 * student_t_lccdf(0 | 3, 0, 10);
target += inv_gamma_lpdf(sd_1 | .5,.5);
target += normal_lpdf(to_vector(z_1) | 0, 1);
target += lkj_corr_cholesky_lpdf(L_1 | 1);
// likelihood including all constants
if (!prior_only) {
target += normal_lpdf(Y | mu, sigma);
}
}
generated quantities {
corr_matrix[M_1] Cor_1 = multiply_lower_tri_self_transpose(L_1);
vector<lower=-1,upper=1>[NC_1] cor_1;
// take only relevant parts of correlation matrix
cor_1[1] = Cor_1[1,2];
cor_1[2] = Cor_1[1,3];
cor_1[3] = Cor_1[2,3];
cor_1[4] = Cor_1[1,4];
cor_1[5] = Cor_1[2,4];
cor_1[6] = Cor_1[3,4];
cor_1[7] = Cor_1[1,5];
cor_1[8] = Cor_1[2,5];
cor_1[9] = Cor_1[3,5];
cor_1[10] = Cor_1[4,5];
cor_1[11] = Cor_1[1,6];
cor_1[12] = Cor_1[2,6];
cor_1[13] = Cor_1[3,6];
cor_1[14] = Cor_1[4,6];
cor_1[15] = Cor_1[5,6];
}

You added ?
target += normal_lpdf(b[2] | -0.5,0.5);
Is that the believe about b[2]?

I’m also not clear, what you do with N_1, M_1?

I’m sure the Stan program requires some more rework.
Can you post the calling routine?

I know that b[2] should be negative for sure. I want to avoid positive values. What do you mean by calling routine?

edit: I use brm to run my code

Stan provides some interface PyStan, RStan, CmdStan and MatlabStan.
Since you using brms, I assume you are using R.

I not recommend to use make_stan in your case. Try to fit directly your model with a call to
brm.
I’m not familiar with the syntax of brm, but the documentation of the brms package
is very good. If you still struggle Paul-Buerkner might help you out. Flag brms in your
message and refer to him.

I have tried so many different combinations and it is still super slow. Thats why I am trying to write the stan code of my own

In addition to what @andre.pfeuffer wrote, whihc absolutely has to be taken care of for the model to work (the posterior must be proper), then I have a few more comments on how to speed it up.

There’s a chapter on efficiency in the manual. It’s partly a matter of making the iterations fast and partly making the mixing fast.

One thing to do is trop the target += form, which adds all the normalizing constants.
The other thing to do is vectorize, so combining these two, you’d replace four target += statements with:

b ~ normal(0.5, 0.5);

Drop the 1 * as it’s not doing anything.

lkj_corr_cholesky is also a no-op with eta = 1. You probably want to have some normalization there.

I’mnot sure why you’re stripping out parts of the correlation matrix, but you can do this more easily as

vector<lower = -1, upper = 1> corr[15]
  = { Cor_1[1, 2], Cor_1[1, 3], ...... };

withought having to worry about indexing.

You probably want to put the calculation of mu down in the scope of !prior_only as that’s where it’s used.

I don’t follow the calculation of mu here or why you’re adding all those things to it. That could all be vectorized down to something like

vector[N] mu
  = X * b
    + r_1_1[J_1] .* Z_1_1
    + ...;

The transformed parameters can be defined as local variables in the model block so they’re not saved. Presumably you only need the second r_1, which should be a vector:

vector[N_1] r_1 = (diag_pre_multiply(sd_1, L_1) * z_1)'[, 1:6];

Then you access as r_1[1] rather than r_1_1.

Thank you for your answers. I tried to improve my code based on your comments and this is the new version:

data {
  int<lower=1> N; 
  real npack[N]; 
  matrix[N, 4] X; 
  int<lower=1> J; 
  // int<lower=1> K; //number of nunits
  int<lower=1, upper=J> brandname[N];
  // int<lower=1, upper=K> nunit[N]; 
}

parameters {
  vector[4] beta; //intercept and slope
  real<lower=0> sigma_e; 
  vector<lower=0>[4] sigma_u; 
  cholesky_factor_corr[4] L_u;
  matrix[4,J] z_u;
}

transformed parameters{
  matrix[4,J] u;

  u = diag_pre_multiply(sigma_u, L_u) * z_u; //brandname random effects

}

model {
  vector [N] mu = X * beta;
  //priors
  L_u ~ lkj_corr_cholesky(3.0);
  to_vector(z_u) ~ normal(0,1);
  sigma_u  ~ student_t( 3, 0, 10);
  beta ~ normal(0, 5);
  //likelihood
}

Its much faster now. However I had more than 3800 divergent samples out of 4000. The n_eff of most variables is really low. I run the code with maxtreedepth=15 and delta=0.99.

The new model already shows some progress. Good. However,
please read what @Bob_Carpenter wrote carefully. Compare your model with the model
in Stan reference manual 9.13, especially pg. 152/153. This model comes close
to your model.

You omitted the log-likelihood, something like:
y ~ normal(rows_dot_product(beta[jj] , x), sigma);

Within the last model u is disconnected from mu. This has to be fixed.

I see one more concerns to your data. Some show a bathtub curve. To get the best
fit, a transform may be needed. Just keep in mind.

So, should I add something like that:
y ~ normal(mu, sigma_e)?

When y is depended variable, eg. the transform of your price, then a
measurement of the quality of the fit is needed.

Thank you! I will try to run everything again and I will post again.

I changed the code to be similar to that of page 152/153. However, I dont have group-level regressors. I only want to calculate different betas for each different level without using regressors. I am using the code below but its still too slow. Should I also define a prior for betta similar to that of gamma?

  int<lower=0> N; // num individuals
  int<lower=1> K; // num ind predictors
  int<lower=1> J; // num groups
  // int<lower=1> L; // num group predictors
  int<lower=1,upper=J> jj[N]; // group for individual
  matrix[N, K] x; // individual predictors
  // row_vector[L] u[J]; // group predictors
  vector[N] y; // outcomes
}
parameters {
  matrix[K, J] z;
  cholesky_factor_corr[K] L_Omega;
  vector<lower=0,upper=pi()/2>[K] tau_unif;
  // matrix[L, K] gamma; // group coeffs
  real<lower=0> sigma; // prediction error scale
}
transformed parameters {
  matrix[J, K] beta;
  vector<lower=0>[K] tau; // prior scale
  for (k in 1:K) tau[k] = 2.5 * tan(tau_unif[k]);
  // beta = u * gamma + (diag_pre_multiply(tau,L_Omega) * z);
   beta =   (diag_pre_multiply(tau,L_Omega) * z)';
}
model {
to_vector(z) ~ normal(0, 1);
L_Omega ~ lkj_corr_cholesky(2);
// to_vector(gamma) ~ normal(0, 5);
y ~ normal(rows_dot_product(beta[jj] , x), sigma);
}```
data {
  int<lower=0> N; // num individuals
  int<lower=1> K; // num ind predictors
  int<lower=1> J; // num groups
  // int<lower=1> L; // num group predictors
  int<lower=1,upper=J> jj[N]; // group for individual
  matrix[N, K] x; // individual predictors
  // row_vector[L] u[J]; // group predictors
  vector[N] y; // outcomes
}
parameters {
  matrix[K, J] z;
  cholesky_factor_corr[K] L_Omega;
  vector<lower=0>[K] tau; // prior scale
  real<lower=0> sigma; // prediction error scale
}
transformed parameters {
  matrix[J, K] beta =   (diag_pre_multiply(tau,L_Omega) * z)';
}
model {
  sigma ~ normal(0, 1);
  tau ~ exponential(1.0);
  to_vector(z) ~ normal(0, 1);
  L_Omega ~ lkj_corr_cholesky(2);
  y ~ normal(rows_dot_product(beta[jj] , x), sigma);
}

Since you have enough data, we try something easy first. If still too slow, we have to change
the beta - matrix to an array. However, Stan will by far not be as fast as a standard regression model.
NUTS in Stan not optimizes the whole distribution, not only a point estimate. This comes with
a price.

The dataset is not that big. It has ~26000 observations. I will try the new code and i will let you know. Thank you again for your help. But why the code I run this morning was so fast?? Did I make a mistake or ignore something important?( I am talking for the code of the 8th comment)

There was no likelihood. Nothing to optimize, informally said.

The speed problem is addressed. Parallel CPU + GPU usage are on its way.
You may also check out GNU R lme4 package or R-INLA, with different focus and shorter
overall runtime.

Still running. Could it be because of the data? I have a lot of 0s for some regressors…

Also even when i included y the code was still really fast
In the code above I have defined prior for betas but in the code of the book there is no prior for beta. How could I define a prior for the mean of betas?

Yes, that’s Andrew’s Folk Theorem. If the model is misspecified for the data, HMC in particular is going to struggle with slow iterations as it properly reconciles mismatched priors and data.

Does any of you have the time to try to run in with these data ?
https://expirebox.com/download/5fc702bab44ff01ebb98f4c04aee167f.html

Maybe I am doing something wrong when I call stan. I just want to have brand specific parameters for the regressors (My response variable is the number of packages).