Slow Stan Code

onoufrios · June 16, 2018, 7:55am

Hello everyone. I am trying to fit a model. This is the stan code produced by make_stan function. However, the model takes hours to run and even in simpler versions of the model, the effective sample is low. The model I am trying to fit is the following:

log1p(Number of units sold)= 0 + (1 + log1(Price) + log1p(Distribution) + log1p(Distribution on Display) + log1p(Distribution on Leaflet) + log1p(Distribution Leaflet + Display)|Brand)

fixed effects of all these variables.

functions { 
} 
data { 
  int<lower=1> N;  // total number of observations 
  vector[N] Y;  // response variable 
  int<lower=1> K;  // number of population-level effects 
  matrix[N, K] X;  // population-level design matrix 
  // data for group-level effects of ID 1
  int<lower=1> J_1[N];
  int<lower=1> N_1;
  int<lower=1> M_1;
  vector[N] Z_1_1;
  vector[N] Z_1_2;
  vector[N] Z_1_3;
  vector[N] Z_1_4;
  vector[N] Z_1_5;
  vector[N] Z_1_6;
  int<lower=1> NC_1;
  int prior_only;  // should the likelihood be ignored? 
} 
transformed data { 
} 
parameters { 
  vector[K] b;  // population-level effects 
  real<lower=0> sigma;  // residual SD 
  vector<lower=0>[M_1] sd_1;  // group-level standard deviations
  matrix[M_1, N_1] z_1;  // unscaled group-level effects
  // cholesky factor of correlation matrix
  cholesky_factor_corr[M_1] L_1;
} 
transformed parameters { 
  // group-level effects 
  matrix[N_1, M_1] r_1 = (diag_pre_multiply(sd_1, L_1) * z_1)';
  vector[N_1] r_1_1 = r_1[, 1];
  vector[N_1] r_1_2 = r_1[, 2];
  vector[N_1] r_1_3 = r_1[, 3];
  vector[N_1] r_1_4 = r_1[, 4];
  vector[N_1] r_1_5 = r_1[, 5];
  vector[N_1] r_1_6 = r_1[, 6];
} 
model { 
  vector[N] mu = X * b; 
  for (n in 1:N) { 
    mu[n] += r_1_1[J_1[n]] * Z_1_1[n] + r_1_2[J_1[n]] * Z_1_2[n] + r_1_3[J_1[n]] * Z_1_3[n] + r_1_4[J_1[n]] * Z_1_4[n] + r_1_5[J_1[n]] * Z_1_5[n] + r_1_6[J_1[n]] * Z_1_6[n];
  } 
  // priors including all constants 
  target += normal_lpdf(b[1] | 0.5,0.5); 
  target += normal_lpdf(b[3] | 0.5,0.5); 
  target += normal_lpdf(b[4] | 0.5,0.5); 
  target += normal_lpdf(b[5] | 0.5,0.5); 
  target += student_t_lpdf(sigma | 3, 0, 10)
    - 1 * student_t_lccdf(0 | 3, 0, 10); 
  target += inv_gamma_lpdf(sd_1 | .5,.5); 
  target += normal_lpdf(to_vector(z_1) | 0, 1);
  target += lkj_corr_cholesky_lpdf(L_1 | 1); 
  // likelihood including all constants 
  if (!prior_only) { 
    target += normal_lpdf(Y | mu, sigma); 
  } 
} 
generated quantities { 
  corr_matrix[M_1] Cor_1 = multiply_lower_tri_self_transpose(L_1);
  vector<lower=-1,upper=1>[NC_1] cor_1;
  // take only relevant parts of correlation matrix
  cor_1[1] = Cor_1[1,2]; 
  cor_1[2] = Cor_1[1,3]; 
  cor_1[3] = Cor_1[2,3]; 
  cor_1[4] = Cor_1[1,4]; 
  cor_1[5] = Cor_1[2,4]; 
  cor_1[6] = Cor_1[3,4]; 
  cor_1[7] = Cor_1[1,5]; 
  cor_1[8] = Cor_1[2,5]; 
  cor_1[9] = Cor_1[3,5]; 
  cor_1[10] = Cor_1[4,5]; 
  cor_1[11] = Cor_1[1,6]; 
  cor_1[12] = Cor_1[2,6]; 
  cor_1[13] = Cor_1[3,6]; 
  cor_1[14] = Cor_1[4,6]; 
  cor_1[15] = Cor_1[5,6]; 
}

You can find a subset of the dataset here:
https://expirebox.com/download/ec988977cd40671e7916aed7e02e1b5a.html

[edit: escaped code]

andre.pfeuffer · June 16, 2018, 8:26am

No prior on b[2]. Same for K > 5. If K < 5 priors on b[K+1… K] undefined.

This doesn’t look good either:
target += student_t_lpdf(sigma | 3, 0, 10) - 1 * student_t_lccdf(0 | 3, 0, 10);
why pdf - lower ccdf? You might go along with a simpler sigma ~ student_t(3, 0, 10);

target += lkj_corr_cholesky_lpdf(L_1 | 1);
You might try values greater 1 first.

target += inv_gamma_lpdf(sd_1 | .5,.5);
Not wrong, but no sqrt on sq_1 in:

onoufrios · June 16, 2018, 8:39am

Shoud it be like that ?

data {
int<lower=1> N; // total number of observations
vector[N] Y; // response variable
int<lower=1> K; // number of population-level effects
matrix[N, K] X; // population-level design matrix
// data for group-level effects of ID 1
int<lower=1> J_1[N];
int<lower=1> N_1;
int<lower=1> M_1;
vector[N] Z_1_1;
vector[N] Z_1_2;
vector[N] Z_1_3;
vector[N] Z_1_4;
vector[N] Z_1_5;
vector[N] Z_1_6;
int<lower=1> NC_1;
int prior_only; // should the likelihood be ignored?
}
transformed data {
}
parameters {
vector[K] b; // population-level effects
real<lower=0> sigma; // residual SD
vector<lower=0>[M_1] sd_1; // group-level standard deviations
matrix[M_1, N_1] z_1; // unscaled group-level effects
// cholesky factor of correlation matrix
cholesky_factor_corr[M_1] L_1;
}
transformed parameters {
// group-level effects
matrix[N_1, M_1] r_1 = (diag_pre_multiply(sd_1, L_1) * z_1)';
vector[N_1] r_1_1 = r_1[, 1];
vector[N_1] r_1_2 = r_1[, 2];
vector[N_1] r_1_3 = r_1[, 3];
vector[N_1] r_1_4 = r_1[, 4];
vector[N_1] r_1_5 = r_1[, 5];
vector[N_1] r_1_6 = r_1[, 6];
}
model {
vector[N] mu = X * b;
for (n in 1:N) {
mu[n] += r_1_1[J_1[n]] * Z_1_1[n] + r_1_2[J_1[n]] * Z_1_2[n] + r_1_3[J_1[n]] * Z_1_3[n] + r_1_4[J_1[n]] * Z_1_4[n] + r_1_5[J_1[n]] * Z_1_5[n] + r_1_6[J_1[n]] * Z_1_6[n];
}
// priors including all constants
target += normal_lpdf(b[1] | 0.5,0.5);
target += normal_lpdf(b[2] | -0.5,0.5);
target += normal_lpdf(b[3] | 0.5,0.5);
target += normal_lpdf(b[4] | 0.5,0.5);
target += normal_lpdf(b[5] | 0.5,0.5);
target += student_t_lpdf(sigma | 3, 0, 10)
- 1 * student_t_lccdf(0 | 3, 0, 10);
target += inv_gamma_lpdf(sd_1 | .5,.5);
target += normal_lpdf(to_vector(z_1) | 0, 1);
target += lkj_corr_cholesky_lpdf(L_1 | 1);
// likelihood including all constants
if (!prior_only) {
target += normal_lpdf(Y | mu, sigma);
}
}
generated quantities {
corr_matrix[M_1] Cor_1 = multiply_lower_tri_self_transpose(L_1);
vector<lower=-1,upper=1>[NC_1] cor_1;
// take only relevant parts of correlation matrix
cor_1[1] = Cor_1[1,2];
cor_1[2] = Cor_1[1,3];
cor_1[3] = Cor_1[2,3];
cor_1[4] = Cor_1[1,4];
cor_1[5] = Cor_1[2,4];
cor_1[6] = Cor_1[3,4];
cor_1[7] = Cor_1[1,5];
cor_1[8] = Cor_1[2,5];
cor_1[9] = Cor_1[3,5];
cor_1[10] = Cor_1[4,5];
cor_1[11] = Cor_1[1,6];
cor_1[12] = Cor_1[2,6];
cor_1[13] = Cor_1[3,6];
cor_1[14] = Cor_1[4,6];
cor_1[15] = Cor_1[5,6];
}

andre.pfeuffer · June 16, 2018, 8:48am

You added ?
target += normal_lpdf(b[2] | -0.5,0.5);
Is that the believe about b[2]?

I’m also not clear, what you do with N_1, M_1?

I’m sure the Stan program requires some more rework.
Can you post the calling routine?

onoufrios · June 16, 2018, 8:54am

I know that b[2] should be negative for sure. I want to avoid positive values. What do you mean by calling routine?

edit: I use brm to run my code

andre.pfeuffer · June 17, 2018, 12:47pm

Stan provides some interface PyStan, RStan, CmdStan and MatlabStan.
Since you using brms, I assume you are using R.

I not recommend to use make_stan in your case. Try to fit directly your model with a call to
brm.
I’m not familiar with the syntax of brm, but the documentation of the brms package
is very good. If you still struggle Paul-Buerkner might help you out. Flag brms in your
message and refer to him.

onoufrios · June 18, 2018, 6:22am

I have tried so many different combinations and it is still super slow. Thats why I am trying to write the stan code of my own

Bob_Carpenter · June 18, 2018, 7:51pm

In addition to what @andre.pfeuffer wrote, whihc absolutely has to be taken care of for the model to work (the posterior must be proper), then I have a few more comments on how to speed it up.

There’s a chapter on efficiency in the manual. It’s partly a matter of making the iterations fast and partly making the mixing fast.

One thing to do is trop the target += form, which adds all the normalizing constants.
The other thing to do is vectorize, so combining these two, you’d replace four target += statements with:

b ~ normal(0.5, 0.5);

Drop the 1 * as it’s not doing anything.

lkj_corr_cholesky is also a no-op with eta = 1. You probably want to have some normalization there.

I’mnot sure why you’re stripping out parts of the correlation matrix, but you can do this more easily as

vector<lower = -1, upper = 1> corr[15]
  = { Cor_1[1, 2], Cor_1[1, 3], ...... };

withought having to worry about indexing.

You probably want to put the calculation of mu down in the scope of !prior_only as that’s where it’s used.

I don’t follow the calculation of mu here or why you’re adding all those things to it. That could all be vectorized down to something like

vector[N] mu
  = X * b
    + r_1_1[J_1] .* Z_1_1
    + ...;

The transformed parameters can be defined as local variables in the model block so they’re not saved. Presumably you only need the second r_1, which should be a vector:

vector[N_1] r_1 = (diag_pre_multiply(sd_1, L_1) * z_1)'[, 1:6];

Then you access as r_1[1] rather than r_1_1.

onoufrios · June 19, 2018, 6:36am

Thank you for your answers. I tried to improve my code based on your comments and this is the new version:

data {
  int<lower=1> N; 
  real npack[N]; 
  matrix[N, 4] X; 
  int<lower=1> J; 
  // int<lower=1> K; //number of nunits
  int<lower=1, upper=J> brandname[N];
  // int<lower=1, upper=K> nunit[N]; 
}

parameters {
  vector[4] beta; //intercept and slope
  real<lower=0> sigma_e; 
  vector<lower=0>[4] sigma_u; 
  cholesky_factor_corr[4] L_u;
  matrix[4,J] z_u;
}

transformed parameters{
  matrix[4,J] u;

  u = diag_pre_multiply(sigma_u, L_u) * z_u; //brandname random effects

}

model {
  vector [N] mu = X * beta;
  //priors
  L_u ~ lkj_corr_cholesky(3.0);
  to_vector(z_u) ~ normal(0,1);
  sigma_u  ~ student_t( 3, 0, 10);
  beta ~ normal(0, 5);
  //likelihood
}

Its much faster now. However I had more than 3800 divergent samples out of 4000. The n_eff of most variables is really low. I run the code with maxtreedepth=15 and delta=0.99.

andre.pfeuffer · June 19, 2018, 7:40am

The new model already shows some progress. Good. However,
please read what @Bob_Carpenter wrote carefully. Compare your model with the model
in Stan reference manual 9.13, especially pg. 152/153. This model comes close
to your model.

You omitted the log-likelihood, something like:
y ~ normal(rows_dot_product(beta[jj] , x), sigma);

Within the last model u is disconnected from mu. This has to be fixed.

I see one more concerns to your data. Some show a bathtub curve. To get the best
fit, a transform may be needed. Just keep in mind.

onoufrios · June 19, 2018, 8:02am

So, should I add something like that:
y ~ normal(mu, sigma_e)?

andre.pfeuffer · June 19, 2018, 10:12am

When y is depended variable, eg. the transform of your price, then a
measurement of the quality of the fit is needed.

onoufrios · June 19, 2018, 10:29am

Thank you! I will try to run everything again and I will post again.

onoufrios · June 19, 2018, 1:20pm

I changed the code to be similar to that of page 152/153. However, I dont have group-level regressors. I only want to calculate different betas for each different level without using regressors. I am using the code below but its still too slow. Should I also define a prior for betta similar to that of gamma?

  int<lower=0> N; // num individuals
  int<lower=1> K; // num ind predictors
  int<lower=1> J; // num groups
  // int<lower=1> L; // num group predictors
  int<lower=1,upper=J> jj[N]; // group for individual
  matrix[N, K] x; // individual predictors
  // row_vector[L] u[J]; // group predictors
  vector[N] y; // outcomes
}
parameters {
  matrix[K, J] z;
  cholesky_factor_corr[K] L_Omega;
  vector<lower=0,upper=pi()/2>[K] tau_unif;
  // matrix[L, K] gamma; // group coeffs
  real<lower=0> sigma; // prediction error scale
}
transformed parameters {
  matrix[J, K] beta;
  vector<lower=0>[K] tau; // prior scale
  for (k in 1:K) tau[k] = 2.5 * tan(tau_unif[k]);
  // beta = u * gamma + (diag_pre_multiply(tau,L_Omega) * z);
   beta =   (diag_pre_multiply(tau,L_Omega) * z)';
}
model {
to_vector(z) ~ normal(0, 1);
L_Omega ~ lkj_corr_cholesky(2);
// to_vector(gamma) ~ normal(0, 5);
y ~ normal(rows_dot_product(beta[jj] , x), sigma);
}```

andre.pfeuffer · June 19, 2018, 2:15pm

data {
  int<lower=0> N; // num individuals
  int<lower=1> K; // num ind predictors
  int<lower=1> J; // num groups
  // int<lower=1> L; // num group predictors
  int<lower=1,upper=J> jj[N]; // group for individual
  matrix[N, K] x; // individual predictors
  // row_vector[L] u[J]; // group predictors
  vector[N] y; // outcomes
}
parameters {
  matrix[K, J] z;
  cholesky_factor_corr[K] L_Omega;
  vector<lower=0>[K] tau; // prior scale
  real<lower=0> sigma; // prediction error scale
}
transformed parameters {
  matrix[J, K] beta =   (diag_pre_multiply(tau,L_Omega) * z)';
}
model {
  sigma ~ normal(0, 1);
  tau ~ exponential(1.0);
  to_vector(z) ~ normal(0, 1);
  L_Omega ~ lkj_corr_cholesky(2);
  y ~ normal(rows_dot_product(beta[jj] , x), sigma);
}

Since you have enough data, we try something easy first. If still too slow, we have to change
the beta - matrix to an array. However, Stan will by far not be as fast as a standard regression model.
NUTS in Stan not optimizes the whole distribution, not only a point estimate. This comes with
a price.

onoufrios · June 19, 2018, 2:27pm

The dataset is not that big. It has ~26000 observations. I will try the new code and i will let you know. Thank you again for your help. But why the code I run this morning was so fast?? Did I make a mistake or ignore something important?( I am talking for the code of the 8th comment)

andre.pfeuffer · June 19, 2018, 3:06pm

There was no likelihood. Nothing to optimize, informally said.

The speed problem is addressed. Parallel CPU + GPU usage are on its way.
You may also check out GNU R lme4 package or R-INLA, with different focus and shorter
overall runtime.

onoufrios · June 19, 2018, 4:28pm

Still running. Could it be because of the data? I have a lot of 0s for some regressors…

Also even when i included y the code was still really fast
In the code above I have defined prior for betas but in the code of the book there is no prior for beta. How could I define a prior for the mean of betas?

Bob_Carpenter · June 19, 2018, 6:03pm

Yes, that’s Andrew’s Folk Theorem. If the model is misspecified for the data, HMC in particular is going to struggle with slow iterations as it properly reconciles mismatched priors and data.

onoufrios · June 19, 2018, 6:28pm

Does any of you have the time to try to run in with these data ?
https://expirebox.com/download/5fc702bab44ff01ebb98f4c04aee167f.html

Maybe I am doing something wrong when I call stan. I just want to have brand specific parameters for the regressors (My response variable is the number of packages).

Topic		Replies	Views
Slow Stan Code for Hierarchical Multivariate Model Modeling	8	1204	July 24, 2018
Very slow multilevel model Modeling performance	14	1536	August 24, 2018
Slow run time when fitting multi_normal_cholesky likelihood in bivariate model Modeling rstan , fitting-issues , specification , performance	3	281	May 8, 2024
(Another) Slow Hierarchical Model in Stan Modeling fitting-issues , specification , performance	3	1766	September 29, 2017
Theoretical lowest runtime – hierarchical linear model Modeling fitting-issues , specification , performance	14	1391	September 22, 2019

Slow Stan Code

Related topics