# Slow Stan Code

Hello everyone. I am trying to fit a model. This is the stan code produced by make_stan function. However, the model takes hours to run and even in simpler versions of the model, the effective sample is low. The model I am trying to fit is the following:

log1p(Number of units sold)= 0 + (1 + log1(Price) + log1p(Distribution) + log1p(Distribution on Display) + log1p(Distribution on Leaflet) + log1p(Distribution Leaflet + Display)|Brand)

• fixed effects of all these variables.
``````functions {
}
data {
int<lower=1> N;  // total number of observations
vector[N] Y;  // response variable
int<lower=1> K;  // number of population-level effects
matrix[N, K] X;  // population-level design matrix
// data for group-level effects of ID 1
int<lower=1> J_1[N];
int<lower=1> N_1;
int<lower=1> M_1;
vector[N] Z_1_1;
vector[N] Z_1_2;
vector[N] Z_1_3;
vector[N] Z_1_4;
vector[N] Z_1_5;
vector[N] Z_1_6;
int<lower=1> NC_1;
int prior_only;  // should the likelihood be ignored?
}
transformed data {
}
parameters {
vector[K] b;  // population-level effects
real<lower=0> sigma;  // residual SD
vector<lower=0>[M_1] sd_1;  // group-level standard deviations
matrix[M_1, N_1] z_1;  // unscaled group-level effects
// cholesky factor of correlation matrix
cholesky_factor_corr[M_1] L_1;
}
transformed parameters {
// group-level effects
matrix[N_1, M_1] r_1 = (diag_pre_multiply(sd_1, L_1) * z_1)';
vector[N_1] r_1_1 = r_1[, 1];
vector[N_1] r_1_2 = r_1[, 2];
vector[N_1] r_1_3 = r_1[, 3];
vector[N_1] r_1_4 = r_1[, 4];
vector[N_1] r_1_5 = r_1[, 5];
vector[N_1] r_1_6 = r_1[, 6];
}
model {
vector[N] mu = X * b;
for (n in 1:N) {
mu[n] += r_1_1[J_1[n]] * Z_1_1[n] + r_1_2[J_1[n]] * Z_1_2[n] + r_1_3[J_1[n]] * Z_1_3[n] + r_1_4[J_1[n]] * Z_1_4[n] + r_1_5[J_1[n]] * Z_1_5[n] + r_1_6[J_1[n]] * Z_1_6[n];
}
// priors including all constants
target += normal_lpdf(b | 0.5,0.5);
target += normal_lpdf(b | 0.5,0.5);
target += normal_lpdf(b | 0.5,0.5);
target += normal_lpdf(b | 0.5,0.5);
target += student_t_lpdf(sigma | 3, 0, 10)
- 1 * student_t_lccdf(0 | 3, 0, 10);
target += inv_gamma_lpdf(sd_1 | .5,.5);
target += normal_lpdf(to_vector(z_1) | 0, 1);
target += lkj_corr_cholesky_lpdf(L_1 | 1);
// likelihood including all constants
if (!prior_only) {
target += normal_lpdf(Y | mu, sigma);
}
}
generated quantities {
corr_matrix[M_1] Cor_1 = multiply_lower_tri_self_transpose(L_1);
vector<lower=-1,upper=1>[NC_1] cor_1;
// take only relevant parts of correlation matrix
cor_1 = Cor_1[1,2];
cor_1 = Cor_1[1,3];
cor_1 = Cor_1[2,3];
cor_1 = Cor_1[1,4];
cor_1 = Cor_1[2,4];
cor_1 = Cor_1[3,4];
cor_1 = Cor_1[1,5];
cor_1 = Cor_1[2,5];
cor_1 = Cor_1[3,5];
cor_1 = Cor_1[4,5];
cor_1 = Cor_1[1,6];
cor_1 = Cor_1[2,6];
cor_1 = Cor_1[3,6];
cor_1 = Cor_1[4,6];
cor_1 = Cor_1[5,6];
}
``````

You can find a subset of the dataset here:

[edit: escaped code]

No prior on `b`. Same for K > 5. If K < 5 priors on b[K+1… K] undefined.

This doesn’t look good either:
`target += student_t_lpdf(sigma | 3, 0, 10) - 1 * student_t_lccdf(0 | 3, 0, 10);`
why pdf - lower ccdf? You might go along with a simpler `sigma ~ student_t(3, 0, 10);`

`target += lkj_corr_cholesky_lpdf(L_1 | 1);`
You might try values greater 1 first.

`target += inv_gamma_lpdf(sd_1 | .5,.5);`
Not wrong, but no sqrt on sq_1 in:

Shoud it be like that ?

data {
int<lower=1> N; // total number of observations
vector[N] Y; // response variable
int<lower=1> K; // number of population-level effects
matrix[N, K] X; // population-level design matrix
// data for group-level effects of ID 1
int<lower=1> J_1[N];
int<lower=1> N_1;
int<lower=1> M_1;
vector[N] Z_1_1;
vector[N] Z_1_2;
vector[N] Z_1_3;
vector[N] Z_1_4;
vector[N] Z_1_5;
vector[N] Z_1_6;
int<lower=1> NC_1;
int prior_only; // should the likelihood be ignored?
}
transformed data {
}
parameters {
vector[K] b; // population-level effects
real<lower=0> sigma; // residual SD
vector<lower=0>[M_1] sd_1; // group-level standard deviations
matrix[M_1, N_1] z_1; // unscaled group-level effects
// cholesky factor of correlation matrix
cholesky_factor_corr[M_1] L_1;
}
transformed parameters {
// group-level effects
matrix[N_1, M_1] r_1 = (diag_pre_multiply(sd_1, L_1) * z_1)’;
vector[N_1] r_1_1 = r_1[, 1];
vector[N_1] r_1_2 = r_1[, 2];
vector[N_1] r_1_3 = r_1[, 3];
vector[N_1] r_1_4 = r_1[, 4];
vector[N_1] r_1_5 = r_1[, 5];
vector[N_1] r_1_6 = r_1[, 6];
}
model {
vector[N] mu = X * b;
for (n in 1:N) {
mu[n] += r_1_1[J_1[n]] * Z_1_1[n] + r_1_2[J_1[n]] * Z_1_2[n] + r_1_3[J_1[n]] * Z_1_3[n] + r_1_4[J_1[n]] * Z_1_4[n] + r_1_5[J_1[n]] * Z_1_5[n] + r_1_6[J_1[n]] * Z_1_6[n];
}
// priors including all constants
target += normal_lpdf(b | 0.5,0.5);
target += normal_lpdf(b | -0.5,0.5);
target += normal_lpdf(b | 0.5,0.5);
target += normal_lpdf(b | 0.5,0.5);
target += normal_lpdf(b | 0.5,0.5);
target += student_t_lpdf(sigma | 3, 0, 10)
- 1 * student_t_lccdf(0 | 3, 0, 10);
target += inv_gamma_lpdf(sd_1 | .5,.5);
target += normal_lpdf(to_vector(z_1) | 0, 1);
target += lkj_corr_cholesky_lpdf(L_1 | 1);
// likelihood including all constants
if (!prior_only) {
target += normal_lpdf(Y | mu, sigma);
}
}
generated quantities {
corr_matrix[M_1] Cor_1 = multiply_lower_tri_self_transpose(L_1);
vector<lower=-1,upper=1>[NC_1] cor_1;
// take only relevant parts of correlation matrix
cor_1 = Cor_1[1,2];
cor_1 = Cor_1[1,3];
cor_1 = Cor_1[2,3];
cor_1 = Cor_1[1,4];
cor_1 = Cor_1[2,4];
cor_1 = Cor_1[3,4];
cor_1 = Cor_1[1,5];
cor_1 = Cor_1[2,5];
cor_1 = Cor_1[3,5];
cor_1 = Cor_1[4,5];
cor_1 = Cor_1[1,6];
cor_1 = Cor_1[2,6];
cor_1 = Cor_1[3,6];
cor_1 = Cor_1[4,6];
cor_1 = Cor_1[5,6];
}

` target += normal_lpdf(b | -0.5,0.5);`
Is that the believe about `b`?

I’m also not clear, what you do with `N_1, M_1`?

I’m sure the Stan program requires some more rework.
Can you post the calling routine?

I know that b should be negative for sure. I want to avoid positive values. What do you mean by calling routine?

edit: I use brm to run my code

Stan provides some interface PyStan, RStan, CmdStan and MatlabStan.
Since you using brms, I assume you are using R.

I not recommend to use `make_stan` in your case. Try to fit directly your model with a call to
`brm`.
I’m not familiar with the syntax of brm, but the documentation of the brms package
is very good. If you still struggle Paul-Buerkner might help you out. Flag brms in your
message and refer to him.

I have tried so many different combinations and it is still super slow. Thats why I am trying to write the stan code of my own

In addition to what @andre.pfeuffer wrote, whihc absolutely has to be taken care of for the model to work (the posterior must be proper), then I have a few more comments on how to speed it up.

There’s a chapter on efficiency in the manual. It’s partly a matter of making the iterations fast and partly making the mixing fast.

One thing to do is trop the `target +=` form, which adds all the normalizing constants.
The other thing to do is vectorize, so combining these two, you’d replace four `target +=` statements with:

``````b ~ normal(0.5, 0.5);
``````

Drop the `1 *` as it’s not doing anything.

`lkj_corr_cholesky` is also a no-op with `eta = 1`. You probably want to have some normalization there.

I’mnot sure why you’re stripping out parts of the correlation matrix, but you can do this more easily as

``````vector<lower = -1, upper = 1> corr
= { Cor_1[1, 2], Cor_1[1, 3], ...... };
``````

withought having to worry about indexing.

You probably want to put the calculation of `mu` down in the scope of `!prior_only` as that’s where it’s used.

I don’t follow the calculation of `mu` here or why you’re adding all those things to it. That could all be vectorized down to something like

``````vector[N] mu
= X * b
+ r_1_1[J_1] .* Z_1_1
+ ...;
``````

The transformed parameters can be defined as local variables in the model block so they’re not saved. Presumably you only need the second r_1, which should be a vector:

``````vector[N_1] r_1 = (diag_pre_multiply(sd_1, L_1) * z_1)'[, 1:6];
``````

Then you access as `r_1` rather than `r_1_1`.

``````data {
int<lower=1> N;
real npack[N];
matrix[N, 4] X;
int<lower=1> J;
// int<lower=1> K; //number of nunits
int<lower=1, upper=J> brandname[N];
// int<lower=1, upper=K> nunit[N];
}

parameters {
vector beta; //intercept and slope
real<lower=0> sigma_e;
vector<lower=0> sigma_u;
cholesky_factor_corr L_u;
matrix[4,J] z_u;
}

transformed parameters{
matrix[4,J] u;

u = diag_pre_multiply(sigma_u, L_u) * z_u; //brandname random effects

}

model {
vector [N] mu = X * beta;
//priors
L_u ~ lkj_corr_cholesky(3.0);
to_vector(z_u) ~ normal(0,1);
sigma_u  ~ student_t( 3, 0, 10);
beta ~ normal(0, 5);
//likelihood
}
``````

Its much faster now. However I had more than 3800 divergent samples out of 4000. The n_eff of most variables is really low. I run the code with maxtreedepth=15 and delta=0.99.

The new model already shows some progress. Good. However,
in Stan reference manual 9.13, especially pg. 152/153. This model comes close

You omitted the log-likelihood, something like:
`y ~ normal(rows_dot_product(beta[jj] , x), sigma);`

Within the last model `u` is disconnected from `mu`. This has to be fixed.

fit, a transform may be needed. Just keep in mind.

So, should I add something like that:
y ~ normal(mu, sigma_e)?

When `y` is depended variable, eg. the transform of your price, then a
measurement of the quality of the fit is needed.

Thank you! I will try to run everything again and I will post again.

I changed the code to be similar to that of page 152/153. However, I dont have group-level regressors. I only want to calculate different betas for each different level without using regressors. I am using the code below but its still too slow. Should I also define a prior for betta similar to that of gamma?

``````  int<lower=0> N; // num individuals
int<lower=1> K; // num ind predictors
int<lower=1> J; // num groups
// int<lower=1> L; // num group predictors
int<lower=1,upper=J> jj[N]; // group for individual
matrix[N, K] x; // individual predictors
// row_vector[L] u[J]; // group predictors
vector[N] y; // outcomes
}
parameters {
matrix[K, J] z;
cholesky_factor_corr[K] L_Omega;
vector<lower=0,upper=pi()/2>[K] tau_unif;
// matrix[L, K] gamma; // group coeffs
real<lower=0> sigma; // prediction error scale
}
transformed parameters {
matrix[J, K] beta;
vector<lower=0>[K] tau; // prior scale
for (k in 1:K) tau[k] = 2.5 * tan(tau_unif[k]);
// beta = u * gamma + (diag_pre_multiply(tau,L_Omega) * z);
beta =   (diag_pre_multiply(tau,L_Omega) * z)';
}
model {
to_vector(z) ~ normal(0, 1);
L_Omega ~ lkj_corr_cholesky(2);
// to_vector(gamma) ~ normal(0, 5);
y ~ normal(rows_dot_product(beta[jj] , x), sigma);
}`````````
``````data {
int<lower=0> N; // num individuals
int<lower=1> K; // num ind predictors
int<lower=1> J; // num groups
// int<lower=1> L; // num group predictors
int<lower=1,upper=J> jj[N]; // group for individual
matrix[N, K] x; // individual predictors
// row_vector[L] u[J]; // group predictors
vector[N] y; // outcomes
}
parameters {
matrix[K, J] z;
cholesky_factor_corr[K] L_Omega;
vector<lower=0>[K] tau; // prior scale
real<lower=0> sigma; // prediction error scale
}
transformed parameters {
matrix[J, K] beta =   (diag_pre_multiply(tau,L_Omega) * z)';
}
model {
sigma ~ normal(0, 1);
tau ~ exponential(1.0);
to_vector(z) ~ normal(0, 1);
L_Omega ~ lkj_corr_cholesky(2);
y ~ normal(rows_dot_product(beta[jj] , x), sigma);
}
``````

Since you have enough data, we try something easy first. If still too slow, we have to change
the beta - matrix to an array. However, Stan will by far not be as fast as a standard regression model.
NUTS in Stan not optimizes the whole distribution, not only a point estimate. This comes with
a price.

The dataset is not that big. It has ~26000 observations. I will try the new code and i will let you know. Thank you again for your help. But why the code I run this morning was so fast?? Did I make a mistake or ignore something important?( I am talking for the code of the 8th comment)

There was no likelihood. Nothing to optimize, informally said.

The speed problem is addressed. Parallel CPU + GPU usage are on its way.
You may also check out GNU R lme4 package or R-INLA, with different focus and shorter
overall runtime.

Still running. Could it be because of the data? I have a lot of 0s for some regressors…

Also even when i included y the code was still really fast
In the code above I have defined prior for betas but in the code of the book there is no prior for beta. How could I define a prior for the mean of betas?

Yes, that’s Andrew’s Folk Theorem. If the model is misspecified for the data, HMC in particular is going to struggle with slow iterations as it properly reconciles mismatched priors and data.

Does any of you have the time to try to run in with these data ?