Stuck at warmup

Nestor · November 30, 2017, 3:50am

Hi again, it is actually sampling now. It is just very slow. Would it be possible to delete my post please? My apologies for posting too fast, I will be more patient next time. Thank you.

Hi!

I am having troubles with sampling stopping during warm-up. I consistently obtain the following message when I run the model:

SAMPLING FOR MODEL ‘stanmodel’ NOW (CHAIN 1).

Gradient evaluation took 0.000346 seconds
1000 transitions using 10 leapfrog steps per transition would take 3.46 seconds.
Adjust your expectations accordingly!

Iteration: 1 / 2000 [ 0%] (Warmup)
Iteration: 200 / 2000 [ 10%] (Warmup)
Iteration: 400 / 2000 [ 20%] (Warmup)
Iteration: 600 / 2000 [ 30%] (Warmup)

I use Ubuntu, I have some space available, and the quasi same model was running slowly but without problem yesterday. Would you know why this is happening? Is this due to inefficient parametrization of the model?

Thank you very much for your attention!

bbbales2 · November 30, 2017, 4:47pm

No crime committed, no apologies necessary

Slow is a pretty good diagnostic. If things sample slowly, that’s either cause of huge amounts of data or something interesting happening in the posterior.

If you don’t definitely have huge amounts of data, then standard thing is to try to reparameterize. First place to look are at divergence/treedepth diagnostics and pairplots for clues on how to do that.

Nestor · November 30, 2017, 5:23pm

Thank you for your answer! My session crashed mid sampling and I could not obtain any results. I will try again today and see if I can share some convergence diagnostics. I indeed don’t have much data: I am doing simulations and I am using very small samples at first. I have used uncentered parametrization for normally distributed parameters, and I use adapt_delta = 0.99 and max_treedepth = 25.

I have seen on this forum that it is not recommended to define distributions using multi_normal with a diagonal covariance matrix, so I will try and change that. Regardless, I am pretty sure my code is very inefficient. I am copying it below, and any feedback on it is very much appreciated.

bbbales2 · November 30, 2017, 9:14pm

Is there any way to build this model up piece by piece? This looks really complicated. If you can get this working in smaller bits and glue em’ together you might find it easier to figure out what’s going on.

list(adapt_delta = 0.99, max_treedepth = 25)

I only go for these options if I’ve run out of reparameterization ideas. These will slow down the model, but sometimes there are ways to avoid them.

Nestor · December 1, 2017, 1:33am

Yes, I did built the model by block (with just beta as unknown parameter, just delta, with a less complicated Pi matrix etc.) and the previous steps work.
Also, this version is still an intermediary step. More complicated versions, with a third lower hierarchy parameter and other types of observations, actually work well.
I will try with lower values of adapt_delta and max_treedepth, thank you.

cfhammill · December 1, 2017, 2:26am

Should this be L * beta_delta_tilde instead of beta_delta_tilde * L <multiply(beta_delta_tilde, cholesky_decompose(Sigma))>? Other examples of NCP for multivariate normal I’ve seen have used the former.

Nestor · December 1, 2017, 3:27am

Thank you for your answer.
beta_delta_tilde is J*2 and the Cholesky decomposition is 2 * 2 so it would not be possible to do the reverse. I think that what I did is equivalent to doing the following (this is in R language), which might be the formulation that is most frequently used:

beta_b_delta <- data.frame(matrix(NA, nrow = J, ncol = 3))
for (j in 1:J){
beta_b_delta[j,] <- c(as.matrix((mu + chol(Sigma) %*% as.matrix(beta_b_delta_tilde)[j,]))[,1])
}

Please let me know if I missed something, and thanks again for taking the time to read the code, there must be something wrong somewhere.

cfhammill · December 1, 2017, 1:01pm

Fair point re dimensionality, I was thinking the parameters were the columns of beta_delta_tilde, so in my suggestion above just transpose beta_delta_tilde.

Looks like that R example has Lb, you (currently) have bL which is (L’b’)’, which though both of the same dimensionality are different. Pretty sure you want the former.

Nestor · December 1, 2017, 5:12pm

Hi!

If I understand what your recommendation means, it is that having the transposed version of the code could improve efficiency. I am willing to try but I don’t understand what changes it could make. I will think about it.
Also, as a practical argument in favor of the current version, it yielded convergence and decent run time in less complicated versions of the code.

I found that transforming the multi_normal in normal distributions saves lots of run time in simpler versions, so I will definitely try that on the posted version today.

cfhammill · December 1, 2017, 7:17pm

Hi Nestor, I think one is probably the right answer and the other is a bug. Still not quite sure which is which, maybe one of the more expert users can weigh in. But to the best of my understanding the non-centered parameterization of the multivariate normal goes something like:

y ~ MVN(mu, Sigma)

implies

L^-1 * (y - mu) ~ MVN(0, I)
where L is the cholesky factor of Sigma (Sigma = LL’).

(y - mu) * L^-1 would be distributed completely differently and is probably not what you want.

Another point, if mu00 and sigma00 are a zero vector and an identity matrix respectively, you can speed up sampling (in the 2x2 case not a huge amount) by doing

beta_delta_tilde[j] ~ normal(0,1)

saves a matrix factorization and multiplication at each step.

Nestor · December 3, 2017, 2:28am

Hi,
The following yield the same result, so this is definitely not the problem:

set.seed(1)
J ← 30
mu ← c(1, 1)
Sigma ← diag(0.2, 2)
beta_b_delta ← data.frame(matrix(NA, nrow = J, ncol = 3))
for (j in 1:J){
beta_b_delta[j,] ← c(as.matrix((mu + chol(Sigma) %% as.matrix(beta_b_delta_tilde)[j,]))[,1])
}
print(head(beta_b_delta))
beta_b_delta_tilde ← matrix(rnorm(J3,0, 1), nrow = J, ncol = 3)
beta_b_delta2 ← matrix(1, nrow = J, ncol = 3) + beta_b_delta_tilde %*% chol(Sigma)
print(head(beta_b_delta2))

cfhammill · December 3, 2017, 8:45pm

If Sigma is diagonal L = L’, in general it won’t be

Topic		Replies	Views
How to speed up sampling in rstan? Modeling rstan , performance , hierarchical-model	6	6197	July 24, 2020
Problems remain after non-centered parametrization of correlated parameters Modeling	34	4271	July 3, 2019
Any way to speed up warmup? General performance	5	2023	July 18, 2020
Variable running speeds across chains during warmup - possible causes Modeling	4	676	May 27, 2019
Warmup stalling? Modeling fitting-issues	5	833	May 8, 2018

Stuck at warmup

Related topics