Stuck at warmup

Hi again, it is actually sampling now. It is just very slow. Would it be possible to delete my post please? My apologies for posting too fast, I will be more patient next time. Thank you.


Hi!

I am having troubles with sampling stopping during warm-up. I consistently obtain the following message when I run the model:

SAMPLING FOR MODEL ‘stanmodel’ NOW (CHAIN 1).

Gradient evaluation took 0.000346 seconds
1000 transitions using 10 leapfrog steps per transition would take 3.46 seconds.
Adjust your expectations accordingly!

Iteration: 1 / 2000 [ 0%] (Warmup)
Iteration: 200 / 2000 [ 10%] (Warmup)
Iteration: 400 / 2000 [ 20%] (Warmup)
Iteration: 600 / 2000 [ 30%] (Warmup)

I use Ubuntu, I have some space available, and the quasi same model was running slowly but without problem yesterday. Would you know why this is happening? Is this due to inefficient parametrization of the model?

Thank you very much for your attention!

No crime committed, no apologies necessary

Slow is a pretty good diagnostic. If things sample slowly, that’s either cause of huge amounts of data or something interesting happening in the posterior.

If you don’t definitely have huge amounts of data, then standard thing is to try to reparameterize. First place to look are at divergence/treedepth diagnostics and pairplots for clues on how to do that.

1 Like

Thank you for your answer! My session crashed mid sampling and I could not obtain any results. I will try again today and see if I can share some convergence diagnostics. I indeed don’t have much data: I am doing simulations and I am using very small samples at first. I have used uncentered parametrization for normally distributed parameters, and I use adapt_delta = 0.99 and max_treedepth = 25.

I have seen on this forum that it is not recommended to define distributions using multi_normal with a diagonal covariance matrix, so I will try and change that. Regardless, I am pretty sure my code is very inefficient. I am copying it below, and any feedback on it is very much appreciated.

Is there any way to build this model up piece by piece? This looks really complicated. If you can get this working in smaller bits and glue em’ together you might find it easier to figure out what’s going on.

list(adapt_delta = 0.99, max_treedepth = 25)

I only go for these options if I’ve run out of reparameterization ideas. These will slow down the model, but sometimes there are ways to avoid them.

Yes, I did built the model by block (with just beta as unknown parameter, just delta, with a less complicated Pi matrix etc.) and the previous steps work.
Also, this version is still an intermediary step. More complicated versions, with a third lower hierarchy parameter and other types of observations, actually work well.
I will try with lower values of adapt_delta and max_treedepth, thank you.

Should this be L * beta_delta_tilde instead of beta_delta_tilde * L <multiply(beta_delta_tilde, cholesky_decompose(Sigma))>? Other examples of NCP for multivariate normal I’ve seen have used the former.

Thank you for your answer.
beta_delta_tilde is J*2 and the Cholesky decomposition is 2 * 2 so it would not be possible to do the reverse. I think that what I did is equivalent to doing the following (this is in R language), which might be the formulation that is most frequently used:

beta_b_delta <- data.frame(matrix(NA, nrow = J, ncol = 3))
for (j in 1:J){
beta_b_delta[j,] <- c(as.matrix((mu + chol(Sigma) %*% as.matrix(beta_b_delta_tilde)[j,]))[,1])
}

Please let me know if I missed something, and thanks again for taking the time to read the code, there must be something wrong somewhere.

Fair point re dimensionality, I was thinking the parameters were the columns of beta_delta_tilde, so in my suggestion above just transpose beta_delta_tilde.

Looks like that R example has Lb, you (currently) have bL which is (L’b’)’, which though both of the same dimensionality are different. Pretty sure you want the former.

Hi!

If I understand what your recommendation means, it is that having the transposed version of the code could improve efficiency. I am willing to try but I don’t understand what changes it could make. I will think about it.
Also, as a practical argument in favor of the current version, it yielded convergence and decent run time in less complicated versions of the code.

I found that transforming the multi_normal in normal distributions saves lots of run time in simpler versions, so I will definitely try that on the posted version today.

Hi Nestor, I think one is probably the right answer and the other is a bug. Still not quite sure which is which, maybe one of the more expert users can weigh in. But to the best of my understanding the non-centered parameterization of the multivariate normal goes something like:

y ~ MVN(mu, Sigma)

implies

L^-1 * (y - mu) ~ MVN(0, I)
where L is the cholesky factor of Sigma (Sigma = LL’).

(y - mu) * L^-1 would be distributed completely differently and is probably not what you want.

Another point, if mu00 and sigma00 are a zero vector and an identity matrix respectively, you can speed up sampling (in the 2x2 case not a huge amount) by doing

beta_delta_tilde[j] ~ normal(0,1)

saves a matrix factorization and multiplication at each step.

Hi,
The following yield the same result, so this is definitely not the problem:

set.seed(1)
J ← 30
mu ← c(1, 1)
Sigma ← diag(0.2, 2)
beta_b_delta ← data.frame(matrix(NA, nrow = J, ncol = 3))
for (j in 1:J){
beta_b_delta[j,] ← c(as.matrix((mu + chol(Sigma) %% as.matrix(beta_b_delta_tilde)[j,]))[,1])
}
print(head(beta_b_delta))
beta_b_delta_tilde ← matrix(rnorm(J
3,0, 1), nrow = J, ncol = 3)
beta_b_delta2 ← matrix(1, nrow = J, ncol = 3) + beta_b_delta_tilde %*% chol(Sigma)
print(head(beta_b_delta2))

If Sigma is diagonal L = L’, in general it won’t be

1 Like