Sorry if the question is too basic. I am running a model which appears to run very very slow (I get 17 iterations per day), so it was suggested to me that I should split it into a number of smaller chains so that it runs faster. However, I am not sure that this is the proper thing to do, so I would like to have some insight.
Also, if that is the way to go, then how would one go about having warmup in the calculation? If the chain length that I’m aiming for is lower than the number of iterations that constitute the warmup. Thanks.
17 iterations per day is extremely slow, it would take you almost half a year to get to 2500 iterations. Since in all likelihood there will be things that need to be fixed/improved after the initial run(s) it is probably a very bad idea to go for that and then spend months repeating the run until you actually get a result that is useful.
I’d suggest looking into what can be improved (or fixed) in you implementation to make it practical; I don’t see how breaking down into multiple chains will give you the minimum performance you need there.
But to answer your question directly: yes, but not without some considerations/caveats.
If the warmup optimization is fast enough that chains are quickly mixing properly, all parallel chains will be sampling from the same distribution and you can substitute more short chains for fewer long ones; however, each chain will need its own warmup, so you will be wasting computations there in exchange for parallel sampling. You could shorten warmup enough that you get just enough iterations to get the sampling, or you could (in theory) run the warmup extract the stepsize and mass matrix them and fix them for later runs (in practice it’s hard to verify that these are good values, which is why relying on separate optimizations that still manage to get chains to mix is a reasonable validation).
In any case it’s best to be able to run your model for a reasonably long chain (more than one, actually) to make sure warmup is consistently getting to reasonable values that allow sampling to be done properly. If you are confident that is the case maybe you can work with the length and number of chains to get enough samples.
Thank you for the insights! We already tried implementing a lot of changes in the model, following all the reparametrization instructions etc. but some parts of the model still have trouble converging. It seems that the estimation improves with increased sample size, though, so we are trying to estimate the model with a large number of participants (10000). However, that is too slow (17 iterations per day), so I was looking for a way to split my current 4 chains into hundreds of smaller chains in order to have it estimated in a few days instead of a few years. It should be noted that with a ‘reasonable’ number of participants, such as 500, the model estimation lasts just under one day, which is acceptable.
However, I am still not sure what I would do with the warmup if I wanted to split my 4 chains into hundreds of smaller chains. Your comment seems to describe the situation the other way around, when I want to merge more short chains into fewer longer ones. Instead of that, I am looking to split my fewer longer chains into a very large number of shorter ones. Do you perhaps have any insight into how one could go on about that?
I think what @caesoma wrote is still directly applicable to your situation.
In theory, running many shorter chains to yield the same number of posterior iterations should be at least as good as running fewer longer chains (provided that the short chains are still long enough to reliably estimate convergence diagnostics). But the problem is that there is a serious limit to how short you can make the warmup. If warmup is too short, then the samples that you draw post-warmup will not be useful for inference, because in order to reliably infer convergence you’ll need a decent effective sample size post-warmup per chain, and with poor warmup you will need to run each chain for a very large number of post-warmup iterations in order to get an acceptably large per-chain effective sample size.
So there’s some minimum amount of warmup, below which short post-warmup chains will not be helpful. And this minimum amount of warmup is likely to be large enough that you still run into the same problem.
With all of that said, there are a couple of points that I think are worthwhile here, and might provide you with a way forward.
Sometimes things run much slower for the first 100-200 iterations and then begin to speed up dramatically. It might be worth letting this model run for a week or so, and then checking back. Prior to iteration 100, it’s possible that as chains find the region of nontrivial posterior probability mass, they will start to hit the no-u-turn criterion in fewer iterations, which can speed things up a lot. Then, after iteration 100 you get your first update to the inverse metric, and this can again allow things to speed up. Thereafter you will continue to get inverse metric updates during the windowed phase of adaptation, which can continue to yield improvements. Prior to waiting a week to reach 100 iterations, two clues can suggest whether there is any hope of a useful speed-up deeper into warmup, and both are related to understanding whether the model is slow due to the number of gradient evaluations or due to the time per gradient evaluation. If you save the warmup, then you can look at the CSV output. Big speedups are unlikely to ever happen unless the treedepths during warmup are fairly large. The second clue is (using rstan) to look at the reported time per gradient evaluation. Ultimately, for a model that is tricky but not too pathological, you might hope that the time per iteration settles down near 2^7 or 2^8 times the time per gradient eval.
Because you see things getting much slower with increasing sample size, and because it seems you have HPC resources available (since you are talking about running many short chains), it is possible that you can parallelize the computation within-chains for a big savings. See reduce_sum. The savings are most likely to be very large if increasing the sample size is not associated with increasing the length of the parameter vectors.
Finally, I agree with @caesoma that 17 iterations per day seems extremely slow. In general, my experience with this forum is that if you’re willing to post your Stan code there’s a strong chance that somebody will spot important optimizations, or else will be able to explain why your model is pathological and unlikely to be tractable.
I agree, and I think it’s safe to say that it’s not recomendable in general (actually only very rarely, if ever) or efficient to try to get more samples by simply running chains in parallel, and especially not by massively parallelizing, since there’s a limit and lack of knowledge about how short you can make warmup without making the rest of the chain a very poor or useless set of samples.
I also agree that the information about sample sizes being significant in the final speed can be exploited in different ways. Depending on how your likelihood is computed, doing it in parallel is probably highly recommended in your case, and someone around here will probably have experience with similar models and be able to help if you post your code.
Many thanks for the great and very informative answers. I will definitely also make sure to re-run the model again and to see if it speeds up after 100-200 iterations. Below is my code. The data (we’re using simulated data at this point) consists of respondents who give their responses to binary items. The run which had 17 iterations per day had 10.000 respondents, but I can’t remember the exact number of items that were used - I think it was 30. The reason why we want such a large sample size is because certain parameters of the model (mainly, the variance of the second member of the PersPar vector and its correlations with other parameters) don’t quite converge and the estimates are off. We would like to see if this is only due to the sample size, i.e. will increasing the sample size make these parameters converge. Feel free to ask me about any other previous results that you feel would help your understanding of the problem. One last note: removing the generated quantites block, naturally, does speed things up, but not to a significant extent, so I decided to keep it in the code because then it’s easier for me to evaluate my parameters after the estimation is done. Here is the code:
data {
int<lower=0> n_person; // number of examinees
int<lower=0> n_item; // number of items
int<lower = 1> Dim; // number of dimensions
int<lower = 1> N; // number of data points
int<lower = 1> jj[N]; // item id
int<lower = 1> ii[N]; // person id
int<lower=0, upper = 1> Y[N];
real<lower=0> t[N]; // response time data
real<lower=0, upper=1> X[N]; // time scale variable
vector[Dim] Zero; // vector of 0s for person parameter means
}
parameters {
vector[Dim] PersParStar[n_person];
cholesky_factor_corr[Dim] choleskyP; // Cholesky decomposition of correlation of person parameters
vector<lower=0>[Dim] sigmaP; // variances of person parameters
vector[n_item] b; // item difficulty parameters
vector[n_item] beta; // item time intensity parameters
real<lower=0> alpha; // common variance of RTs
}
transformed parameters {
vector[Dim] PersPar[n_person];
for(i in 1:n_person){
PersPar[i] = Zero + sigmaP .* (choleskyP * PersParStar[i]);
}
}
model {
for(j in 1:n_person){
PersParStar[j] ~ std_normal();
}
sigmaP ~ student_t(3, 0, 2.5);
choleskyP ~ lkj_corr_cholesky(1);
b ~ normal(0, 100);
beta ~ normal(0, 100);
alpha ~ student_t(3, 0, 2.5);
for(n in 1:N){
target += (bernoulli_logit_lpmf(Y[n] | (PersPar[ii[n],1] + PersPar[ii[n],2]*X[n]) - b[jj[n]])+
lognormal_lpdf(t[n] | beta[jj[n]] - (PersPar[ii[n],3] + PersPar[ii[n],4]*X[n]), alpha));
}
}
generated quantities {
matrix[Dim, Dim] correlP;
matrix[Dim, Dim] SigmaP;
correlP = multiply_lower_tri_self_transpose(choleskyP);
SigmaP = quad_form_diag(correlP, sigmaP);
}
Thanks for the tip. I am unsure about how I would vectorize it, so I am going to seek further help on this forum by opening a new topic, since this is now a completely different question.
In my experience, when a model samples very slowly, there is an extremely good chance that there is a problem with your model specification. Using the model to some prior predictive check would be helpful to check that you are at least generating values that are not completely unrealistic, because even just having too vague priors (I’m looking at those 100 standard deviations in your normal priors, except if you are NOT working on a standardized scale), can make life quite difficult for the sampler. I have seen an order of magnitude speedup in sampling (with much less divergent transitions), by just specifying more informative priors to keep the sampler away from pathological regions in the posterior.