I have already fitted my models using Rstan across fifteen candidate models, with Rhat and LOOIC applied as model selection criteria. The next step I would like to take is model recovery within the six most representative models selected from the fifteen, which should enhance the identification of the winning model compared to other models. However, the average time to fit one model is around 3 to 4 hours using Rstan. This means that if I choose six alternative models and simulate 100 iterations for each model during the model recovery procedure, it will take approximately 3,600 hours—nearly half a year. Given this, I am wondering if there are faster methods available to accomplish model recovery. Each candidate model is fitted with four independent MCMC chains using 1,000 iterations (after 2,000 iterations for initial warmup per chain), resulting in 4,000 valid posterior samples.
What do you mean by “model recovery” and “enhance identification”?
If you already fit the model, why not just save the posterior samples?
I get the six models, but what do you mean by “100 iterations”? If you already fit each model with 4000 iterations, why run another 100? Or are these some different kind of iteration?
You will get 4K draws from the posterior if everything has converged, but they don’t have the same power as 4K independent posterior draws. The effective sample size measures their equivalent size of independent draws (it can be more or less than the actual sample size depending on the anti-correlation or correlation in the Markov chains).
What ESS do you actually need for the downstream task? Are you sure you need 2K iterations for warmup? That’s a lot. Our use of 1K by default is usually a very conservative choice.
What have you done to optimize the speed of the model? Is the autodiff as vectorized as possible, are calculations only done once and reused, are constants defined in transformed data, etc.?
If the models are related, it’s possible to use warm starts from one model to initialize another.
Thanks for the comments, Bob. I apologize for unclear descriptions about the model recovery. In fact, I have compared fifteen candidate models using both R-hat and LOOIC. The winning model had an R-hat value below 1.1 and the lowest LOOIC score among the other fourteen models. Additionally, I performed a posterior predictive check (PPC) to assess whether the winning model fit the actual behavior data well, using 4K posterior samples. During the model fitting process, each candidate model is fitted with four independent MCMC chains using 1K iterations (after 2K iterations for initial warmup per chain), resulting in 4K valid posterior samples.
To avoid the possibility of model misidentification in model comparisons, I would like to performed a model recovery analysis. In the analysis, six most representative models will be selected, and for each model, one sample selecting randomly from the 4K valid posterior samples will generate a synthetic dataset of 139 valid participants. Each synthetic dataset regarding a specific model is then used to fit each of the six alternative models and identify the best-fitting model by LOOIC scores. I will repeat the above procedure 100 times to calculate the percentage at which each model is identified as the best model on the basis of all synthetic datasets from a specific generating model. The highest percentage assigned to the same fitting model as the generating model suggests that the model is identifiable.
However, as mentioned in my previous comment, this process may take approximately 3,600 hours. Given this, I’m wondering if there are faster methods to achieve model recovery. In your reply, you suggested reducing the warm-up iterations from 2K to 1K, which is a good approach. However, as I’m newbie in computational modeling, I’m not sure how to achieve warm starts from one model to initialize another in Rstudio—though I understand that this could also be an effective strategy. Thanks again for your feedback, and please feel free to let me know if anything else is unclear.