I have found that the time spent simulating data is well invested. When I have not done this initially, I’ve always ended up doing it later to debug the model. Often the code is quite similar to code I will be writing anyway to do the posterior predictive checks, so I can reuse a lot of it.
One exception is if I’m porting a model from an MLE framework, and then I can compare if the results are reasonably close to what I was expecting. But if they weren’t, then I end up doing the simulation anyway…
I’ve always found it better to simulate the data independently in R. It is easier for me to work with, easier to debug since it doesn’t need to compile the model. In fact, I have stopped using the generated quantity block entirely, since it’s incredibly frustrating when your model finishes, but there is a bug in the generated quantities block then RStan barfs when it finds an NA in the generated quantities and throws out your results.
I would strongly argue that if you can’t simulate data then you don’t understand the generative process, hence the decision to focus effort on either building a better model or simulating data is a false dichotomy. Simulating data is critical for understanding the consequences of your modeling assumptions.
I totally second this. I have always resorted to simulation sooner or later for all projects. I also prefer simulating in R not only because the workflow is faster, but also because it serves as a double check that the model is correctly implemented in Stan.
I was curious whether simulating data is practically mandatory, even if cost some time.
So far I tended to build a model and tested on some real world data, and see whether the inferred parameters match known facts about the data set/science, that have been published previously.
I manly used simulated data when I had to solve some issue, and needed to identify what part of the model was problematic, if any, or the real world prior knowledge data I was using itself was not sufficient/non representative of the observations I wanted to infer quantity of.
@jonah an example could be that for my deconvolution problem (understanding the proportions of cell types within a tissue from the overall “gene production” observed of that tissue, for many replicates), the challenging part was not the principle itself which is really quite straightforward but the way to treat prior information available, that is often non representative of the observed data. In this case a tricky part was to model the noise provided by misleading prior knowledge rather than the mathematical model of the deconvolution.
I realised that I had to backtrack many times, and building a solid simulation data set first is always worth the time. I just wanted to know the general opinion on the best practices to rule out some bad habits definitely.