Bayes work flow - simulated data sets?


#1

Hello,

in the strategy to develop optimal models, I often I have the doubt whether:

  • to test on some data I think I know the generative numerical process of, or
  • investing hours of work in building a simulated data set

(is there some cases where you decide to not to invest time for producing simulated data?)

And is it better to:

  • produce generated data set from Stan directly, or
  • produce the simulated data set independently with R for example

Thanks


#2

Hey Stefano,

What are examples of data for which you really know the generative numerical process but that aren’t simulated? Do you mean you know the generative biological mechanism or something like that?

Not really! (Maybe some super special cases that aren’t occurring to me at the moment.)

A few resources that touch on this topic:

Either way is fine. Depending on the project one may make more sense than the other, but you can always get the same result either way (up to prng noise).


#3

I have found that the time spent simulating data is well invested. When I have not done this initially, I’ve always ended up doing it later to debug the model. Often the code is quite similar to code I will be writing anyway to do the posterior predictive checks, so I can reuse a lot of it.

One exception is if I’m porting a model from an MLE framework, and then I can compare if the results are reasonably close to what I was expecting. But if they weren’t, then I end up doing the simulation anyway…

I’ve always found it better to simulate the data independently in R. It is easier for me to work with, easier to debug since it doesn’t need to compile the model. In fact, I have stopped using the generated quantity block entirely, since it’s incredibly frustrating when your model finishes, but there is a bug in the generated quantities block then RStan barfs when it finds an NA in the generated quantities and throws out your results.


#4

I would strongly argue that if you can’t simulate data then you don’t understand the generative process, hence the decision to focus effort on either building a better model or simulating data is a false dichotomy. Simulating data is critical for understanding the consequences of your modeling assumptions.


#5

Yeah I completely agree. I was just being (overly) nitpicky about the language regarding really knowing the numerical process, and trying to emphasize the importance of simulated data.


#6

I totally second this. I have always resorted to simulation sooner or later for all projects. I also prefer simulating in R not only because the workflow is faster, but also because it serves as a double check that the model is correctly implemented in Stan.


#7

Thanks for the discussion,

I was curious whether simulating data is practically mandatory, even if cost some time.

So far I tended to build a model and tested on some real world data, and see whether the inferred parameters match known facts about the data set/science, that have been published previously.

I manly used simulated data when I had to solve some issue, and needed to identify what part of the model was problematic, if any, or the real world prior knowledge data I was using itself was not sufficient/non representative of the observations I wanted to infer quantity of.

@jonah an example could be that for my deconvolution problem (understanding the proportions of cell types within a tissue from the overall “gene production” observed of that tissue, for many replicates), the challenging part was not the principle itself which is really quite straightforward but the way to treat prior information available, that is often non representative of the observed data. In this case a tricky part was to model the noise provided by misleading prior knowledge rather than the mathematical model of the deconvolution.

I realised that I had to backtrack many times, and building a solid simulation data set first is always worth the time. I just wanted to know the general opinion on the best practices to rule out some bad habits definitely.


#8

@stemangiola -

Not always, but sometimes it’s possible to simulate fake data from the same Stan code you use to fit your model using this approach: http://modernstatisticalworkflow.blogspot.com/2017/04/an-easy-way-to-simulate-fake-data-from.html

This can save you some time.


#9

The simulation has to be per model. The point is that you want to simulate data from the model, not from something roughly like the model.