Bayes work flow - simulated data sets?

stemangiola · April 28, 2018, 6:14pm

Hello,

in the strategy to develop optimal models, I often I have the doubt whether:

to test on some data I think I know the generative numerical process of, or
investing hours of work in building a simulated data set

(is there some cases where you decide to not to invest time for producing simulated data?)

And is it better to:

produce generated data set from Stan directly, or
produce the simulated data set independently with R for example

Thanks

jonah · April 29, 2018, 7:25pm

Hey Stefano,

What are examples of data for which you really know the generative numerical process but that aren’t simulated? Do you mean you know the generative biological mechanism or something like that?

Not really! (Maybe some super special cases that aren’t occurring to me at the moment.)

A few resources that touch on this topic:

Best Practices wiki
Visualization in Bayesian workflow paper and code
many of @betanalpha’s case studies

Either way is fine. Depending on the project one may make more sense than the other, but you can always get the same result either way (up to prng noise).

aaronjg · April 29, 2018, 9:13pm

I have found that the time spent simulating data is well invested. When I have not done this initially, I’ve always ended up doing it later to debug the model. Often the code is quite similar to code I will be writing anyway to do the posterior predictive checks, so I can reuse a lot of it.

One exception is if I’m porting a model from an MLE framework, and then I can compare if the results are reasonably close to what I was expecting. But if they weren’t, then I end up doing the simulation anyway…

I’ve always found it better to simulate the data independently in R. It is easier for me to work with, easier to debug since it doesn’t need to compile the model. In fact, I have stopped using the generated quantity block entirely, since it’s incredibly frustrating when your model finishes, but there is a bug in the generated quantities block then RStan barfs when it finds an NA in the generated quantities and throws out your results.

betanalpha · April 30, 2018, 1:38am

I would strongly argue that if you can’t simulate data then you don’t understand the generative process, hence the decision to focus effort on either building a better model or simulating data is a false dichotomy. Simulating data is critical for understanding the consequences of your modeling assumptions.

jonah · April 30, 2018, 6:09am

Yeah I completely agree. I was just being (overly) nitpicky about the language regarding really knowing the numerical process, and trying to emphasize the importance of simulated data.

martinmodrak · April 30, 2018, 6:51am

I totally second this. I have always resorted to simulation sooner or later for all projects. I also prefer simulating in R not only because the workflow is faster, but also because it serves as a double check that the model is correctly implemented in Stan.

stemangiola · April 30, 2018, 3:40pm

Thanks for the discussion,

I was curious whether simulating data is practically mandatory, even if cost some time.

So far I tended to build a model and tested on some real world data, and see whether the inferred parameters match known facts about the data set/science, that have been published previously.

I manly used simulated data when I had to solve some issue, and needed to identify what part of the model was problematic, if any, or the real world prior knowledge data I was using itself was not sufficient/non representative of the observations I wanted to infer quantity of.

@jonah an example could be that for my deconvolution problem (understanding the proportions of cell types within a tissue from the overall “gene production” observed of that tissue, for many replicates), the challenging part was not the principle itself which is really quite straightforward but the way to treat prior information available, that is often non representative of the observed data. In this case a tricky part was to model the noise provided by misleading prior knowledge rather than the mathematical model of the deconvolution.

I realised that I had to backtrack many times, and building a solid simulation data set first is always worth the time. I just wanted to know the general opinion on the best practices to rule out some bad habits definitely.

James_Savage · April 30, 2018, 7:55pm

@stemangiola -

Not always, but sometimes it’s possible to simulate fake data from the same Stan code you use to fit your model using this approach: http://modernstatisticalworkflow.blogspot.com/2017/04/an-easy-way-to-simulate-fake-data-from.html

This can save you some time.

Bob_Carpenter · May 24, 2018, 5:22am

The simulation has to be per model. The point is that you want to simulate data from the model, not from something roughly like the model.

Topic		Replies	Views
Conducting Simulation Studies in RStan RStan	2	948	July 4, 2019
Pattern for single Stan-file for simulate & inference Modeling	6	879	November 15, 2017
Conduct simulation studies in RStan RStan rstan , fitting-issues	1	648	August 1, 2021
Help in identifying strategies for fitting simulated datasets General mixed-model , hierarchical-model	5	841	May 4, 2021
Simulating fake data for regression in Stan Modeling techniques , specification , performance	2	1349	October 9, 2021

Bayes work flow - simulated data sets?

Related topics