How to check the quality of simulated data?

Hello. I have a conceptual question.

I have a hierarchical Bayesian model that looks like this:

\alpha_i \sim \mathcal{N}\left(\mu_\alpha, \sigma_\alpha\right) \tag{1}
\beta_i \sim \mathcal{N}\left(\mu_\beta, \sigma_\beta\right) \tag{2}
\gamma_i \sim \mathcal{N}\left(\mu_\gamma, \sigma_\gamma\right) \tag{3}

So, my model has lower-level parameters \alpha_i, \beta_i, \gamma_i for each participant i. Also, there are upper level parameters like \mu_\alpha \ldots \sigma_\gamma.

Assume I have specified an observation model so that I can fit these parameters with the NUTS sampler (let’s say 4 chains, 1000 posterior samples each).

After fitting the model, I will get 4000 posterior samples for each parameter, let’s denote them by \alpha_i^{\text{est}}, \beta_i^{\text{est}}, \gamma_i^{\text{est}}, \mu_\alpha^{\text{est}}, \ldots \sigma_\gamma^{\text{est}}.

Now, I want to simulate lower-level parameters using the upper-level parameters, and check if the simulated parameters are “close” to the estimated ones.

The way I have done this is described below:

  1. Use \mu_\alpha^{\text{est}}, \ldots \sigma_\gamma^{\text{est}} in (1) -(3) to sample lower-level parameters, let’s denote them by \alpha_i^{\text{sim}}, \beta_i^{\text{sim}},\gamma_i^{\text{sim}}.

  2. Now, we can use PCA to visualize these parameters on a cartesian plane. We can fit the PCA map on \alpha_i^{\text{est}}, \beta_i^{\text{est}}, \gamma_i^{\text{est}} which would be fit on an array of size (4000, 3). After fitting the PCA map, we can use it to reduce the dimension to (4000, 2).

  3. We can also use the fitted PCA map to reduce the dimension of simulated parameters \alpha_i^{\text{sim}}, \beta_i^{\text{sim}},\gamma_i^{\text{sim}} from (4000, 3) to (4000, 2).

  4. Now, we can plot both of these and see if there’s an overlap.

I want to know if this is an ok thing to do or am I violating any assumptions?

Note that, I can also plot the prior parameters by simply sampling from (1)-(3) and then using the fitted PCA map to reduce them. Ideally, on the PCA plot, the estimated and simulated parameters would have a large overlap and they will both be within the area colored by prior parameters.

I want to know if this a reasonable way to assess the quality of simulated data. Any suggestions or guidance will be highly appreciated. Thanks!

Maybe I’m missing something, but are you trying to perform a parameter recovery to see if your Stan program returns parameter values close to the ones underlying the data of the generative model? Because you can assess that by plotting scatter plots of each ground truth value vs estimated value, one parameter at a time (including hyperparameters if your generative model uses them), getting a deming regression line from the pairs, and comparing that line to unity. You could hypothetically run the same generative model parameters for ten data sets and plot all of the deming regressions and see if they aggregate around the line of unity. You can also get a correlation value for the ground truth vs recovered parameters in each run, though for many non-normal hyperparameters (like lognormal(0,1) for a sigma hyperparameter), you’d either have to use a rank-based correlation like a Spearman’s rho, or a copula.