Generating mock data from estimated quantities


My project has two parts. First, fit a model (using stan) to the data, and retrieve estimates of the parameters of interest. (This part goes alright, though the “theta” parameters get strongly shrunk to 0, sd of the thetas about 0.3 despite specification of 1. I chalked this up to shrinkage, which is mostly expected given I’m intentionally investigating limits of data sufficiency.). See stan code below:

   data {
     int<lower=0> N;
     int<lower=0> n_t;
     int<lower=0> n_s;
     real w[N]; //scores
     int v[N]; // sIDs
     int u[N]; // tIDs
     //s parameters
        real<lower=0> aparams[n_s];
        real<lower=0> alphaparams[n_s];
     //t parameters
        real thetas[n_t];
     //optional helpers
        real<lower=0> a_mean;
        real<lower=0> a_sd;
        real<lower=0> alpha_mean;
        real<lower=0> alpha_sd;
     thetas ~ normal(0,1);
     aparams ~ lognormal(a_mean, a_sd);
     alphaparams ~ lognormal(alpha_mean, alpha_sd);
     for (k in 1:N)
       w[k] ~ normal(alphaparams[v[k]]*thetas[u[k]], 1 / aparams[v[k]]);

Second step is to use those parameters to simulate similar data sets as the original data (I don’t use stan for this; just generate things in R), and subject them to a similar analysis (essentially, split given data into subsets, and compare stability of estimates for theta across subsets). For those who use R, the data generating code is:

(rnorm(nrow(matchmatrix), 0, 1)/params$a + (params$alpha*params$theta))

This is where I’m running into big issues. On the aggregate, the scores (i.e. w vector) look relatively similar. But the simulated data produces very different results (originally correlation of thetas between subsets was about 0.5-0.6; now they’re consistently at 0.8-0.9+).

I think I found the culprit once I ran the above stan code on the simulated data. The suggested a parameters and alpha parameters are way off; nowhere near the skew of the original a parameters, and nearly an order of magnitude higher for alpha parameters, (theta parameters are mostly similar). I’ve run into the same issue using both the data generating process given by the model and the suggested by estimates step (1), as well as just sampling from the s and t parameters (the mean estimates) from step (1) without imposing a prior distribution. Issue exists in both among individual chains, and in the aggregate estimates after the chains are combined.

Is this just the result of stan shrinking my estimates? If so, is there a way to “unshrink” the estimates, or rigorously unsteady the estimates so I can get a more realistic data set? Or do you suspect there’s something else going on?

Apologies if this has been similarly asked elsewhere; references would be just as appreciated as answers.

P.S. For reference, I’m running this in R, under the options of nchains = 4, standelta= 0.95, stantreedepth = 15, stansteps = 3000.

Hi, welcome to the Stan forum.

I had a brief look at you code, and the first thing that caught my eye is that your R-code

does seem to use parameters that are not in the Stan model, which does not have a or alpha as parameters. Could this be part of the problem?

More general observation are:

  • It’s easier to see what is going on if you also post how you generated the data in R
  • some of your parameters (a_mean, a_sd, alpha_mean, alpha_sd) do not have priors (except the lower bound). It think should be easier with priors on these parameters
  • you are specifying a precision instead of a standard deviation (aparams). I find it easier to understand and specify priors for standard deviations.
  • an indicator that something is already going wrong during parameter estimation are Rhat values and divergences. Do these look OK (given the very weak prior information for everything expect theta I imagine this model is not easy to fit )

(On a side note: It looks as if you are implementing some type of IRT model. These can be implemented without much difficulty in brms, here is more info on that:

Hello Guido,

I’m confused by one of your comment that the stan model doesn’t use a or alpha parameters. Does the following model code not do the trick?

     thetas ~ normal(0,1);
     aparams ~ lognormal(a_mean, a_sd);
     alphaparams ~ lognormal(alpha_mean, alpha_sd);
     for (k in 1:N)
       w[k] ~ normal(alphaparams[v[k]]*thetas[u[k]], 1 / aparams[v[k]]);

And yes indeed, this is based on the continuous response model proposed by Samejima; the link you provided does have some useful code for me to try and adapt for this setting.

What I mean is that you have params$alpha in the R code, but no parameter in you Stan model is literally called alpha. I see that you have alphaparams, alpha_mean, and alpha_sd, but for somebody who does not see the complete code, it is not possible to know if params$alpha corresponds to any of these. (I would recommend to give parameters the same name in R and Stan to avoid any confusion)