Hello,
My project has two parts. First, fit a model (using stan) to the data, and retrieve estimates of the parameters of interest. (This part goes alright, though the “theta” parameters get strongly shrunk to 0, sd of the thetas about 0.3 despite specification of 1. I chalked this up to shrinkage, which is mostly expected given I’m intentionally investigating limits of data sufficiency.). See stan code below:
data {
int<lower=0> N;
int<lower=0> n_t;
int<lower=0> n_s;
real w[N]; //scores
int v[N]; // sIDs
int u[N]; // tIDs
}
parameters{
//s parameters
real<lower=0> aparams[n_s];
real<lower=0> alphaparams[n_s];
//t parameters
real thetas[n_t];
//optional helpers
real<lower=0> a_mean;
real<lower=0> a_sd;
real<lower=0> alpha_mean;
real<lower=0> alpha_sd;
}
model{
thetas ~ normal(0,1);
aparams ~ lognormal(a_mean, a_sd);
alphaparams ~ lognormal(alpha_mean, alpha_sd);
for (k in 1:N)
w[k] ~ normal(alphaparams[v[k]]*thetas[u[k]], 1 / aparams[v[k]]);
}
Second step is to use those parameters to simulate similar data sets as the original data (I don’t use stan for this; just generate things in R), and subject them to a similar analysis (essentially, split given data into subsets, and compare stability of estimates for theta across subsets). For those who use R, the data generating code is:
(rnorm(nrow(matchmatrix), 0, 1)/params$a + (params$alpha*params$theta))
This is where I’m running into big issues. On the aggregate, the scores (i.e. w vector) look relatively similar. But the simulated data produces very different results (originally correlation of thetas between subsets was about 0.5-0.6; now they’re consistently at 0.8-0.9+).
I think I found the culprit once I ran the above stan code on the simulated data. The suggested a parameters and alpha parameters are way off; nowhere near the skew of the original a parameters, and nearly an order of magnitude higher for alpha parameters, (theta parameters are mostly similar). I’ve run into the same issue using both the data generating process given by the model and the suggested by estimates step (1), as well as just sampling from the s and t parameters (the mean estimates) from step (1) without imposing a prior distribution. Issue exists in both among individual chains, and in the aggregate estimates after the chains are combined.
Is this just the result of stan shrinking my estimates? If so, is there a way to “unshrink” the estimates, or rigorously unsteady the estimates so I can get a more realistic data set? Or do you suspect there’s something else going on?
Apologies if this has been similarly asked elsewhere; references would be just as appreciated as answers.
P.S. For reference, I’m running this in R, under the options of nchains = 4, standelta= 0.95, stantreedepth = 15, stansteps = 3000.