Simulation to ensure model accuracy with covariate effects, how to ensure all parameters match true values?

Currently I am trying to run a simulation to ensure my model is correct. When checking the estimates, I noticed that my “item_difficulty” matches the true values. However, this was achieved through the estimation with different “difficulty” and “beta_difficulty” values than the true ones. Even though i end up obtaining the same item_difficulty estimate, I feel like this indicates some type of problem. Especially if i wanted to see the covariate effect coefficients or difficulty estimates on their own, they would be different than what they should be.

Any potential reasons for this occurring?

data {
  int<lower=0> n_examinee; // Number of examinees
  int<lower=0> n_item; // Number of items

  int<lower=1> Z;   //number of obs
  int<lower=0> Y[Z]; //The data matrix (count) in vector
  
  int<lower=1> K;  // number of item-level covariates (e.g., item types)
  matrix[n_item, K] X; 
}
parameters {
  vector <lower = -6,upper = 6>[n_examinee] theta;  //
  vector <lower = -6, upper = 6> [n_item] difficulty;  //
  
  vector[K] beta_difficulty;

}

transformed parameters {
  
  vector[n_item] item_difficulty;
  for (i in 1:n_item) {
    item_difficulty[i] = difficulty[i] + dot_product(X[i], beta_difficulty);
  }
}

model {
  theta ~ normal(0, 3);
  beta_difficulty ~ normal(0,1);
  difficulty ~ normal(0,3);

  matrix [n_examinee, n_item] lambdas;
  for(i in 1:n_examinee){
    for (j in 1:n_item){
      lambdas[i,j] = exp(theta[i] + item_difficulty[j]);
    }
  }
  Y ~ poisson(to_vector(lambdas));
}

} 

I wasn’t sure what you meant by this:

Currently I am trying to run a simulation to ensure my model is correct. When checking the estimates, I noticed that my “item_difficulty” matches the true values. However, this was achieved through the estimation with different “difficulty” and “beta_difficulty” values than the true ones.

To do this test, you should simulate according to the way the model is written. So you simulate difficulty and theta and beta_difficulty from their priors, then simulate the data from the parameters. As long as you do that in a way that matches the way your model is defined, you should be able to recover your model parameters in their posterior intervals at nominal coverage.

Some more general Stan/stats comments:

  1. You can rewrite the transformed parameters to be much more efficiently and compactly as

    vector[n_item] item_difficulty = difficulty + X * beta_difficulty;
    
  2. The interval constraints can be challenging. Is there a reason you bounded the items? The problem is that exactly the place where the bounds matter is where they cause problems by bunching up probability mass near the boundaries.

  3. I would strongly recommend not taking in Z as an argument because it has to be equal to n_examinee * n_item. Instead, just use n_examinee * n_item and then the data can never be inconsistent. For the same reason, I’d be inclined to take in Y as a n_examinee * n_item 2D array and then do the conversion to a 1D array in the Stan program so that it’s easier to use from outside the Stan program.

  4. Stan has a built-in poisson_log function that takes the argument on a log scale and is much more arithmetically stable than exponentiating. The arithmetic defining the matrix lambdas can also be vectorized. Together, the last few lines of code are

    for (i in 1:n_examinee) log_lambdas[i] =theta';  // even better, define `theta` as a row vector
    for (j in 1:n_item) log_lambdas[ , j] += item_difficulty;
    Y ~ poisson_log(to_vector(log_lambdas));