Currently I am trying to run a simulation to ensure my model is correct. When checking the estimates, I noticed that my “item_difficulty” matches the true values. However, this was achieved through the estimation with different “difficulty” and “beta_difficulty” values than the true ones. Even though i end up obtaining the same item_difficulty estimate, I feel like this indicates some type of problem. Especially if i wanted to see the covariate effect coefficients or difficulty estimates on their own, they would be different than what they should be.
Any potential reasons for this occurring?
data {
int<lower=0> n_examinee; // Number of examinees
int<lower=0> n_item; // Number of items
int<lower=1> Z; //number of obs
int<lower=0> Y[Z]; //The data matrix (count) in vector
int<lower=1> K; // number of item-level covariates (e.g., item types)
matrix[n_item, K] X;
}
parameters {
vector <lower = -6,upper = 6>[n_examinee] theta; //
vector <lower = -6, upper = 6> [n_item] difficulty; //
vector[K] beta_difficulty;
}
transformed parameters {
vector[n_item] item_difficulty;
for (i in 1:n_item) {
item_difficulty[i] = difficulty[i] + dot_product(X[i], beta_difficulty);
}
}
model {
theta ~ normal(0, 3);
beta_difficulty ~ normal(0,1);
difficulty ~ normal(0,3);
matrix [n_examinee, n_item] lambdas;
for(i in 1:n_examinee){
for (j in 1:n_item){
lambdas[i,j] = exp(theta[i] + item_difficulty[j]);
}
}
Y ~ poisson(to_vector(lambdas));
}
}
I wasn’t sure what you meant by this:
Currently I am trying to run a simulation to ensure my model is correct. When checking the estimates, I noticed that my “item_difficulty” matches the true values. However, this was achieved through the estimation with different “difficulty” and “beta_difficulty” values than the true ones.
To do this test, you should simulate according to the way the model is written. So you simulate difficulty and theta and beta_difficulty from their priors, then simulate the data from the parameters. As long as you do that in a way that matches the way your model is defined, you should be able to recover your model parameters in their posterior intervals at nominal coverage.
Some more general Stan/stats comments:
-
You can rewrite the transformed parameters to be much more efficiently and compactly as
vector[n_item] item_difficulty = difficulty + X * beta_difficulty;
-
The interval constraints can be challenging. Is there a reason you bounded the items? The problem is that exactly the place where the bounds matter is where they cause problems by bunching up probability mass near the boundaries.
-
I would strongly recommend not taking in Z
as an argument because it has to be equal to n_examinee * n_item
. Instead, just use n_examinee * n_item
and then the data can never be inconsistent. For the same reason, I’d be inclined to take in Y
as a n_examinee * n_item
2D array and then do the conversion to a 1D array in the Stan program so that it’s easier to use from outside the Stan program.
-
Stan has a built-in poisson_log
function that takes the argument on a log scale and is much more arithmetically stable than exponentiating. The arithmetic defining the matrix lambdas
can also be vectorized. Together, the last few lines of code are
for (i in 1:n_examinee) log_lambdas[i] =theta'; // even better, define `theta` as a row vector
for (j in 1:n_item) log_lambdas[ , j] += item_difficulty;
Y ~ poisson_log(to_vector(log_lambdas));