Fail to run due to relatively high number of missing observations

Hello,

referring to a previous post (Missing data in a 2PL (IRT) model - #37 by Panagiotis_Arsenis), we currently have the following Stan code:

functions {
real rsm(int y, real theta, real beta, vector kappa) {
vector[rows(kappa) + 1] unsummed;
vector[rows(kappa) + 1] probs;
unsummed = append_row(rep_vector(0, 1), theta - beta - kappa);
probs = softmax(cumulative_sum(unsummed));
return categorical_lpmf(y + 1 | probs);
}
real rsm_rng(real theta, real beta, vector kappa) {
vector[rows(kappa) + 1] unsummed;
vector[rows(kappa) + 1] probs;
unsummed = append_row(rep_vector(0, 1), theta - beta - kappa);
probs = softmax(cumulative_sum(unsummed));
return categorical_rng(probs);
}
}
data {
int<lower=1> I; // # items
int<lower=1> J; // # persons
int<lower=1> N; // # observations
int<lower=1> N_mis; // # missing observations
int<lower=1, upper=I> ii[N]; // item for n
int<lower=1, upper=J> jj[N]; // person for n
int<lower=0> y[N]; // response for n
}
transformed data {
int m; // # steps
m = max(y);
}
parameters {
vector[I] beta;
vector[m-1] kappa_free;
vector[J] theta;
real<lower=0> sigma;
}
transformed parameters {
vector[m] kappa;
kappa[1:(m-1)] = kappa_free;
kappa[m] = -1*sum(kappa_free);
}
model {
beta ~ normal(0, 3);
target += normal_lpdf(kappa | 0, 3);
theta ~ normal(0, sigma);
sigma ~ exponential(.1);
for (n in 1:N)
target += rsm(y[n], theta[jj[n]], beta[ii[n]], kappa);
}
generated quantities {
vector[N_mis] y_mis;
for (n in 1:N_mis)
y_mis[n] = rsm_rng(theta[jj[n]], beta[ii[n]], kappa);
}

The issue here is that when N_mis > N, Stan will come back with this:

Exception: : accessing element out of range. index 7455 out of range; expecting index to be between 1 and 7454; index position = 1jj (in ‘model52b1217b4c_1e9f8627a1bbd76d859458272d0dfc57’ at line 54)

We are not sure how to deal with this. Any suggestions?

Panos

I’m just speculating here:

Are y_mis[n] and y[n] someway related?

would this return the same goal if you randomly pick an integer between 1 to N each round and use that as a n.

The problem is that jj and ii are of length N but in the generated quantities you’re indexing into them up to N_miss, which as you say is > N.

Well, both N_mis and N come from the same data set. So, what is not N_mis (NA, missing data), it is N.

I see, this sounds right. If I simply re-index ii and jj as ii[N_mis] and jj[N_mis] I get the following:

Error in new_CppObject_xp(fields$.module, fields$.pointer, …) :
Exception: mismatch in dimension declared and found in context; processing stage=data initialization; variable name=ii; position=0; dims declared=(10285); dims found=(7454) (in ‘model79fbfba1a_d0f40f95b93802bc6d0fcad905f25cf3’ at line 24)

A bit more insight here if possible?

The error message is telling you that the data you passed in is the wrong size. I’ll try to break it down.

  • “mismatch in dimension declared and found”: it’s telling you that the size you declared and what it found were different.

  • “processing stage=data initialization”: it’s telling you this is when it’s initializing data, i.e. reading data from R / Python / command line

  • “variable name=ii”: check ii

  • “position = 0”: this might throw you off and is unimportant for this error. It’s telling you which dimension of the array that the problem was found.

  • “dims declared=(10285)”: in your program, it’s expecting ii to be length 10285

  • “dims found=(7454)”: in your data, check the size of ii. I bet it’s length 7454

The error messages contain a lot of info if you unpack them.

1 Like

It’s telling you which line in the model is causing problems:

That’s always a good place to start debugging. (Just thought I’d call that out explicitly in addition to @syclik’s comments.

1 Like

Thank you for this helpful information.