Missing data in a 2PL (IRT) model

Hi,

@Panagiotis_Arsenis: Sorry for asking this again. Suppose that we observe Y: 0, 1, 1, 0, NA, NA, 1, 1. As you have said:

I still do not understand fully.

Could you let me know in this example how your Y look like? Thank you so much!

Tran.

Hi Tran,

my data include NAs, like your example. However, y includes only observed data. I converted my database-like data structure into ā€œlong formā€ to achieve this.

The process is described in section 16.1 of the Stan manual.

Panos

@Panagiotis_Arsenis: Thank you, now I get it!

Yes, signatures need to match. Looks like I missed a vector argument or something.

Yes, so is there a way I can rewrite y_mis[n] = rsm_rng(theta[jj[n]], beta[ii[n]], kappa) to match rsm_rng(vector, real, real, vector)? Or any other alternatives?

Iā€™m not exactly sure what youā€™re trying to do, but you need to make sure that the output of your function has the same type as the variable you are trying to assign to. Iā€™m not sure what you want to match what.

Ok, the code as it stands right now is the following:

> functions {
>   real rsm(int y, real theta, real beta, vector kappa) {
>   vector[rows(kappa) + 1] unsummed;
>   vector[rows(kappa) + 1] probs;
>   unsummed = append_row(rep_vector(0, 1), theta - beta - kappa);
>   probs = softmax(cumulative_sum(unsummed));
>   return categorical_lpmf(y + 1 | probs);
>   }
>   real rsm_rng(vector y, real theta, real beta, vector kappa) {
>   vector[rows(kappa) + 1] unsummed;
>   vector[rows(kappa) + 1] probs;
>   unsummed = append_row(rep_vector(0, 1), theta - beta - kappa);
>   probs = softmax(cumulative_sum(unsummed));
>   return categorical_rng(y + 1);
>   }
> }
> data {
>   int<lower=1> I;               // # items
>   int<lower=1> J;               // # persons
>   int<lower=1> N;               // # observations
>   int<lower=1> N_mis;           // # missing observations
>   int<lower=1, upper=I> ii[N];  // item for n
>   int<lower=1, upper=J> jj[N];  // person for n
>   int<lower=0, upper=1> y[N];   // correctness for n
> }
> transformed data {
>   int m;                        // # steps
>   m = max(y);
> }
> parameters {
>   vector[I] beta;
>   vector[m-1] kappa_free;
>   vector[J] theta;
>   real<lower=0> sigma;
> }
> transformed parameters {
>   vector[m] kappa;
>   kappa[1:(m-1)] = kappa_free;
>   kappa[m] = -1*sum(kappa_free);
> }
> model {
>   beta ~ normal(0, 3);
>   target += normal_lpdf(kappa | 0, 3);
>   theta ~ normal(0, sigma);
>   sigma ~ exponential(.1);
>   for (n in 1:N)
>     target += rsm(y[n], theta[jj[n]], beta[ii[n]], kappa);
> }
> generated quantities {
>   vector[N_mis] y_mis;
>   for (n in 1:N_mis)
>    y_mis[n] = rsm_rng(theta[jj[n]], beta[ii[n]], kappa);
> }

The tricky part is the very last line where I try to generate missing values using the rsm_rng function defined at the top of the code. However, the above does not work since it is not in line with the functionā€™s definition. Given the restrictions of the generated block (e.g. cannot include sampling statements), is there a way to rewrite the last line to be able to run it?

[edit: escape code with triple back ticks]

Iā€™m not sure what you expect trying to send three arguments to a function you wrote to require four arguments.

Yes, I know. If I write it like this target += rsm_rng(y_mis[n], theta[jj[n]], beta[ii[n]], kappa), it wonā€™t work as well though. And this is my problem; I do not know how to write it within this block to run.

If you are going to send it three arguments, you need to write a three-argument function. Itā€™s not a syntax problem, itā€™s a conceptual problemā€”thereā€™s no way for Stan to guess what missing argumens should be.

Indeed, according to the definition of the function, four arguments are necessary and this works in the model block for the rsm function. It cannot work, however, in the generated quantities block for the rsm_rng. Is there an alternative to target += that can work in the generated block?

I think thereā€™s some misunderstanding here. Let me try to clarify. If you have a density like the normal, there are three arguments, normal(y | mu, sigma). When you write down a sampling statement itā€™s

y ~ normal(mu, sigma);

with only two arguments to what looks like a normal() function. But thatā€™s just because itā€™s shorthnd for

target += normal_lpdf(y | mu, sigma);

where the 3 arguments are clear.

Now in generated quantities, it looks like this:

real y;
y = normal_rng(mu, sigma);

Here, normal_rng() is a two-argument function, despite the fact that the normal distribution is a three-argument function. Thatā€™s because it returns the y value.

I see, thanks. I need to somehow redefine the function then.

Following up from the above, consider the following code:

functions {
real rsm(int y, real theta, real beta, vector kappa) {
vector[rows(kappa) + 1] unsummed;
vector[rows(kappa) + 1] probs;
unsummed = append_row(rep_vector(0, 1), theta - beta - kappa);
probs = softmax(cumulative_sum(unsummed));
return categorical_lpmf(y + 1 | probs);
}
real rsm_rng(vector y, real theta, real beta, vector kappa) {
vector[rows(kappa) + 1] unsummed;
vector[rows(kappa) + 1] probs;
unsummed = append_row(rep_vector(0, 1), theta - beta - kappa);
probs = softmax(cumulative_sum(unsummed));
return categorical_rng(y + 1);
}
}
data {
int<lower=1> I; // # items
int<lower=1> J; // # persons
int<lower=1> N; // # observations
int<lower=1> N_mis; // # missing observations
int<lower=1, upper=I> ii[N]; // item for n
int<lower=1, upper=J> jj[N]; // person for n
int<lower=0, upper=1> y[N]; // correctness for n
}
transformed data {
int m; // # steps
m = max(y);
}
parameters {
vector[I] beta;
vector[m-1] kappa_free;
vector[J] theta;
real<lower=0> sigma;
}
transformed parameters {
vector[m] kappa;
kappa[1:(m-1)] = kappa_free;
kappa[m] = -1*sum(kappa_free);
}
model {
beta ~ normal(0, 3);
target += normal_lpdf(kappa | 0, 3);
theta ~ normal(0, sigma);
sigma ~ exponential(.1);
for (n in 1:N)
target += rsm(y[n], theta[jj[n]], beta[ii[n]], kappa);
}
generated quantities {
vector[N_mis] y_mis;
for (n in 1:N_mis)
y_mis[n] = rsm_rng(y[n], theta[jj[n]], beta[ii[n]], kappa);

The above code includes two user-defined functions (top of the code); the second (ā€˜rsm_rngā€™) involves the use of Stan function ā€˜categorical_rngā€™ which requires a vector as part of its argument.

However, y should be an integer since this is the nature of the data, therefore it cannot be type vector in ā€˜rsm_rngā€™.

Maybe I could use a different function instead of categorical? Or any other suggestions?

1 Like

Categoricals return numbers in 1:K. Multinomials return size-K arrays of counts. Youā€™re trying to return a real value from rsm_rng which doesnā€™t make sense as categorical returns an integer.

I canā€™t follow your rsm_rng function. Itā€™s returning a real when it should be returning an int and I think probs is what you want as the argument to categorical_rng, so I donā€™t know what y is supposed to be doing.

Many thanks for the input.

The above model is an implementation of a simple rating scale model (without regression) by Daniel C. Furr (http://mc-stan.org/users/documentation/case-studies/rsm_and_grsm.html). y is the response of person j to item i. The data involve Likert scale responses.

Regarding the return of the categorical function, the rsm function does work though, which is a categorical_lpmf. This is a bit confusing for me too.

Maybe this is a complicated way to implement this model, not sure about that, but happy to adopt another case study if recommended.

It seems that we have resolved our issue. The code that actually works is the following:

functions {
  real rsm(int y, real theta, real beta, vector kappa) {
  vector[rows(kappa) + 1] unsummed;
  vector[rows(kappa) + 1] probs;
  unsummed = append_row(rep_vector(0, 1), theta - beta - kappa);
  probs = softmax(cumulative_sum(unsummed));
  return categorical_lpmf(y + 1 | probs);
  }
  real rsm_rng(real theta, real beta, vector kappa) {
  vector[rows(kappa) + 1] unsummed;
  vector[rows(kappa) + 1] probs;
  unsummed = append_row(rep_vector(0, 1), theta - beta - kappa);
  probs = softmax(cumulative_sum(unsummed));
  return categorical_rng(probs);
  }
}
data {
  int<lower=1> I;               // # items
  int<lower=1> J;               // # persons
  int<lower=1> N;               // # observations
  int<lower=1> N_mis;           // # missing observations
  int<lower=1, upper=I> ii[N];  // item for n
  int<lower=1, upper=J> jj[N];  // person for n
  int<lower=0, upper=1> y[N];   // correctness for n
}
transformed data {
  int m;                        // # steps
  m = max(y);
}
parameters {
  vector[I] beta;
  vector[m-1] kappa_free;
  vector[J] theta;
  real<lower=0> sigma;
}
transformed parameters {
  vector[m] kappa;
  kappa[1:(m-1)] = kappa_free;
  kappa[m] = -1*sum(kappa_free);
}
model {
  beta ~ normal(0, 3);
  target += normal_lpdf(kappa | 0, 3);
  theta ~ normal(0, sigma);
  sigma ~ exponential(.1);
  for (n in 1:N)
    target += rsm(y[n], theta[jj[n]], beta[ii[n]], kappa);
}
generated quantities {
  vector[N_mis] y_mis;
  for (n in 1:N_mis)
    y_mis[n] = rsm_rng(theta[jj[n]], beta[ii[n]], kappa);
}

It seems that we were using the wrong argument for rsm_rng, where function ā€œprobsā€ should have been included, i.e. return categorical_rng(probs). Also, rsm_rng is now defined as rsm_rng(real theta, real beta, vector kappa).

All in all, we had to figure out how the random number generating function would work in this context.

Many thanks for your comments. This issue has now been resolved.

1 Like

Thanks for reporting back.

1 Like