Missing data in a 2PL (IRT) model

Trung_Dung_Tran · August 4, 2017, 1:30pm

Hi,

@Panagiotis_Arsenis: Sorry for asking this again. Suppose that we observe Y: 0, 1, 1, 0, NA, NA, 1, 1. As you have said:

I still do not understand fully.

Could you let me know in this example how your Y look like? Thank you so much!

Tran.

Panagiotis_Arsenis · August 4, 2017, 1:51pm

Hi Tran,

my data include NAs, like your example. However, y includes only observed data. I converted my database-like data structure into “long form” to achieve this.

The process is described in section 16.1 of the Stan manual.

Panos

Trung_Dung_Tran · August 4, 2017, 2:09pm

@Panagiotis_Arsenis: Thank you, now I get it!

Bob_Carpenter · August 6, 2017, 9:12pm

Yes, signatures need to match. Looks like I missed a vector argument or something.

Panagiotis_Arsenis · August 9, 2017, 2:27pm

Yes, so is there a way I can rewrite y_mis[n] = rsm_rng(theta[jj[n]], beta[ii[n]], kappa) to match rsm_rng(vector, real, real, vector)? Or any other alternatives?

Bob_Carpenter · August 11, 2017, 10:59pm

I’m not exactly sure what you’re trying to do, but you need to make sure that the output of your function has the same type as the variable you are trying to assign to. I’m not sure what you want to match what.

Panagiotis_Arsenis · August 14, 2017, 2:51pm

Ok, the code as it stands right now is the following:

> functions {
>   real rsm(int y, real theta, real beta, vector kappa) {
>   vector[rows(kappa) + 1] unsummed;
>   vector[rows(kappa) + 1] probs;
>   unsummed = append_row(rep_vector(0, 1), theta - beta - kappa);
>   probs = softmax(cumulative_sum(unsummed));
>   return categorical_lpmf(y + 1 | probs);
>   }
>   real rsm_rng(vector y, real theta, real beta, vector kappa) {
>   vector[rows(kappa) + 1] unsummed;
>   vector[rows(kappa) + 1] probs;
>   unsummed = append_row(rep_vector(0, 1), theta - beta - kappa);
>   probs = softmax(cumulative_sum(unsummed));
>   return categorical_rng(y + 1);
>   }
> }
> data {
>   int<lower=1> I;               // # items
>   int<lower=1> J;               // # persons
>   int<lower=1> N;               // # observations
>   int<lower=1> N_mis;           // # missing observations
>   int<lower=1, upper=I> ii[N];  // item for n
>   int<lower=1, upper=J> jj[N];  // person for n
>   int<lower=0, upper=1> y[N];   // correctness for n
> }
> transformed data {
>   int m;                        // # steps
>   m = max(y);
> }
> parameters {
>   vector[I] beta;
>   vector[m-1] kappa_free;
>   vector[J] theta;
>   real<lower=0> sigma;
> }
> transformed parameters {
>   vector[m] kappa;
>   kappa[1:(m-1)] = kappa_free;
>   kappa[m] = -1*sum(kappa_free);
> }
> model {
>   beta ~ normal(0, 3);
>   target += normal_lpdf(kappa | 0, 3);
>   theta ~ normal(0, sigma);
>   sigma ~ exponential(.1);
>   for (n in 1:N)
>     target += rsm(y[n], theta[jj[n]], beta[ii[n]], kappa);
> }
> generated quantities {
>   vector[N_mis] y_mis;
>   for (n in 1:N_mis)
>    y_mis[n] = rsm_rng(theta[jj[n]], beta[ii[n]], kappa);
> }

The tricky part is the very last line where I try to generate missing values using the rsm_rng function defined at the top of the code. However, the above does not work since it is not in line with the function’s definition. Given the restrictions of the generated block (e.g. cannot include sampling statements), is there a way to rewrite the last line to be able to run it?

[edit: escape code with triple back ticks]

Bob_Carpenter · August 14, 2017, 3:11pm

I’m not sure what you expect trying to send three arguments to a function you wrote to require four arguments.

Panagiotis_Arsenis · August 14, 2017, 3:16pm

Yes, I know. If I write it like this target += rsm_rng(y_mis[n], theta[jj[n]], beta[ii[n]], kappa), it won’t work as well though. And this is my problem; I do not know how to write it within this block to run.

Bob_Carpenter · August 14, 2017, 3:38pm

If you are going to send it three arguments, you need to write a three-argument function. It’s not a syntax problem, it’s a conceptual problem—there’s no way for Stan to guess what missing argumens should be.

Panagiotis_Arsenis · August 16, 2017, 6:07pm

Indeed, according to the definition of the function, four arguments are necessary and this works in the model block for the rsm function. It cannot work, however, in the generated quantities block for the rsm_rng. Is there an alternative to target += that can work in the generated block?

Bob_Carpenter · August 17, 2017, 1:17am

I think there’s some misunderstanding here. Let me try to clarify. If you have a density like the normal, there are three arguments, normal(y | mu, sigma). When you write down a sampling statement it’s

y ~ normal(mu, sigma);

with only two arguments to what looks like a normal() function. But that’s just because it’s shorthnd for

target += normal_lpdf(y | mu, sigma);

where the 3 arguments are clear.

Now in generated quantities, it looks like this:

real y;
y = normal_rng(mu, sigma);

Here, normal_rng() is a two-argument function, despite the fact that the normal distribution is a three-argument function. That’s because it returns the y value.

Panagiotis_Arsenis · August 27, 2017, 8:44pm

I see, thanks. I need to somehow redefine the function then.

Panagiotis_Arsenis · September 29, 2017, 10:36am

Following up from the above, consider the following code:

functions {
real rsm(int y, real theta, real beta, vector kappa) {
vector[rows(kappa) + 1] unsummed;
vector[rows(kappa) + 1] probs;
unsummed = append_row(rep_vector(0, 1), theta - beta - kappa);
probs = softmax(cumulative_sum(unsummed));
return categorical_lpmf(y + 1 | probs);
}
real rsm_rng(vector y, real theta, real beta, vector kappa) {
vector[rows(kappa) + 1] unsummed;
vector[rows(kappa) + 1] probs;
unsummed = append_row(rep_vector(0, 1), theta - beta - kappa);
probs = softmax(cumulative_sum(unsummed));
return categorical_rng(y + 1);
}
}
data {
int<lower=1> I; // # items
int<lower=1> J; // # persons
int<lower=1> N; // # observations
int<lower=1> N_mis; // # missing observations
int<lower=1, upper=I> ii[N]; // item for n
int<lower=1, upper=J> jj[N]; // person for n
int<lower=0, upper=1> y[N]; // correctness for n
}
transformed data {
int m; // # steps
m = max(y);
}
parameters {
vector[I] beta;
vector[m-1] kappa_free;
vector[J] theta;
real<lower=0> sigma;
}
transformed parameters {
vector[m] kappa;
kappa[1:(m-1)] = kappa_free;
kappa[m] = -1*sum(kappa_free);
}
model {
beta ~ normal(0, 3);
target += normal_lpdf(kappa | 0, 3);
theta ~ normal(0, sigma);
sigma ~ exponential(.1);
for (n in 1:N)
target += rsm(y[n], theta[jj[n]], beta[ii[n]], kappa);
}
generated quantities {
vector[N_mis] y_mis;
for (n in 1:N_mis)
y_mis[n] = rsm_rng(y[n], theta[jj[n]], beta[ii[n]], kappa);

The above code includes two user-defined functions (top of the code); the second (‘rsm_rng’) involves the use of Stan function ‘categorical_rng’ which requires a vector as part of its argument.

However, y should be an integer since this is the nature of the data, therefore it cannot be type vector in ‘rsm_rng’.

Maybe I could use a different function instead of categorical? Or any other suggestions?

Bob_Carpenter · October 2, 2017, 12:09am

Categoricals return numbers in 1:K. Multinomials return size-K arrays of counts. You’re trying to return a real value from rsm_rng which doesn’t make sense as categorical returns an integer.

I can’t follow your rsm_rng function. It’s returning a real when it should be returning an int and I think probs is what you want as the argument to categorical_rng, so I don’t know what y is supposed to be doing.

Panagiotis_Arsenis · October 10, 2017, 11:19am

Many thanks for the input.

The above model is an implementation of a simple rating scale model (without regression) by Daniel C. Furr (http://mc-stan.org/users/documentation/case-studies/rsm_and_grsm.html). y is the response of person j to item i. The data involve Likert scale responses.

Regarding the return of the categorical function, the rsm function does work though, which is a categorical_lpmf. This is a bit confusing for me too.

Maybe this is a complicated way to implement this model, not sure about that, but happy to adopt another case study if recommended.

Panagiotis_Arsenis · October 21, 2017, 11:09am

It seems that we have resolved our issue. The code that actually works is the following:

functions {
  real rsm(int y, real theta, real beta, vector kappa) {
  vector[rows(kappa) + 1] unsummed;
  vector[rows(kappa) + 1] probs;
  unsummed = append_row(rep_vector(0, 1), theta - beta - kappa);
  probs = softmax(cumulative_sum(unsummed));
  return categorical_lpmf(y + 1 | probs);
  }
  real rsm_rng(real theta, real beta, vector kappa) {
  vector[rows(kappa) + 1] unsummed;
  vector[rows(kappa) + 1] probs;
  unsummed = append_row(rep_vector(0, 1), theta - beta - kappa);
  probs = softmax(cumulative_sum(unsummed));
  return categorical_rng(probs);
  }
}
data {
  int<lower=1> I;               // # items
  int<lower=1> J;               // # persons
  int<lower=1> N;               // # observations
  int<lower=1> N_mis;           // # missing observations
  int<lower=1, upper=I> ii[N];  // item for n
  int<lower=1, upper=J> jj[N];  // person for n
  int<lower=0, upper=1> y[N];   // correctness for n
}
transformed data {
  int m;                        // # steps
  m = max(y);
}
parameters {
  vector[I] beta;
  vector[m-1] kappa_free;
  vector[J] theta;
  real<lower=0> sigma;
}
transformed parameters {
  vector[m] kappa;
  kappa[1:(m-1)] = kappa_free;
  kappa[m] = -1*sum(kappa_free);
}
model {
  beta ~ normal(0, 3);
  target += normal_lpdf(kappa | 0, 3);
  theta ~ normal(0, sigma);
  sigma ~ exponential(.1);
  for (n in 1:N)
    target += rsm(y[n], theta[jj[n]], beta[ii[n]], kappa);
}
generated quantities {
  vector[N_mis] y_mis;
  for (n in 1:N_mis)
    y_mis[n] = rsm_rng(theta[jj[n]], beta[ii[n]], kappa);
}

It seems that we were using the wrong argument for rsm_rng, where function “probs” should have been included, i.e. return categorical_rng(probs). Also, rsm_rng is now defined as rsm_rng(real theta, real beta, vector kappa).

All in all, we had to figure out how the random number generating function would work in this context.

Many thanks for your comments. This issue has now been resolved.

Bob_Carpenter · October 22, 2017, 4:39am

Thanks for reporting back.

Topic		Replies	Views
Fail to run due to relatively high number of missing observations Modeling	7	890	February 28, 2018
Missing data in Stan - some difficulties understanding Modeling	6	458	August 16, 2021
Help with understanding my results Modeling fitting-issues , specification , performance	5	1069	February 8, 2019
Missing response model (section 10.3 of Stan manual) Modeling	11	2298	May 24, 2017
How to circumvent defining a integer array in transformed parameter block Modeling specification , ecology , capture-recapture	3	4251	March 7, 2018

Missing data in a 2PL (IRT) model

Related Topics