Missing data in a 2PL (IRT) model

We are looking to develop an hierarchical ordered two-parameter logistic model with missing data, and we are starting with this case-study model as on stan website (http://mc-stan.org/users/documentation/case-studies/hierarchical_2pl.html). We would like to consider missing data as well, thus we have made an adjustment following section 11.1 of the latest Stan manual. The code is the following (highlighted the problematic line):

data {
int<lower=1> I; // # items
int<lower=1> J; // # persons
int<lower=1> N; // # observations
int<lower=1> N_mis; // # missing observations
int<lower=1, upper=I> ii[N]; // item for n
int<lower=1, upper=J> jj[N]; // person for n
int<lower=0, upper=1> y[N]; // correctness for n
}
parameters {
vector[J] theta; // abilities
vector[2] xi[I]; // alpha/beta pair vectors
vector[2] mu; // vector for alpha/beta means
vector<lower=0>[2] tau; // vector for alpha/beta residual sds
cholesky_factor_corr[2] L_Omega;
real y_mis[N_mis];
}
transformed parameters {
vector[I] alpha;
vector[I] beta;
for (i in 1:I) {
alpha[i] = exp(xi[i,1]);
beta[i] = xi[i,2];
}
}
model {
matrix[2,2] L_Sigma;
L_Sigma = diag_pre_multiply(tau, L_Omega);
for (i in 1:I)
xi[i] ~ multi_normal_cholesky(mu, L_Sigma);
theta ~ normal(0, 1);
L_Omega ~ lkj_corr_cholesky(4);
mu[1] ~ normal(0,1);
tau[1] ~ exponential(.1);
mu[2] ~ normal(0,5);
tau[2] ~ exponential(.1);
for (n in 1:N)
y[n] ~ bernoulli_logit(alpha[ii] .* (theta[jj] - beta[ii]));
for (n in 1:N_mis)
y_mis[n] ~ bernoulli_logit(alpha[ii] .* (theta[jj] - beta[ii]));
}
generated quantities {
corr_matrix[2] Omega;
Omega = multiply_lower_tri_self_transpose(L_Omega);
}
"

The parser then generates the following error:

No matches for:

real ~ bernoulli_logit(vector)

Available argument signatures for bernoulli_logit:

int ~ bernoulli_logit(real)
int ~ bernoulli_logit(real[])
int ~ bernoulli_logit(vector)
int ~ bernoulli_logit(row vector)
int[] ~ bernoulli_logit(real)
int[] ~ bernoulli_logit(real[])
int[] ~ bernoulli_logit(vector)
int[] ~ bernoulli_logit(row vector)

require real scalar return type for probability function.

ERROR at line 42

40: y[n] ~ bernoulli_logit(alpha[ii] .* (theta[jj] - beta[ii]));
41: for (n in 1:N_mis)
42: y_mis[n] ~ bernoulli_logit(alpha[ii] .* (theta[jj] - beta[ii]));
^
43: }

Apparently, the real argument cannot work for the bernoulli_logit function, while an integer is not accepted in the parameter block. Any suggestions on how to get round this issue?

Stan does not accept integer parameters. The normal procedure is to try to marginalise them out of the log likelihood (i.e. sum over 0 and 1). I think in your current case the missing data would not add any information to the parameters alpha, theta, and beta. You can just generate y_mis in the generated quantities.

generated_quantities{
vector[N_mis] y_mis;
for (n in 1:N_mis){
  y_mis[n] = bernoulli_rng(inv_logit(alpha[ii] .* (theta[jj] - beta[ii])))
  }
}

Not tested!

Thanks for the suggestion! I think it makes sense.

I have added these lines to the generated quantities block (also removed the highlighted line from the parameters block and the relevant bit from the model block), but I got this error message from the parser:

variable “vector” does not exist.

ERROR at line 45

43: corr_matrix[2] Omega;
44: Omega = multiply_lower_tri_self_transpose(L_Omega);
45: vector[N_mis] y_mis;
^
46: for (n in 1:N_mis)

Any ideas what’s the problem here?

You can only declare local variables at the top of a block. So this is bad:

transformed parameters {
  real x;
  x = 10;
  real y; // BAD!!!
  ...
}

But if you change order of declaration it’s OK

transformed parameters {
  real x;
  real y;  // OK!
  x = 10;
  ...
}

And if you put it in a local scope with braces, it’s also OK:

transformed parameters {
  real x;
  x = 10;
  {
    real y; // OK
    ...
  }
}

In this latter case, y will not be exposed in the model block or saved, whereas in the former case it will be.

I see, thank you.

Implementing all the above I have the following code:

data {
int<lower=1> I; // # items
int<lower=1> J; // # persons
int<lower=1> N; // # observations
int<lower=1> N_mis; // # missing observations
int<lower=1, upper=I> ii[N]; // item for n
int<lower=1, upper=J> jj[N]; // person for n
int<lower=0, upper=1> y[N]; // correctness for n
}
parameters {
vector[J] theta; // abilities
vector[2] xi[I]; // alpha/beta pair vectors
vector[2] mu; // vector for alpha/beta means
vector<lower=0>[2] tau; // vector for alpha/beta residual sds
cholesky_factor_corr[2] L_Omega;
}
transformed parameters {
vector[I] alpha;
vector[I] beta;
for (i in 1:I) {
alpha[i] = exp(xi[i,1]);
beta[i] = xi[i,2];
}
}
model {
matrix[2,2] L_Sigma;
L_Sigma = diag_pre_multiply(tau, L_Omega);
for (i in 1:I)
xi[i] ~ multi_normal_cholesky(mu, L_Sigma);
theta ~ normal(0, 1);
L_Omega ~ lkj_corr_cholesky(4);
mu[1] ~ normal(0,1);
tau[1] ~ exponential(.1);
mu[2] ~ normal(0,5);
tau[2] ~ exponential(.1);
for (n in 1:N)
y[n] ~ bernoulli_logit(alpha[ii] .* (theta[jj] - beta[ii]));
}
generated quantities {
corr_matrix[2] Omega;
vector[N_mis] y_mis;
Omega = multiply_lower_tri_self_transpose(L_Omega);
for (n in 1:N_mis)
y_mis[n] = bernoulli_rng(inv_logit(alpha[ii] .* (theta[jj] - beta[ii])));
}
"

The message from parser is now:

No matches for:

bernoulli_rng(vector)

Available argument signatures for bernoulli_rng:

bernoulli_rng(real)

ERROR at line 46

44: Omega = multiply_lower_tri_self_transpose(L_Omega);
45: for (n in 1:N_mis)
46: y_mis[n] = bernoulli_rng(inv_logit(alpha[ii] .* (theta[jj] - beta[ii])));
^
47: }

It seems that the argument here has to be a real variable. Do I have to define new alpha, beta and theta within the generated quantities block for this operation?

It’s probably because I relied on the vectorization of the indexing which does not work with the loop. Does it work if you change ii to ii[n] and jj to jj[n]in line 46?

Indeed it works, thanks!

Following up on the previous discussion, I’m trying now to implement a rating scale model (Rating scale and generalized rating scale models with latent regression) incorporating missing values as above. The code is the following:

functions {
real rsm(int y, real theta, real beta, vector kappa) {
vector[rows(kappa) + 1] unsummed;
vector[rows(kappa) + 1] probs;
unsummed = append_row(rep_vector(0, 1), theta - beta - kappa);
probs = softmax(cumulative_sum(unsummed));
return categorical_lpmf(y + 1 | probs);
}
}
data {
int<lower=1> I; // # items
int<lower=1> J; // # persons
int<lower=1> N; // # observations
int<lower=1> N_mis; // # missing observations
int<lower=1, upper=I> ii[N]; // item for n
int<lower=1, upper=J> jj[N]; // person for n
int<lower=0, upper=1> y[N]; // correctness for n
}
transformed data {
int m; // # steps
m = max(y);
}
parameters {
vector[I] beta;
vector[m-1] kappa_free;
vector[J] theta;
real<lower=0> sigma;
}
transformed parameters {
vector[m] kappa;
kappa[1:(m-1)] = kappa_free;
kappa[m] = -1*sum(kappa_free);
}
model {
beta ~ normal(0, 3);
target += normal_lpdf(kappa | 0, 3);
theta ~ normal(0, sigma);
sigma ~ exponential(.1);
for (n in 1:N)
target += rsm(y[n], theta[jj[n]], beta[ii[n]], kappa);
}
generated quantities {
vector[N_mis] y_mis;
for (n in 1:N_mis)
target += rsm(y_mis[n], theta[jj[n]], beta[ii[n]], kappa);
}

However, the following error message comes up:

Sampling statements (~) and increment_log_prob() are
only allowed in the model block or lp functions.

ERROR at line 47

45: vector[N_mis] y_mis;
46: for (n in 1:N_mis)
47: target += rsm(y_mis[n], theta[jj[n]], beta[ii[n]], kappa);
^
48: }

The message is pretty clear. But how can I perform this operation (incrementing the log posterior) in the generated quantities block?

I think you want to define y_mis as a parameter. Then put tha sampling in the model block. There’s a chapter in the manual on missing data that you might find helpful that lays out a simple example.

You need to create two different functions. rsm_lpmf to calculate the loglikelihood contribution. That is probably what you have done with rms and what you need in the parameter block. In the generated quantities block you need a rsm_rng function that samples a random variable. You should use categorical_rng in that function.

I think Bob’s suggestion won’t work because rsm_lpmf requires y to be an integer.

Right—if it’s an integer, you can’t declare it as a parameter. Instead, you have to marginalize it out of the likelihood if it’s missing.

Thanks for your suggestions!

Indeed, I have used the missing data section of the manual to adapt the example above. I have already tried the rsm in the model block but it does not work because y_mis has to be an integer, as you have pointed out.

The rsm function is defined at the top of the code. I’m not sure how to use categorical_rng in the generated quantities. Could you be a bit more specific? An example?

Moreover, I’m not sure about the comment of the rsm function in the parameter block. If you could elaborate on this a bit (why and how to do this), it would be very helpful.

Sorry I have not a lot of time today. I’ll try to quickly get the idea across.

Bob’s idea of marginalizing out means you have to average the loglikelihood over the all the potential integers so that you do not need an integer parameter anymore. This is feasible with simple models and when there is not too many integer options. I am not sure whether that is feasible with the rating scale model because I am not familiar with it. No mistake, if feasible this is the best solution in general. There are chapters in the Stan manual that deal with marginalization.

What I think you can do is to simplify things. You do not need to marginalize when the missing values are the outcomes of your model. You can just treat them as you would a hold-out sample: estimate the model with complete observations + generate new outcomes from the model. That is my suggestions.

To do that you need two different functions with your custom distribution. If you go back to the simpler bernouilli model at the start of this thread you will see that in the model block, you have y[n] ~ bernouilli_logit(...) which is stan shorthand for target += target + bernouilli_logit_lpmf(y[n]| ...). In the generated quantities block, you have y_mis[n] = bernouilli_rng(inv_logit(...)). You use a different function for building the loglikelihood and for generating new outcomes for the missing values. You need the same structure for your custom function, which means you need two functions: rms_lpmf and rms_rng. I think you can use your rms as rms_lpmf and if you replace categorical_lpmf with categorical_rng, you will have the appropriate rms_rng. It would be a good idea to check this with the stan manual chapter on functions though.

I think I have applied all the steps you mention. The resulting code is this:

functions {
  real rsm(int y, real theta, real beta, vector kappa) {
  vector[rows(kappa) + 1] unsummed;
  vector[rows(kappa) + 1] probs;
  unsummed = append_row(rep_vector(0, 1), theta - beta - kappa);
  probs = softmax(cumulative_sum(unsummed));
  return categorical_lpmf(y + 1 | probs);
  }
  real rsm_rng(vector y, real theta, real beta, vector kappa) {
  vector[rows(kappa) + 1] unsummed;
  vector[rows(kappa) + 1] probs;
  unsummed = append_row(rep_vector(0, 1), theta - beta - kappa);
  probs = softmax(cumulative_sum(unsummed));
  return categorical_rng(y + 1);
  }
}
data {
  int<lower=1> I;               // # items
  int<lower=1> J;               // # persons
  int<lower=1> N;               // # observations
  int<lower=1> N_mis;           // # missing observations
  int<lower=1, upper=I> ii[N];  // item for n
  int<lower=1, upper=J> jj[N];  // person for n
  int<lower=0, upper=1> y[N];   // correctness for n
}
transformed data {
  int m;                        // # steps
  m = max(y);
}
parameters {
  vector[I] beta;
  vector[m-1] kappa_free;
  vector[J] theta;
  real<lower=0> sigma;
}
transformed parameters {
  vector[m] kappa;
  kappa[1:(m-1)] = kappa_free;
  kappa[m] = -1*sum(kappa_free);
}
model {
  beta ~ normal(0, 3);
  target += normal_lpdf(kappa | 0, 3);
  theta ~ normal(0, sigma);
  sigma ~ exponential(.1);
  for (n in 1:N)
    target += rsm(y[n] | theta[jj[n]], beta[ii[n]], kappa);
}
generated quantities {
  vector[N_mis] y_mis;
  for (n in 1:N_mis)
    target += rsm_rng(y_mis[n] | theta[jj[n]], beta[ii[n]], kappa);
}

I get the following error from parser:

ERROR at line 49

 47:      sigma ~ exponential(.1);
 48:      for (n in 1:N)
 49:        target += rsm(y[n] | theta[jj[n]], beta[ii[n]], kappa);
                         ^
 50:    }

PARSER EXPECTED: "("

Using the “target +=” seems to be in conflict with the definition of rsm. Any insights here?

Many thanks for your comments so far!

You need to have a function ending in _lpdf, _lpmf, _lcdf, or _lccdf in order to use the vertical bar notation or to use the sampling notation. Otherwise, it works like an ordinary function. So replace the | with a comman ,. I tried to explain all this in the user-defined function chapter of the manual.

Sorry about the error message not being so helpful. We’re working on making them better.

Also, you can’t manipulated the target in generated quantities. I think you want

y_mis[n] = rsm_rng(...)

but that’s not enough becuase you included the y in the argument list for rsm_rng, which is probably eithr a bug, or you meant to include it as an argument.

What you could try to do here is model the probability of response as missing data (i.e., beta-binomial distributed random variable) instead of the responses themselves, which is a binary integer and for that reason something you have to marginalize over. You could then generate the actual missing responses in the generated quantities block, but ideally you want there to be some kind of joint estimation of the missing and non-missing outcomes in the model block, otherwise the estimated parameters will not take into account uncertainty over missingness.

But that is just a thought and the wise Stan gurus on this thread may see problems with it.

I have an ideal point IRT model that handles missingness, but the application is limited to legislatures, so unless that is what you are dealing with it wouldn’t be of much help.

Sorry my comment only applies to your original binomial model… for a rating scale model, you could think about the probabilities of each category as a separate random variable. But I believe that is essentially what the marginalization would do.

@Panagiotis_Arsenis,

Hi, have you run the codes properly? I am wondering if you have missing values then how can you specify:

in your stan codes because Stan does not allow any missing data as input.

Tran.

Many thanks for your comments!

Regarding the latter, y[N] defines the output as a function of actual observations; I do not have missing data as input. These are in the generated quantities block.

I followed Bob’s comment regarding the rsm function in the model block and it works well know:

for (n in 1:N)
    target += rsm(y[n], theta[jj[n]], beta[ii[n]], kappa);

However, the second comment regarding the rsm_rng function does not seem to work. I think Bob suggests that I do the following:

generated quantities {
  vector[N_mis] y_mis;
  for (n in 1:N_mis)
    y_mis[n] = rsm_rng(theta[jj[n]], beta[ii[n]], kappa);
}

The rsm_rng expects a vector y as the first argument in its list. So, if I do this I get an error message:

No matches for: 

  rsm_rng(real, real, vector)

Available argument signatures for rsm_rng:

  rsm_rng(vector, real, real, vector)


ERROR at line 54

 52:      vector[N_mis] y_mis;
 53:      for (n in 1:N_mis)
 54:        y_mis[n] = rsm_rng(theta[jj[n]], beta[ii[n]], kappa);
                                                                ^
 55:    }

Should I redefine the rsm_rng function then? Or is there an alternative to line 54’s specification?