Hi,
I will first try to describe the model we successfully fit and get reasonably good results. Then, I will ask about how we want to improve it by relaxing an assumption. I believe this involves sampling discrete parameters in Stan, but I couldn’t figure it out myself after exploring and reading what I find in the forums and manual. I know there are examples, but I can’t figure out how to apply it to this example. Any help is really appreciated. Also, I am happy to discuss co-authorship for a paper if we can figure this out. This will be an interesting paper and a good contribution to the field I am working on.
First, let me briefly talk about the problem we are trying to address. Think about a large-scale certification test. There are n items in the test and N people are taking the test. A subgroup of people had access to a subset of items before the test, so they had item preknowledge and they are cheating. The purpose of the model is to identify these fraudulent examinees with item preknowledge using the time each examinee spends on each item.
The inputs are an N \times n matrix of log response time and a vector of length n for item compromise status (0s and 1s, 1 indicates an item is compromised, and 0 indicates an item is not compromised). So, we assume that we know which items are compromised.
Below is the model we are fitting.
Y_{ij} is the logarithm of the response time for the i^{th} person on the j^{th} item, and is normally distributed with a mean of (\beta_j - \tau_i) and variance of \sigma_j^2.
-
\beta_j : item-specific time-intensity parameter, the largest this parameter is the more time it takes for an average examinee to respond to an item.
-
\tau_i : person-specific latent speed parameter, the higher this parameter is the less time it takes for an examinee to respond to an average item. This is like cognitive speed for individuals. Some individuals may process information faster than others.
-
\sigma_j^2 : item-specific error variance, this accounts for the noise in response time representing anything that contributes to the variation in response times other than individuals’ cognitive speed.
For those familiar with IRT models, this is similar to an IRT model with two item parameters but for modeling response time instead of modeling response accuracy.
For our purpose, we slightly modified the model by assuming that there are two latent speed parameters for each individual, \tau_{ti} and \tau_{ci}.
- \tau_{ti} : true person-specific latent speed parameter that the individuals operationalize when responding to items they hadn’t seen before taking the test.
- \tau_{ci} : cheating person-specific latent speed parameter that the individuals operationalize when responding to items they had seen before taking the test.
We assume that the individuals who had seen some items before taking the test respond faster to those items they had seen before compared to the items they had not seen before. So, assuming that we know which items are compromised, \tau_{ci} estimated based on compromised items should be larger than \tau_{ti} estimated based on uncompromised items for those fraudulent individuals who had seen these items before. On the other hand, \tau_{ci} and \tau_{ti} should be very similar for those honest individuals who didn’t see any item before the test.
Therefore, we modified the model as follows:
Y_{ij} \sim N(\mu_{ij},\sigma_j)
\mu_{ij} = (\beta_j - \tau_{ti})^{1-T_i} \times[ (1- C_j) \times (\beta_j - \tau_{ti}) + C_j \times (\beta_j - \tau_{ci})]^{T_i}
- C_j is an input value of 0 or 1 indicating whether or not an item is compromised.
- T_i is a parameter we want to estimate for every individual. If T_i is 1 the person has item preknowledge, T_i is 0 otherwise.
The way we implement is to compare \tau_{ci} and \tau_{ti} at every iteration, and then assign a value of 1 to T_i if \tau_{ci} > \tau_{ti}, and 0 otherwise. At the end of the model fitting process, we take the average of posterior distribution for T for every individual and use it as a probability of item preknowledge. So, we obtain P(T_i = 1) = P(\tau_{ci} > \tau_{ti}) by fitting this model. Then, we can make a decision about each individual using a cut-off value for the posterior probability (e.g., 0.95).
Below is the code we are using to fit this model. We get great convergence, applied it to simulated data and real data. It worked reasonably well to identify fraudulent individuals with item preknowledge.
The follow-up question is this. Can we also estimate C_j for each item simultaneously during the model fitting process so we don’t have to assume that the compromised status of each item is known? This is what happens in practice. We don’t typically know whether or not the items are stolen. So, the model has limited use in its current form assuming that C enters the model as known data and not a parameter to estimate.
data{
int <lower=1> I; // number of examinees
int <lower=1> J; // number of items
real RT[I,J]; // a matrix for log of response time
int C[J]; // vector for item compromise status (0s and 1s)
}
parameters {
vector[J] beta; // vector for time intensity parameters
vector <lower=0> [J] alpha; // vector for time discrimination parameters
vector[I] tau_t; // vector for latent speed for honest responses
vector[I] tau_c; // vector for latent speed for fraudulent responses
real<lower=0> sigma_taut; // sd for tau_t
real<lower=0> sigma_tauc; // sd for tau_c
}
transformed parameters {
vector[I] T;
for (i in 1:I) {
for(j in 1:J) {
if(tau_t[i] < tau_c[i])
T[i] = 1;
else
T[i] = 0;
}
}
}
model{
sigma_taut ~ exponential(1);
sigma_tauc ~ exponential(1);
tau_t ~ normal(0,sigma_taut);
tau_c ~ normal(0,sigma_tauc);
beta ~ normal(4,1); // these numbers can be estimated from observed data
alpha ~ inv_gamma(100,128); // these numbers can be estimated from observed data
for (i in 1:I) {
for(j in 1:J) {
real p_t = beta[j]-tau_t[i];
real p_c = beta[j]-tau_c[i];
real p = (p_t^(1-T[i]))*(((1-C[j])*p_t + C[j]*p_c)^T[i]);
RT[i,j] ~ normal(p,1/alpha[j]);
}
}
}
I defined C as a parameter, but can’t figure how to sample it.
parameters {
...
vector[J] C; // vector for item compromise status
}
Then, I tried C ~ uniform(0,1)
and C ~ beta(1,1)
in the model statement, but I didn’t get any sensible result.
I was imagining doing something like this.
parameters {
...
real<lower=0,upper=1> pr; // proportion of compromised items
vector[J] C; // vector for item compromise status
}
Then, in the model statement,
model {
...
pr ~ beta(1,1)
C ~ bernoulli(pr)
}
However, I understand that this will not work because Stan doesn’t handle discrete parameters. I have read a few things about marginalizing over the discrete parameters to handle this kind of situation, but can’t implement it in this problem.
I am sorry for the long post. This comes out of frustration after a couple days of trying and I am almost about to give up. Hope, someone can guide me some potential solutions to overcome this.
Thank you so much for your time and reading till the end.