Handling discrete parameter in a very specific response time model

Hi,

I will first try to describe the model we successfully fit and get reasonably good results. Then, I will ask about how we want to improve it by relaxing an assumption. I believe this involves sampling discrete parameters in Stan, but I couldn’t figure it out myself after exploring and reading what I find in the forums and manual. I know there are examples, but I can’t figure out how to apply it to this example. Any help is really appreciated. Also, I am happy to discuss co-authorship for a paper if we can figure this out. This will be an interesting paper and a good contribution to the field I am working on.

First, let me briefly talk about the problem we are trying to address. Think about a large-scale certification test. There are n items in the test and N people are taking the test. A subgroup of people had access to a subset of items before the test, so they had item preknowledge and they are cheating. The purpose of the model is to identify these fraudulent examinees with item preknowledge using the time each examinee spends on each item.

The inputs are an N \times n matrix of log response time and a vector of length n for item compromise status (0s and 1s, 1 indicates an item is compromised, and 0 indicates an item is not compromised). So, we assume that we know which items are compromised.

Below is the model we are fitting.

Y_{ij} is the logarithm of the response time for the i^{th} person on the j^{th} item, and is normally distributed with a mean of (\beta_j - \tau_i) and variance of \sigma_j^2.

  • \beta_j : item-specific time-intensity parameter, the largest this parameter is the more time it takes for an average examinee to respond to an item.

  • \tau_i : person-specific latent speed parameter, the higher this parameter is the less time it takes for an examinee to respond to an average item. This is like cognitive speed for individuals. Some individuals may process information faster than others.

  • \sigma_j^2 : item-specific error variance, this accounts for the noise in response time representing anything that contributes to the variation in response times other than individuals’ cognitive speed.

For those familiar with IRT models, this is similar to an IRT model with two item parameters but for modeling response time instead of modeling response accuracy.

For our purpose, we slightly modified the model by assuming that there are two latent speed parameters for each individual, \tau_{ti} and \tau_{ci}.

  • \tau_{ti} : true person-specific latent speed parameter that the individuals operationalize when responding to items they hadn’t seen before taking the test.
  • \tau_{ci} : cheating person-specific latent speed parameter that the individuals operationalize when responding to items they had seen before taking the test.

We assume that the individuals who had seen some items before taking the test respond faster to those items they had seen before compared to the items they had not seen before. So, assuming that we know which items are compromised, \tau_{ci} estimated based on compromised items should be larger than \tau_{ti} estimated based on uncompromised items for those fraudulent individuals who had seen these items before. On the other hand, \tau_{ci} and \tau_{ti} should be very similar for those honest individuals who didn’t see any item before the test.

Therefore, we modified the model as follows:

Y_{ij} \sim N(\mu_{ij},\sigma_j)

\mu_{ij} = (\beta_j - \tau_{ti})^{1-T_i} \times[ (1- C_j) \times (\beta_j - \tau_{ti}) + C_j \times (\beta_j - \tau_{ci})]^{T_i}

  • C_j is an input value of 0 or 1 indicating whether or not an item is compromised.
  • T_i is a parameter we want to estimate for every individual. If T_i is 1 the person has item preknowledge, T_i is 0 otherwise.

The way we implement is to compare \tau_{ci} and \tau_{ti} at every iteration, and then assign a value of 1 to T_i if \tau_{ci} > \tau_{ti}, and 0 otherwise. At the end of the model fitting process, we take the average of posterior distribution for T for every individual and use it as a probability of item preknowledge. So, we obtain P(T_i = 1) = P(\tau_{ci} > \tau_{ti}) by fitting this model. Then, we can make a decision about each individual using a cut-off value for the posterior probability (e.g., 0.95).

Below is the code we are using to fit this model. We get great convergence, applied it to simulated data and real data. It worked reasonably well to identify fraudulent individuals with item preknowledge.

The follow-up question is this. Can we also estimate C_j for each item simultaneously during the model fitting process so we don’t have to assume that the compromised status of each item is known? This is what happens in practice. We don’t typically know whether or not the items are stolen. So, the model has limited use in its current form assuming that C enters the model as known data and not a parameter to estimate.

data{
  int <lower=1> I;          // number of examinees          
  int <lower=1> J;         // number of items
  real RT[I,J];               // a matrix for log of response time
  int C[J];                    // vector for item compromise status (0s and 1s)
}

parameters {
  vector[J] beta;                        // vector for time intensity parameters
  vector <lower=0> [J]  alpha;  // vector for time discrimination parameters
  vector[I] tau_t;                       // vector for latent speed for honest responses
  vector[I] tau_c;                      // vector for latent speed for fraudulent responses
  real<lower=0> sigma_taut;              // sd for tau_t
  real<lower=0> sigma_tauc;              // sd for tau_c
}

transformed parameters {
  
  vector[I] T;

  for (i in 1:I) {
    for(j in 1:J) {
      
      if(tau_t[i] < tau_c[i])
        
        T[i] = 1;
      
      else 
        
        T[i] = 0;
    }
  }
  
}


model{
  
  sigma_taut ~ exponential(1);
  sigma_tauc ~ exponential(1);
  
  tau_t    ~ normal(0,sigma_taut);
  tau_c    ~ normal(0,sigma_tauc);
  
  beta     ~ normal(4,1);                  // these numbers can be estimated from observed data
  alpha    ~ inv_gamma(100,128); // these numbers can be estimated from observed data
      
  for (i in 1:I) {
    for(j in 1:J) {
      
      real p_t = beta[j]-tau_t[i];
      real p_c = beta[j]-tau_c[i];
      real p   = (p_t^(1-T[i]))*(((1-C[j])*p_t + C[j]*p_c)^T[i]);
    
       RT[i,j] ~ normal(p,1/alpha[j]);
    }
  }
}

I defined C as a parameter, but can’t figure how to sample it.

parameters {
...
  vector[J] C;                       // vector for item compromise status
}

Then, I tried C ~ uniform(0,1) and C ~ beta(1,1) in the model statement, but I didn’t get any sensible result.

I was imagining doing something like this.

parameters {
...
  real<lower=0,upper=1>  pr;       // proportion of compromised items
  vector[J] C;                    // vector for item compromise status
}

Then, in the model statement,

model {
...
  pr ~ beta(1,1)
  C ~ bernoulli(pr)
}

However, I understand that this will not work because Stan doesn’t handle discrete parameters. I have read a few things about marginalizing over the discrete parameters to handle this kind of situation, but can’t implement it in this problem.

I am sorry for the long post. This comes out of frustration after a couple days of trying and I am almost about to give up. Hope, someone can guide me some potential solutions to overcome this.

Thank you so much for your time and reading till the end.

2 Likes

I think this will be conceptually easiest if you marginalize over both whether the item is compromised and whether the test-taker is cheating. The way I’d approach this is to:

  1. Write down a function that gives the likelihood of a single observed response time given compromised & cheating times the probability of {compromised and cheating}.
  2. Same thing, but for {compromised and not cheating}, {not compromised and cheating}, {not compromised and not cheating}.
  3. Recognize that the four situations above are mutually exclusive, so add their likelihoods (function log_sum_exp is useful here).
  4. Loop (or vectorize) this function over all of the observations.
3 Likes

Thank you for taking the time to read and respond, Jacob!

Unfortunately, this is where I feel really dumb and can’t operationalize the verbal descriptions and put them into code. Would you be able to kindly provide a sample code about how to implement this?

I am sorry that it takes more than typical for people like me with not a strong background in math and stat to translate these ideas into code.

1 Like

I believe you are more than capable of deriving these quantities, with right nudge.
The main idea is to take

real p = (p_t^(1-T[i]))*(((1-C[j])*p_t + C[j]*p_c)^T[i]);

and substitute the four combinations of T_i \in \{0, 1\} and C_j \in \{0, 1\} in turn.
Let’s follow @jsocolar’s suggestion for T_i = 1 and C_j = 0:

real p2   = (p_t^(1-T[i]))*((p_t)^T[i]);

Now, let’s consider T_i = 0 and C_j = 1:

real p3   = p_t;

Now you can complete the pattern to take care of the remaining situations (T_i =1, C_j = 1 and T_i =0, C_j = 0).

You’ll then need a line somewhere with

// likelihood
target += log_sum_exp([lp1, lp2, lp3, lp4])

where lp1 = normal_lpdf(RT[i, j] | p1, 1/alpha[j]).

Maybe @jsocolar has suggestion on how to vectorise efficiently.

3 Likes

@Cengiz_Zopluoglu as far as I know this stuff is hard for absolutely everybody until they’ve seen it a few times. Nothing dumb here.

By the way, I wouldn’t worry about trying to vectorize this as you get off the ground, but if you find that sampling is too slow there might be some efficiency gains available vectorizing the computation of p2, p3, etc, and likewise the call to normal_lpdf.

3 Likes

Second that.

2 Likes

Thank you @maxbiostat and @jsocolar for providing a potential solution. And, yes, I am not worried about efficiency. Just need something that runs and converges. Can be worried later about vectorizing.

I am also reading these documents.

I feel better but I am still stuck. Here is what I currently have based on your suggestions and other resources I can find on the web. Could you please have a look to see if I am getting it right?

One thing I am still stuck on is how I sample 0s and 1s for each individual and each item at each iteration (vector of C and vector of T).

Thank you again.

data{
  int <lower=1> I;          // number of examinees          
  int <lower=1> J;          // number of items
  real RT[I,J];             // Response time matrix
}

parameters {
  vector[J] beta;                        // vector for time intensity parameters
  vector <lower=0> [J]  alpha;           // vector for time discrimination parameters
  vector[I] tau_t;                       // vector for latent speed for honest responses
  vector[I] tau_c;                       // vector for latent speed for fraudulent responses
  real<lower=0> sigma_taut;              // sd for tau_t
  real<lower=0> sigma_tauc;              // sd for tau_c
  vector[I] T;                           // vector for probability of preknowledge
  vector[J] C;                           // vector for probability of compromised
}

model{
  
  sigma_taut ~ exponential(1);
  sigma_tauc ~ exponential(1);
  
  tau_t    ~ normal(0,sigma_taut);
  tau_c    ~ normal(0,sigma_tauc);
  
  beta     ~ normal(4,1);
  alpha    ~ inv_gamma(100,128);
  
  T ~ beta(1,1); // ?????
  C ~ beta(1,1); // ?????
      
  for (i in 1:I) {
    for(j in 1:J) {
      
      real p_t = beta[j]-tau_t[i];
      real p_c = beta[j]-tau_c[i];
  
      real lp1 = log(0.25) + normal_lpdf(RT[i,j] | p_t, 1/alpha[j]);  // T = 0, C=0
      real lp2 = log(0.25) + normal_lpdf(RT[i,j] | p_t, 1/alpha[j]);  // T = 1, C=0
      real lp3 = log(0.25) + normal_lpdf(RT[i,j] | p_t, 1/alpha[j]);  // T = 0, C=1
      real lp4 = log(0.25) + normal_lpdf(RT[i,j] | p_c, 1/alpha[j]);  // T = 1, C=1 
      
      target += log_sum_exp([lp1, lp2, lp3, lp4]);
      
    }
  }
}
3 Likes

I can take a more detailed pass over this sometime next week perhaps. In the mean time, if you want to keep tinkering, the next step is to recognize that the probability of being in any of the four regimes is not simply 0.25 for every regime and every observation. Instead, this is a quantity that you want to model. So for example, you’ll be modeling the probability that question j was compromised, and you’ll be modeling the probability that subject i was cheating, and these modeled probabilities are the pieces that you need to get the correct contribution from each of the four possible states. So for example, you might declare parameters vector<lower=0, upper=1>[J] p_compromised; and vector<lower=0, upper=1>[I] p_cheater; and then you would use log(p_compromised[j]) + log(p_cheater[i]) in place of log(0.25) for the {compromised, cheater} case. For, e.g., the {compromised, non-cheater} case you’d use log(p_compromised[j]) + log1m(p_cheater[i]).

The really nice thing is that at the end of the day, you’ll have the p_compromised parameters that are directly interpretable as the probabilities that each question is compromised, and the p_cheater parameters that are directly interpretable as the probabilities that each subject cheated.

3 Likes

Thank you so much. These are very helpful. I will keep working on this, and post it here if I can figure this out.

Ok, I think it now makes sense. After going back and reading your messages and stan manual one more time, I believe I now have a better idea about what is happening when people talk about marginalizing over discrete parameters.

Below is my summary (and I hope I got the concept right).

P(Y_{ij} | \tau_{ti},\tau_{ci}, \beta_j,\alpha_j,C_j,T_i) =

P(Y_{ij} | \tau_{ti},\tau_{ci}, \beta_j,\alpha_j,C_j = 0,T_i = 0) \times P(C_j=0) \times P(T_i = 0) +
P(Y_{ij} | \tau_{ti},\tau_{ci}, \beta_j,\alpha_j,C_j = 0,T_i = 1) \times P(C_j=0) \times P(T_i = 1) +
P(Y_{ij} | \tau_{ti},\tau_{ci}, \beta_j,\alpha_j,C_j = 1,T_i = 0) \times P(C_j=1) \times P(T_i = 0) +
P(Y_{ij} | \tau_{ti},\tau_{ci}, \beta_j,\alpha_j,C_j = 1,T_i = 1) \times P(C_j=1) \times P(T_i = 1) +

Instead of working with probabilities, we work with the log of these probabilities for numerical stability.

So,

The log of the first term,
log(P(Y_{ij} | \tau_{ti},\tau_{ci}, \beta_j,\alpha_j,C_j = 0,T_i = 0) \times P(C_j=0) \times P(T_i = 0) ), translates to

normal_lpdf(RT[i,j] | beta[j]-tau_t[i], 1/alpha[j]) + log1m(C[j]) + log1m(T[i]) 

The log of the second term,
log(P(Y_{ij} | \tau_{ti},\tau_{ci}, \beta_j,\alpha_j,C_j = 0,T_i = 1) \times P(C_j=0) \times P(T_i = 1) ), translates to

normal_lpdf(RT[i,j] | beta[j]-tau_t[i], 1/alpha[j]) + log1m(C[j]) + log(T[i]) 

The log of the third term,
log(P(Y_{ij} | \tau_{ti},\tau_{ci}, \beta_j,\alpha_j,C_j = 1,T_i = 0) \times P(C_j=1) \times P(T_i = 0) ), translates to

normal_lpdf(RT[i,j] | beta[j]-tau_t[i], 1/alpha[j]) + log(C[j]) + log1m(T[i]) 

The log of the fourth term,

log(P(Y_{ij} | \tau_{ti},\tau_{ci}, \beta_j,\alpha_j,C_j = 1,T_i = 1) \times P(C_j=1) \times P(T_i = 1) ), translates to

normal_lpdf(RT[i,j] | beta[j]-tau_c[i], 1/alpha[j]) + log(C[j]) + log(T[i]) 

Then, the log_sum_exp() function puts each term back to the probability scale, sums these probabilities, and then takes the log of the sum of these probabilities.

This suddenly made perfect sense now.

Below is the most recent version of the code that seems working now. I tested it with a small simulated data, and all the parameters converge nicely and I get the probability of compromise for each item and probability of preknowledge for each examinee. Whether or not these probabilities have any predictive value is a separate issue and depends on a lot of factors. I will probably run some extensive simulations to see how well it works under different conditions.

Thank you @jsocolar and @maxbiostat again. I really appreciate your patience and help. This is super neat!

data{
  int <lower=1> I;          // number of examinees          
  int <lower=1> J;          // number of items
  real RT[I,J];             // Response time matrix
}

parameters {
  vector[J] beta;                        // vector for time intensity parameters
  vector <lower=0> [J]  alpha;           // vector for time discrimination parameters
  vector[I] tau_t;                       // vector for latent speed for honest responses
  vector[I] tau_c;                       // vector for latent speed for fraudulent responses
  real<lower=0> sigma_taut;              // sd for tau_t
  real<lower=0> sigma_tauc;              // sd for tau_c
  vector<lower=0, upper=1>[I] T;        // vector for probability of examinee preknowledge
  vector<lower=0, upper=1>[J] C;        // vector for probability of item compromise
   
}

model{
  
  sigma_taut ~ exponential(1);
  sigma_tauc ~ exponential(1);
  
  tau_t    ~ normal(0,sigma_taut);
  tau_c    ~ normal(0,sigma_tauc);
  
  beta     ~ normal(4,1);                // numbers are estimated from observed data
  alpha    ~ inv_gamma(100,128);         // numbers are estimated from observed data
  
  T  ~ beta(1,1);
  C  ~ beta(1,1);

  for (i in 1:I) {
    for(j in 1:J) {
      
      real p_t = beta[j]-tau_t[i];
      real p_c = beta[j]-tau_c[i];
  
      real lp1 = log1m(C[j]) + log1m(T[i]) + normal_lpdf(RT[i,j] | p_t, 1/alpha[j]);  // T = 0, C=0
      real lp2 = log1m(C[j]) + log(T[i])   + normal_lpdf(RT[i,j] | p_t, 1/alpha[j]);  // T = 1, C=0
      real lp3 = log(C[j])   + log1m(T[i]) + normal_lpdf(RT[i,j] | p_t, 1/alpha[j]);  // T = 0, C=1
      real lp4 = log(C[j])   + log(T[i])   + normal_lpdf(RT[i,j] | p_c, 1/alpha[j]);  // T = 1, C=1 
      
      target += log_sum_exp([lp1, lp2, lp3, lp4]);
      
    }
  }
}

4 Likes

It seems the problem is partially resolved. Here is another issue I am facing as I play with more simulated datasets. While I am not certain, I think the issue is closely related to what is being discussed here.

https://mc-stan.org/users/documentation/case-studies/identifying_mixture_models.html

Although the model is not a typical mixture model, it has similar characteristics. Chains are going east and west for most P(C) and P(T) parameters yielding bimodal posterior distributions.

Below is an example.



I will work on finding a solution. Any idea/suggestion is appreciated.

I don’t think you should be looking at mixture models in the context of the pathology in those plots. Possibly you got to that idea through the phrase “the chains are not mixing”? Regardless, what you’re observing means that the posterior is multimodal and the chains are exploring different modes. Sometimes multi-modality is inherent, but often it indicates that the model is under-identified.

Sorry, ignore me. I see now your model IS a mixture model.

Thank you. I guess, I found the trick. There were a few changes I made in the code. First, I start to draw item and person parameters from a multivariate normal distribution instead of drawing them independently. I also defined the person parameters as an ordered vector, and I think this is forcing \tau_c to be larger than \tau_t for every person. This resolved the issue of direction. Now, everything works as expected. Chains are exploring the same mode, and posteriors look fine.

Below is the new version of the code to fit this model.

data{
  int <lower=1> I;              // number of examinees          
  int <lower=1> J;              // number of items
  real RT[I,J];                 // matrix  the log of responses
}

parameters {
  real mu_beta;                 // mean for time intensity parameters
  real<lower=0> sigma_beta;     // sd for time intensity parameters
  
  real mu_alpha;                // mean for log of time discrimination parameters
  real<lower=0> sigma_alpha;    // sd for time discrimination parameters
  
  real<lower=0> sigma_taut;     // sd for tau_t
  real<lower=0> sigma_tauc;     // sd for tau_c
  
  corr_matrix[2] omega_P;       // 2 x 2 correlation matrix for person parameters
  corr_matrix[2] omega_I;       // 2 x 2 correlation matrix for item parameters
  
  vector<lower=0,upper=1>[J] pC; // vector of length J for the probability of item compromise status
  
  vector<lower=0,upper=1>[I] pH; // vector of length I for the probability of examinee item peknowledge 
  
  ordered[2] person[I];          // an array with length I for person specific latent parameters
                                 // Each array has two elements
                                 // first element is tau_t
                                 // second element is tau_c
                                 // ordered vector makes sure that tau_c > tau_t for every person
                                 // to make sure chains are exploring the same mode and 
                                 // do not go east and west leading multi-modal posteriors
  
  vector[2] item[J];           // an array with length J for item specific parameters
                               // each array has two elements
                              // first element is log of alpha parameter
                             // second element is beta
}

transformed parameters{
  
  vector[2] mu_P;                        // vector for mean vector of person parameters 
  vector[2] mu_I;                        // vector for mean vector of item parameters
  
  vector[2] scale_P;                     // vector of standard deviations for person parameters
  vector[2] scale_I;                     // vector of standard deviations for item parameters
  
  cov_matrix[2] Sigma_P;                 // covariance matrix for person parameters
  cov_matrix[2] Sigma_I;                 // covariance matrix for person parameters
  
  mu_P[1] = 0;
  mu_P[2] = 0;
  
  scale_P[1] = sigma_taut;               
  scale_P[2] = sigma_tauc;
  
  Sigma_P = quad_form_diag(omega_P, scale_P); 
  
  mu_I[1] = mu_alpha;
  mu_I[2] = mu_beta;
  
  scale_I[1] = sigma_alpha;               
  scale_I[2] = sigma_beta;
  
  Sigma_I = quad_form_diag(omega_I, scale_I); 
  
}

model{
  
  sigma_taut  ~ exponential(1);
  sigma_tauc  ~ exponential(1);
  sigma_beta  ~ exponential(1);
  sigma_alpha ~ exponential(1);
  
  mu_beta      ~ normal(4,1);
  mu_alpha     ~ lognormal(0,0.5);   // Note that we are drawing log of alpha, so use exp(alpha) below
  
  pC ~ beta(1,1);
  pH ~ beta(1,1);
  
  omega_P   ~ lkj_corr(1);
  omega_I   ~ lkj_corr(1);
  
  person  ~ multi_normal(mu_P,Sigma_P);
  
  item    ~ multi_normal(mu_I,Sigma_I);
  
  
  for (i in 1:I) {
    for(j in 1:J) {
      
      // item[j,1] represents log of parameter alpha of the jth item
          // that's why we use exp(item[j,1]) below 
      // item[j,2] represents parameter beta of the jth item
      
      //person[i,1] represents parameter tau_t of the ith person
      //person[i,2] represents parameter tau_c of the ith person
      
      
      real p_t = item[j,2]-person[i,1];   //expected response time for non-cheating response
      real p_c = item[j,2]-person[i,2];  //expected response time for cheating response
      
      // log of probability densities for each combination of two discrete parameters
      // (C,T) = {(0,0),(0,1),(1,0),(1,1)}
      
      real lprt1 = log1m(pC[j]) + log1m(pH[i]) + normal_lpdf(RT[i,j] | p_t, 1/exp(item[j,1]));  // T = 0, C=0
      real lprt2 = log1m(pC[j]) + log(pH[i])   + normal_lpdf(RT[i,j] | p_t, 1/exp(item[j,1]));  // T = 1, C=0
      real lprt3 = log(pC[j])   + log1m(pH[i]) + normal_lpdf(RT[i,j] | p_t, 1/exp(item[j,1]));  // T = 0, C=1
      real lprt4 = log(pC[j])   + log(pH[i])   + normal_lpdf(RT[i,j] | p_c, 1/exp(item[j,1]));  // T = 1, C=1 
      
      target += log_sum_exp([lprt1, lprt2, lprt3, lprt4]);
      
    }
  }
  
}
2 Likes