Non convergence issue on polytomous IRT model

Hi! I am a baby Stan modeler and I have been trying to fit a polytomous IRT model (i.e., Nominal response model) using simulated response data.

The nominal response model (NRM; Bock, 1972) models the probability of an examinee j with a latent ability trait \theta selecting a category of k of an item i as

P(u_{ij}=k|\theta _j) = \frac{\exp(a_{ik}\theta_j +c_{ik})} {\sum_{k=1}^{m_i}\exp({a_{ik}\theta_j}+c_{ik})}.

In NRM each category of an item i has its slope parameter a_{ik} and intercept parameter c_{ik} and each item has as many slope and intercept parameters as the number of categories. For example, if an item has 4 categories, then the item i will have a vector of 4 slope parameters and 4 intercept parameters.

So I implemented the NRM model in Stan with the simulated response data (which is attached) Fatigue_NRM_sim_resp_20.csv (39.2 KB) as the following:


resp <- resp+1  
N <- nrow(resp)
T <- ncol(resp)
data_nrm<-list(n_student = N,n_item=T,response=resp,K=5)

nrm <- "
  data{
    int<upper=5> K; //  number of categories
    int <lower=0> n_student; //  number of individuals
    int <lower=0> n_item; //  number of items
    int<lower=1,upper=K> response[n_student,n_item]; //array of responses
  }

  parameters {
    vector[K] zeta[n_item]; // intercept
    vector[K] lambda[n_item]; // slope
    vector[n_student] theta; // latent trait
  }

  transformed parameters {
    vector[K] zetan[n_item]; // centered intercept
    vector[K] lambdan[n_item]; // centered slope

   for (k in 1:n_item) {
     for (l in 1:K) {
			  zetan[k,l]<-zeta[k,l]-mean(zeta[k]);
 		        lambdan[k,l]<-lambda[k,l]-mean(lambda[k]);
}}

  }

  model{
    theta ~ normal(0,1);
    for (i in 1: n_item){
    zeta[i] ~ normal(0,4);
    lambda[i] ~ normal(0,4);
    }
   for (i in 1:n_student){
      for (j in 1:n_item){
        response[i,j] ~ categorical_logit(zetan[j]+lambdan[j]*theta[i]);
  }
   }
}
"

My questions are on the convergence because the estimation reached convergence only very occasionally. But once the chains are converged, all the parameters seem to be recovered.

  1. I tried to increase the number of iterations to reach convergence, but simply increasing the number of iterations did not really help attain convergence. Why would increasing the number of iterations in this case not help convergence?

  2. I tried priors with smaller variance such as Zeta[i] ~ normal (0,1.5) and lambda[i] ~ normal(0, 1.5). It seems that it helped convergence at first, but multiple attempts with exactly the same code and priors did not always lead to convergence. Why would convergence be attained at one time but not another with the exact same code?

  3. When the chains did not converge, the problematic parameters were slope and theta parameters. Intercept parameters were almost always converged. I checked the posterior densities for slope and theta parameters and found out that many of the densities were bimodal, which made me believe that the bimodal posterior densities were the reason for non-convergence. What should I possibly fix to correct this biomodal densities? Picture1

Thank you so much!

Sincerely,

Sue

Is this sum-to-zero constraint necessary?

transformed parameters {
   vector[K] zetan[n_item]; // centered intercept
   vector[K] lambdan[n_item]; // centered slope

   for (k in 1:n_item) {
     for (l in 1:K) {
       zetan[k,l]<-zeta[k,l]-mean(zeta[k]);
       lambdan[k,l]<-lambda[k,l]-mean(lambda[k]);
     }
   }
}

Maybe this be a problem with identifiability?
One way to check this would be to see if the term
zetan[j]+lambdan[j]*theta[i] converges. You could also check if different chains converge to different typical sets, or if you observe multimodality within a chain.

One solution could be to fix the first theta to be larger than zero.

I was thinking along the lines of keeping theta positive to avoid the equivalent solution when both lambda and theta change sign. I also included bbbales2’s suggestion of not centering lambda, though I left in centering zeta. I didn’t really have a reason for doing that, it was late and I got sloppy. In any case, the following model did converge using the posted data with Rhat at 1.00 for all parameters. There were a handful (5-ish) of transitions that reached the maximum tree depth, so it is not completely without problems. I do not know if setting theta to be positive is acceptable for this purpose, I just wanted to check how it would affect convergence.

data{
  int<upper=5> K; //  number of categories
  int <lower=0> n_student; //  number of individuals
  int <lower=0> n_item; //  number of items
  int<lower=1,upper=K> response[n_student,n_item]; //array of responses
}

parameters {
  vector[K] zeta[n_item]; // intercept
  vector[K] lambda[n_item]; // slope
  vector<lower=0>[n_student] theta; // latent trait
}

transformed parameters {
  vector[K] zetan[n_item]; // centered intercept
  //vector[K] lambdan[n_item]; // centered slope

  for (k in 1:n_item) {
    for (l in 1:K) {
      zetan[k,l] = zeta[k,l] - mean(zeta[k]);
      //lambdan[k,l] = lambda[k,l] - mean(lambda[k]);
    }
  }
}

model{
  theta ~ gamma(1, 0.8); //normal(0,1);
  for (i in 1: n_item){
    zeta[i] ~ normal(0,4);
    lambda[i] ~ normal(0,4);
  }
  for (i in 1:n_student){
    for (j in 1:n_item){
      response[i,j] ~ categorical_logit(zetan[j]+ lambda[j] * theta[i]); //lambdan[j]*
    }
  }
}

Just to be clear, I did not propose to constrain all theta’s above zero, but only the first.

Of you exceed the max treedepth, you just can set it to a higher value.

Hi!

One of the typical constraints for NRM for model identification purpose is to have the sum of all the parameters to be zero. That is why I included the transformed parameters section in the code. I quickly tried a model without transformed parameters section like below, but it didn’t converge.

nrm <- "
  data{
    int<upper=5> K; // four number of categories
    int <lower=0> n_student; //  number of individuals
    int <lower=0> n_item; //  number of items
    int<lower=1,upper=K> response[n_student,n_item]; //array of responses
  }

  parameters {
    vector[K] zeta[n_item]; // intercept
    vector[K] lambda[n_item]; // slope
    vector[n_student] theta; // latent trait
  }
  model {
    theta ~ normal(0,1);
    for (i in 1: n_item){
    zeta[i] ~ normal(0,4);
    lambda[i] ~ normal(0,4);
    }
    
   for (i in 1:n_student){
      for (j in 1:n_item){
        response[i,j] ~ categorical_logit(zeta[j]+lambda[j]*theta[i]);
  }
   }
}
"

Do you think the presence of transformed parameters might have affected convergence?

Thank you!

Sue

This seems very similar to an issue from another post, as the issue of the latent factor theta indeterminancy. Meaning that at any given iteration the factor can be in a positive of negative direction, for example would mean “High Fatigue” or “Low Fatigue”.

https://discourse.mc-stan.org/t/non-convergence-of-latent-variable-model/12450/15

The solution that I suggested was to estimate the factor loadings without constraint, and adjust the direction of the factor in function of 1 item, in the generated quantities block.

Here I am “fixing” the factor loading for item 1 in category 1 to be negative. So, if the iteration gives a positive value to this factor loading, the code will multiply all the factor loadings and latent factor for -1. So, lambdan_swt and theta_swt go in the direction that I wanted.

I ran your code, with this new block, and it converged with 5000 iterations

generated quantities{
vector[K] lambdan_swt[n_item]; // sign adjusted lambda
vector[n_student] theta_swt; // latent trait

  lambdan_swt = lambdan;
  theta_swt = theta;
 
  if(lambdan[1,1] > 0){
     for (i in 1:K){
      for (j in 1:n_item){
        lambdan_swt[j,i] = -1*lambdan[j,i]; }}
        
    theta_swt = -1*theta;
    }

}

Notice that I did this in function of the center factor loadings. Since I am not familiar with this model, I dont know if it would work with the uncenter ones, or which ones are the “right ones” to use

2 Likes

What @Mauricio_Garnier-Villarre says sounds like the right thing.

I was worried about the sum to zero constraint but I think I was wrong to be here. You only need N - 1 parameters to define a length N vector with a sum to zero constraint, but, if what you have works, don’t worry about this and just roll with what you have. There was a thread where people talked quite a bit about these sum to zero things but I don’t think anything was absolutely conclusive: Test: Soft vs Hard sum-to-zero constrain + choosing the right prior for soft constrain.

1 Like

Thank you so much! I will try constraining the first theta to be above zero. Now I understand why setting direction is needed thanks to @Mauricio_Garnier-Villarre’s comment

But actually I have not learned how to constrain a particular parameter value (e.g., the first theta). Would it be possible directing me how I can achieve that?

Thank you so much!

Sincerely,

Sue

Thank you so much for your comment! I tried your code and it has actually converged (which has been pretty rare to see). I am curious how constraining theta to be positive might have contributed to convergence. Maybe it was conducive because it is more informative than N(0,1)? Do you have any theory on this?

First, I want to caution you that I am only a casual user of Stan and just learning about it. My thinking could well be wrong.
What I noticed about your model is that both theta and lambdan can be either positive or negative but only their product is used in calculating the log probability.

response[i,j] ~ categorical_logit(zetan[j]+lambdan[j]*theta[i]);

Imagine that a certain value x resulting from lambdan[j]*theta[i] produces a high log probability. If x is positive, then there are two ways to get that result, lambdan and theta can both be positive or they can both be negative. Similarly, if x is negative there are also two equivalent pairs of lambdan and theta. One has to be positive and the other negative but it does not matter which. As the chains explore parameter values they have no grounds on which to pick one solution over another and they struggle to converge. This bimodality was shown in the graph in your first post. A crude way to fix that is to force one of the parameters to be positive or negative, which is what I tried. It is probably not a valid model of your problem but it does converge.

A better answer was provided by @Mauricio_Garnier-Villarre. I do not entirely understand it but I modified it a bit, moving the calculations into the transformed parameters rather than generated quantities, and tried fitting the data. The parameters theta_swt and lambdn (equivalent to his lambdan_swt) did indeed converge with 1000 total iterations. Theta and lambda did not converge, so warnings were thrown about that.
Here is my version of his suggestion.

data{
  int<upper=5> K; //  number of categories
  int <lower=0> n_student; //  number of individuals
  int <lower=0> n_item; //  number of items
  int<lower=1,upper=K> response[n_student,n_item]; //array of responses
}

parameters {
  vector[K] zeta[n_item]; // intercept
  vector[K] lambda[n_item]; // slope
  vector[n_student] theta; // latent trait
}

transformed parameters {
  vector[K] zetan[n_item]; // centered intercept
  vector[K] lambdan[n_item]; // centered slope
  //vector[K] lambdan_swt[n_item]; // sign adjusted lambda
  vector[n_student] theta_swt; // latent trait
  for (k in 1:n_item) {
    for (l in 1:K) {
      zetan[k,l] = zeta[k,l] - mean(zeta[k]);
      lambdan[k,l] = lambda[k,l] - mean(lambda[k]);
    }
  }

//lambdan_swt = lambdan;
theta_swt = theta;

if(lambdan[1,1] > 0){
  for (i in 1:K){
    for (j in 1:n_item){
      lambdan[j,i] = -1*lambdan[j,i]; }}
  theta_swt = -1*theta;
}
}

model{
  theta ~ normal(0,1);
  for (i in 1: n_item){
    zeta[i] ~ normal(0,4);
    lambda[i] ~ normal(0,4);
  }
  for (i in 1:n_student){
    for (j in 1:n_item){
      response[i,j] ~ categorical_logit(zetan[j]+ lambdan[j] * theta_swt[i]); //lambdan[j]*
    }
  }
}

You would need to specify

  • a real for the first theta with the constraint and
  • a vector for the remaining vectors without constraints

separately in the parameters block. Then you can concatenate them in the transformed parameters block.

In principle, this method should achieve the same what @Mauricio_Garnier-Villarre’s approach does with less computations. But in another thread about a related problems @Mauricio_Garnier-Villarre reports that his method is more robust.

In the estimation part you should use the unconstraint parameters. Let the model run “as usual”, and sign adjust after the fact.

With this approach you do not check the unconstraint parameters for convergence, as this will not find a single solution because it is finding 2 solutions a positive and negative signed one. You would keep track and check for converge of: intercepts, sign adjusted slopes, and sign adjusted thetas. All thes converged on my example

You need t set a direction for the meaning of theta, this actually is not done by setting a specific theta value to be positive or negative. Because factor indeterminancy means that subjects specific theta values “meaning” depend on the theta values of the other subjects. For example, lets lay subject 1 gets a theta = -0.4, and subject 2 gets theta = -0.6; for theta is the same as is subject 1 gets +0.6 and subject 2 gets +0.4. The relevance between these 2 subjects is the distance between them, more than the sign.
With my code I am setting the direction of theta in function of the model parameters, by sign adjusting lambda and theta in function of the first slope for the first item being negative (you can choose to set as positive). For example, in CFA a common method is to set a factor loading equal to 1, this sets the point of reference for the factor.

For example, in the first plot you showed, its clear the model “converged” on the absolute value of the slope being around |1.5|. This symmetric bimodal distribution is a clear indicator of the need to set a meaningful direction to theta in function of the parameters

This comes from my experience and from the source code from blavaan. Ed Merkle has comments in his code that setting the direction of the factor by fixing as positive a factor loading does not work with Stan as it works in JAGS. I tried the same, with the same outcome, the model only converged to correct simulated values when using the sign adjust method

1 Like

Yeah, this is weird/surprising to me. Do you know if somebody tried this with simulated data, where the factor loadings are known?

I have tested with simulated data. Could only found a converged solution, that replicated simulated values with the sign switch method

1 Like