Max Treedepth Exceeded = Lack of Convergence?

In ordinal regression a global intercept is nonidentified with the cutpoints by construction. Removing alpha removes this problem entirely. You could also anchor alpha and the cutpoints with sufficiently informative priors, just like alpha is typically identified in logistic IRT models.

I see, so strengthening the prior on alpha from N(0, 10) to something more informative would make it identifiable? And this would also require changing the prior on cutpoints from N(0, 5)?

The fundamental difficulty is that alpha is just not a meaningful quantity in the model as it is written.

Let’s think generatively and see how the predictions are made.
Each individual is assumed to have a latent theta.
Each item measures theta through the predictor alpha+beta*theta. The rule is that if the predictor is between (i-1)th an (i)th cutpoints then the prediction is y=i.
The distance between cutpoints indicates precision; if it is larger than one then the predictions are confident; if it is less than one then the predictions are uncertain.
The interpretation of alpha is that it gives the predictor when theta=0, the minimum. But this does not say anything if you don’t know where the cutpoints are. The predicted outcome depends only on the relative position of alpha among the cutpoints.

You could fix this by removing alpha and reinterpreting the first cutpoint as item difficulty. However, I suspect that trying to estimate cutpoint locations is more trouble than it’s worth. Evenly spaced cutpoints probably work just as well.

transformed data {
  ordered[k-1] c; // cutpoints 
  for (i in 1:k-1)
    c[i] = i;
}
...
model {
  y1 ~ ordered_logistic(alpha[1] + beta[1] * theta, c);
  y2 ~ ordered_logistic(alpha[2] + beta[2] * theta, c);
  ...
}
1 Like

Thank you so much for insight. This however assumes that the evenly-spaced cutpoints are the same for all 4 manifest y’s, correct? Is this an okay assumption to make? And by modeling in this way do the N(0, 10) on alpha and N(0,5) on (now one parameter instead of 4) c make sense?
I would also need to use a loop in the model in order to bypass the 2.19 bug on the ordered logistic function right?

N(0, 10) is reasonable for alpha. Cutpoints are assumed known so there’s no prior on c.

The exact position of cutpoints is irrelevant as long as the prior on alpha is vague.
The spacing between cutpoints (let’s call it s) does two things

  1. the interpretation of beta is that beta/s is the expected difference in y between theta=0 and theta=1 individuals.
  2. The probability assigned to a predicted y value is at most (exp(s)-1)/(exp(s)+1)

The first effect could (and probably should) be removed by using alpha+s*beta*theta instead of alpha+beta*theta as the predictor.
If you think s=1 is too strong an assumption you can make s a parameter

...
parameters {
  real<lower=0> s[j];
  ...
}
transformed parameters {
  vector[k-1] c1 = s[1] * c;
  vector[k-1] c2 = s[2] * c;
  ...
}

s needs an informative prior, maybe lognormal(0, 0.5).

2 Likes

I see. So would this model properly identify everything we need to know? Just want to make sure I understood everything.

data{

  int<lower=0> n; // number of data points
  int<lower=0> j; // number of item parameters
  int<lower=2> k; // number of cutpoints
  
  int y1[n]; 
  int y2[n]; 
  int y3[n]; 
  int y4[n]; 

}

parameters{

vector[j] alpha; // item difficulty parameters
real<lower=0> beta[j]; // item discrimination parameters
vector<lower=0, upper=1>[n] theta; // latent variables

real<lower=0> s[j]; // cutpoint spacing

}

transformed paramters{

vector[k-1] c1 = s[1] * c;
vector[k-1] c2 = s[2] * c;
vector[k-1] c3 = s[3] * c;
vector[k-1] c4 = s[4] * c;

}

model{

theta ~ normal(0.5, 0.25); \\ prior on latent variable
alpha ~ normal(0, 10); \\ prior on difficulty
beta ~ beta(2, 2); \\ prior on discrimination - still tinkering with this
s ~ lognormal(0, 0.5);

for(i in 1:n){
  y1[i] ~ ordered_logistic(alpha[1] + beta[1] * theta[i], c1);
  y2[i] ~ ordered_logistic(alpha[2] + beta[2] * theta[i], c2);
  y3[i] ~ ordered_logistic(alpha[3] + beta[3] * theta[i], c3);
  y4[i] ~ ordered_logistic(alpha[4] + beta[4] * theta[i], c4);
}

}

Looks good to me, although definition of c is missing and transformed parameters is misspelled. More serious problem is the prior on beta. With s near 1 and theta constrained between 0 and 1, beta approximates the dynamic range of y. Surely this is greater than 1, otherwise all individuals have the same y regardless of their theta. A prior like beta ~ normal(0, k) is more appropriate.

How would I define c in the parameters block? As a vector?

And I want to ensure a positive relationship between beta and theta - at first I had used a Gamma prior. But two of the four items (which I will rediscretize) were disproportionately skewing the estimation of theta, so I thought I would try and constrain beta with a Beta prior. But what you’re saying makes sense because y is 1, 2, 3, or 4. So perhaps beta ~ Gamma(3, 2)?

c is not a parameter, it’s the vector [1,2,3,...]. It can be defined in transformed data block.

Gamma prior for beta sounds reasonable.

Do I need to pass the model an argument for c then? Or in transformed data I can just write vector[k-1] c?

transformed data {
  vector[k-1] c;
  for (i in 1:k-1)
    c[i] = i;
}
2 Likes

Ordinal regression is based on having flexible cut points – trying to fix the cut points to a uniform grid defines a very different model with very different results. As I write in my case study the latent logistic density maps to probabilities by integrating between the cut points, and uniform distances between cut points lead to very non-uniform distribution of probabilities. Because there is no way for the model to configure itself to change this mapping it will enforce very strong, and counterintuitive, constraints on the category probabilities regardless of the IRT structure.

In order to achieve reasonable results you need flexibility in the mapping from latent effects to probabilities which you can achieve by either having a very flexible latent density (which is hard to implement) or let the cut points be parameters. Parametric cut points is the approach typically taken.

The nonidentifiability between alpha and the cut points isn’t that different from the nonidentifiabilities inherent to logistic IRT models. The item difficulties and discriminations, alpha + beta * theta in your model, define only relative performance, but the binary outcomes need to be defined relative to some absolute performance. In order to convert the relative performances to absolute performances you need to anchor the relative performance somehow, say be defining one task/subject combination as default or imposing priors that locate the difficulties.

Another way of thinking about the logistic IRT that helps to transition into the ordinal model is to separate the discrimination term, beta * theta from the difficulties. beta * theta becomes the latent affinity and alpha becomes single cutpoint separating successes from failures. The latent affinity and cutpoint are nonidentified and so you have to anchor at least one to ensure a reasonable model.

To generalize to ordinal responses we consider multiple cut points instead of just the one. Leaving an alpha in the latent affinity, however, is like having one too many cut points. If you increase alpha by some amount but decrease all of the cut points by the same amount then you get exactly the same likelihood function,
and the range of reasonable values will be limited only by the priors.

Most people remove alpha entirely and hope that the data are enough to inform the cut points. I show in my case study that this can be optimistic, so it’s best to remove alpha and impose a principled prior on the cut points. Because one’s domain expertise is usually on the category probabilities and not the cut points themselves, the induced Dirichlet prior I demonstrate is a useful way to go. Assigning normal priors to the cut points directly leas to some weird joint prior behavior once you consider the ordering constraint. With careful prior checking it can lead to something reasonable, but it’s tricky and in the ordinal regression case it’s easier to stick with an induced Dirichlet prior.

To summarize, in the ordinal case the cut points generalize the item difficulty and so having both cut points and an item difficulty overparameterizes the model, which leads to degenerate posteriors. You can include both an item difficulty and cutpoints if you impose lots of priors, but a more principled way forwards is to use your information about the difficulty to inform domain expertise on the category probabilities (for example challenging tasks might require concentrating probability on lower categories instead of having uniform probabilities), and then convert that domain expertise on category probabilities into domain expertise on outpoints using the induced Dirichlet prior construction.

1 Like

I see, so you recommend ditching alpha again and replacing the priors on cutpoints with something more informative, i.e. the (induced) Dirichlet distribution. I’m not very familiar with this distribution and I see it in your case study as well as in section 23.1 in the Stan reference manual.

Is the induced_dirichlet() command that you mention in your case study programmed into Stan? Or does it require taking the function that you define to replicate the Jacobian and refurbishing it to fit this model?

Given the original model you wrote, yes. If you have multiple groups and you wanted to model heterogeneity with a model like alpha + delta_alpha then you would still want to drop the global alpha and put reasonable priors on the varying intercepts delta_alpha.

It is not currently in Stan – you’d have to recreate it or use the functions in my case study.