Hierarchical ordinal logistic regression

I have data that consists of a number of ordinal questions on a survey, all of which measure the same basic concept. I would like to build a hierarchical, ordinal logistic regression model that borrows strength across different survey questions. I noticed that Stan has an ordered_logistic distribution, which might make this a lot easier than going all the way to the likelihood. However, there are a couple complications:

  1. Not all survey questions have the same responses, and sometimes not even the same number of responses. My plan is to allow the “intercepts” (cutpoints) to be completely question-specific, with a different number of cutpoints for each question. However, I’m not sure how to specify the data structure for these parameters in Stan. I’d like to have some sort of “ragged” list of ordered vectors, is anything like this possible? If not is there a work-around you could suggest?

  2. My data is in a collapsed format, with one row for every response option for each question, and a column that is the “count” of responses. Is there a way to get ordered_logistic to accept data of this format, or would I have to expand the dataset to have one row for each respondent, for each question?

Thanks,
Jon

Stan doesn’t support ragged anything, so (1) is difficult. One possibility would be to declare

parameters {
  ordered[HUGE_NUM] cutpoints;
  ...
}

where HUGE_NUM is the total of all the response categories in all the questions. Then include a question-specific intercept in the linear predictor and break off the number of cutpoints that question needs with segment or :.

For (2), you can do something like

model {
  int pos = 1;
  for (j in 1:J) {
    vector[categories[j]] c = cutpoints[pos:(pos + categories[j] - 1)];
    vector[categories[j]] eta = ???;
    target += count[j] * ordered_logistic_lpmf(y[j] | eta, c);
    pos += categories[j];
  }
}

Thanks for the response. But for (1), you suggest putting all the cutpoints together in a single ordered vector. Wouldn’t that restrict the cutpoints for the first question to all be less than the second question, and those of the second question to all be less than the third, etc.?

Yeah, which is why you need to include a free intercept in eta to offset that.

Oh I see. I suppose that could work. Are there any problems with identifiability here though? And what kind of prior would I put on the cutpoints?

As an alternative, would it be okay to store the cutpoints in a rectangular data structure, but only use the first categories[j] elements of row j? I don’t think I have any questions on the survey with more than 5 or 6 response options. Would it be a bad idea to have a bunch of parameters floating around that don’t enter at all into the likelihood?

Yes, unless they have a proper prior attached to them. Otherwise you get an improper posterior and lots of divergent transitions.

The reason why my suggestion “works” to get around the ragged array problem is that in a model with $$K$$ categories, a model with $$K$$ cutpoints and an intercept has a ridge in the likelihood where you can increase each of the $$K$$ cutpoints by a constant and decrease the intercept by a constant while still getting the same probability of observing the data. But that may create sampling problems for Stan since only the prior prevents the posterior from having a similar ridge. @Bob_Carpenter has argued for putting priors on the differences between adjacent cutpoints, which seems like it would work well in this case. Also, I think you need to declare your intercepts as ordered too.

Well I would certainly put a prior on those “floating” parameters. It would be pretty easy to loop through and place N(0,1) on all of them.

for (k in 1:K) {
  cutpoints[k][(categories[k]+1):K] ~ normal(0,1);
}

or something like that (sorry if my Stan syntax is slightly off, I always have to plan around with it to make sure I’m indexing arrays correctly).

Regardless of which method I use, it’s still not clear to me what kind of prior I should be putting on an ordered parameter vector? I would still have this question even if I was in a non-hierarchical setting. The only examples I have seen don’t specify any type of prior for the ordered parameters. What default is being applied?

I don’t think ignoring the prior would work with your suggestion, due to the identifiability concerns I raised earlier. Suppose the cutpoints should be around (-1, -.5, .5, 1) for the first question, and (-1.5, 0, 0.5) for the second one (I assume they’re on the log odds scale?). So the model estimates the first 4 parameters of categories to be something around (-1, -.5, .5, 1), but then how does it know if the next 3 should be (1.5, 3, 3.5) with an intercept of -3, or if they should be (11.5, 13, 13.5) with an intercept of -13? I would think this would lead to instability in the model, without a prior that constrains the parameters appropriately. And that prior seems like it would be hard to specify when all the categories are in one vector.

Unless I’m missing something about your suggestion?

@bgoodri - I would just like to close the loop on this.

I now realize that there is no “default” prior for ordered variables. Or I should say, that the default is the same as it would be for any other numeric variable, other than the constraints placed on it so that it is ordered (please confirm my understanding).

However, I think my identifiability concerns are still valid. Are they valid, or am I off base here?

They might be; I haven’t actually tried it. Here is a better idea: Declare the huge unordered vector in the parameters block but no intercepts. And then in the model block, create the cutpoints for each question the same way Stan does


which is referred to as the “ordered inverse transform” even though it is more natural to think about it simply as the transformation from unconstrained to ordered. Just make sure to account for the Jacobian.

Thanks, this is helpful. However, could you clarify your last sentence about the Jacobian. My plan was to do the following:

\\ Code above defines the "unoredered" vector x
model {
  int pos = 1;
  // k indexes the outcome
  for (k in 1:K) {
    int J_k = categories[k];
    int ncut = J_k - 1;
    vector[ncut] x_k = x[pos:(pos + ncut - 1)];
    vector[ncut] cutpoints;
    cutpoints[1] = x_k[1];
    for (j in 2:ncut) {
      cutpoints[j] = cutpoints[j-1] + exp(x_k[j]);
    }
    

    vector[J_k] eta = ???; // I'll figure this out later....

    target += count[k] * ordered_logistic_lpmf(y[k] | eta, cutpoints);
    pos += J_k;
  }
}

Questions:

  1. I don’t think this can actually be how ordered_logistic_lpmf works. As above, y and eta are vectors of length J_k, and cutpoints is a vector of length J_k - 1. Perhaps that is what ordered_logistic_lpmf expects. But for any given outcome k, we have counts for all J_k levels of that outcome. So how do we apply the weighting in this way?
  2. Per your last message, where would the Jacobian come in? My sense is it’s not necessary if using ordered_logistic_lpmf?

A related solution, instead of using the “ordered inverse transform”, was suggested by my colleague @mariel. Instead of modeling the “cutoffs” as ordered factors, I could model the difference between all the levels, where those differences are constrained to be positive? This seems like a more natural approach to me. That seems to be effectively doing the same thing as the transformation approach. Do you agree?

Sorry, I now realize I need another level of looping, so you can ignore question #1 above. I’m still curious about question #2, and also what you think of @mariel’s suggestion.

The Jacobian is necessary for the same reason it is done internally when you declare an ordered in the parameters block. The ordered vector you create is just an intermediate from Stan’s perspective and the posterior is kernel is defined on the big unordered vector you declare. But adding those Jacobian terms is pretty easy.

If you declare a big positive vector and interpret each element as the difference in adjacent cutpoints, then the Jacobian of a cumulative sum is just the identity matrix so including its log determinant would just be adding zero to target.