Ordered logit--three challenges

Hello,

I am developing a model where I try to estimate the amount of public concern about climate change in different countries, based on Likert-scale (i.e., ordinal) survey responses. Each survey question has a different number of answer options. So, the data might look like this:

country----question----option----resp. who picked option----total no. of respondents
Belgium—A--------------1----------345------------------------------1000
Belgium—A--------------2----------278------------------------------1000
etc.

Model components: each country has a latent concern score X. For each survey question, an ordered logit model connects the latent X to the answer option picked by respondents from that country. So, we need to estimate a set of thresholds for each question.

I am facing three challenges in writing up this model in STAN.

1. priors for the thresholds

Every question will have its own number of answer options and therefore thresholds. It makes sense to me that the thresholds within a question would not be independent. Previous iterations (where I binarized all the survey responses) show that weakly informative priors are needed for convergence. So, here’s how I would go about setting the (p-1) thresholds needed for a question with p answer options:

step 1. Set the lower and upper thresholds to 0 and 1. Then break the space between 0 and 1 into (p-2) parts. This can be done by drawing from a flat Dirichlet distribution with p-2 categories.

step 2. Draw a location parameter from a weakly informative prior and add it to all of the thresholds.

step 3. Draw a scale parameter from a weakly informative prior and multiply all of the thresholds by it.

Does this make sense?

2. data structure

I have really struggled with finding a good way to store the parameters generated in step 1. The best thing I have been able to come up with is an intermediate step where we have one long vector containing the “step sizes” (i.e. the size of the space between thresholds) for all the questions:

data {
  int<lower=0> J;  //number of items
  int<lower=0> no_of_steps[J]; //number of steps for each question (p-2)

  int<lower=0> H //total number of steps (summed over all questions)
  int<lower=0> h_questions[H] //index: questions corresponding to steps
  int<lower=0> h_steps[H] //index: which steps
}

parameters {
  real thres_steps[H];
}

So, we want to obtain a thres_steps vector which, if we put it next to the indexing vectors h_questions and h_steps, looks like this:

thres_steps—h_questions—h_steps
.2----------------A-----------------1
.4----------------A-----------------2
.1----------------A-----------------3
.3----------------B-----------------1
etc.

This means that for question A, before adding scale and location, the first threshold is 0, the second threshold is .2, the third is .6 (=.2+.4), the fourth is .7 (=.2+.4+.1), and the fifth is 1. We could then declare the prior using segmentation:

model {

  int pos;
  pos = 1;
  for(j in 1:J){ //for each question
    segment(thres_steps, pos, no_of_steps[j]) ~ dirichlet(rep_vector(1, no_of_steps[j]));
    pos = pos + no_of_steps[j];
  }

}

Then, in the transformed parameters block, we can turn this into a thresholds vector where the step sizes are all added up, with its own indexing vectors.

This is rather messy. In fact, I find it complicated enough that I wonder whether my choice of priors isn’t overcomplicating things, and whether adding an ordinal part to the model is really worth it (as opposed to binarizing all survey responses).

3. Ordered logit distribution with multiple trials

If there is a term for this, I don’t know it. The distribution of my data is to an ordered logistic one as Binomial is to Bernouilli. So I don’t think I can use the canned ordered_logistic distribution. I could present the data in a respondent-by-respondent fashion, making the outcome ordered logistic, but that is not ideal because (1) the survey data is weighted and (2) the size would be prohibitive. What would be the most efficient way to declare my likelihood function?

Thank you,

Clara

Sorry for the long delay. I’m only just getting back to the forums to try to mop up some unanswered questions. These long multi-part ones are the ones least likely to get answered because they’re so involved.

You can check out our advice on priors, which includes advice for ordinal logit:

which directly addresses this piont.

That’s going to depend on the data, but it’s usually better to fit with all the data you have if the data’s sufficient to fit the model reasonably well (which will vary by type of model and noisiness of data).

That’s because we don’t have ragged arrays yet.

Yes, that’s the best thing to do now. It’s what we recommend in the manual chapter on ragged data.

Instead of segment(a, i, n), you might find a[i : i + n] more readable.

If you have N copies of the same observation y, then you can do this:

for (n in 1:N) y ~ foo(theta);

or you can more efficiently do this

target += N * foo_lpdf(y | theta);

(or _lpmf if y is discrete, as I think it is here). This is also addressd in the manual under efficiency and sufficient statistics.

1 Like

Thanks for your response, Bob, and apologies for the delay on my side. I will work on implementing your advice, and will let you know what came out of it.

Bob, I finally got around to implementing your advice. I got stuck, however, at implementing the transformation of the step size vector into the threshold vector.

This is what I would like to do in the transformed parameters block:

  real thres[H]; //question thresholds
  int<lower=1> thres_start[J]; //index of starting point for question's thresholds

  //index counters
  int<lower=1> pos_steps;
  int<lower=1> pos_thres;

  //reconstruct thresholds from step sizes
  pos_steps = 1;
  pos_thres = 1;
  for(j in 1:J){ //for each question

    thres[pos_thres] = 0; //set first threshold to 0
    for(i in 0: (no_of_thres[j]-1) ){ //set each next threshold...
      //...to previous thresh. plus step
      thres[pos_thres+i+1] = thres[pos_thres+i] + steps[pos_steps+i];
    }

    thres_start[j] = pos_thres; //store starting point of thresholds for question
    pos_thres = pos_thres + no_of_thres[j]; //move index of threshold vector
    pos_steps = pos_steps + no_of_thres[j] - 1; //move index of steps vector

  }

The problem I run into is that declaring integers in the transformed parameters block does not work. What’s the right way to approach this?

(Hope this question will be useful in general for others who want to use indexing in the transformed parameters block)

If you need integer variables, you can declare them as local variables in the transformed parameter or model blocks. Just put them in braces in the transformed parameter block and they can be used anywhere within:

{
  int N = ...; // local to braced scope

  ...
}

Thanks a bunch. that solved the problem for the indexing variables.

How should I deal with the thres_start variable in the block above? It’s supposed to be an array of indices, which keeps track for every question of where its series of threshold parameters start in the array thres. I am using that variable later on in the transformed_parameters block, like so:

thres[ thres_start[ j ] ]

where j is a question index. In the transformed_parameters block, I also want to declare a similar array of indices psi_start, to be used in the model block. Both must be of type integer.

Because I know beforehand how many questions and thresholds there will be, I suppose I could enter these arrays as data. Is that the best solution?

Probably. Or you could compute them in transformed data. Either way, you can compute something once in the data block then reuse it for ever log density and gradient evaluation.

If you want to declare integers in the transformed parameter block, they have to be local variables—Stan doesn’t let you have integer parameters or transformed parameters (generated quantities are OK).

For posterity: a follow-up thread on this same model can be found here