Hello,
I am developing a model where I try to estimate the amount of public concern about climate change in different countries, based on Likert-scale (i.e., ordinal) survey responses. Each survey question has a different number of answer options. So, the data might look like this:
country----question----option----resp. who picked option----total no. of respondents
Belgium—A--------------1----------345------------------------------1000
Belgium—A--------------2----------278------------------------------1000
etc.
Model components: each country has a latent concern score X. For each survey question, an ordered logit model connects the latent X to the answer option picked by respondents from that country. So, we need to estimate a set of thresholds for each question.
I am facing three challenges in writing up this model in STAN.
1. priors for the thresholds
Every question will have its own number of answer options and therefore thresholds. It makes sense to me that the thresholds within a question would not be independent. Previous iterations (where I binarized all the survey responses) show that weakly informative priors are needed for convergence. So, here’s how I would go about setting the (p-1) thresholds needed for a question with p answer options:
step 1. Set the lower and upper thresholds to 0 and 1. Then break the space between 0 and 1 into (p-2) parts. This can be done by drawing from a flat Dirichlet distribution with p-2 categories.
step 2. Draw a location parameter from a weakly informative prior and add it to all of the thresholds.
step 3. Draw a scale parameter from a weakly informative prior and multiply all of the thresholds by it.
Does this make sense?
2. data structure
I have really struggled with finding a good way to store the parameters generated in step 1. The best thing I have been able to come up with is an intermediate step where we have one long vector containing the “step sizes” (i.e. the size of the space between thresholds) for all the questions:
data { int<lower=0> J; //number of items int<lower=0> no_of_steps[J]; //number of steps for each question (p-2) int<lower=0> H //total number of steps (summed over all questions) int<lower=0> h_questions[H] //index: questions corresponding to steps int<lower=0> h_steps[H] //index: which steps } parameters { real thres_steps[H]; }
So, we want to obtain a thres_steps vector which, if we put it next to the indexing vectors h_questions and h_steps, looks like this:
thres_steps—h_questions—h_steps
.2----------------A-----------------1
.4----------------A-----------------2
.1----------------A-----------------3
.3----------------B-----------------1
etc.
This means that for question A, before adding scale and location, the first threshold is 0, the second threshold is .2, the third is .6 (=.2+.4), the fourth is .7 (=.2+.4+.1), and the fifth is 1. We could then declare the prior using segmentation:
model { int pos; pos = 1; for(j in 1:J){ //for each question segment(thres_steps, pos, no_of_steps[j]) ~ dirichlet(rep_vector(1, no_of_steps[j])); pos = pos + no_of_steps[j]; } }
Then, in the transformed parameters block, we can turn this into a thresholds vector where the step sizes are all added up, with its own indexing vectors.
This is rather messy. In fact, I find it complicated enough that I wonder whether my choice of priors isn’t overcomplicating things, and whether adding an ordinal part to the model is really worth it (as opposed to binarizing all survey responses).
3. Ordered logit distribution with multiple trials
If there is a term for this, I don’t know it. The distribution of my data is to an ordered logistic one as Binomial is to Bernouilli. So I don’t think I can use the canned ordered_logistic
distribution. I could present the data in a respondent-by-respondent fashion, making the outcome ordered logistic, but that is not ideal because (1) the survey data is weighted and (2) the size would be prohibitive. What would be the most efficient way to declare my likelihood function?
Thank you,
Clara