I am trying to run an ordered logistic regression model, using quite special data.
In my data, I am measuring one outcome per respondent, which can be coded as 1, 2 and 3 (i.e., the survey responses “no”, “I am not sure” and “yes”).
Then, I have two different input variables, with which the respondents were treated. However, quite a big subset of my participants did not receive only one treatment with the two input variables but several treatments at the same time (i.e., 974 respondents had one unique treatment, 637 respondents had two treatments, 268 respondents had three treatments,…).
As my aim is, to estimate the effect of my two input variables on the probability of giving a particular response, I thought about the following model specification:
Actually, each of the treatments we observe (i.e., a combination of the two input variables) would lead to a specific response, i.e., either yes, I am not sure or no. Or to put it differently, each of the treatments is associated with a particular probability of being in one of the three categories.
However, we only observe one outcome per panelist and not one outcome per treatment (which we would be actually interested in).
So, if we have a respondent who received two treatments but gave only one response, the response we eventually measure is actually already a combination of the probabilities of being in one of the categories per treatment.
If the respondent received two treatments, I would assume that the probability of being in the yes category is increased, wheras the probability of being in the no category is decreased (this is actually an assumption that can, of course, be discussed - however, I should be able to support this assumption with theory).
So, I am estimating a probability (theta) of being in one of the categories for each treatment. In a next step, I combine those distinct probabilities to an overall probability (THETA) per respondent, which is a combination of the distinct treatment probabilities per respondent.
For example, a respondent has probabilities theta of being in the “no”, “i am not sure” and “yes” category of (0.6, 0.2, 0.2) for treatment 1 and (0.1, 0.5, 0.4) for treatment 2.
The combined probabilities THETA would then be calculated as follows:
no-category: 0.6 * 0.1 = 0.06
yes-category: 1 - ((1 - 0.2) * (1 - 0.4)) = 0.52
not sure-category: 1- 0.06 - 0.52 = 0.42
This is the code I am using at the moment:
data {
int<lower=2> K;
int<lower=0> N;
int<lower=1> L;
int<lower=1,upper=K> CrR[L];
vector[N] var1;
vector[N] var2;
int<lower=1,upper=L> ll[N];
int<lower=1,upper=N> indices1[L];
int<lower=1,upper=N> indices2[L];
}
parameters {
real b1;
real b2;
ordered[K-1] c;
}
model {
vector[N] theta[K];
vector[K] THETA;
for (n in 1:N) {
theta[1,n] = 1 - (inv_logit(var1[n]*b1 +var2[n]*b2 - c[1]));
for(k in 2:K-1)
theta[k,n] = (inv_logit(var1[n]*b1 + var2[n]*b2 - c[k-1])) - (inv_logit(var1[n]*b1 + var2[n]*b2 - c[k]));
theta[K,n] = (inv_logit(var1[n]*b1 + var2[n]*b2 - c[K-1]));
}
for (l in 1:L) {
THETA[1] = theta[1,indices1[l]];
THETA[3] = theta[3,indices1[l]];
for (m in 1:(indices2[l]-indices1[l])) {
if (m > 0) {
THETA[1] = THETA[1] * theta[1,(indices1[l]+m)];
THETA[3] = 1 - ((1-THETA[3]) * (1-theta[3,(indices1[l]+m)]));
}
}
THETA[2] = 1 - THETA[1] - THETA[3];
CrR[l] ~ categorical(THETA);
}
}
Please note that I only implemented the model code for three categories at the moment. I would still need to generalize the code to a flexible number of categories.
K… number of categories
N… overall number of observations (one observation for each treatment)
L… overall number of respondents
CrR… measured responses
var1… input variable 1
var2… input variable 2
ll… respondent ID
indices1… indices variable for each respondent, stating which treatment is the first treatment belonging to respondent ll
indices2… indices variable for each respondent, stating which treatment is the last treatment belonging to respondent ll
So, I first tried running the model with those respondents only, who received only one treatment, such that I can directly compare it to the normal ordered logistic model.
And actually, I get the same estimates as in the normal ordered logistic model.
So, what is your overall oppinion about my modelling idea? Is my way of calculating the overall probabilities per respondent proabably not a valid approach, i.e., influencing final results to much?
I am happy about any comments and feedback you can give me!
Thank you in advance,
Christina