i want to use Stan to fit a logistic regression which accounts for imperfect measurement of the outcome. The paper describing the Bayesian approach (McInturff 2004) uses discrete latent parameters in bugs. In order to perform marginalization i turned to an earlier paper describing an EM algorithm to estimate this kind of regression (Magder & Hughes, 1997). This paper gives the conditional probabilities, but my model seems to be not correct as i can’t get reasonable regression estimates even when i impose high sensitivity and specificity.

I think my error could be where I sum up the target density, but frankly i am lost. Anyone who spots my misdoing here?

best wishes, felix

data{
int n;
int outcome[n]; //outcome measured with error
int gender[n]; //a predictor
}
parameters {
real<lower=0,upper=1> sensitivity;
real<lower=0,upper=1> specificity;
real slope;
real intercept;
}
transformed parameters {
real linear_predictor[n];
real pie[n];
real likelihood[n];
for (i in 1:n){
linear_predictor[i] = slope*gender[i]+intercept;
pie[i] = exp(linear_predictor[i])/(1+exp(linear_predictor[i])); //probability
//marginalization --> different probabilities for true disease state given observed outcome and linear predictor
if (outcome[i] == 0){
likelihood[i] = log(pie[i]*sensitivity/(pie[i]*sensitivity+pie[i]*(1-specificity)));
} else if (outcome[i] == 1){
likelihood[i] = log(pie[i]*(1-sensitivity)/(pie[i]*(1-sensitivity)+pie[i]*specificity));
}
}
}
model {
sensitivity ~ beta(200,2);
specificity ~ beta(200,2);
intercept ~ normal(0,1);
slope ~ normal(0,1);
target += log_sum_exp(likelihood);
}

Thanks for looking into this. Pasting your code unfortunately still does not give reasonable parameters.

What I am calculating there (according to Magder / Hughes) is the probability of true disease state given the observed test result and the predictors (i should have not named that variable likelihood). I think this information is needed to weight the likelihood contributions.

I ultimately want to model is the relation between the predictors and the true disease state and i feel that part is still missing? Should I calculate the likelihood now for both states (true diseases yes/no) and then sum it up weighted by the probabilities?

I can’t readily follow the math in your likelihood definition. But I do have some remarks that may help.

The expected probability p of observing outcome ==1 is: p = sens x p_true + (1-spec) x (1-p_true). From there it should be straightforward to work out the log-likelihood. Maybe your definition is right, but I don’t understand how you end up with a multiplication of the specificity and true prevalence in your likelihood…

General tip: define the inverse logit transformation as 1 / (1+exp(-(linear predictor))). Thats fewer terms. Not sure if it makes a difference for the autodiff cost though.

Also, if you work out the likelihood based on the expected probability that I defined above, you have already marginalised out all the possible latent states and dot’t need to do the log_sum_exp thing. I don’t think you needed that in the firste place…

Good catch; I didn’t look closely enough at the model section. Since likelihood is a vector filled with log-likeilhoods (might consider renaming it to log_lik for clarity), you don’t need to exponentiate then sum then log to get the sum log likelihood, just sum: target += sum(likelihood)