# Logistic regression with imperfect measurement of the outcome

Hi,

i want to use Stan to fit a logistic regression which accounts for imperfect measurement of the outcome. The paper describing the Bayesian approach (McInturff 2004) uses discrete latent parameters in bugs. In order to perform marginalization i turned to an earlier paper describing an EM algorithm to estimate this kind of regression (Magder & Hughes, 1997). This paper gives the conditional probabilities, but my model seems to be not correct as i can’t get reasonable regression estimates even when i impose high sensitivity and specificity.

I think my error could be where I sum up the target density, but frankly i am lost. Anyone who spots my misdoing here?

best wishes, felix

``````
data{
int n;
int outcome[n]; //outcome measured with error
int gender[n]; //a predictor
}

parameters {
real<lower=0,upper=1> sensitivity;
real<lower=0,upper=1> specificity;

real slope;
real intercept;
}

transformed parameters {
real linear_predictor[n];
real pie[n];
real likelihood[n];

for (i in 1:n){
linear_predictor[i] = slope*gender[i]+intercept;
pie[i] = exp(linear_predictor[i])/(1+exp(linear_predictor[i])); //probability

//marginalization --> different probabilities for true disease state given observed outcome and linear predictor
if (outcome[i] == 0){
likelihood[i] = log(pie[i]*sensitivity/(pie[i]*sensitivity+pie[i]*(1-specificity)));
} else if (outcome[i] == 1){
likelihood[i] = log(pie[i]*(1-sensitivity)/(pie[i]*(1-sensitivity)+pie[i]*specificity));
}
}
}

model {
sensitivity ~ beta(200,2);
specificity ~ beta(200,2);

intercept ~ normal(0,1);
slope ~ normal(0,1);

target += log_sum_exp(likelihood);
}
``````
1 Like

Here’s the model description from McInturff 2004:

So, I think you have the two likelihoods flipped. Plus, possibly the algebra works out the same, but copying directly from McInturff I get:

``````        if (outcome[i] == 1){
likelihood[i] = log( pie[i]*sensitivity + (1-pie[i]) * (1-specificity) );
} else if (outcome[i] == 0){
likelihood[i] = log( pie[i]*(1-sensitivity) + (1-pie[i])*specificity );
}
``````

Thanks for looking into this. Pasting your code unfortunately still does not give reasonable parameters.

What I am calculating there (according to Magder / Hughes) is the probability of true disease state given the observed test result and the predictors (i should have not named that variable likelihood). I think this information is needed to weight the likelihood contributions.

I ultimately want to model is the relation between the predictors and the true disease state and i feel that part is still missing? Should I calculate the likelihood now for both states (true diseases yes/no) and then sum it up weighted by the probabilities?

I can’t readily follow the math in your likelihood definition. But I do have some remarks that may help.

The expected probability p of observing outcome ==1 is: p = sens x p_true + (1-spec) x (1-p_true). From there it should be straightforward to work out the log-likelihood. Maybe your definition is right, but I don’t understand how you end up with a multiplication of the specificity and true prevalence in your likelihood…

General tip: define the inverse logit transformation as 1 / (1+exp(-(linear predictor))). Thats fewer terms. Not sure if it makes a difference for the autodiff cost though.

1 Like

Also, if you work out the likelihood based on the expected probability that I defined above, you have already marginalised out all the possible latent states and dot’t need to do the log_sum_exp thing. I don’t think you needed that in the firste place…

Wouldn’t be better to use `inv_logit()`?

Good catch; I didn’t look closely enough at the model section. Since `likelihood` is a vector filled with log-likeilhoods (might consider renaming it to `log_lik` for clarity), you don’t need to exponentiate then sum then log to get the sum log likelihood, just sum: `target += sum(likelihood)`

Thanks so much, i think i got it now!

``````   data{
int n;
int outcome[n];
int gender[n];
}

parameters {
real<lower=0,upper=1> sensitivity;
real<lower=0,upper=1> specificity;

real slope;
real intercept;
}

transformed parameters {
real linear_predictor[n];
real pie[n];
real loglik[n];

for (i in 1:n){
linear_predictor[i] = slope*gender[i]+intercept;
pie[i] = 1/(1+exp(-(linear_predictor[i])));

if (outcome[i] == 1){
loglik[i] = log( pie[i]*sensitivity + (1-pie[i]) * (1-specificity) );
} else if (outcome[i] == 0){
loglik[i] = log( pie[i]*(1-sensitivity) + (1-pie[i])*specificity );
}
}
}

model {
sensitivity ~ beta(200,2);
specificity ~ beta(200,2);

intercept ~ normal(0,1);
slope ~ normal(0,1);

target += sum(loglik);
}
``````
1 Like

Be sure to double check by simulating some fake data and ensuring the simulation parameters are recovered reasonably.

``````pie[i] = inv_logit(linear_predictor[i]);