Logistic regression with imperfect measurement of the outcome

Hi,

i want to use Stan to fit a logistic regression which accounts for imperfect measurement of the outcome. The paper describing the Bayesian approach (McInturff 2004) uses discrete latent parameters in bugs. In order to perform marginalization i turned to an earlier paper describing an EM algorithm to estimate this kind of regression (Magder & Hughes, 1997). This paper gives the conditional probabilities, but my model seems to be not correct as i can’t get reasonable regression estimates even when i impose high sensitivity and specificity.

I think my error could be where I sum up the target density, but frankly i am lost. Anyone who spots my misdoing here?

best wishes, felix


    data{
      int n;
      int outcome[n]; //outcome measured with error
      int gender[n]; //a predictor
    }
      
    parameters {
      real<lower=0,upper=1> sensitivity;
      real<lower=0,upper=1> specificity;
    
      real slope;
      real intercept;
    }
    
    transformed parameters {
      real linear_predictor[n];
      real pie[n];
      real likelihood[n];
      
      for (i in 1:n){
        linear_predictor[i] = slope*gender[i]+intercept;
         pie[i] = exp(linear_predictor[i])/(1+exp(linear_predictor[i])); //probability 
      
        //marginalization --> different probabilities for true disease state given observed outcome and linear predictor
        if (outcome[i] == 0){          
          likelihood[i] = log(pie[i]*sensitivity/(pie[i]*sensitivity+pie[i]*(1-specificity)));
        } else if (outcome[i] == 1){
          likelihood[i] = log(pie[i]*(1-sensitivity)/(pie[i]*(1-sensitivity)+pie[i]*specificity));
        }
      }
    }         

model {
      sensitivity ~ beta(200,2); 
      specificity ~ beta(200,2);
      
      intercept ~ normal(0,1);
      slope ~ normal(0,1);
      
      target += log_sum_exp(likelihood);
    }

Here’s the model description from McInturff 2004:

So, I think you have the two likelihoods flipped. Plus, possibly the algebra works out the same, but copying directly from McInturff I get:

        if (outcome[i] == 1){          
          likelihood[i] = log( pie[i]*sensitivity + (1-pie[i]) * (1-specificity) );
        } else if (outcome[i] == 0){
          likelihood[i] = log( pie[i]*(1-sensitivity) + (1-pie[i])*specificity );
        }

Thanks for looking into this. Pasting your code unfortunately still does not give reasonable parameters.

What I am calculating there (according to Magder / Hughes) is the probability of true disease state given the observed test result and the predictors (i should have not named that variable likelihood). I think this information is needed to weight the likelihood contributions.

I ultimately want to model is the relation between the predictors and the true disease state and i feel that part is still missing? Should I calculate the likelihood now for both states (true diseases yes/no) and then sum it up weighted by the probabilities?

I can’t readily follow the math in your likelihood definition. But I do have some remarks that may help.

The expected probability p of observing outcome ==1 is: p = sens x p_true + (1-spec) x (1-p_true). From there it should be straightforward to work out the log-likelihood. Maybe your definition is right, but I don’t understand how you end up with a multiplication of the specificity and true prevalence in your likelihood…

General tip: define the inverse logit transformation as 1 / (1+exp(-(linear predictor))). Thats fewer terms. Not sure if it makes a difference for the autodiff cost though.

1 Like

Also, if you work out the likelihood based on the expected probability that I defined above, you have already marginalised out all the possible latent states and dot’t need to do the log_sum_exp thing. I don’t think you needed that in the firste place…

Wouldn’t be better to use inv_logit()?

Good catch; I didn’t look closely enough at the model section. Since likelihood is a vector filled with log-likeilhoods (might consider renaming it to log_lik for clarity), you don’t need to exponentiate then sum then log to get the sum log likelihood, just sum: target += sum(likelihood)

Thanks so much, i think i got it now!

   data{
     int n;
     int outcome[n];
     int gender[n];
   }
     
   parameters {
     real<lower=0,upper=1> sensitivity;
     real<lower=0,upper=1> specificity;
   
     real slope;
     real intercept;
   }
   
   transformed parameters {
     real linear_predictor[n];
     real pie[n];
     real loglik[n];

     
     for (i in 1:n){
       linear_predictor[i] = slope*gender[i]+intercept;
       pie[i] = 1/(1+exp(-(linear_predictor[i])));
     
       if (outcome[i] == 1){          
         loglik[i] = log( pie[i]*sensitivity + (1-pie[i]) * (1-specificity) );
       } else if (outcome[i] == 0){
         loglik[i] = log( pie[i]*(1-sensitivity) + (1-pie[i])*specificity );
       }
}
}

    
   
   model {
     sensitivity ~ beta(200,2);
     specificity ~ beta(200,2);
     
     intercept ~ normal(0,1);
     slope ~ normal(0,1);
     
    target += sum(loglik);
   }

Be sure to double check by simulating some fake data and ensuring the simulation parameters are recovered reasonably.

Already on it :D

1 Like

Sure yes, if that’s an existing function in Stan, that’s probably even more stable and efficient

Yes, one could write

pie[i] = inv_logit(linear_predictor[i]);

I don’t think it’s vectorized, so the loop over elements will have to stay.

1 Like