Context of Problem
I’m trying to specify a logistic regression model that incorporates some amount of unknown classification error in the sample data. The issue of diagnostic uncertainty is an obvious problem, but I’ve seen primarily model adjustments and corrections that are designed for specific contexts. My hope is that a logistic regression may be more generalized in its applications and can be extended to ordered logit models as well.
In the area of neuropsychology, we’re often looking at methods and measures that may improve the detection of cognitive disorders, but most cognitive disorders lack either a definitive diagnostic method or have a definitive diagnostic method that is impractical/impossible in clinical settings. Thus, these studies have to rely on clinical diagnoses that make classification errors at some (un)known rate with sensitivity and specificity of testing methods being among the most widely used clinical summaries of these methods. In cases where a model/measure may produce better accuracy to true diagnostic classes than the clinical diagnostic standard, those class predictions are being penalized for not matching the error-in-variable observed classifications.
Actual Modeling Problem
I have approached the problem as a mixture model: when the observed classification is a control (0), there is a mixture of true negatives (at a rate = specificity) and false negatives (at a rate = 1 - sensitivity), and similarly when the observation is classified as a case (1), this is a mixture of true positives (at a rate = sensitivity) and false positives (at a rate = 1 - specificity). When the sensitivity and specificity of a diagnostic method are known (in this case, the sensitivity and specificity of clinical diagnoses to particular diagnoses), this model works very well, but I want to generalize it to cases where the sensitivity and specificity of the initial observations are not known.
The approaches I’ve taken so far, however, to extend the model have resulted in non-identifiability with \hat{R} values as high as 1.75 (but no divergent transitions, max treedepths, or E-BFMI issues). The model also seems to be sensitive to starting values as some times the model will return posterior estimates that are very close to the simulated values (still bad mixing) and then other times it will return nearly equivocal sensitivity and specificity estimates (both ~0.50). Using variational estimation, the model runs very quickly (<3-4 seconds) but also demonstrates variability in whether the results are close or not.
Reproducible Example
Here’s the function I’m using to generate the data for the logistic regression:
logDatGen <- function(n, p, sn, sp) {
#n = sample size, p = num. of predictors, sn = sensitivity, sp = specificity
x <- rnorm(n*p, mean = 0, sd = 1) #generate random predictors
x <- matrix(x, nrow = n, ncol = p) #convert to predictors by person
c <- rnorm(p, mean = 1, sd = 0.25) #generate coefficients
s <- sample.int(n = p, size = round(p/4)) #randomly select ~1/4 of coefficients to be negative
c[s] <- c[s]*-1 #convert coefficients to be negative
odds <- rowSums(sweep(x, 2, c, FUN = "*")) + 1 #multiply predictor values by respective coefficients
pr <- 1 / (1 + exp(-odds)) #convert linear component to probability scale
y <- rbinom(n = n, size = 1, prob = pr) #generate true data
obs <- integer(n) #initate observed data
for ( i in 1:n ) {
obs[i] <- ifelse(y[i] == 1, rbinom(n = 1, size = 1, prob = sn), rbinom(n = 1, size = 1, prob = 1 - sp))
} #introduce error in classification as a function of sens/spec
return(data.frame("Predictors" = x,
"True" = y,
"Error" = obs))
}
Assuming that I haven’t messed up the data generation code, the result here should be a matrix of predictors (x) corresponding to both a set of true observations (y) and a set of observations with error (obs) where some positives (1) are false positives (at a rate = 1 - specificity) and others are true positives (at a rate = sensitivity).
Here’s the cmdstanr code for the model for when sensitivity and specificity are known:
data {
int<lower=1> n; // number of observations
int<lower=0> p; // number of predictors
array[n] int<lower=0,upper=1> obs; // class observations
matrix[n, p] x; // predictor matrix
real sens; // known sensitivity
real spec; // known specificity
}
parameters {
real alpha; // intercept parameter
vector[p] beta; // coefficient parameter(s)
}
model {
// simple priors for intercept and coefficient(s)
target += normal_lpdf(alpha | 0, 1);
target += normal_lpdf(beta | 0, 1);
// logistic likelihood
for (j in 1:n) {
// conditional mixture: if class = 1, weight by true positive versus false positive rate
if (obs[j] == 1) {
target += log_sum_exp(log(sens) +
bernoulli_logit_lpmf(1 | alpha + x[j, ]*beta),
log(1 - spec) +
bernoulli_logit_lpmf(0 | alpha + x[j, ]*beta));
}
// conditional mixture: if class = 0, weight by true negative versus false negative rate
if (obs[j] == 0) {
target += log_sum_exp(log(spec) +
bernoulli_logit_lpmf(0 | alpha + x[j, ]*beta),
log(1 - sens) +
bernoulli_logit_lpmf(1 | alpha + x[j, ]*beta));
}
}
}
Finally, here’s the current iteration of trying to make the model work when sensitivity and specificity are not known and need to be estimated along with the other parameters:
data {
int<lower=1> n; // number of observations
int<lower=1> p; // number of predictors
array[n] int<lower=0,upper=1> obs; // class observations
matrix[n, p] x; // predictor matrix
real<lower=0> sens_a; // alpha of sensitivity prior
real<lower=0> sens_b; // beta of sensitivity prior
real<lower=0> spec_a; // alpha of specificity prior
real<lower=0> spec_b; // beta of specificity prior
real<lower=0> prev_a; // alpha of prevalence prior
real<lower=0> prev_b; // beta of prevalence prior
}
parameters {
real alpha; // intercept parameter
vector[p] beta; // coefficient parameter(s)
real<lower=0, upper=1> pr; // estimated prevalence
real<lower=0, upper=1> sens; // estimated sensitivity
real<lower=0, upper=1> spec; // estimated specificity
}
model {
// simple priors for intercept and coefficient(s)
target += normal_lpdf(alpha | 0, 1);
target += normal_lpdf(beta | 0, 1);
// priors for prevalence
target += beta_lpdf(pr | prev_a, prev_b);
// priors for sensitivity and specificity
target += beta_lpdf(sens | sens_a, sens_b);
target += beta_lpdf(spec | spec_a, spec_b);
// prevalence estimation
real p_sample = pr * sens + (1 - pr) * (1 - spec);
// logistic likelihood
for (j in 1:n) {
// conditional mixture: if class = 1, weight by true positive versus false positive rate
if (obs[j] == 1) {
target += log_sum_exp(log(sens) +
bernoulli_logit_lpmf(1 | alpha + x[j, ]*beta),
log(1 - spec) +
bernoulli_logit_lpmf(0 | alpha + x[j, ]*beta));
}
// conditional mixture: if class = 0, weight by true negative versus false negative rate
if (obs[j] == 0) {
target += log_sum_exp(log(spec) +
bernoulli_logit_lpmf(0 | alpha + x[j, ]*beta),
log(1 - sens) +
bernoulli_logit_lpmf(1 | alpha + x[j, ]*beta));
}
}
}
The model will work (i.e., good parameter recovery, convergence of chains, and clear HMC diagnostics) if the parameters of the beta distribution are sufficiently informative, but the posterior for sensitivity and specificity are highly impacted by those priors, which I find problematic for the intended purposes.
I have tried to create an “infrastructure” for eliciting reasonable beta priors with the following function in R:
beta_prior <- function(lb = 0.25, ub = 0.75, ci = 0.95) {
out <- optim(c(2, 2),
function(v) abs( (pbeta(ub, v[1], v[2]) - pbeta(lb, v[1], v[2])) - ci ))
return(out$par)
}
While specifying a lower and upper bound of an arbitrary confidence interval may help to elicit somewhat reasonable priors from experts, it’s still an abusable practice that I’d prefer to avoid if possible.
Suspicions
Based on my understanding of the model’s behavior, the current specification is too flexible with priors that do not provide sufficient information to identify the model. I’ve tried a few things to impose more constraints on the model, but they haven’t worked so far. I’ve tried specifying sensitivity and specificity as logits as used here (http://www.stat.columbia.edu/~gelman/research/unpublished/specificity.pdf) and also using the \lambda and \phi parameterization discussed in the User Guide (24.2 Reparameterizations | Stan User’s Guide).
My hunch is that there’s extra information within the inherent interrelationships of sensitivity, specificity, and prevalence that could be used to separate those parameters and coax the model into mixing better (and avoiding the cases where the outputs of sensitivity and specificity are both ~0.50). Unfortunately, this model exceeds my mathematical knowledge and is constructed purely theoretically based on my understanding of the problem.
Based on these observations, here are my questions:
-
Is there a reasonable/simple constraint(s) that can be added to the model that could identify it?
-
Are there better hyperpriors that could encourage better mixing of chains?
-
Is there another way of approaching the problem that might be more efficient/robust to generalization from known to unknown diagnostic error? Other approaches to the problem (that I’ve seen) have treated the matter as a latent class model with computation of sensitivity and specificity done on the back-end. Perhaps that’s an easier/more appropriate general solution (e.g., using predictors + error-in-variable diagnoses to inform class membership?). I’ve avoided this as clinical diagnoses are themselves important endpoints (e.g., for treatment options, access to services, etc.), so I like the idea of predicting those outcomes more accurately than dealing with the problem of selecting the number of latent classes to extract and then validating that the classes are meaningful.