Logistic Regression without a Gold Standard/Error in Observed Classification

Context of Problem

I’m trying to specify a logistic regression model that incorporates some amount of unknown classification error in the sample data. The issue of diagnostic uncertainty is an obvious problem, but I’ve seen primarily model adjustments and corrections that are designed for specific contexts. My hope is that a logistic regression may be more generalized in its applications and can be extended to ordered logit models as well.

In the area of neuropsychology, we’re often looking at methods and measures that may improve the detection of cognitive disorders, but most cognitive disorders lack either a definitive diagnostic method or have a definitive diagnostic method that is impractical/impossible in clinical settings. Thus, these studies have to rely on clinical diagnoses that make classification errors at some (un)known rate with sensitivity and specificity of testing methods being among the most widely used clinical summaries of these methods. In cases where a model/measure may produce better accuracy to true diagnostic classes than the clinical diagnostic standard, those class predictions are being penalized for not matching the error-in-variable observed classifications.

Actual Modeling Problem

I have approached the problem as a mixture model: when the observed classification is a control (0), there is a mixture of true negatives (at a rate = specificity) and false negatives (at a rate = 1 - sensitivity), and similarly when the observation is classified as a case (1), this is a mixture of true positives (at a rate = sensitivity) and false positives (at a rate = 1 - specificity). When the sensitivity and specificity of a diagnostic method are known (in this case, the sensitivity and specificity of clinical diagnoses to particular diagnoses), this model works very well, but I want to generalize it to cases where the sensitivity and specificity of the initial observations are not known.

The approaches I’ve taken so far, however, to extend the model have resulted in non-identifiability with \hat{R} values as high as 1.75 (but no divergent transitions, max treedepths, or E-BFMI issues). The model also seems to be sensitive to starting values as some times the model will return posterior estimates that are very close to the simulated values (still bad mixing) and then other times it will return nearly equivocal sensitivity and specificity estimates (both ~0.50). Using variational estimation, the model runs very quickly (<3-4 seconds) but also demonstrates variability in whether the results are close or not.

Reproducible Example

Here’s the function I’m using to generate the data for the logistic regression:

logDatGen <- function(n, p, sn, sp) {
  #n = sample size, p = num. of predictors, sn = sensitivity, sp = specificity
  
  x <- rnorm(n*p, mean = 0, sd = 1)               #generate random predictors
  x <- matrix(x, nrow = n, ncol = p)              #convert to predictors by person
  
  c <- rnorm(p, mean = 1, sd = 0.25)              #generate coefficients
  
  s <- sample.int(n = p, size = round(p/4))       #randomly select ~1/4 of coefficients to be negative
  
  c[s] <- c[s]*-1                                 #convert coefficients to be negative
  
  odds <- rowSums(sweep(x, 2, c, FUN = "*")) + 1  #multiply predictor values by respective coefficients
  
  pr <- 1 / (1 + exp(-odds))                      #convert linear component to probability scale
  
  y <- rbinom(n = n, size = 1, prob = pr)         #generate true data
  
  obs <- integer(n)                               #initate observed data
  for ( i in 1:n ) {
    obs[i] <- ifelse(y[i] == 1, rbinom(n = 1, size = 1, prob = sn), rbinom(n = 1, size = 1, prob = 1 - sp))
  }                                               #introduce error in classification as a function of sens/spec

return(data.frame("Predictors" = x, 
                  "True" = y, 
                  "Error" = obs))
}

Assuming that I haven’t messed up the data generation code, the result here should be a matrix of predictors (x) corresponding to both a set of true observations (y) and a set of observations with error (obs) where some positives (1) are false positives (at a rate = 1 - specificity) and others are true positives (at a rate = sensitivity).

Here’s the cmdstanr code for the model for when sensitivity and specificity are known:

data {
  int<lower=1> n;                          // number of observations
  int<lower=0> p;                          // number of predictors
  array[n] int<lower=0,upper=1> obs;       // class observations
  matrix[n, p] x;                          // predictor matrix
  real sens;                               // known sensitivity
  real spec;                               // known specificity
}

parameters {
  real alpha;                              // intercept parameter
  vector[p] beta;                          // coefficient parameter(s)
}

model {
  
  // simple priors for intercept and coefficient(s) 
  target += normal_lpdf(alpha | 0, 1);
  target += normal_lpdf(beta | 0, 1);

  // logistic likelihood
  for (j in 1:n) {

    // conditional mixture: if class = 1, weight by true positive versus false positive rate
    if (obs[j] == 1) {
      target += log_sum_exp(log(sens) +
                            bernoulli_logit_lpmf(1 | alpha + x[j, ]*beta),
                            log(1 - spec) +
                            bernoulli_logit_lpmf(0 | alpha + x[j, ]*beta));
    }

    // conditional mixture: if class = 0, weight by true negative versus false negative rate
    if (obs[j] == 0) {
      target += log_sum_exp(log(spec) +
                            bernoulli_logit_lpmf(0 | alpha + x[j, ]*beta),
                            log(1 - sens) +
                            bernoulli_logit_lpmf(1 | alpha + x[j, ]*beta));
    }
  }
}

Finally, here’s the current iteration of trying to make the model work when sensitivity and specificity are not known and need to be estimated along with the other parameters:

data {
  int<lower=1> n;                          // number of observations
  int<lower=1> p;                          // number of predictors
  array[n] int<lower=0,upper=1> obs;       // class observations
  matrix[n, p] x;                          // predictor matrix
  real<lower=0> sens_a;                    // alpha of sensitivity prior
  real<lower=0> sens_b;                    // beta of sensitivity prior
  real<lower=0> spec_a;                    // alpha of specificity prior
  real<lower=0> spec_b;                    // beta of specificity prior
  real<lower=0> prev_a;                    // alpha of prevalence prior
  real<lower=0> prev_b;                    // beta of prevalence prior
}

parameters {
  real alpha;                              // intercept parameter
  vector[p] beta;                          // coefficient parameter(s)
  real<lower=0, upper=1> pr;               // estimated prevalence
  real<lower=0, upper=1> sens;             // estimated sensitivity
  real<lower=0, upper=1> spec;             // estimated specificity
}

model {

  // simple priors for intercept and coefficient(s)
  target += normal_lpdf(alpha | 0, 1);
  target += normal_lpdf(beta  | 0, 1);

  // priors for prevalence
  target += beta_lpdf(pr | prev_a, prev_b);

  // priors for sensitivity and specificity
  target += beta_lpdf(sens | sens_a, sens_b);
  target += beta_lpdf(spec | spec_a, spec_b);

  // prevalence estimation
  real p_sample = pr * sens + (1 - pr) * (1 - spec);

  // logistic likelihood
  for (j in 1:n) {

    // conditional mixture: if class = 1, weight by true positive versus false positive rate
    if (obs[j] == 1) {
      target += log_sum_exp(log(sens) +
                            bernoulli_logit_lpmf(1 | alpha + x[j, ]*beta),
                            log(1 - spec) +
                            bernoulli_logit_lpmf(0 | alpha + x[j, ]*beta));
    }

    // conditional mixture: if class = 0, weight by true negative versus false negative rate
    if (obs[j] == 0) {
      target += log_sum_exp(log(spec) +
                            bernoulli_logit_lpmf(0 | alpha + x[j, ]*beta),
                            log(1 - sens) +
                            bernoulli_logit_lpmf(1 | alpha + x[j, ]*beta));
    }
  }
}

The model will work (i.e., good parameter recovery, convergence of chains, and clear HMC diagnostics) if the parameters of the beta distribution are sufficiently informative, but the posterior for sensitivity and specificity are highly impacted by those priors, which I find problematic for the intended purposes.

I have tried to create an “infrastructure” for eliciting reasonable beta priors with the following function in R:

beta_prior <- function(lb = 0.25, ub = 0.75, ci = 0.95) {
  out <- optim(c(2, 2),
               function(v) abs( (pbeta(ub, v[1], v[2]) - pbeta(lb, v[1], v[2])) - ci ))
  return(out$par)
}

While specifying a lower and upper bound of an arbitrary confidence interval may help to elicit somewhat reasonable priors from experts, it’s still an abusable practice that I’d prefer to avoid if possible.

Suspicions

Based on my understanding of the model’s behavior, the current specification is too flexible with priors that do not provide sufficient information to identify the model. I’ve tried a few things to impose more constraints on the model, but they haven’t worked so far. I’ve tried specifying sensitivity and specificity as logits as used here (http://www.stat.columbia.edu/~gelman/research/unpublished/specificity.pdf) and also using the \lambda and \phi parameterization discussed in the User Guide (24.2 Reparameterizations | Stan User’s Guide).

My hunch is that there’s extra information within the inherent interrelationships of sensitivity, specificity, and prevalence that could be used to separate those parameters and coax the model into mixing better (and avoiding the cases where the outputs of sensitivity and specificity are both ~0.50). Unfortunately, this model exceeds my mathematical knowledge and is constructed purely theoretically based on my understanding of the problem.

Based on these observations, here are my questions:

  1. Is there a reasonable/simple constraint(s) that can be added to the model that could identify it?

  2. Are there better hyperpriors that could encourage better mixing of chains?

  3. Is there another way of approaching the problem that might be more efficient/robust to generalization from known to unknown diagnostic error? Other approaches to the problem (that I’ve seen) have treated the matter as a latent class model with computation of sensitivity and specificity done on the back-end. Perhaps that’s an easier/more appropriate general solution (e.g., using predictors + error-in-variable diagnoses to inform class membership?). I’ve avoided this as clinical diagnoses are themselves important endpoints (e.g., for treatment options, access to services, etc.), so I like the idea of predicting those outcomes more accurately than dealing with the problem of selecting the number of latent classes to extract and then validating that the classes are meaningful.

One of the problems with these models is adversarial solutions. They can’t tell someone giving intentionally bad answers from someone giving good answers. You can constrain away the non-adversarial solutions by requiring sensitivity > 1 - specificity (i.e., true positive rate higher than false positive rate), but this isn’t always realistic. You can do that by declaring send with <lower=(1 - spec)>.

The problem with latent class models is that they can run into problems with multi-modality (many locally optimal points in the posterior). We talk about how to do inference in that case in the mixture chapter of the User’s Guide:

There’s a pretty big epidemiology literature around diagnostics with unknown gold standards (equivalently, measurement error in categories).

I’m not sure what you mean by the back end here, but that’s how most of these models work. They assume there’s a true latent value with noisy observations of it. There’s an example of Dawid and Skene’s model for this in the User’s Guide here:

https://mc-stan.org/docs/stan-users-guide/data-coding-and-diagnostic-accuracy-models.html

And there’s a whole (short) chapter on measurement error models more generally:

Then going back to your main questions, here are some code comments. First, this can be simplified:

    if (obs[j] == 1) {
      target += log_sum_exp(log(sens) +
                            bernoulli_logit_lpmf(1 | alpha + x[j, ]*beta),
                            log(1 - spec) +
                            bernoulli_logit_lpmf(0 | alpha + x[j, ]*beta));
    }
    if (obs[j] == 0) {
      target += log_sum_exp(log(spec) +
                            bernoulli_logit_lpmf(0 | alpha + x[j, ]*beta),
                            log(1 - sens) +
                            bernoulli_logit_lpmf(1 | alpha + x[j, ]*beta));
    }

to

    target += log_sum_exp(bernoulli_lpmf(obs[j] | sens) + bernoulli_logit_lpmf(1 | alpha + x[j] *beta),
                          bernoulli_lpmf(1 - obs[j] | spec) + bernoulli_logit_lpmf(0 | alpha + x[j] *beta));

It’d be even more efficient this way using matrix multiply and indexing rather than recomputing the row products.

 vector[n] lp = alpha + x * beta;
 for (j in 1:n)
   target += log_sum_exp(bernoulli_lpmf(obs[j] | sens) + bernoulli_logit_lpmf(1 | lp[j]),
                         bernoulli_lpmf(1 - obs[j] | spec) + bernoulli_logit_lpmf(0 | lp[j]));

As far as I can tell, you define p_sample but don’t use it because the probability is based on a linear regression. Similarly, I don’t think the prevalence variable pr is doing anything here. There’s not a simple prevalence in the model—it depends on the predictor row vector x[j]. But given that they’re all bounded in (0, 1), they shouldn’t be a problem sampling. Other than this, the model looks OK.

I would recommend using priors based on mean and total concentration for beta. We even have a version taking that parameterization built-in.

The priors on the intercepts are a bit tight at normal(0, 1). Are you really expecting those to be in the (-2, 2) range? We’d normally use something like normal(0, 5) or at least normal(0, 2) for a logistic regression. Of course, it depends on the scale of the predictors x[j], but those should probably be standardized anyway. Also, it’s more efficient to use lupdf or use the sampling statement form. The _lpdf form computes a lot of normalizing constants you don’t need here.

1 Like

Many thanks for your detailed reply, and the many resources you provided!

Seems like my thinking was not too far off on the problem as some level of sensitivity to priors is needed to identify the model. I’d hoped that I could find something that worked in a logistic regression model as it’s so commonly used versus various forms of latent class models – I find clinicians are less skeptical of “established” modeling methods, which is also why I’d been focused on trying to make the sensitivity and specificity stuff a central way of understanding the model structure.

This approach is what I ended up trying after posting the question initially, which is part of the reason why the model I pasted in has the prevalence parameters (I hit my limit and posted in the midst of some other ideas that I then came back to this morning). I tried to think of what data was likely to be the least problematic to treat as informative prior information, and I figured that the prevalence of cases in the sample is probably a reasonable starting place as it’s not that most methods are highly inaccurate per se.

I ended up estimating prevalence on the logit scale as target += normal(log_pr_mu, log_pr_sd); where log_pr_mu was defined in the transformed data block as real<lower=0, upper=1> log_pr_mu = logit(mean(to_vector(obs))); and then log_pr_sd was just based to the model as a data point (with the plan to see how changes in the scale affected model performance).

I then built in the assumption that the predictors were at least better than chance at classifying cases with the following general structure:

  ...

parameters {
  ...

  // estimate sensitivity
  real<lower=0, upper=1> sens;

  // constraint for sum of sensitivity and specificity
  real<lower=1, upper=2> sens_plus_spec;
}

transformed parameters {
  ...

  // estimate specificity
  real<lower=0, upper=1> spec = sens_plus_spec - sens;
}

model {
  ...
  
  // vague prior on sensitivity
  target += beta_lpdf(sens | 2, 2);
}

This set-up actually did a pretty decent job in the few tests I ran of it (far from rigorous). It converged and ran pretty quickly (~14s for the default 1000 warmup + 1000 samples over 4 chains) with no other model issues. While the model was good at recovering sensitivity, it tended to overestimate specificity. During sampling, I got many warnings that a proposed value for sensitivity was larger than 1 (usually something like 1.05 – the largest I recall seeing was 1.07).

Since I had prevalence and some estimate of sensitivity and specificity, I also thought that perhaps changing the mixing probabilities from sensitivity and specificity to negative and positive predictive values made more sense. That addition, however, caused the estimates of sensitivity to be nearly 1 and specificity to be correspondingly nearly 0, but it at least got rid of the many warnings about impossible sensitivity estimates. And that’s when I decided to give up and hope that someone would answer my post here.

I had looked at this documentation page, but I couldn’t find anything when I was searching for it that helped me to understand how I might use prior information to define those values. For example, with the beta distribution that I was using, I know some basic weakly informative shape terms, but I also had found one recommendation specific to sensitivity and specificity: sens ~ beta(TP + 1, p - TP + 1); where TP = a count of true positives identified by a test and p is the prevalence of the condition. That helped me to think about the beta prior’s shape terms.

So, with this other parameterization, would this then become something like: sens ~ beta_proportion((TP + 1) / (n * pr + 2) , n * pr); where I’ve just estimated the number of cases from the same size and prevalence estimate? If that’s right, then the prior for the mean is really just a matter of thinking about what TP is likely to be – or is there a better way of understanding the shape terms?

I’d actually started the project out from the Dawid Skene model chapter before I thought it might be more intuitive to clinicians if I could do it within the draping of a logistic regression. It seems like I need to go back to that model though as it’s ultimately more well-suited for the issue.

As far as the back-end, I just meant that computing sensitivity and specificity is usually done from the results of the latent class assignments after the model has been run (so I guess in Stan’s case, it’s computed from the generated quantities block) instead of being parameters estimated in parallel with the rest of the model.

After doing some more work on this problem, I found a few helpful resources that are part of the noisy raters material on the stan-dev GitHub. The Dawid-Skene model there seems largely the same as the one in the user guide currently, but there is another model in from Raykar et al. (2010) with a Stan implementation that specifically incorporates logistic regression and sensitivity and specificity estimates as I was looking for.

It looks like Git material for the Raykar model was being worked on by @WardBrian. For convenience, the Stan model is below:

/**
 * Implementation of joint estimation of logistic regression and
 * annotator sensitivity and specifity.  This is just like the Dawid-Skene
 * model with the basic prevalence replaced with a logistic regression
 * based on predictors.
 *
 * Raykar, V. C., Yu, S., Zhao, L. H., Valadez, G. H., Florin, C.,
 * Bogoni, L., & Moy, L.  (2010). Learning from crowds. Journal of
 * Machine Learning Research, 11(Apr), 1297-1322.
 *
 * TODO:  hierarchical priors for alpha, beta
 * TODO:  precompute counts for sum(y[ , r] == 1) and multiply
 *        in the N/R loops and the eltwise multiply and sum
 */
data {
  int<lower=0> N; // # items (reviews)
  int<lower=0> R; // # raters (heuristics)
  int<lower=0> D; // # of predictors for each item
  matrix[N, D] x; // [n, d] predictors d for item n
  array[N, R] int<lower=0, upper=1> y; // outcome: 1 success, 0 failure
}
parameters {
  vector[D] w; // logistic regression coeffs
  real w0; // intercept
  vector<lower=0, upper=1>[R] alpha; // sensitivity
  vector<lower=0, upper=1>[R] beta; // specificity
}
model {
  vector[N] logit_z_hat = w0 + x * w;
  vector[N] log_z_hat = log_inv_logit(logit_z_hat);
  vector[N] log1m_z_hat = log1m_inv_logit(logit_z_hat);
  
  vector[R] log_alpha = log(alpha);
  vector[R] log1m_alpha = log1m(alpha);
  vector[R] log_beta = log(beta);
  vector[R] log1m_beta = log1m(beta);
  
  // prior
  w ~ normal(0, 2);
  w0 ~ normal(0, 5);
  alpha ~ beta(10, 1);
  beta ~ beta(10, 1);
  
  // likelihood
  for (n in 1 : N) {
    real pos_sum = log_z_hat[n];
    real neg_sum = log1m_z_hat[n];
    for (r in 1 : R) {
      if (y[n, r] == 1) {
        pos_sum += log_alpha[r];
        neg_sum += log1m_beta[r];
      } else {
        pos_sum += log1m_alpha[r];
        neg_sum += log_beta[r];
      }
    }
    target += log_sum_exp(pos_sum, neg_sum);
  }
}
generated quantities {
  vector[N] Pr_z_eq_1;
  {
    vector[N] logit_z_hat = w0 + x * w;
    vector[N] log_z_hat = log_inv_logit(logit_z_hat);
    vector[N] log1m_z_hat = log1m_inv_logit(logit_z_hat);
    
    vector[R] log_alpha = log(alpha);
    vector[R] log1m_alpha = log1m(alpha);
    vector[R] log_beta = log(beta);
    vector[R] log1m_beta = log1m(beta);
    
    for (n in 1 : N) {
      real pos_sum = log_z_hat[n];
      real neg_sum = log1m_z_hat[n];
      for (r in 1 : R) {
        if (y[n, r] == 1) {
          pos_sum += log_alpha[r];
          neg_sum += log1m_beta[r];
        } else {
          pos_sum += log1m_alpha[r];
          neg_sum += log_beta[r];
        }
      }
      Pr_z_eq_1[n] = softmax([pos_sum, neg_sum]')[1];
    }
  }
}

New Efforts Since Original Post

I’ve reduced the model somewhat as I’m interested in just a single rater. I’m still running simulations to check on how model does, but so far, things are looking positive. There were issues, however, in reliably getting the cmdstanr sampling command to produce reasonable results. I’ve found that running the model on the same data will produce what boils down to one of two outcomes:

  1. Divergent transitions (<1-5% usually, sometimes >20%) but very reasonable posterior estimates

  2. No divergent transitions but very poor posterior estimates and \hat{R} >> 1.05 (probably averages in 1.50) for most parameters (sometimes one or two of the regression coefficients have acceptable values)

I haven’t found that using initial values helps at all. I also changed the priors for sensitivity and specificity as the beta(10, 1) strongly favors both values being >0.50, but while some models I’ve seen do have sensitivity \approx specificity \approx 0.70, it’s more common that I see methods favor either high sensitivity (accepting the reduced specificity) or vice versa. Based on some similar papers (e.g., Evans et al., 1996, in the case of binary classification problems and Swartz et al., 2004, for the multinomial extension), I used the constraint that 1 < sensitivity + specificity < 2 that I included mentioned before:

I’ve been working with a vague prior on sensitivity in this case, just beta(2, 2). I also tried to use a dirichlet prior to represent belief for the rates of all 4 possible diagnostic outcomes in a 2x2 confusion matrix: true positive and negative rates (sensitivity and specificity, respectively) and then false positive and negative rates. For example, if someone wants to use a skeptical prior that the test has about 65% accuracy with true positives and negatives being equally likely, then dirichlet(to_vector({1.85, 1.85, 1, 1})) gives about that. I’ve never used the dirichlet before, so I may be misunderstanding what it does or how it’s applied in Stan. I was trying to wrap my head around what kind of values the distribution generates in R: colMeans(MCMCpack::rdirichlet(10000, c(1.85, 1.85, 1, 1))). My thought was that these priors could be substituted for the log(...) and log1m(...) in the Raykar et al. Stan model’s code above with the advantage of being able to directly reflect priors about specific types of diagnostic outcomes with a guaranteed simplex parameter . That, however, did not pan out, though I don’t know exactly why that was the case.

Still, I haven’t found a way to coax the model into reliably producing accurate results from the cmdstanr::sampling() method, but I’ve found that variational inference tends to work pretty well. As sample size drops (e.g., <250), the variational method fails entirely at a pretty good rate (e.g., something like 45% of the time).

Questions for What to Try Next

The model really seems like it has the potential for what I’m interested in, but I’m clearly missing some additional constraints (either hard or soft) to consistently identify the model. That may just be the nature of this problem and something that inherently requires some stronger prior information and then sufficient data to correct error in the prior belief.

One constraint that Swartz et al discuss in the ordered multilogit case is the belief that misdiagnosis probability is greatest for whatever diagnoses are “closest” to the true diagnosis and then decline in likelihood with greater distance. In the binary case, this is obviously not of direct relevance, but I think a similar and justifiable constraint could be that misdiagnosis of some conditions is less likely than others. For example, it’s relatively rare that a clinician/test will mistake someone as having dementia when they are cognitively normal, but it’s more likely that a classification of specific dementia etiology (e.g., Alzheimer’s versus Lewy body disease) will be problematic. I’d think that this prior information could be passed from a clinician via the dirichlet distribution.

What I think could be an even better extension of this idea of varying misclassification probabilities is to include a weighting system (I’d guess some kind of beta regression?) that can adjust the expected diagnostic error rate for a person. For example, if someone in a dataset is 50 years old, then it’s extraordinarily unlikely that that they have dementia (they may have clinical scores that suggest impairment but that’s more likely due to some other process than dementia at that age). Similarly, individuals 80 and above often have some degree of cognitive change, and it can be hard to identify appropriate normative references for individuals this old – introducing greater error in understanding whether scores are low due to age, dementia, or poor normative comparisons.

I also noticed that the Raykar model is not currently in the Stan User Guide while the Dawid-Skene example from the same GitHub section is. If I could help in validating or stabilizing the model to the point that it could be included along side the other materials, then I’m happy to contribute my current code and help out in whatever way I can as I’m already working on the project (initial simulation study but then also preparing a real data example).

Just wanted to jump in to say that model was actually added by @Bob_Carpenter - I just did some formatting work in the examples repo which has given me undue credit

2 Likes

Ah, I see: thanks for the clarification! I’m sure I could have hunted down somewhere in the Git history to see that

For anyone else who may be interested in this problem, I want to share what ultimately I think was the solution. Looking back over the material I was reading, I realized that the identifiability constraint for accuracy was actually placed on classification errors, not sensitivity and specificity in particular.

To put this more directly, I can borrow from the Evans et al. (1996) notation of the problem:

p_{sick} = P(\eta = 1)
\theta = P(X = 0 | \eta = 1)
\phi = P(X = 1 | \eta = 0)

My constraints would be written as the following: (1 - \theta) + (1 - \phi) > 1, but Evans et al. suggest the constraint as \theta + \phi < 1. With a switch to the constraint on errors being less than 1, I could use a dirichlet prior to identify the model. Since \theta and \phi needed to be less than one, using a simplex[3] variable definition means that I could extract two estimates from the simplex and be guaranteed that they sum to less than one. Using the dirichlet over the simplex, fully resolved all of my estimation problems.

For the sake of time, I still used variational estimation since those models run much faster (<1 second in my simulations versus 14-17 seconds for sampling). No errors were encountered or thrown during the 48,000 simulations I ran. The model has very minimal bias or estimation issues, and it’s overall more accurate than a standard logistic regression. So, seems like these changes worked as desired.

The model is pasted below. If anyone sees areas for improvement, then please still let me know. I also still want to extend the model to multinomial/ordered logistic models instead of just binary ones.

data {

  int<lower=0> N;                          // number of observations
  int<lower=0> P;                          // number of predictors
  array[N] int<lower=0, upper=1> y;        // class observations
  matrix[N, P] x;                          // predictor matrix

}

transformed data {

  vector[P] x_m;                           // predictor means
  matrix[N, P] x_c;                        // centered predictors

  // mean center predictors
  for ( p in 1 : P ) {
    x_m[p] = mean(x[, p]);
    x_c[, p] = x[, p] - x_m[p];
  }

  // perform QR decomposition on centered predictors
  matrix[N, P] b_x_r = qr_thin_Q(x_c) * sqrt(N - 1);
  matrix[P, P] b_x_a = inverse(qr_thin_R(x_c) / sqrt(N - 1));

}

parameters {

  vector[P] t_x;                           // QR decomposed coefficient parameter(s)
  real b_0_c;                              // intercept parameter
  vector<lower=1>[3] theta;                // hyperprior for accuracy
  simplex[3] accuracy;                     // constraint for misclassification error

}

transformed parameters {

  // convert to sensitivity and specificity (assuming model is better than chance)
  real<lower=0, upper=1> sens = 1 - accuracy[1];
  real<lower=0, upper=1> spec = 1 - accuracy[2];

  // linear component of logistic regression
  vector[N] logit_z_hat = b_0_c + x_c * t_x;

}

model {

  // (hyper)priors
  theta ~ chi_square(5);
  accuracy ~ dirichlet(theta);
  b_0_c ~ std_normal();
  t_x   ~ std_normal();

  // likelihood
  for ( n in 1 : N ) {
    target += log_sum_exp(bernoulli_lpmf(    y[n] | sens) + bernoulli_logit_lpmf(1 | logit_z_hat[n]),
                          bernoulli_lpmf(1 - y[n] | spec) + bernoulli_logit_lpmf(0 | logit_z_hat[n]));
  }
}

generated quantities {

  // posterior predictions
  vector<lower=0, upper=1>[N] prob;        // probability of 1
  array[N] int<lower=0, upper=1> pop;      // realization of diagnostic class

  for ( n in 1 : N ) {
    prob[n] = softmax([bernoulli_lpmf(    y[n] | sens) + bernoulli_logit_lpmf(1 | logit_z_hat[n]),
                       bernoulli_lpmf(1 - y[n] | spec) + bernoulli_logit_lpmf(0 | logit_z_hat[n])]')[1];
  }

  // compute class predictions
  pop = bernoulli_rng(prob);

  // compute rescaled coefficients
  vector[P] b_x = b_x_a * t_x;
  real b_0 = b_0_c - dot_product(x_m, b_x);

}
1 Like