Multiple experiments on same sample

Meir_Maor · August 6, 2020, 7:02pm

I’m interested in a binary phenomena. I have two tests for it, each I believe to have no false positive and a different false negative rate. I have experimental data, and I wish to estimate both the true prevalence of the phenomena and the true positive rate of each test type.
This sounds like it should be a trivial model, yet I’m struggling to model it in stan.
Without stan I can algebraically solve for one test but not combine the other or get confidence intervals.

Below one of several approaches which don’t work.

model = """
data {
  int<lower=0> Ntot;
  int<lower=0,upper=1> measured1[Ntot];
  int<lower=0,upper=1> measured2[Ntot];  
}
parameters {
    real<lower=0,upper=1> tpr1;
    real<lower=0,upper=1> tpr2;
    real<lower=0,upper=1> rate;
}
model {
    int<lower=0,upper=1> sick[Ntot];

    actual ~ bernoulli(rate);
    measured1 ~ bernoulli(tpr1)*actual;
    measured2 ~ bernoulli(tpr2)*actual;
}
"""

Meir_Maor · August 7, 2020, 9:20am

Here is a model which does work but it doesn’t force the full structure, it has independence in the samples while they are related.

data {
  int<lower=0> Ntot;
  int<lower=0> NPosPos;
  int<lower=0> NPosNeg;  
  int<lower=0> NNegPos; 
  int<lower=0> NNegNeg; 
}
parameters {
    real<lower=0,upper=1> tpr1;
    real<lower=0,upper=1> tpr2;
    real<lower=0,upper=1> rate;
}
model {
    NNegNeg ~ binomial(Ntot,rate * (1-tpr2) * (1-tpr1) + (1-rate));
    NPosNeg ~ binomial(Ntot,rate * (1-tpr2) * tpr1);
    NNegPos ~ binomial(Ntot,rate * (1-tpr1) * tpr2);
    NPosPos ~ binomial(Ntot, rate * tpr1 * tpr2);
}

jonah · August 15, 2020, 8:37pm

Hi @Meir_Maor, sorry nobody responded sooner! The first model you posted definitely doesn’t work, as you said, but I have a few questions about it that will hopefully help clarify your intent and maybe spark some suggestions:

This is declared at the top of the model block but I don’t see it defined or used anywhere. What was the intent here?

I don’t see actual declared or defined anywhere before it’s used. What is actual intended to represent here? Is this the true value of the binary phenomenon, which isn’t measured?

I’m a bit confused by this multiplication. What are you trying to multiply here? actual, as used on the previous line, is a 0/1 variable (although it’s never actually defined). And ~ bernoulli(...) is just syntactic sugar for target += bernoulli_lpmf(...) (i.e., adding the bernoulli log probability to the total that’s being accumulated in the model block). So it’s not clear to me what’s intended by the multiplication. Can you write out what you’re trying to do in math notation?

Meir_Maor · August 15, 2020, 8:50pm

The sick was just leftover I forgot to remove. sorry.
The multiplication was a supposed to create a dependency. We can only measure the effect if it’s actually there.
There is an actual phenomena, which appears in some of the samples and an unknown prevelance/rate. And we have two tests for it. Which will never detect it when it’s not there but each have some independent chance of detecting it I called these tpr1 and tpr2.
My second model does give numbers which fit the data nicely but i think it doesn’t model the fact the two measurements are related the actual phenomena either exists or doesn’t for each sample so we expect a strong correlation between the two measurements.

Perhaps a different failed attempt would be clearer:

data {
  int<lower=0> Ntot;
  int<lower=0> NPosPos;
  int<lower=0> NPosNeg;  
  int<lower=0> NNegPos; 
  int<lower=0> NNegNeg; 
}
parameters {
    real<lower=0.3,upper=1> tpr1;
    real<lower=0.3,upper=1> tpr2;
    real<lower=0,upper=1> rate;
    int<lower=0> actual;
}
model {
    actual ~ binomial(Ntot,rate);
    NPosNeg ~ binomial(actual,(1-tpr2) * tpr1);
    NNegPos ~ binomial(actual,1-tpr1) * tpr2);
    NPosPos ~ binomial(actual, tpr1 * tpr2);
}

Thanks for your interest in helping.

jonah · August 16, 2020, 10:38pm

Thanks for clarifying, I think I see what you mean. There may be other approaches, but what comes to mind first is that you can take your initial attempt (from your first post) and marginalize out the latent binary “truth” (sick / not sick). Here’s the idea in pseudo-code and assuming just one test (we’ll add the second test later):

for (i in 1:Ntot) {
  if (measured[n] == 0) {
   compute Pr([not sick] or [sick and not detected]) = (1-rate) + (rate * (1 - tpr1))
  }
  else {
   compute Pr(sick and detected) = rate * tpr1
  }
}

For the second test we just do the same thing but with tpr2. The only other complication is that Stan works on the log scale so we actually want to compute log probabilities.

So, taking the data and parameters blocks from your initial attempt, we can add the model block:

data {
  int<lower=0> Ntot;
  int<lower=0,upper=1> measured1[Ntot];
  int<lower=0,upper=1> measured2[Ntot];  
}
parameters {
    real<lower=0,upper=1> tpr1;
    real<lower=0,upper=1> tpr2;
    real<lower=0,upper=1> rate;
}
model {
  vector[Ntot] loglik;
  for (n in 1:Ntot) {
    if (measured1[n] == 0) {
      // log((1-rate) + (rate * (1 - tpr1)))
      // could just directly compute that, but more stable to do all of it using log-probabilities
      loglik[n] = log_sum_exp(log1m(rate), log(rate) + log1m(tpr1));
    } else {
      // log(rate * tpr1)
      loglik[n] = log(rate) + log(tpr1); 
    }

    // same thing for measured2 with tpr2
    // use 'loglik[n] +=' to just add to loglik computed for measured1
    if (measured2[n] == 0) {
      loglik[n] += log_sum_exp(log1m(rate), log(rate) + log1m(tpr2));
    } else {
      loglik[n] += log(rate) + log(tpr2); 
    }
  }
  target += loglik;

  // optionally add priors here for rate, tpr1, tpr2
}

To see this in action we can simulate some data and fit the model to the simulated data:

# pick some values for parameters
tpr1 <- 0.8
tpr2 <- 0.4
rate <- 0.6

Ntot <- 1000
actual <- rbinom(1, Ntot, rate)
measured1 <- rep(0, Ntot)
measured2 <- rep(0, Ntot)

# fill in true positives with probabilities from above
measured1[1:actual] <- sample(c(1, 0), size = actual, replace=TRUE, prob = c(tpr1, 1 - tpr1))
measured2[1:actual] <- sample(c(1, 0), size = actual, replace=TRUE, prob = c(tpr2, 1 - tpr2))

data_list <- list(Ntot = Ntot, measured1 = measured1, measured2 = measured2)

# I'm using cmdstanr but you can use any Stan interface
library(cmdstanr) 
mod <- cmdstan_model("marginalize.stan") # assume you put the program above in file "marginalize.stan"
fit <- mod$sample(data = data_list, chains = 4, parallel_chains = 4)
print(fit)

The results I get

 variable     mean   median   sd  mad       q5      q95 rhat ess_bulk ess_tail
     lp__ -1231.71 -1231.37 1.33 1.08 -1234.38 -1230.25 1.00     1043     1410
     tpr1     0.73     0.73 0.15 0.19     0.49     0.97 1.00      592      772
     tpr2     0.35     0.35 0.08 0.09     0.24     0.48 1.00      636      782
     rate     0.67     0.64 0.15 0.16     0.48     0.95 1.00      591      535

which is pretty good. The values used in the simulation were

tpr1 = 0.8
tpr2 = 0.4
rate = 0.6

so we got pretty close (we’re not going to get it exactly with just 1 simulated data set of 1000 observations. ideally we would repeat the simulation and fitting many times in order to really verify it).

Meir_Maor · August 18, 2020, 4:56pm

So the technique showed by jonah indeed solves the problem.
The the code specifically is not exactly what I intended, but following his lead I now have it solved.
For prosperity I will share what I’m using:

data {
  int<lower=0> Ntot;
  int<lower=0,upper=1> measured1[Ntot];
  int<lower=0,upper=1> measured2[Ntot];  
}
parameters {
    real<lower=0,upper=1> tpr1;
    real<lower=0,upper=1> tpr2;
    real<lower=0,upper=1> rate;
}
model {
  vector[Ntot] loglik;
  for (n in 1:Ntot) {
    if (measured1[n] == 0) {
      if (measured2[n] == 0) {
          // log((1-rate) + (rate * (1 - tpr1)))
          // could just directly compute that, but more stable to do all of it using log-probabilities
          loglik[n] = log_sum_exp(log1m(rate), log(rate) + log1m(tpr1) + log1m(tpr2));
      } else {
          loglik[n] = log(rate)+log(tpr2) + log1m(tpr1);          
      }
    } else {
        if (measured2[n] == 0) {
            loglik[n] = log(rate) + log(tpr1) + log1m(tpr2);
        } else {
            loglik[n] = log(rate) + log(tpr1) + log(tpr2);
        }
    }
  }
    
  target += loglik;
}

jonah · August 18, 2020, 5:32pm

Great, glad that helped and thanks for sharing the code that ended up working!

Topic		Replies	Views
Combining data from multiple sources to model the same parameter Modeling	1	1024	November 29, 2017
Model for binary outcome with misclassification Modeling	1	781	April 5, 2018
Two different models have the same log likllihood Modeling	2	118	June 18, 2024
Fitting a Bernoulli model conditionally to data Modeling fitting-issues , specification	0	89	May 1, 2024
A/B testing with beta-binomial models Modeling	4	2289	September 17, 2017

Multiple experiments on same sample

Related topics