Thanks for clarifying, I think I see what you mean. There may be other approaches, but what comes to mind first is that you can take your initial attempt (from your first post) and marginalize out the latent binary “truth” (sick / not sick). Here’s the idea in pseudo-code and assuming just one test (we’ll add the second test later):
for (i in 1:Ntot) {
if (measured[n] == 0) {
compute Pr([not sick] or [sick and not detected]) = (1-rate) + (rate * (1 - tpr1))
}
else {
compute Pr(sick and detected) = rate * tpr1
}
}
For the second test we just do the same thing but with tpr2
. The only other complication is that Stan works on the log scale so we actually want to compute log probabilities.
So, taking the data and parameters blocks from your initial attempt, we can add the model block:
data {
int<lower=0> Ntot;
int<lower=0,upper=1> measured1[Ntot];
int<lower=0,upper=1> measured2[Ntot];
}
parameters {
real<lower=0,upper=1> tpr1;
real<lower=0,upper=1> tpr2;
real<lower=0,upper=1> rate;
}
model {
vector[Ntot] loglik;
for (n in 1:Ntot) {
if (measured1[n] == 0) {
// log((1-rate) + (rate * (1 - tpr1)))
// could just directly compute that, but more stable to do all of it using log-probabilities
loglik[n] = log_sum_exp(log1m(rate), log(rate) + log1m(tpr1));
} else {
// log(rate * tpr1)
loglik[n] = log(rate) + log(tpr1);
}
// same thing for measured2 with tpr2
// use 'loglik[n] +=' to just add to loglik computed for measured1
if (measured2[n] == 0) {
loglik[n] += log_sum_exp(log1m(rate), log(rate) + log1m(tpr2));
} else {
loglik[n] += log(rate) + log(tpr2);
}
}
target += loglik;
// optionally add priors here for rate, tpr1, tpr2
}
To see this in action we can simulate some data and fit the model to the simulated data:
# pick some values for parameters
tpr1 <- 0.8
tpr2 <- 0.4
rate <- 0.6
Ntot <- 1000
actual <- rbinom(1, Ntot, rate)
measured1 <- rep(0, Ntot)
measured2 <- rep(0, Ntot)
# fill in true positives with probabilities from above
measured1[1:actual] <- sample(c(1, 0), size = actual, replace=TRUE, prob = c(tpr1, 1 - tpr1))
measured2[1:actual] <- sample(c(1, 0), size = actual, replace=TRUE, prob = c(tpr2, 1 - tpr2))
data_list <- list(Ntot = Ntot, measured1 = measured1, measured2 = measured2)
# I'm using cmdstanr but you can use any Stan interface
library(cmdstanr)
mod <- cmdstan_model("marginalize.stan") # assume you put the program above in file "marginalize.stan"
fit <- mod$sample(data = data_list, chains = 4, parallel_chains = 4)
print(fit)
The results I get
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
lp__ -1231.71 -1231.37 1.33 1.08 -1234.38 -1230.25 1.00 1043 1410
tpr1 0.73 0.73 0.15 0.19 0.49 0.97 1.00 592 772
tpr2 0.35 0.35 0.08 0.09 0.24 0.48 1.00 636 782
rate 0.67 0.64 0.15 0.16 0.48 0.95 1.00 591 535
which is pretty good. The values used in the simulation were
tpr1 = 0.8
tpr2 = 0.4
rate = 0.6
so we got pretty close (we’re not going to get it exactly with just 1 simulated data set of 1000 observations. ideally we would repeat the simulation and fitting many times in order to really verify it).