I am trying to implement the method of Campbell et al. 2020 to estimate Covid-19 infection rates (IR) and infection fatality rates (IFR) from observed positive tests and Covid-attributable deaths, accounting for preferential testing of infected individuals. As warned in the paper, I am encountering identifiability issues estimating IR and IFR. The authors suggest using strongly informative priors on the IFR between-group variance under the assumption that fatality rates should be quite similar between jurisdictions/time periods, which I have implemented in Stan as follows:
data {
int<lower = 0> J; // number of groups
int<lower = 0> N; // total population (same for all groups)
int<lower = 0> T_all[J]; // total number of tests
int<lower = 0> T_pos[J]; // number of positive tests
int<lower = 0> D[J]; // number of covid deaths
}
parameters {
// infection rate
real<lower = 0> s_IR; // IR between-group std deviation
real mu_IR;
vector<offset = mu_IR, multiplier = s_IR>[J] logit_IR; // log-odds infection rate
// infection fatality rate
real<lower = 0> s_IFR; // IFR between-group std deviation
real mu_IFR;
vector<offset = mu_IFR, multiplier = s_IFR>[J] logit_IFR; // log-odds IFR
vector[J] log_w; // log of preferential testing ratio
}
transformed parameters{
vector<lower = 0, upper = 1>[J] IR_pref; // prob for preferential testing approx
vector<lower = 0, upper = 1>[J] IFR_pop; // marginalized over true infections
vector<lower = 0>[J] w = exp(log_w); // preferential testing ratio
// approximately an odds ratio for w close to 1
vector<lower = 0, upper = 1>[J] IR = inv_logit(logit_IR); // infection rate
vector<lower = 0, upper = 1>[J] IFR = inv_logit(logit_IFR); // infection fatality rate
for(j in 1:J){
IR_pref[j] = 1 - pow((1 - IR[j]), w[j]); // prob for preferential testing approximation
IFR_pop[j] = IR[j] * IFR[j]; // marginalize over true infections
}
}
model{
// likelihood
T_pos ~ binomial(T_all, IR_pref); // binomial approximation for preferential testing
D ~ binomial(N, IFR_pop); // deaths marginalizing over true infections
logit_IR ~ normal(mu_IR, s_IR);
logit_IFR ~ normal(mu_IFR, s_IFR);
mu_IR ~ normal(-2, 1); // < ~50% infection rate
mu_IFR ~ normal(-3, 1); // < ~30% fatality rate (still high)
s_IR ~ normal(0, 1);
s_IFR ~ normal(0, 0.01); // assume IFR similar between groups
log_w ~ normal(2, 1); // greater weight on preference for positive cases
}
generated quantities{
int<lower = 0> C[J];
C = binomial_rng(N, IR); // simulate true infection count
}
The number of positive tests under preferential testing should follow Fisher’s non-central hypergeometric distribution (I think), but I do not understand the probability math well enough to implement it as a custom distribution in Stan (although I would welcome help on that front). Instead I am using the binomial approximation suggested by the authors in my model while simulating test data from the FNC hypergeometric:
## simulate data
set.seed(12345)
J <- 5 # number of groups (time periods)
N <- 1e7 # total population (of NC)
mu_IR <- -2 # mean log-odds infection rate
mu_IFR <- -3 # mean log-odds fatality rate
s_IR <- 0.5 # log-odds infection rate std
s_IFR <- 0.01 # log-odds fatality rate std
logit_IR <- rnorm(J, mu_IR, s_IR)
logit_IFR <- rnorm(J, mu_IFR, s_IFR)
IR <- inv_logit(logit_IR) # infection rate
IFR <- inv_logit(logit_IFR) # fatality rate
N_pos <- rbinom(J, N, IR)
N_neg <- as.integer(N - N_pos)
D <- rbinom(J, N_pos, IFR)
T_all <- rbinom(J, N, runif(J, 0.01, 0.1)) # total tests, testing rate < 10%
w <- runif(J, 1, 10) # preferential testing odds ratio
# noncentral hypergeometric function apparently not vectorized
for(j in 1:J){
T_pos[j] <- BiasedUrn::rFNCHypergeo(1, N_pos[j], N_neg[j], T_all[j], w[j])
}
dat <- list(
J = J,
N = N,
T_all = T_all,
T_pos = T_pos,
D = D
)
## fit stan model
model <- stan_model(code)
fit <- sampling(model,
data = dat,
control = list(max_treedepth = 15),
warmup = 2000,
iter = 3000)
print(fit, digits = 3,
pars = c("s_IR", "s_IFR", "mu_IR", "mu_IFR", "IR", "IFR", "w", "C"),
probs = c(0.025, 0.5, 0.975))
pairs(fit, pars = c("IR[1]", "IFR[1]", "IR[5]", "IFR[5]", "lp__"))
I had to increase the max_treedepth to 15 but otherwise the model fit without divergent transitions and decent Rhats. However, the estimates for IR and IFR were pulled towards each other, with the IRs all too low and the IFRs all too high. The pairs plot shows very high correlation between IR, IFR, and w (preferential testing ratio).
Are there any tricks or tips for model specification that could help here or is this approach simply a victim of trying to estimate 3 parameters (IR, IFR, w) from 2 observed quantities (test positivity rate, mortality rate)? The authors reported model fits from Nimble using a custom sampler and also said they implemented it in Stan, but do not provide any code.