# Improving identifiability of two latent rates

I am trying to implement the method of Campbell et al. 2020 to estimate Covid-19 infection rates (IR) and infection fatality rates (IFR) from observed positive tests and Covid-attributable deaths, accounting for preferential testing of infected individuals. As warned in the paper, I am encountering identifiability issues estimating IR and IFR. The authors suggest using strongly informative priors on the IFR between-group variance under the assumption that fatality rates should be quite similar between jurisdictions/time periods, which I have implemented in Stan as follows:

``````data {
int<lower = 0> J;  // number of groups
int<lower = 0> N;  // total population (same for all groups)
int<lower = 0> T_all[J];  // total number of tests
int<lower = 0> T_pos[J];  // number of positive tests
int<lower = 0> D[J];  // number of covid deaths
}
parameters {
// infection rate
real<lower = 0> s_IR;  // IR between-group std deviation
real mu_IR;
vector<offset = mu_IR, multiplier = s_IR>[J] logit_IR;   // log-odds infection rate

// infection fatality rate
real<lower = 0> s_IFR;  // IFR between-group std deviation
real mu_IFR;
vector<offset = mu_IFR, multiplier = s_IFR>[J] logit_IFR;   // log-odds IFR

vector[J] log_w;      // log of preferential testing ratio
}
transformed parameters{
vector<lower = 0, upper = 1>[J] IR_pref;  // prob for preferential testing approx
vector<lower = 0, upper = 1>[J] IFR_pop;  // marginalized over true infections
vector<lower = 0>[J] w = exp(log_w);  // preferential testing ratio
// approximately an odds ratio for w close to 1
vector<lower = 0, upper = 1>[J] IR = inv_logit(logit_IR);  // infection rate
vector<lower = 0, upper = 1>[J] IFR = inv_logit(logit_IFR);  // infection fatality rate

for(j in 1:J){
IR_pref[j] = 1 - pow((1 - IR[j]), w[j]);  // prob for preferential testing approximation
IFR_pop[j] = IR[j] * IFR[j];  // marginalize over true infections
}
}
model{
// likelihood
T_pos ~ binomial(T_all, IR_pref);  // binomial approximation for preferential testing
D ~ binomial(N, IFR_pop);  // deaths marginalizing over true infections

logit_IR ~ normal(mu_IR, s_IR);
logit_IFR ~ normal(mu_IFR, s_IFR);
mu_IR ~ normal(-2, 1);  //  < ~50% infection rate
mu_IFR ~ normal(-3, 1);  // < ~30% fatality rate (still high)
s_IR ~ normal(0, 1);
s_IFR ~ normal(0, 0.01);  // assume IFR similar between groups
log_w ~ normal(2, 1);  // greater weight on preference for positive cases
}
generated quantities{
int<lower = 0> C[J];
C = binomial_rng(N, IR);  // simulate true infection count
}
``````

The number of positive tests under preferential testing should follow Fisher’s non-central hypergeometric distribution (I think), but I do not understand the probability math well enough to implement it as a custom distribution in Stan (although I would welcome help on that front). Instead I am using the binomial approximation suggested by the authors in my model while simulating test data from the FNC hypergeometric:

``````## simulate data
set.seed(12345)

J <- 5  # number of groups (time periods)
N <- 1e7  # total population (of NC)

mu_IR <- -2 # mean log-odds infection rate
mu_IFR <- -3  # mean log-odds fatality rate

s_IR <- 0.5  # log-odds infection rate std
s_IFR <- 0.01  # log-odds fatality rate std

logit_IR <- rnorm(J, mu_IR, s_IR)
logit_IFR <- rnorm(J, mu_IFR, s_IFR)

IR <- inv_logit(logit_IR)  # infection rate
IFR <- inv_logit(logit_IFR)   # fatality rate

N_pos <- rbinom(J, N, IR)
N_neg <- as.integer(N - N_pos)
D <- rbinom(J, N_pos, IFR)

T_all <- rbinom(J, N, runif(J, 0.01, 0.1))  # total tests, testing rate < 10%
w <- runif(J, 1, 10)  # preferential testing odds ratio

# noncentral hypergeometric function apparently not vectorized
for(j in 1:J){
T_pos[j] <- BiasedUrn::rFNCHypergeo(1, N_pos[j], N_neg[j], T_all[j], w[j])
}

dat <- list(
J = J,
N = N,
T_all = T_all,
T_pos = T_pos,
D = D
)

## fit stan model
model <- stan_model(code)
fit <- sampling(model,
data = dat,
control = list(max_treedepth = 15),
warmup = 2000,
iter = 3000)

print(fit, digits = 3,
pars = c("s_IR", "s_IFR", "mu_IR", "mu_IFR", "IR", "IFR", "w", "C"),
probs = c(0.025, 0.5, 0.975))

pairs(fit, pars = c("IR", "IFR", "IR", "IFR", "lp__"))
``````

I had to increase the max_treedepth to 15 but otherwise the model fit without divergent transitions and decent Rhats. However, the estimates for IR and IFR were pulled towards each other, with the IRs all too low and the IFRs all too high. The pairs plot shows very high correlation between IR, IFR, and w (preferential testing ratio).

Are there any tricks or tips for model specification that could help here or is this approach simply a victim of trying to estimate 3 parameters (IR, IFR, w) from 2 observed quantities (test positivity rate, mortality rate)? The authors reported model fits from Nimble using a custom sampler and also said they implemented it in Stan, but do not provide any code.

2 Likes

This looks very much like a problem that can’t be overcome without using more data. The data seems to be able to inform only the `IFR_prop` and not it’s decomposition and also only some linear combination of `IFR` and `w`. I think this is a fundamental problem, because the data directly inform `IR_pref` and `IFR_prop` but the decomposition into `IR`, `IFR` and `w` is not constrained in any way beyond the prior. I think it could make more sense to use `IR_pref`, `IFR_prop` and `IFR` as the actual parameters (the former two are directly informed by data, the last is constrained)… but not sure if that would help.

It could also be sensible to only estimate `IR_pref` and `IFR_prop` and then create the decomposition completely in `generated quantitites` (as it seems to be driven purely by the prior)…

Does that make sense?

That makes perfect sense, thank you. Sounds like my best bet is to find an additional data source that could directly inform one of these decomposed parameters.

1 Like