Improving identifiability of two latent rates

I am trying to implement the method of Campbell et al. 2020 to estimate Covid-19 infection rates (IR) and infection fatality rates (IFR) from observed positive tests and Covid-attributable deaths, accounting for preferential testing of infected individuals. As warned in the paper, I am encountering identifiability issues estimating IR and IFR. The authors suggest using strongly informative priors on the IFR between-group variance under the assumption that fatality rates should be quite similar between jurisdictions/time periods, which I have implemented in Stan as follows:

data {
  int<lower = 0> J;  // number of groups
  int<lower = 0> N;  // total population (same for all groups)
  int<lower = 0> T_all[J];  // total number of tests
  int<lower = 0> T_pos[J];  // number of positive tests
  int<lower = 0> D[J];  // number of covid deaths
parameters {
  // infection rate
  real<lower = 0> s_IR;  // IR between-group std deviation
  real mu_IR;
  vector<offset = mu_IR, multiplier = s_IR>[J] logit_IR;   // log-odds infection rate

  // infection fatality rate
  real<lower = 0> s_IFR;  // IFR between-group std deviation
  real mu_IFR;
  vector<offset = mu_IFR, multiplier = s_IFR>[J] logit_IFR;   // log-odds IFR
  vector[J] log_w;      // log of preferential testing ratio
transformed parameters{
  vector<lower = 0, upper = 1>[J] IR_pref;  // prob for preferential testing approx
  vector<lower = 0, upper = 1>[J] IFR_pop;  // marginalized over true infections
  vector<lower = 0>[J] w = exp(log_w);  // preferential testing ratio
                                                               // approximately an odds ratio for w close to 1
  vector<lower = 0, upper = 1>[J] IR = inv_logit(logit_IR);  // infection rate
  vector<lower = 0, upper = 1>[J] IFR = inv_logit(logit_IFR);  // infection fatality rate
  for(j in 1:J){
  IR_pref[j] = 1 - pow((1 - IR[j]), w[j]);  // prob for preferential testing approximation
  IFR_pop[j] = IR[j] * IFR[j];  // marginalize over true infections
  // likelihood
  T_pos ~ binomial(T_all, IR_pref);  // binomial approximation for preferential testing
  D ~ binomial(N, IFR_pop);  // deaths marginalizing over true infections
  logit_IR ~ normal(mu_IR, s_IR);  
  logit_IFR ~ normal(mu_IFR, s_IFR);
  mu_IR ~ normal(-2, 1);  //  < ~50% infection rate
  mu_IFR ~ normal(-3, 1);  // < ~30% fatality rate (still high)
  s_IR ~ normal(0, 1);
  s_IFR ~ normal(0, 0.01);  // assume IFR similar between groups
  log_w ~ normal(2, 1);  // greater weight on preference for positive cases
generated quantities{
  int<lower = 0> C[J];
  C = binomial_rng(N, IR);  // simulate true infection count

The number of positive tests under preferential testing should follow Fisher’s non-central hypergeometric distribution (I think), but I do not understand the probability math well enough to implement it as a custom distribution in Stan (although I would welcome help on that front). Instead I am using the binomial approximation suggested by the authors in my model while simulating test data from the FNC hypergeometric:

## simulate data

J <- 5  # number of groups (time periods)
N <- 1e7  # total population (of NC)

mu_IR <- -2 # mean log-odds infection rate
mu_IFR <- -3  # mean log-odds fatality rate

s_IR <- 0.5  # log-odds infection rate std
s_IFR <- 0.01  # log-odds fatality rate std

logit_IR <- rnorm(J, mu_IR, s_IR)
logit_IFR <- rnorm(J, mu_IFR, s_IFR)

IR <- inv_logit(logit_IR)  # infection rate
IFR <- inv_logit(logit_IFR)   # fatality rate

N_pos <- rbinom(J, N, IR)
N_neg <- as.integer(N - N_pos)
D <- rbinom(J, N_pos, IFR)

T_all <- rbinom(J, N, runif(J, 0.01, 0.1))  # total tests, testing rate < 10%
w <- runif(J, 1, 10)  # preferential testing odds ratio

# noncentral hypergeometric function apparently not vectorized
for(j in 1:J){
  T_pos[j] <- BiasedUrn::rFNCHypergeo(1, N_pos[j], N_neg[j], T_all[j], w[j])

dat <- list(
  J = J,
  N = N,
  T_all = T_all,
  T_pos = T_pos,
  D = D

## fit stan model
model <- stan_model(code)
fit <- sampling(model,
                        data = dat,
                        control = list(max_treedepth = 15),
                        warmup = 2000,
                        iter = 3000)

print(fit, digits = 3,
       pars = c("s_IR", "s_IFR", "mu_IR", "mu_IFR", "IR", "IFR", "w", "C"),
       probs = c(0.025, 0.5, 0.975))

pairs(fit, pars = c("IR[1]", "IFR[1]", "IR[5]", "IFR[5]", "lp__"))

I had to increase the max_treedepth to 15 but otherwise the model fit without divergent transitions and decent Rhats. However, the estimates for IR and IFR were pulled towards each other, with the IRs all too low and the IFRs all too high. The pairs plot shows very high correlation between IR, IFR, and w (preferential testing ratio).

Are there any tricks or tips for model specification that could help here or is this approach simply a victim of trying to estimate 3 parameters (IR, IFR, w) from 2 observed quantities (test positivity rate, mortality rate)? The authors reported model fits from Nimble using a custom sampler and also said they implemented it in Stan, but do not provide any code.


This looks very much like a problem that can’t be overcome without using more data. The data seems to be able to inform only the IFR_prop[5] and not it’s decomposition and also only some linear combination of IFR[5] and w[5]. I think this is a fundamental problem, because the data directly inform IR_pref and IFR_prop but the decomposition into IR, IFR and w is not constrained in any way beyond the prior. I think it could make more sense to use IR_pref, IFR_prop and IFR as the actual parameters (the former two are directly informed by data, the last is constrained)… but not sure if that would help.

It could also be sensible to only estimate IR_pref and IFR_prop and then create the decomposition completely in generated quantitites (as it seems to be driven purely by the prior)…

Does that make sense?

That makes perfect sense, thank you. Sounds like my best bet is to find an additional data source that could directly inform one of these decomposed parameters.

1 Like