Partially Identified Model of COVID-19 Spread for Measuring Virus Suppression Policies

Hi all -

I have a new model of a latent infectious process (COVID-19) that I am planning to use with an ongoing data collection project of international government responses to the coronavirus epidemic. Compared to the SEIR/SIR models, this more simpler approach uses observed cases/tests to model COVID-19 as a partially-identified latent variable. I show in the paper that it is possible to rank identify suppression policies targeted at the virus even if the true infected rate is unknown. Furthermore, by incorporating informative priors based on SEIR/SIR papers, it is possible to convert the estimates to ā€œtrueā€ numbers. The model predicts that there were approximately 700,000 infected cases in the US yesterday versus 80,000 reported cases.

Teaser pic of model estimates versus observed case counts for the US:

I really appreciate feedback on the model as we are collecting data now and the model will be used later. My hope is that it will also be useful for others engaged in modeling various aspects of this public health crisis.

Update: Now available on a preprint server:
Link: https://osf.io/preprints/socarxiv/jp4wk/

Best,

Bob

4 Likes

Pinging some people @jriou, @charlesm93, @bgoodri

1 Like

This is a nice initiative! Below I give some feedback.

Overall impressions

In the current climate of increasingly complex models which are all flawed in some obvious way, itā€™s refreshing to see someone take a step back, elect a particular (single!) goal and go for it with a model of appropriate complexity. Kudos!
I very much like the idea that while the (first) model cannot estimate the true size of the epidemic, it at least can help us estimate whether control measures are working and how much.

I guess the conclusion that ā€œthese early containment strategies have not yet stopped domestic transmission of the virusā€ is consistent with the notion that early intervention AND massive testing are whatā€™s needed for substantial reduction of transmission.
The disclaimer about the association being just an association at this point is very apt; there is another factor thatā€™s important in early-epidemic behaviour: stochasticity. With a few introductions, you can get massively different outcomes for the same system if you ā€œrun different seedsā€.

Technical comments

  • The use of the word ā€œmutatedā€ is technically incorrect. The virus mutates. It just does. Whether the mutations it accumulates lead to appreciable difference in infectiviy and transmissibility is determined by a host of factors. Moreover, except for very peculiar situations, it is safe to assume that whatever differences in dynamics are observed between countries are caused by either (i) stochastic factors owing to timing of introduction, presence/absence of large gatherings or (ii) populational characteristics such as early/late institution of social distancing, age-structure, etc or both.

  • Is this ā€œHowever, the number of tests is increasing in Ict conditional on a countryā€™s willingness to test people. That is, regardless of how much a country wants to test people, as the outbreak grows the number of tests will increase though at very different rates.ā€ a reasonable assumption? Many countries simply lack the resources to test at any appreciable rate, right?

  • I didnā€™t get how test imperfection is incorporated into the model. Is it ā€œjustā€ through the constraining of the intercept \alpha_3?

  • Iā€™m not sure I get the figure (in the Estimation section) where most of the population is infected for a while and then the proportion drops down again. I donā€™t suppose it is meant to be the same as the I(t) curve in a (normalised) SIR model, that shows the incidence. But then if itā€™s not that, I donā€™t understand what it is.

  • I know you want a simple model and I appreciate why that is. Depending on exactly what data you have at your disposal, however, you could include the reporting delay explicitly in the model relatively easily: https://onlinelibrary.wiley.com/doi/epdf/10.1002/sim.8303. Note I havenā€™t thought in depth about this suggestion and it might very well be vacuous.

  • Iā€™d put a vertical line at 1 in the random effects plot of the proportion tested versus infected.

  • When you model \log q_{\text{ct}}āˆ’ \log I_{\text{ct}}, can you not derive the distribution directly instead of approximating it by a standard normal? I say this because if you change things to, say, a Uniform(0.1, 0.59), you donā€™t get a standard normal as the ā€œbest fitā€ anymore.

hereā€™s some R code for you to play with in case youā€™d like to model Z_{\text{ct}} = \log q_{\text{ct}}āˆ’ \log I_{\text{ct}} explicitly:

f_y <- function(y, a, b, log = FALSE){
  lc <-  log(b-a)
  ans <- y  - lc
  if(!log) ans <- exp(ans)
  return(ans)
}
#
get_Y_var <- function(a, b){
  EXsq <- ( b*(log(b)^2 - 2*log(b) + 2) -  a*(log(a)^2 - 2*log(a) + 2) )/(b-a)
  EX <- (b*(log(b) - 1) - a*(log(a) - 1) ) /(b-a)
  ans <- EXsq - (EX)^2
  return(ans)
}
##
analytic_W <- function(w, a, b){  ## assumes i.i.d.
  c0 <- all(w > a/b, w < b/a)
  k <- (b-a)*(b-a)
  m <- max(a, a/w)
  M <- min(b, b/w)
  soln <- function(L, U) ((U *abs(U))- (L *abs(L)))/(2*k)
  d0 <- soln(L = m, U = M)
  dens <- c0 * d0 
  return(dens)  
}
analytic_W <- Vectorize(analytic_W)
##
f_z <- function(z, a, b, log = FALSE){
  ans <- log(analytic_W(exp(z), a, b)) + z
  if(!log) ans <- exp(ans)
  return(ans)
}
f_z <- Vectorize(f_z)
###############

M <- 1E6
a <-  .01
b <-  .5
Y <- log(runif(M, min = a, max = b))

hist(Y, probability = TRUE)
curve(f_y(x, a, b), min(Y), max(Y), add = TRUE, lwd = 2)
abline(v = c(log(a), log(b)), lwd = 2, lty = 2)
integrate(function(x) f_y(x, a, b), log(a), log(b))

###

Y1 <- log(runif(M, min = a, max = b))
Y2 <- log(runif(M, min = a, max = b))

Z <- Y1 - Y2

# 
hist(Z, breaks = 50, probability = TRUE)
curve(f_z(x, a, b), min(Z), max(Z), col = 2, add = TRUE)
curve(dnorm(x, 0, sqrt(2*get_Y_var(a, b))), min(Z), max(Z), col = 3, add = TRUE)
abline(v = c(log(a)-log(b), log(b)-log(a)), lwd = 2, lty = 2)

integrate(function(x) f_z(x, a, b), log(a)-log(b), log(b)-log(a))
integrate(function(x) x^2* f_z(x, a, b), log(a)-log(b), log(b)-log(a))

2*get_Y_var(a, b)
var(Z)

typos, nitpicks, etc

  • ā€œrecieveā€ is misspelt
  • ā€œas the virus increasesā€ -> as the number of cases increases / as the epidemic progresses
  • Is the term ā€œBinomial proportionā€ technically sound? if X ~Binomial(N, p), I donā€™t feel confortable calling X a ā€œbinomial proportionā€ of N. But I might be very wrong.
  • " Stan, a Markov Chain Monte Carlo sampler" Stan is a language. Dynamic HMC is a sampler.
3 Likes

Oh wow thank you so much! I really appreciate you taking the time to read through and provide feedback. I am currently writing it up into a pre-print so the timing is excellent too.

The new draft is more explicit about what I think are the connections to SIR. The quantity is literally:

\frac{I}{S+R}

where S and R are exogenous to the model. As such, you canā€™t use this model to forecast (and itā€™s not meant to) given that S and R are changing over time (per the ODEs yā€™all are using). However, retrospectively, we know R (how many people have recovered/died up to time t) and S can be identified with informative priors (in theory). So the model is only useful looking backwards. Per the simulation, I just found polynomial coefficients that had a nice bell shape as people are seeing in the media. But you canā€™t identify the trend moving forwards, only backwards, so itā€™s not necessary for the model to have that shape. One way of thinking about it is as a local linear approximation to the I(t) curve.

I have a DAG that explains whatā€™s going on with the models:

The infection rate is a confounder between any covariate, tests and cases because it influences both the number of tests (people will test more if they believe more people are sick) and the number of cases (we only observe cases conditional on tests). The test adjustment parameter permits the test/infection relationship to vary by country but not over time.

Test/infection can be very different based on a countryā€™s capacity; itā€™s not the total number thatā€™s important just the change in testing over time. So long as that is non-decreasing in the infection rate itā€™s fine.

You are correct that the only explicit correction for testing noise is the false positive rate. The intercept in the tests binomial count is unconstrained so presumably that can capture some of it (i.e. tests that have nothing to do with the infection rate).

So the model is trying to adjust for the confounder to allow people to identify X_{ct}. Iā€™m trying to get it out soon as more and more papers are popping up of regressions on the observed case counts.

Iā€™ll look into the reporting delay. Iā€™m not sure how much that matters retrospectively but I will give it some thought. The model is certainly conservative due to reporting delays, though it might be better to be conservative.

Your point on the log prior is well-taken. I will revise that section. The standard normal is really just for convenience and is not the true distribution as you point out. It makes the assumption that the testing/infection ratio could be quite high, but is unlikely to be less than 10%. Thank you very much for the code. Judging by the plot, it looks like a double exponential. If thatā€™s the actual mathematical relationship between the quantities, that would be fantastic.

1 Like

Not exactly. When b \approx a the distribution looks triangular. The maths on this is a bit tedious, but basically if you have Z = \log(W) where W = \frac{X_1}{X_2} and X_1, X_2 \sim \operatorname{Uniform}(a, b) , the pdf of W is

f_W(w) = \chi_{> a/b}(w) \cdot \chi_{< b/a}(w) \frac{U |U|- L |L|}{2(b-a)^2},

with U = \min(b, b/w) and L = \max(a, a/w) and \chi_{\text{cond}}(a) = 1 if a satisfies cond and 0 otherwise. The pdf of Z is the pdf of W times the Jacobian adjustment \exp(z),

f_Z(z) = f_W(\exp(z))\exp(z).

If you donā€™t want to implement all this in Stan, then in the code I give a way to derive the variance of Z in terms of a and b, which you can then plug into a Gaussian.

3 Likes

Now that is pretty cool. I should have figured there was a distribution for a ratio of two probabilities. I will certainly look into this. The thing is that X_1 and X_2 arenā€™t really uniform as that would imply that 100% testing of the population is possible (though with current demand who knows). Iā€™ll have to think through this some more.

Paper is now available on a preprint server: https://osf.io/preprints/socarxiv/jp4wk/.

2 Likes