CoDatMo Liverpool & UNINOVE models (slow ODE implementation and Trapezoidal Solver)

Great I’ll do that. I showed the LCT and Erlang Distributions and the whole idea to an epidemiologist and she got all excited. We are exploring this using Liverpool’s model with Brazil’s data.

It makes sense. I’ll explore how the sub-states are populated using either a fixed beta or a varying weekly beta. Thanks!

So I’ve got another, slightly related question:

In principle, the Liverpool model could have been run daily from scratch. Was this done? Or were you able to exploit information from previous runs?

Maybe @s.maskell can tell me something about that? (Or tag someone that can).

Thanks!

@Funko_Unko: We’ve been running the model on each of 100 (ish) CCGs every day since mid-2020 (modulo server downtime etc). If we do that by re-fitting the model every day, we waste a lot of computation. @alphillips has therefore written a little adjunct to Stan that uses importance sampling to enable the samples for one day to be used for another. This saves us some computation and enables us to run more fits on a fixed computational resource. We could put that somewhere (eg on the CODATMO github pages) if others would find it useful.

4 Likes

That would be wonderful, I’d like to check it out.

A CCG is some node on a computing cluster?

How much time gets saved? Does this approach then still require the occasional complete refit?

@Funko_Unko: sure we can share the code. @alphillips: can you pop what you’ve got on a new folder on CODATMO?

Sorry, a CCG is a region of the UK as perceived by people who do healthcare: Clinical commissioning group - Wikipedia. We run each CCG separately and use lots of cores, so saying that a CCG is a core is, by chance, a relatively accurate description!

Currently, we can refit while the parameterisation is the same. Since we consider piecewise linear segments and the pieces are a week long, so we refit for 6 days out of 7. The importance sampling is near-instantaneous so we get a 7x average speed-up. We could over-parameterise (ie include next week’s parameter in the fit for today) but we assumed (we’ve not checked) that the prior would be such that Neff would become terrible.

Our longer term plan is to get away from fitting ODEs and use SDEs with proper streaming algorithms (see here: StanCon 2020. Talk 5: Simon Maskell. Towards an Interface for Streaming Stan - YouTube), but that’s not quite ready to use in anger yet.

3 Likes

@Funko_Unko, R package here: GitHub - codatmo/stanIncrementalImportanceSampling: Basic R package to perform importance sampling to update fitted Stan models with new data, consisting of a single function importance_sampling that takes as input: the original stanfit object; the original data as a list; the original data plus the new data as a list. It then outputs the importance weights for each iteration.

You need to predict a week into the future for example so that the parameters for those days are there in the fitted model to be reweighted with the new data when it’s available.

2 Likes

That’s great, thanks both of you, I’ll have a look.

If there is interest I’d like to hook up some data generating process COVID simulations to this effort. What is the current model in play that I can run on a laptop?

I think recovering some known generating parameters could shed some light on the work. I am particularly interested in the integration of social media data into the model.

Page on what I have done, I have a busted connection to the early version Liverpool model.

https://codatmo.github.io/dataGeneratingProcess1/

I am also interested in integration of social media data into the model. CoDatMo’s Liverpool NHS 111 calls models from my inspection do the opposite. They somehow predict calls from the infections in the model. See lines 209-221 here.

calls_111_lagged_daily_infections

Where I think we should do the something different. Use the daily twitter symptom mentions to help predict infected (or even dead maybe).

Glad to hear opinions and thoughts about this.

I have retreated to the simple SIR model (Model reproduction checklist)–I find it useful just because I can return to the case study writeup (linked above) if I get confused.

I think an appropriate way to add tweets is to just use them to predict infected (this is with simulated data) along side the ODE contribution based negative binomial. Model and generated quantities is (running code is test.R at https://github.com/codatmo/dataGeneratingProcess1.git on the tweetSimpleSIR branch:


model {
  //priors
  beta ~ normal(2, 1);
  gamma ~ normal(0.4, 0.5);
  phi_inv ~ exponential(5);

  a_twitter ~ normal(0,1);
  beta_twitter ~ normal(0,.5);
  sigma_twitter ~ normal(0,1);
  if (compute_likelihood == 1) {
    for (i in 1:n_days) {
      if (run_SIR == 1 && run_twitter == 1) {
       real tweet_val = a_twitter + symptomaticTweets[i] * beta_twitter;
       cases[i] ~ neg_binomial_2(infected_daily_counts[i] + tweet_val, phi);
        
      }
      else if (run_SIR == 1) {
        cases[i] ~ neg_binomial_2(infected_daily_counts[i], phi);
      }
      else if (run_twitter == 1) {
        cases[i] ~ normal(a_twitter + symptomaticTweets[i] * beta_twitter, 
                          sigma_twitter);
      }
    }
  }
}
generated quantities {
 
 real pred_cases[n_days];
 for (i in 1:n_days) {
   if (run_twitter == 1 && run_SIR ==1) {
      real tweet_val = a_twitter + symptomaticTweets[i] * beta_twitter;
      pred_cases[i] = neg_binomial_2_rng(infected_daily_counts[i] + tweet_val, phi);
   }
   else if (run_twitter == 1) {
     pred_cases[i] = 
          normal_rng(a_twitter + symptomaticTweets[i] * beta_twitter, sigma_twitter);
    }
    else if (run_SIR == 1) {
      pred_cases[i] = neg_binomial_2_rng(infected_daily_counts[i], phi);
    }

 }
}

Am I doing this right? The relevant case is the `run_twitter == 1 && run_SIR ==1’ if statements in the model and generated quantities.

Fits somewhat:

Thanks

Breck

@breckbaldwin, we need to add a latent lag also.

I don’t think it is interesting to use tweets to predict infected, we don’t have any real data or estimate related to infected besides the I compartments in the ODE system. We should use tweets to predict deaths D that we have quite reliable data (actually the only real-world data we condition the model are daily_deaths and daily_tweets_mentions).

Maybe something like:

parameters {
  ...
  real<lower=0> lag_tweets_deaths;
  ...
}
...
model {
  ...
  lag_tweets_deaths ~ lognormal(log(15), log(10)); // I don't know which prior to use
  ...
  for (i in 1:n_days) {
    deaths[i] ~ neg_binomial_2(deaths[i - lag_tweets_deaths], phi_deaths);
  }
}

I think that is my main idea, I don’t know how to specifically code this thing up but that is what I was planning to do with daily twitter symptoms.

The sources of info we have are:

  1. tweets + date
  2. deaths + date
  3. no hospital admission info or testing info

So simplest model is just lagged tweets predicting deaths. Obviously too simple but I’ll try and code it up with simulation backing it.

1 Like

@storopoli Won’t work, can’t index with a real which lag_tweets_deaths is.
We could do a floor, round. Somehow seems wrong.
Kicking over to slack for now.

You could just add an additional compartment to model this, couldn’t you?

I assume so, I’ll

I thought the compartment models needed to sum to a constant population? At least the ones I have looked at do. So the Tweet state is orthogonal to the SIR model or more complex versions of the same.
thnsk

Is the goal of the model to predict deaths or to understand something about the dynamics? If tweets just modify the predicted number of deaths compared to the SIR expectation, then they might improve prediction, but they will also partially decouple the SIR model from the only data (deaths) that exist to inform it. In particular, this will not allow the tweets data to inform the likely numbers of infected individuals, and therefore the ODE parameters; quite the opposite.

I would have thought that something more like the usage of the Liverpool approach of using the number of infected to predict calls was closer to a plausible data generating process. The idea being that both (lagged) deaths and tweets are informative about the number of infected individuals in the population, and the causality runs in the direction of {number of infections} → {deaths, tweets}.

Yes, but you can extend the model however you like.

You could intrdouce a dummy compartment ExposedTwitter with the same inflow as the original exposed compartment (the number of infected) but with an outflow to a Twitter compartment whose members tweet about their symptoms and which has its own outflow into nothing.

You would then have to model the flows ExposedTwitterTwitterNothing.

The goal is to show “how social media monitoring with a pre-trained classifier can help epidemiological Bayesian models to better infer/predict”. We will demonstrate that with Brazilian tweets, COVID-19, and SEIR-substate model.

Jacob I agree but I don’t know how right or wrong is the Infected I = I_1 + I_2 in the SEITD model since infected is underreported and also we don’t randomly test people in Brazil. I would very much prefer to condition calls/symptoms on real data that we compare and understand what is going on.

Yes, that sound a great approach. I would definitely need some help on what that would look like in the ODE system. But, kinda pushing the handbrake: What are your opinions on using Infected data generated by the SEITD ODE system? We don’t have good data to do any prediction predictive check on the infected I.

1 Like

Hm, depends on what you want to use it for. Infer the proportion infected? Sure, but for this the cumulative number of deaths should be sufficient. I don’t think you gain a lot of (or any?) extra information on the IFR, which is essentially all you need for this.

Or what do you want to use the infected data for?

I think what you can do is compare the death predictions a few weeks into the future. This you should be able to use to compare different models. And hopefully a model with extra info (Twitter / Calls) performs better.

@remoore , we have some new insights.

We went through the Brazilian National Health System (SUS) data (DATASUS) and found the data for hospitalizations due to Severe Acute Respiratory Illness (SARI). We only used data for 2020 and only the hospitalizations with a PCR-positive COVID test (~66% of the hospitalizations).

We took the difference in days between hospitalization and death or hospital discharge (what is called “alta” in Spanish and Portuguese). I only have the mean (@andrelmfsantos a PhD candidate that did this will generate some figures and more summary stats) which is 16.4 days.

Looking the SEITD vs SEEIITD we have the following posterior density (using the same priors as Liverpool and Cambridge) for dT (the mean time for which individuals are terminally ill):

r$> seitd$summary(variables = "dT")                                                                                                                 
# A tibble: 1 x 10
  variable  mean median    sd   mad    q5   q95  rhat ess_bulk ess_tail
  <chr>    <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl>
1 dT        16.2   16.2 0.691 0.689  15.0  17.3  1.00    3628.    2971.

r$> seeiittd$summary(variables = "dT")                                                                                                              
# A tibble: 1 x 10
  variable  mean median    sd   mad    q5   q95  rhat ess_bulk ess_tail
  <chr>    <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl>
1 dT        16.4   16.4 0.706 0.733  15.3  17.5  1.00    1037.    3074.

I know that the difference is small but due to huge amount of data 0.2 difference in the mean estimate of dT is a big difference in inference. And SEEIITTD estimate is more aligned with the real-world data (that the model has not seen - it only uses daily deaths).

I have not yet analyzed the full posterior or the density plots of the hospitalizations/discharges difference in DATASUS.

3 Likes