Great I’ll do that. I showed the LCT and Erlang Distributions and the whole idea to an epidemiologist and she got all excited. We are exploring this using Liverpool’s model with Brazil’s data.
It makes sense. I’ll explore how the sub-states are populated using either a fixed beta or a varying weekly beta. Thanks!
@Funko_Unko: We’ve been running the model on each of 100 (ish) CCGs every day since mid-2020 (modulo server downtime etc). If we do that by re-fitting the model every day, we waste a lot of computation. @alphillips has therefore written a little adjunct to Stan that uses importance sampling to enable the samples for one day to be used for another. This saves us some computation and enables us to run more fits on a fixed computational resource. We could put that somewhere (eg on the CODATMO github pages) if others would find it useful.
@Funko_Unko: sure we can share the code. @alphillips: can you pop what you’ve got on a new folder on CODATMO?
Sorry, a CCG is a region of the UK as perceived by people who do healthcare: Clinical commissioning group - Wikipedia. We run each CCG separately and use lots of cores, so saying that a CCG is a core is, by chance, a relatively accurate description!
Currently, we can refit while the parameterisation is the same. Since we consider piecewise linear segments and the pieces are a week long, so we refit for 6 days out of 7. The importance sampling is near-instantaneous so we get a 7x average speed-up. We could over-parameterise (ie include next week’s parameter in the fit for today) but we assumed (we’ve not checked) that the prior would be such that Neff would become terrible.
You need to predict a week into the future for example so that the parameters for those days are there in the fitted model to be reweighted with the new data when it’s available.
If there is interest I’d like to hook up some data generating process COVID simulations to this effort. What is the current model in play that I can run on a laptop?
I think recovering some known generating parameters could shed some light on the work. I am particularly interested in the integration of social media data into the model.
Page on what I have done, I have a busted connection to the early version Liverpool model.
I am also interested in integration of social media data into the model. CoDatMo’s Liverpool NHS 111 calls models from my inspection do the opposite. They somehow predict calls from the infections in the model. See lines 209-221 here.
calls_111_lagged_daily_infections
Where I think we should do the something different. Use the daily twitter symptom mentions to help predict infected (or even dead maybe).
I have retreated to the simple SIR model (Model reproduction checklist)–I find it useful just because I can return to the case study writeup (linked above) if I get confused.
I think an appropriate way to add tweets is to just use them to predict infected (this is with simulated data) along side the ODE contribution based negative binomial. Model and generated quantities is (running code is test.R at https://github.com/codatmo/dataGeneratingProcess1.git on the tweetSimpleSIR branch:
model {
//priors
beta ~ normal(2, 1);
gamma ~ normal(0.4, 0.5);
phi_inv ~ exponential(5);
a_twitter ~ normal(0,1);
beta_twitter ~ normal(0,.5);
sigma_twitter ~ normal(0,1);
if (compute_likelihood == 1) {
for (i in 1:n_days) {
if (run_SIR == 1 && run_twitter == 1) {
real tweet_val = a_twitter + symptomaticTweets[i] * beta_twitter;
cases[i] ~ neg_binomial_2(infected_daily_counts[i] + tweet_val, phi);
}
else if (run_SIR == 1) {
cases[i] ~ neg_binomial_2(infected_daily_counts[i], phi);
}
else if (run_twitter == 1) {
cases[i] ~ normal(a_twitter + symptomaticTweets[i] * beta_twitter,
sigma_twitter);
}
}
}
}
generated quantities {
real pred_cases[n_days];
for (i in 1:n_days) {
if (run_twitter == 1 && run_SIR ==1) {
real tweet_val = a_twitter + symptomaticTweets[i] * beta_twitter;
pred_cases[i] = neg_binomial_2_rng(infected_daily_counts[i] + tweet_val, phi);
}
else if (run_twitter == 1) {
pred_cases[i] =
normal_rng(a_twitter + symptomaticTweets[i] * beta_twitter, sigma_twitter);
}
else if (run_SIR == 1) {
pred_cases[i] = neg_binomial_2_rng(infected_daily_counts[i], phi);
}
}
}
Am I doing this right? The relevant case is the `run_twitter == 1 && run_SIR ==1’ if statements in the model and generated quantities.
I don’t think it is interesting to use tweets to predict infected, we don’t have any real data or estimate related to infected besides the I compartments in the ODE system. We should use tweets to predict deaths D that we have quite reliable data (actually the only real-world data we condition the model are daily_deaths and daily_tweets_mentions).
Maybe something like:
parameters {
...
real<lower=0> lag_tweets_deaths;
...
}
...
model {
...
lag_tweets_deaths ~ lognormal(log(15), log(10)); // I don't know which prior to use
...
for (i in 1:n_days) {
deaths[i] ~ neg_binomial_2(deaths[i - lag_tweets_deaths], phi_deaths);
}
}
I think that is my main idea, I don’t know how to specifically code this thing up but that is what I was planning to do with daily twitter symptoms.
@storopoli Won’t work, can’t index with a real which lag_tweets_deaths is.
We could do a floor, round. Somehow seems wrong.
Kicking over to slack for now.
I thought the compartment models needed to sum to a constant population? At least the ones I have looked at do. So the Tweet state is orthogonal to the SIR model or more complex versions of the same.
thnsk
Is the goal of the model to predict deaths or to understand something about the dynamics? If tweets just modify the predicted number of deaths compared to the SIR expectation, then they might improve prediction, but they will also partially decouple the SIR model from the only data (deaths) that exist to inform it. In particular, this will not allow the tweets data to inform the likely numbers of infected individuals, and therefore the ODE parameters; quite the opposite.
I would have thought that something more like the usage of the Liverpool approach of using the number of infected to predict calls was closer to a plausible data generating process. The idea being that both (lagged) deaths and tweets are informative about the number of infected individuals in the population, and the causality runs in the direction of {number of infections} → {deaths, tweets}.
Yes, but you can extend the model however you like.
You could intrdouce a dummy compartment ExposedTwitter with the same inflow as the original exposed compartment (the number of infected) but with an outflow to a Twitter compartment whose members tweet about their symptoms and which has its own outflow into nothing.
You would then have to model the flows ExposedTwitter → Twitter → Nothing.
The goal is to show “how social media monitoring with a pre-trained classifier can help epidemiological Bayesian models to better infer/predict”. We will demonstrate that with Brazilian tweets, COVID-19, and SEIR-substate model.
Jacob I agree but I don’t know how right or wrong is the Infected I = I_1 + I_2 in the SEITD model since infected is underreported and also we don’t randomly test people in Brazil. I would very much prefer to condition calls/symptoms on real data that we compare and understand what is going on.
Yes, that sound a great approach. I would definitely need some help on what that would look like in the ODE system. But, kinda pushing the handbrake: What are your opinions on using Infected data generated by the SEITD ODE system? We don’t have good data to do any prediction predictive check on the infected I.
Hm, depends on what you want to use it for. Infer the proportion infected? Sure, but for this the cumulative number of deaths should be sufficient. I don’t think you gain a lot of (or any?) extra information on the IFR, which is essentially all you need for this.
Or what do you want to use the infected data for?
I think what you can do is compare the death predictions a few weeks into the future. This you should be able to use to compare different models. And hopefully a model with extra info (Twitter / Calls) performs better.
We went through the Brazilian National Health System (SUS) data (DATASUS) and found the data for hospitalizations due to Severe Acute Respiratory Illness (SARI). We only used data for 2020 and only the hospitalizations with a PCR-positive COVID test (~66% of the hospitalizations).
We took the difference in days between hospitalization and death or hospital discharge (what is called “alta” in Spanish and Portuguese). I only have the mean (@andrelmfsantos a PhD candidate that did this will generate some figures and more summary stats) which is 16.4 days.
Looking the SEITD vs SEEIITD we have the following posterior density (using the same priors as Liverpool and Cambridge) for dT (the mean time for which individuals are terminally ill):
I know that the difference is small but due to huge amount of data 0.2 difference in the mean estimate of dT is a big difference in inference. And SEEIITTD estimate is more aligned with the real-world data (that the model has not seen - it only uses daily deaths).
I have not yet analyzed the full posterior or the density plots of the hospitalizations/discharges difference in DATASUS.