Hi there,
I’m new to Stan and to statistical modelling generally (I come from a programming and, to a lesser extent, machine learning background) so bear with me here. As an exercise, I thought I’d try – like everyone these days, I suppose – modelling the cumulative number of COVID-19 cases per country.
As a first, simple, version, I tried this: the number of new cases every day equals the number of total cases x the average number of interactions every person has x the probability of an interaction resulting in an infection. (Obviously there are a lot of assumptions there.) Here is my code:
data {
int<lower=1> N; // number of observations
int<lower=1> K; // number of countries
int<lower=0> y[N]; // observations
int s[K]; // number of observations per country
// making predictions
int<lower=1> N_test; // number of observations
int<lower=1> M_test; // number of desired predictions
int<lower=1> k_test; // index of country
int<lower=0> y_test[N_test]; // observations
}
parameters {
real<lower=0, upper=1> transmission_prob;
real<lower=0> interaction_rate;
}
model {
int pos;
int interactions;
int new_cases;
interaction_rate ~ normal(20, 20);
transmission_prob ~ normal(0.125, 0.05);
pos = 1;
new_cases = 0;
interactions = 0;
for (k in 1:K) { // countries
for (n in (pos+1):(s[k]+1)) { // days for country
interactions ~ poisson(y[n-1] * interaction_rate);
new_cases ~ binomial(interactions, transmission_prob);
y[n] ~ poisson(y[n-1] + new_cases);
pos = pos + s[k];
}
}
}
generated quantities {
int predictions[M_test]; // predictions
int predicted_interactions = 0;
int predicted_new_cases = 0;
int previous_value = y_test[N_test];
for (m in 1:M_test) {
predicted_interactions = poisson_rng(previous_value * interaction_rate);
predicted_new_cases = binomial_rng(predicted_interactions, transmission_prob);
predictions[m] = poisson_rng(previous_value + predicted_new_cases);
previous_value = predictions[m];
}
}
The observations for a country is just a list of integers – the cumulative number of cases in that country per day.
Now, when I run this I get:
Inference for Stan model: anon_model_39b4efac40a9cf503a76a2620468306f.
4 chains, each with iter=1000; warmup=500; thin=1;
post-warmup draws per chain=500, total post-warmup draws=2000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
transmission_prob 0.13 1.6e-3 0.05 0.03 0.09 0.12 0.16 0.23 1048 1.0
interaction_rate 2.0e-3 4.6e-5 1.9e-3 5.0e-5 5.7e-4 1.4e-3 2.7e-3 7.3e-3 1729 1.0
predictions[1] 1.8e4 3.19 134.83 1.7e4 1.8e4 1.8e4 1.8e4 1.8e4 1782 1.0
predictions[2] 1.8e4 4.11 185.51 1.7e4 1.8e4 1.8e4 1.8e4 1.8e4 2039 1.0
predictions[3] 1.8e4 5.12 231.79 1.7e4 1.8e4 1.8e4 1.8e4 1.8e4 2048 1.0
predictions[4] 1.8e4 6.03 268.64 1.7e4 1.8e4 1.8e4 1.8e4 1.8e4 1985 1.0
predictions[5] 1.8e4 6.71 302.2 1.7e4 1.7e4 1.8e4 1.8e4 1.8e4 2028 1.0
predictions[6] 1.8e4 7.51 336.27 1.7e4 1.7e4 1.8e4 1.8e4 1.8e4 2002 1.0
predictions[7] 1.8e4 8.47 364.17 1.7e4 1.7e4 1.8e4 1.8e4 1.8e4 1850 1.0
predictions[8] 1.8e4 9.35 385.18 1.7e4 1.7e4 1.8e4 1.8e4 1.8e4 1696 1.0
predictions[9] 1.8e4 9.81 408.86 1.7e4 1.7e4 1.8e4 1.8e4 1.9e4 1736 1.0
predictions[10] 1.8e4 10.33 435.93 1.7e4 1.7e4 1.8e4 1.8e4 1.9e4 1780 1.0
predicted_interactions 34.74 0.83 35.09 0.0 10.0 24.0 48.55 129.19 1784 1.0
predicted_new_cases 4.26 0.12 5.12 0.0 1.0 3.0 6.0 18.0 1806 1.0
previous_value 1.8e4 10.33 435.93 1.7e4 1.7e4 1.8e4 1.8e4 1.9e4 1780 1.0
lp__ 79.61 0.08 1.67 75.45 78.85 80.01 80.85 81.59 486 1.0
I know it’s a very simple model, but I figured it would at least be able to predict some non-linear growth – instead it’s just a flat line, so probably I’m doing something wrong. Are any of my assumptions unfounded? Are my choices in sampling distributions sensible? I’d really appreciate any advice here!