Probably I'm doing something wrong ... (beginner trying to predict epidemic growth)

Hi there,

I’m new to Stan and to statistical modelling generally (I come from a programming and, to a lesser extent, machine learning background) so bear with me here. As an exercise, I thought I’d try – like everyone these days, I suppose – modelling the cumulative number of COVID-19 cases per country.

As a first, simple, version, I tried this: the number of new cases every day equals the number of total cases x the average number of interactions every person has x the probability of an interaction resulting in an infection. (Obviously there are a lot of assumptions there.) Here is my code:

data {
    int<lower=1> N; // number of observations
    int<lower=1> K; // number of countries
    int<lower=0> y[N]; // observations
    int s[K]; // number of observations per country

    // making predictions
    int<lower=1> N_test; // number of observations
    int<lower=1> M_test; // number of desired predictions
    int<lower=1> k_test; // index of country
    int<lower=0> y_test[N_test]; // observations
}
parameters {
    real<lower=0, upper=1> transmission_prob;
    real<lower=0> interaction_rate;
}
model {
    int pos;
    int interactions;
    int new_cases;

    interaction_rate ~ normal(20, 20);
    transmission_prob ~ normal(0.125, 0.05);

    pos = 1;
    new_cases = 0;
    interactions = 0;

    for (k in 1:K) { // countries
        for (n in (pos+1):(s[k]+1)) { // days for country
            interactions ~ poisson(y[n-1] * interaction_rate);
            new_cases ~ binomial(interactions, transmission_prob);
            y[n] ~ poisson(y[n-1] + new_cases);
            pos = pos + s[k];
        }
    }
}
generated quantities {
    int predictions[M_test]; // predictions
    int predicted_interactions = 0;
    int predicted_new_cases = 0;
    int previous_value = y_test[N_test];
    for (m in 1:M_test) {
        predicted_interactions = poisson_rng(previous_value * interaction_rate);
        predicted_new_cases = binomial_rng(predicted_interactions, transmission_prob);
        predictions[m] = poisson_rng(previous_value + predicted_new_cases);
        previous_value = predictions[m];
    }
}

The observations for a country is just a list of integers – the cumulative number of cases in that country per day.

Now, when I run this I get:

Inference for Stan model: anon_model_39b4efac40a9cf503a76a2620468306f.
4 chains, each with iter=1000; warmup=500; thin=1; 
post-warmup draws per chain=500, total post-warmup draws=2000.

                         mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
transmission_prob        0.13  1.6e-3   0.05   0.03   0.09   0.12   0.16   0.23   1048    1.0
interaction_rate       2.0e-3  4.6e-5 1.9e-3 5.0e-5 5.7e-4 1.4e-3 2.7e-3 7.3e-3   1729    1.0
predictions[1]          1.8e4    3.19 134.83  1.7e4  1.8e4  1.8e4  1.8e4  1.8e4   1782    1.0
predictions[2]          1.8e4    4.11 185.51  1.7e4  1.8e4  1.8e4  1.8e4  1.8e4   2039    1.0
predictions[3]          1.8e4    5.12 231.79  1.7e4  1.8e4  1.8e4  1.8e4  1.8e4   2048    1.0
predictions[4]          1.8e4    6.03 268.64  1.7e4  1.8e4  1.8e4  1.8e4  1.8e4   1985    1.0
predictions[5]          1.8e4    6.71  302.2  1.7e4  1.7e4  1.8e4  1.8e4  1.8e4   2028    1.0
predictions[6]          1.8e4    7.51 336.27  1.7e4  1.7e4  1.8e4  1.8e4  1.8e4   2002    1.0
predictions[7]          1.8e4    8.47 364.17  1.7e4  1.7e4  1.8e4  1.8e4  1.8e4   1850    1.0
predictions[8]          1.8e4    9.35 385.18  1.7e4  1.7e4  1.8e4  1.8e4  1.8e4   1696    1.0
predictions[9]          1.8e4    9.81 408.86  1.7e4  1.7e4  1.8e4  1.8e4  1.9e4   1736    1.0
predictions[10]         1.8e4   10.33 435.93  1.7e4  1.7e4  1.8e4  1.8e4  1.9e4   1780    1.0
predicted_interactions  34.74    0.83  35.09    0.0   10.0   24.0  48.55 129.19   1784    1.0
predicted_new_cases      4.26    0.12   5.12    0.0    1.0    3.0    6.0   18.0   1806    1.0
previous_value          1.8e4   10.33 435.93  1.7e4  1.7e4  1.8e4  1.8e4  1.9e4   1780    1.0
lp__                    79.61    0.08   1.67  75.45  78.85  80.01  80.85  81.59    486    1.0

I know it’s a very simple model, but I figured it would at least be able to predict some non-linear growth – instead it’s just a flat line, so probably I’m doing something wrong. Are any of my assumptions unfounded? Are my choices in sampling distributions sensible? I’d really appreciate any advice here!

My quick advice is to forget about Poisson, as this can cause all sorts of problems. If you’re just modeling totals, I recommend treating these counts as continuous measurements. Also I’d keep parameters on unit scale, so instead of having a parameter near 20, let 20 be a miultiplier. This won’t fix your problems–perhaps there’s some nonidentifiability in your model, I recommend you fix it using the usual workflow of model simplification and fake-data simulation–but it could help once you have the model working.

3 Likes

I am just a novice user but I have a comment about the above quoted part of the model. It looks to me as if the intention is that a new value of new_cases is generated and then that is added to the previous y value in the argument of the poisson function. But my understanding of sampling statements of the form

n ~ binomial(N, theta)

is that they increment the value of the target log probability density, they do not actually change the value of n. That is, the above statement is very similar to (differing by the dropping of constants)

target += binomial_lpmf(n | N, theta)

If that is correct, then new_cases is always zero and no growth would be seen.

I hope I am not wasting people’s time with my own confusion.

3 Likes

I have noticed 3 things:

If s[K] has the initial observations per country, why they are not interacting with y[N] directly?

 int s[K];

The second thing is that you are trying to do like an integer auto-regressive model pus new infections so maybe you could use info from this post Log-linear auto-regressive

        y[n] ~ poisson(y[n-1] + new_cases);
        pos = pos + s[k]; # s[k] not affecting y[n] directly!

And third, the new cases of your model are sampled using the whole information in your process
at last the time, why don’t you use the information of the difference between one time and its last? something like this:

new_cases ~ binomial(y[n] - y[n1], transmission_prob);

Including the recommendations above of not using Poisson directly (you could use log_poisson instead) you should check this post:

Modeling Poisson rate

In some parts of it, they talk of how to use poisson_glm models and mixed with auto-regressions to simulate the dynamics. And that could be a way of mixing s[k] and y[n-1] at your model. And you should also look for Poisson process your model looks like one.

2 Likes

Thanks everyone for your replies! Answering in reverse order –

If s[K] has the initial observations per country, why they are not interacting with y[N] directly?

s[K] is the number of observations for a country k. For instance, if I have a 12-day time series for the Netherlands, & if the Netherlands has index k = 9, then s[9] == 12.

And third, the new cases of your model are sampled using the whole information in your process
at last the time, why don’t you use the information of the difference between one time and its last?

I considered modelling just the case Δ instead of the cumulative, but I figured it wouldn’t make much difference in the end. It’s not clear to me why one is better than the other? Thanks for the additional resources, I’ll have a look at them.

I am just a novice …

I don’t fully understand this, I guess because I don’t fully understand how Stan and Bayesian inference in general actually works. ;-)

Two questions aimed at increasing my understanding:

  1. When I write, say, y[n] ~ normal(theta, sigma) & train the model with N observations, my impression was that it will adjust theta & sigma such that they produce a distribution that best describes the observations (I may be using totally wrong terminology here). Is that so?
  2. Now that I look back at the statement y[n] ~ poisson(y[n-1] + new_cases), I realise I don’t exactly know what it means. (See below.) However, I don’t understand why this would cause new_cases to be zero. y[n] & y[n-1] would usually be different. Care to expand?

My quick advice is to forget about Poisson, as this can cause all sorts of problems.

Here, I think, my ignorance of statistics is showing. It seemed to me like just the kind of thing the Poisson distribution was made for (a given number of independent events occurring in a fixed time interval). But now I suspect it’s wrong because y[n-1] changes every time, such that there is no real fixed mean.

I recommend treating these counts as continuous measurements. Also I’d keep parameters on unit scale, so instead of having a parameter near 20, let 20 be a miultiplier. This won’t fix your problems–perhaps there’s some nonidentifiability in your model, I recommend you fix it using the usual workflow of model simplification and fake-data simulation–but it could help once you have the model working.

Thanks, I’ll try this. From what I gather, fake-data simulation simply means starting “at the other end” – with generating data that looks kind of like the real data, & then using the same method create a model to describe it?

I really dont get this

Ok, when we recommend not use the Poisson is because discrete random variables are really bad for Stan algorithms, cause Stan needs to make some differences! So they are some tricks and alternatives.

When I write, say, y[n] ~ normal(theta, sigma) & train the model with N observations, my impression was that it will adjust theta & sigma such that they produce a distribution that best describes the observations (I may be using totally wrong terminology here). Is that so?

Mmm yes, but in this case, you have time-series data, so maybe fit a static mean is not a good idea.

Now that I look back at the statement y[n] ~ poisson(y[n-1] + new_cases), I realise I don’t exactly know what it means. (See below.) However, I don’t understand why this would cause new_cases to be zero. y[n] & y[n-1] would usually be different. Care to expand?

Is because the principal form you have in your model is like a simple state space models, in particularly an exponential smooth moving average, and this kind of models always forecast with the mean! If you want more forecast power you can use something else like a DLM.

Sorry I can not help you more, I am still studying poisson time series models for counting data, so I dont have my ideas clear!,but let me see what else can I do :) :)

1 Like

That seems like a reasonable rough statement of the process. I would caution that the fitting process does not look for the unique “best” values of the parameters, it maps out the entire region of the parameter space where most of the joint probability of the parameters is, given the data. The goal is to learn all of the values of theta and sigma that are reasonably plausible. (No guarantees on my terminology!)

It seemed to me, and I may be wrong, that you expect
new_cases ~ binomial(interactions, transmission_prob); to generate values of new_cases to be used in the line that follows it. Instead, it would increment the target quantity with the log probability of the values of interactions and transmission_prob given that new_cases is zero. The zero value of new_cases is set once in the code and never changed, as far as I can see.
I am really struggling to make myself clear on that last point, so do not hesitate to ask me to try explaining it again.

2 Likes

It’s not because of this part, but because of new_cases = 0 which you have slightly above that part in the model block. In this case the statement y[n] ~ poisson(y[n-1] + new_cases) does not give any information about your parameters, because new_cases remains 0 and y[n] and y[n-1] are data.

You also have interactions = 0 which also causes interaction_rate to become very small (as you can see from the output, Stan thinks it’s somewhere between 5e-5 and 7.3e-3 which is very precise given the prior of normal(20, 20)).

Maybe you can explain your model a bit more in words or formulas, so we can better figure out how to set that up in Stan.

[edit: I started writing this reply before @FJCC posted his, so I missed that - I think his explanation is very clear, so feel free to ignore this :)]

3 Likes

Thanks @FJCC and @MauritsM (and everyone else), this has cleared up quite a bit of confusion for me.

I went back to basics and created a spreadsheet to quickly make a basic model for generating data, plotted the data to see if it looked kind of similar to real data, and then, once I had something that did, went back to Stan and implemented a very basic model:

r_zero ~ normal(0.5, 0.3); // mean ~2.5, sd ~1.5, both divided by 5
sigma ~ cauchy(0, 10);

for (k in ...) { // countries
    for (n in ...) { // days for country
        y[n] ~ normal(y[n-1] + y[n-1] * r_zero * 5, sigma);
    }
}

This predicted growth much more nicely (the last 10 values on the x-axis being predictions):

So now I have a good starting point, and also better understanding of how Stan actually works. :-)

5 Likes

Nice,

My only doubt now is, why you multiply times five? why don’t you fit the autoregressive parameter with stan?

autoregressive = y[n] = a*y[n-1], then a is an autoregressice parameter

Now your model has more studied form, looks like a dynamic model with a deterministic trend! Rob Hyndman talk a lot about those models, if you are interested you could check some additional info here:
Dynamic regression

So good you are involving into Stan, hope you can fit your model :) :)

My only doubt now is, why you multiply times five? why don’t you fit the autoregressive parameter with stan?

I did it based on the advice given above:

Also I’d keep parameters on unit scale, so instead of having a parameter near 20, let 20 be a miultiplier.

But perhaps I misunderstood it?

Thank you for the additional resource.

1 Like

I am not sure but Well if it works! :)

1 Like