Multilevel model using >16GB of RAM

Hey all, I’ve been messing with the coronavirus data, using Stan to fit sigmoid functions to the pandemic spread. Simple sigmoid curves fit the data well, as seen here: https://www.kaggle.com/peopletrees/coronavirus-stan-sigmoid-curves

I coded the sigmoid model in Stan as follows (good fit, no divergent transitions):

data {
    int<lower=1> N;  // number of rows
    vector<lower=0>[N] y; // outcome
    vector<lower=0>[N] t; // time in days
}
parameters {
    real<lower=0> alpha;
    real<lower=0> beta;
    real<lower=0> M;
    real<lower=0> sigma;
}
model {
    vector[N] mu;
    mu = M ./ (1 + exp(-beta * (t - alpha)));

    alpha ~ normal(100, 100); // number of days at which expected # of counts is halfway the maximum
    beta ~ normal(0.5, 5); // growth parameter
    M ~ normal(1000, 10000); // maximal number of cases

    y ~ normal(mu, sigma);
}

However, a multilevel model seems appropriate because so many countries have so few cases at this point (thankfully). To that end, I tried rewriting this as a multilevel model as follows:

data {
    int<lower=1> N;  // number of rows
    int<lower=1> J; // number of locations
    int<lower=1,upper=J> location[N]; // the locations themselves
    vector<lower=0>[N] y; // outcome
    vector<lower=0>[N] t; // time in days
}
parameters {
    vector<lower=0>[J] alpha;
    vector<lower=0>[J] beta;
    vector<lower=0>[J] M;

    real<lower=0> M_mu; // average maximal number of cases
    real<lower=0> M_sigma;

    real<lower=0> alpha_mu; // average number of days at which expected # of counts is halfway the maximum
    real<lower=0> alpha_sigma;

    real<lower=0> beta_mu; // average growth parameter value
    real<lower=0> beta_sigma;

    real<lower=0> sigma;
}

model {
    vector[N] mu;
    for ( i in 1:N ) {
        mu = M[location[i]] ./ (1 + exp(-beta[location[i]] * (t - alpha[location[i]])));
    }

    M ~ normal(M_mu, M_sigma);
    alpha ~ normal(alpha_mu, alpha_sigma); // number of days at which expected # of counts is halfway the maximum
    beta ~ normal(beta_mu, beta_sigma); // growth parameter

    // hyperparameters
    M_mu ~ normal(1000, 10000); 
    alpha_mu ~ normal(100, 100); 
    beta_mu ~ normal(0.5, 5); 

    y ~ normal(mu, sigma);
}

However, this model crashes during sampling as it uses up >16GB of ram. That surprises me since there are only 17k rows in this dataset. Any ideas for troubleshooting this?

If the size of the data isn’t a problem, it could be an issue with the indexing. What is the value of J? Since you need to be able to allocate 3 vectors of length J in memory. You could also try re-running the model with a subset of the data (i.e. 1700 rows instead) to see if the model works fine. If the model still fails, that would indicate an error with the syntax (or how the data is being passed).

Also, this may be unrelated to your issue, but I don’t think this loop is doing what you want:

    vector[N] mu;
    for ( i in 1:N ) {
        mu = M[location[i]] ./ (1 + exp(-beta[location[i]] * (t - alpha[location[i]])));
    }

Because you don’t have an index on mu (or t), you’re effectively just overwriting mu N times, and then the observed data is fitted using the last iteration. You could also use the inv_logit function for better resistance to over-/under-flow:

    vector[N] mu;
    for ( i in 1:N ) {
        mu[i] = M[location[i]] * inv_logit(beta[location[i]] * (t[i] - alpha[location[i]]));
    }
1 Like

N = ~17k and J = 254
Fixing the mu rewrite issue also fixes the memory issue, although I’m not sure why!
It’s still taking forever to run, but I’m trying additional speedups like vectorizing:
M[location] .* inv_logit(beta[location] .* (t - alpha[location]))

I think I can take it from here :) Thank you so much for your help!

Great to hear! A couple of pointers to add.

To speed things up you should have a look into using non-centered parameterisations. Hierarchical parameters can have some tricky posterior geometry to explore which results in divergences. These can instead be parameterised using manipulations of the standard normal distribution. This will often also have the benefit of being faster, since Stan can explore a standard normal distribution very efficiently. More information (and examples) are here in the Stan manual. However these can also perform worse with large sample sizes, so your mileage may vary.

You could also look into the map_rect functionality for parallelising the model. This would likely provide the biggest speed-ups, but requires re-writing the model a little. More info in the Map-Reduce chapter in the manual

1 Like

Thank you, this is great. I’ve tried de-centering everything as follows but it’s still running very slowly. I’ll keep tinkering including using map_rect

data {
    int<lower=1> N;  // number of rows
    int<lower=1> J; // number of locations
    int<lower=1,upper=J> location[N]; // the locations themselves
    vector<lower=0>[N] y; // outcome
    vector<lower=0>[N] t; // time in days
}
parameters {
    vector<lower=0>[J] alpha_raw;
    vector<lower=0>[J] beta_raw;
    vector<lower=0>[J] M_raw;

    real<lower=0> M_mu_raw; 
    real<lower=0> M_sigma_raw;

    real<lower=0> alpha_mu_raw;
    real<lower=0> alpha_sigma_raw;

    real<lower=0> beta_mu_raw; 
    real<lower=0> beta_sigma_raw;

    real<lower=0> sigma;
}

transformed parameters{
    real<lower=0> M_mu; // average maximal number of cases
    real<lower=0> alpha_mu; // average number of days at which expected # of counts is halfway the maximum
    real<lower=0> beta_mu; // average growth parameter value
    real<lower=0> M_sigma;
    real<lower=0> alpha_sigma;
    real<lower=0> beta_sigma; 
    vector<lower=0>[J] alpha;
    vector<lower=0>[J] beta;
    vector<lower=0>[J] M;
    
    M_mu = 1000 + M_mu_raw * 25000;
    alpha_mu = 100 + alpha_mu_raw * 50;
    beta_mu = 0.3 + beta_mu_raw * 2.5;
    M_sigma = M_sigma_raw * 25000;
    alpha_sigma = alpha_sigma_raw * 50;
    beta_sigma = beta_sigma_raw * 2.5;
    
    alpha = alpha_raw * alpha_sigma + alpha_mu;
    beta = beta_raw * beta_sigma + beta_mu;
    M = M_raw * M_sigma + M_mu;
}

model {
    vector[N] mu;
    // use inv_logit for better computational stability, an vectorize for speed
    mu = M[location] .* inv_logit(beta[location] .* (t - alpha[location]));

    M_raw ~ std_normal();
    alpha_raw ~ std_normal();
    beta_raw ~ std_normal();

    // hyperparameters
    M_mu_raw ~ std_normal();
    alpha_mu_raw ~ std_normal();
    beta_mu_raw ~ std_normal();
    
    M_sigma_raw ~ std_normal();
    alpha_sigma_raw ~ std_normal();
    beta_sigma_raw ~ std_normal();

    y ~ normal(mu, sigma);
}

I was trying to do something similar a few days ago with a hierarchical Gompertz model implemented through brms - see thread here: Hierarchical Gompertz model

I gave up in the end as no matter what I tried I could not quite make it work. I think there might be an identifiability problem with the equation - the equation has 3 fixed effect parameters and one variable. But, I don’t know quite so much about identifiability so its just a guess.

Hey jroon. I don’t think it’s an identifiability issue as I can recover all the parameters (I’m sure constraining everything to positive helps). When I fit the model to a subset of a thousand rows of the dataset as suggested by @andrjohns I get very sensible results that plot over the data just fine. The decentered model I posted does fit (slowly) without divergent transitions – once the kaggle markdown is finished I’ll edit this post to include a link

1 Like

Oh yes just a guess on my part really. Are you getting divergences now? Tried looking at pairs plots ?

Also try to make the problem smaller - pick only 7 or 8 countries with high case counts - it will be easier to debug

It runs and fits the data well when I subset to just a few countries (no divergences) but now it’s giving weird results for the full dataset! I’m going to try different prior values – I think the ones I chose were unrealistically large and wide–and will take your advice on selecting ~10 countries

1 Like

FYI the reason I mentioned non-identifiability is that I was seeing pairs plots similar to the ones decribed here for a sigmoid model:
https://www.martinmodrak.cz/2018/05/14/identifying-non-identifiability/

Suggest you examine such plots for your 10 country data set

1 Like

The blog notes that certain values of the parameter bring out the pathology - perhaps in a multi-level model the chance to hit a pathological value is higher because you now have as many potentially pathological variables as you do countries? Perhaps @martinmodrak who wrote the blog has some thoughts here?

2 Likes

Sigmoids are challenging to fit. In this case I would guess the biggest problem would be that the upper plateau parameter is almost not constrained by data at all… (this is also problem for any predictions from such a model). Finally the data is messy and has all sorts of reporting biases and non-stationarity due to evolving reactions of the governments and public, so a sigmoid might be a problematic model.

I’ve found the direct parametrization of the sigmoid problematic. I worked with @stemangiola on fitting sigmoids in a different context, you might want to check out the parametrization we used in a preprint: https://doi.org/10.1101/2020.03.16.993162, also discussed at Difficulties with logistic population growth model

With epidemic modelling in this time I would also ask you to be very careful about communicating the results. There is a lot of noise in the public communication channels that can easily drown out the voices of the best experts so please make sure you are not adding to this noise.

Best of luck with your model

4 Likes

Thank you Martin :)

1 Like