Using Stan to estimate the frequency of severe and critical SARS-CoV-2 infection across age (and question about logit link alternatives)


I am finishing a line of work that has taken a big chunk of my life last year, and will be preprinting it in the next few weeks. The work is estimating the proportion of SARS-CoV-2 infections that produce severe and critical disease, across ages. For this I use seroprevalence data from multiple countries, and also did a LOT of data scavenging. I think it will be an impactful work.

I am doing the analyses with rstan, and the main analysis consists of a hierarchical binomial logistic regression over the count of the different outcomes and the estimated numbers of infections.

Besides sharing the joy of finishing this work, and of doing it in stan (of which this is my first usage), I am writing because there’s a detail of the fit that I think may be improved (although it is already good as it is). The detail is that the trend of the % of the different outcomes with respect to age seems to deviate somewhat from the logistic function. In particular, for the older ages the model seems to fall off and underestimate the real proportion of the outcomes (see Figure below, with log-transformed vertical axis).

So, I am wondering whether there is some alternative I could try to improve the fit. I tried probit regression but that only made the fit worse. I tried using just the exponential function as the link function, but that destroys the fit (as already discussed here Binomial regression with a log link).

Suggestions will be highly welcome. And if you’re into COVID research, stay tuned to this work coming out!

1 Like

Congrats on getting close to the end of a long road!

Without knowing much about the data, one thought that comes to mind is that there may be a dispersion component to the data. Have you tried any distributions that model/estimate a dispersion parameter? Perhaps there are some age-related effects.


I hadn’t thought much about that, and I am also not sure what you mean by dispersion component (I don’t have formal training in statistics). Do you mean a change in the variance with age? There’s the natural change that for younger ages there are fewer outcomes to count (hospitalizations, ICU admissions, deaths) which leads to higher variances, but that should already be considered in the binomial treatment of the data, I think.

Also, I barely made it to this point with my Stan and statistics knowledge, I don’t think I could experiment with introducing new components to the model in the time-frame I think for getting this published (unless there’s something major I have to correct).

If you have any pointers to something I could look at, that would be appreciated though, if it’s not to laborious I may give it a try. Also, I will make the tidy data and code available soon when submitting, so if you’re very interested in that question you should be able to check it yourself.

If you’re able to share your model code that would be helpful.

If you’re using a binomial model, you could extend that to a beta-binomial, where each count on each row has its own ( unobserved) probability of being a “1”, and those probabilities come from a beta distribution. A beta-binomial is a way to account for dispersion (variance in an observed variable).

This might help account for differences in variance by age.

Sure, see the model code below. At the beggining it defines a version of the binomial that can take a continuous N value. Then, the N is determined from the estimated value of prevalence, which is extracted from a gamma distribution for each point (use as input the shape and rate of the gamma), and multiplying by the size of the population. Then I think the rest is fairly standard and straightforward, but feel free to ask any questions. I’ll post the github link here soon too.

  real bin_lpmf(int x, real n, real theta){
    real ans  = lchoose(n, x) + x*log(theta) + (n-x)*log1m(theta);

data {
  int<lower=0> N;                       // number of observations
  int<lower=0> K;                       // number of locations
  int<lower=1, upper=K> location[N];    // location index vector
  vector[N] ageVec;                     // predictor
  vector<lower=0>[N] population;                    // population count
  vector<lower=0>[N] seroprevShape;              // Prevalence gamma shape
  vector<lower=0>[N] seroprevRate;           // Prevalence gamma rate
  int<lower=0> outcomes[N];                      // outcome count

parameters {
  // Countries outcome fit
  real ageSlope;                      // mean slope
  real<lower=0> ageSlopeSigma;        // sd of slope across locations
  vector[K] locationSlope;            // vector with slope of each location
  real intercept;                     // mean intercept
  real<lower=0> interceptSigma;       // sd of the intercept
  vector[K] locationIntercept;        // intercept of each location 
  vector<lower=0, upper=1>[N] prevalence_raw; 

transformed parameters{
  vector<lower=0, upper=1>[N] outcomeRate;  // variable to contain % outcome
  outcomeRate = inv_logit(locationIntercept[location] + locationSlope[location] .* ageVec);  // logistic regression model 

model {
  ageSlope ~ normal(2, 1); // normal(0, 0.1);
  ageSlopeSigma ~ exponential(1);
  locationSlope ~ normal(ageSlope, ageSlopeSigma);
  intercept ~ normal(-6, 2);
  interceptSigma ~ exponential(1);
  locationIntercept ~ normal(intercept, interceptSigma);
  prevalence_raw ~ gamma(seroprevShape, seroprevRate);
  for (n in 1:N) {
    outcomes[n] ~ bin(population[n] * prevalence_raw[n], outcomeRate[n]);