I am finishing a line of work that has taken a big chunk of my life last year, and will be preprinting it in the next few weeks. The work is estimating the proportion of SARS-CoV-2 infections that produce severe and critical disease, across ages. For this I use seroprevalence data from multiple countries, and also did a LOT of data scavenging. I think it will be an impactful work.
I am doing the analyses with rstan, and the main analysis consists of a hierarchical binomial logistic regression over the count of the different outcomes and the estimated numbers of infections.
Besides sharing the joy of finishing this work, and of doing it in stan (of which this is my first usage), I am writing because there’s a detail of the fit that I think may be improved (although it is already good as it is). The detail is that the trend of the % of the different outcomes with respect to age seems to deviate somewhat from the logistic function. In particular, for the older ages the model seems to fall off and underestimate the real proportion of the outcomes (see Figure below, with log-transformed vertical axis).
So, I am wondering whether there is some alternative I could try to improve the fit. I tried probit regression but that only made the fit worse. I tried using just the exponential function as the link function, but that destroys the fit (as already discussed here Binomial regression with a log link).
Suggestions will be highly welcome. And if you’re into COVID research, stay tuned to this work coming out!
Congrats on getting close to the end of a long road!
Without knowing much about the data, one thought that comes to mind is that there may be a dispersion component to the data. Have you tried any distributions that model/estimate a dispersion parameter? Perhaps there are some age-related effects.
I hadn’t thought much about that, and I am also not sure what you mean by dispersion component (I don’t have formal training in statistics). Do you mean a change in the variance with age? There’s the natural change that for younger ages there are fewer outcomes to count (hospitalizations, ICU admissions, deaths) which leads to higher variances, but that should already be considered in the binomial treatment of the data, I think.
Also, I barely made it to this point with my Stan and statistics knowledge, I don’t think I could experiment with introducing new components to the model in the time-frame I think for getting this published (unless there’s something major I have to correct).
If you have any pointers to something I could look at, that would be appreciated though, if it’s not to laborious I may give it a try. Also, I will make the tidy data and code available soon when submitting, so if you’re very interested in that question you should be able to check it yourself.
If you’re able to share your model code that would be helpful.
If you’re using a binomial model, you could extend that to a beta-binomial, where each count on each row has its own ( unobserved) probability of being a “1”, and those probabilities come from a beta distribution. A beta-binomial is a way to account for dispersion (variance in an observed variable).
This might help account for differences in variance by age.
Sure, see the model code below. At the beggining it defines a version of the binomial that can take a continuous N value. Then, the N is determined from the estimated value of prevalence, which is extracted from a gamma distribution for each point (use as input the shape and rate of the gamma), and multiplying by the size of the population. Then I think the rest is fairly standard and straightforward, but feel free to ask any questions. I’ll post the github link here soon too.
functions{
real bin_lpmf(int x, real n, real theta){
real ans = lchoose(n, x) + x*log(theta) + (n-x)*log1m(theta);
return(ans);
}
}
data {
int<lower=0> N; // number of observations
int<lower=0> K; // number of locations
int<lower=1, upper=K> location[N]; // location index vector
vector[N] ageVec; // predictor
vector<lower=0>[N] population; // population count
vector<lower=0>[N] seroprevShape; // Prevalence gamma shape
vector<lower=0>[N] seroprevRate; // Prevalence gamma rate
int<lower=0> outcomes[N]; // outcome count
}
parameters {
// Countries outcome fit
real ageSlope; // mean slope
real<lower=0> ageSlopeSigma; // sd of slope across locations
vector[K] locationSlope; // vector with slope of each location
real intercept; // mean intercept
real<lower=0> interceptSigma; // sd of the intercept
vector[K] locationIntercept; // intercept of each location
vector<lower=0, upper=1>[N] prevalence_raw;
}
transformed parameters{
vector<lower=0, upper=1>[N] outcomeRate; // variable to contain % outcome
outcomeRate = inv_logit(locationIntercept[location] + locationSlope[location] .* ageVec); // logistic regression model
}
model {
ageSlope ~ normal(2, 1); // normal(0, 0.1);
ageSlopeSigma ~ exponential(1);
locationSlope ~ normal(ageSlope, ageSlopeSigma);
intercept ~ normal(-6, 2);
interceptSigma ~ exponential(1);
locationIntercept ~ normal(intercept, interceptSigma);
prevalence_raw ~ gamma(seroprevShape, seroprevRate);
for (n in 1:N) {
outcomes[n] ~ bin(population[n] * prevalence_raw[n], outcomeRate[n]);
}
}