How to truncate a distribution by applying varying bounds?

Hi, I have some information about my data that I do not quite get how I can introduce into my model. The procedure that would seem simplest is to use varying lower bounds on my parameters, but I am not finding a way to do it, with the solutions I read seeming not applicable.

I am working with COVID data, and each data row represents a country. For each country/row, I have an estimate of the proportion of people that had COVID, obtained from seroprevalence studies. I also have a lower bound on the number of people that got COVID in each location, coming from the number of reported cases (called tested^i)

To introduce the estimated prevalence from the serosurveys, I use as input data the values \alpha^i and \beta^i that define a Gamma distribution over prevalence values for each location i, which I obtain from each serosurvey.

prevalence^i \sim Gamma(\alpha^{i}, \beta^{i})

But due to the large uncertainty in seroprevalence studies, I often have that the distribution over prevalence^i goes below tested^i.

This is all related to estimating the lethality of COVID-19 using the number of deaths for each location deaths^i. A very simplified version of the model would look like this

data {
  int<lower=0> N;                       // number of observations
  vector<lower=0>[N] seroprevShape;              // Prevalence gamma shape
  vector<lower=0>[N] seroprevRate;           // Prevalence gamma rate
  vector<lower=0>[N] deaths;                      // outcome count
  vector<lower=0>[N] nPop;                     // population at the country
}
parameters {
  // Countries outcome fit
  real<lower=0, upper=1> lethality;                      // mean lethality
  real<lower=0, upper=1> lethalitySd;
  vector<lower=0, upper=1>[N] prevalence_raw; 
}
model {
  lethality ~ normal(0.5, 0.1);
  lethalitySd ~ normal(0.5, 0.1);
  prevalence_raw ~ gamma(seroprevShape, seroprevRate);
  for (n in 1:N) {
    deaths[n] ~ normal(nPop[n] * prevalence_raw[n] * lethality, lethalitySd);
  }
}

Now, what I would like to do is to be able to introduce tested^i for each location as a lower bound to prevalence\_raw^i. The method I have seen suggested elsewhere is to just introduce a vector of lower bounds and add it to prevalence_raw. But this seems to me like it would completely change the estimated prevalences I obtained from data.

Another thought I have, is maybe using a distribution other than Gamma, and then to choose the distribution parameters so that the new distribution have the intended distribution (i.e. same expected value and variance as the serosurveys) after adding the lower bounds tested^i.

I hope this is not too confusing. Any help or suggestions would be greatly appreciated.

2 Likes

Hi,
so the most obvious thing is that in recent Stan, you can do:

data {
  vector[N] lb;
}

parameters {
  vector<lower=lb>[N] x;
}

to have varying lower bounds (If you are using R, the latest rstan is unfortunately not on CRAN, so you need to either install via Repository for distributing (some) stan-dev R packages | r-packages or use cmdstanr).

It would change the meaning of the model coefficients, but you can always store the “total prevalence = lower bound + estimated” value in generated quantities and use this for inference.

This may be sensible. Also note that the number of reported cases is not technically a hard lower bound - one can conceive that an extensive testing program in a country with low prevalence using tests with a bit subpar specificity could easily provide an overestimate, so not sure that the “hard bound” approach is completely useful. Ben Goodrich did some work on setting priors by specifying a bunch of quantiles, so that might also be useful: GitHub - bgoodri/StanCon2020: Material for my proposed StanCon talk

Would that make sense?

Best of luck with you model!

3 Likes

Thanks! So, if I understand, in the newest version of Stan (not available on CRAN), I can just give a vector of lower bounds. I’ll try that, thanks!

Then, I’m still not sure it would just change the meaning of the coefficients, I think it may change the model. Note that the parameters of the gamma are not model parameters, they are input data. So, if I input my gamma parameters (as data), implying an estimated prevalence level, adding a constant to each gamma would change the overall estimated prevalence I give as input, right?

Then, I agree that the number of cases is not technically a hard lower bound, but given the large variability in most age-stratified seroprevalence studies, and the relatively high reliability of COVID-19 tests, I think that in practice they are a reasonable lower bound.

Thanks!

1 Like

BTW. I was just seeing the video you linked, and it had the solution to another problem I had been thinking about for a couple of days! Big serendipity