Marginalizing over discrete parameter with gamma distribution

Hi, I am having some trouble figuring out how to marginalize a discrete parameter in my model. I have never done this before and I couldn’t figure out how to apply any of the explanations I found online.

The problem I am working on is estimating the proportion of COVID-19 infections that end up in the hospital (hospitalProp). In a simplified version of the problem, I have for several different locations the following data:

  • hospitalizations: Number of hospitalizations (integer)
  • population: The population at the location (integer)
  • seroprevShape, seroprevRate: The shape and rate parameters of a gamma distribution over real values between 0 and 1, that described the distribution over the proportion of people that have been infected in the population (which I extract from seroprevalence studies)

Then, my model consists in the following:

Seroprevalence[N] \sim gamma(shape[N], rate[N])
Infections[N] = Population[N] \times Seroprevalence[N]
Hospitalizations[N] \sim Binomial(Infections[N], hospitalProp)

I would try to implement it the following way, but I cannot because of the discrete parameter (infections).


data {
  int<lower=0> N;                      // number of observations
  int population[N];                    // population count
  vector[N] seroprevShape;      // Prevalence gamma shape
  vector[N] seroprevRate;         // Prevalence gamma rate
  int hospitalizations[N];            // outcome count
}

parameters {
  // Countries outcome fit
  real<lower=0, upper=1> hospitalProp;      // hospitalization proportion
  vector[N] prevalence;                                // underlying prevalence at each location
}

transformed parameters {
  int infections[N];                        // number of infections at each location
  infections = prevalence .* population;
}

model {
  hospitalizations ~ binomial(infections, hospitalProp);
} 

I have been looking into how to marginalize the number of infections out of the model, but I am at a loss. Any help or pointers on where to look will be greatly appreciated.

Solution:

The solution proposed by @maxbiostat solved the problem. Since I had to fix a couple of details of the code he shared, I’m leaving the final solution here in case someone uses it in the future:

The solution was to define a binomial probability function that takes as input a real valued N. This is achieved with the following code at the start of the script:

functions{
  real bin_lpmf(int x, real n, real theta){
    real ans  = lchoose(n, x) + x*log(theta) + (n-x)*log1m(theta);
    return(ans);
  }
}

Then the sampling in the model is done with

model {
  hospitalizations ~ binomial(infections, hospitalProp);
} 

The difference with @maxbiostat code is that lpdf is substituted with lpmf because it’s probability distribution over discrete values, and that log1p was substituted for log1m which is the correct form of the density function.

Hi @dherrera1911,

Given that infections is not a random variable could you simply just not declare it in the transformed parameters block i.e.

data {
  int<lower=0> N;                      // number of observations
  int population[N];                    // population count
  vector[N] seroprevShape;      // Prevalence gamma shape
  vector[N] seroprevRate;         // Prevalence gamma rate
  int hospitalizations[N];            // outcome count
}

parameters {
  // Countries outcome fit
  real<lower=0, upper=1> hospitalProp;      // hospitalization proportion
  vector[N] prevalence;                                // underlying prevalence at each location
}

model {
  hospitalizations ~ binomial(prevalence .* population;, hospitalProp);
} 

and then just calculate infections = prevalance * population for each draw from prevalance and population?

Best,
Jeffrey

2 Likes

Hi @jeffreypullin,

That stops throwing the message of integer parameters, but I run into the errors I wanted to avoid with that declaration (admittedly, probably a bad hack in the first place). I am running with the problem that I cannot multiply the integer population with the vector prevalence. If I define population as a vector though, I get the error that binomial doesn’t allow me to use the result of the multiplication as the first parameter, because it is a real and it only takes integers.

Maybe I’ll have to reformulate the model around this. If you have any ideas of how to fix the issue, that will be super welcome.

Thanks

Daniel

An easy fix would be to implement your own binomial, that takes a real as the trial parameter.

1 Like

Given that prevalence is continuous, infections*prevalence will never be integer. It seems to me that the straightforward solution is to model infections as

infections ~ binomial(population, prevalence);
hospitalizations ~ binomial(infections, hospitalProp);

Now you have a latent integer infections parameter that you need to marginalize over. But the marginalization is trivial, because binomial-of-binomial marginalizes to binomial:

hospitalizations ~ binomial(population, prevalence*hospitalProp);
2 Likes

Thanks, @jsocolar, that sounds like a reasonable work around, I’ll try that. Still, it feels like something is a bit off, although I’m not completely sure. I think this is because I was not thinking of infections as a random variable, because I have the input distribution over prevalence, given by the uncertainty in the estimations taken from the literature, but for each possible prevalence value there’s one number of infections that basically define that prevalence (as prevalence = infections/population)

Although I expect the added uncertainty of making infections a random variable to be small for places with large populations and high prevalence, the effect may not be that small for places with very small prevalences that amount to infections in the order of the few hundreds. Maybe that’s not wrong, but I would have to think about whether I’m counting the same uncertainty twice: once in the uncertainty of the prevalence, and once in the uncertainty over the infections | prevalence.

Thanks!

1 Like

Thanks, @maxbiostat, I’ll give that a try, that does sounds like a nice fix, although I’ll have to see how “easy” it is given my beginner Stan skills.

functions{
  real bin_lpmf(int x, real n, real theta){
    real ans  = lchoose(n, x) + x*log(theta) + (n-x)*log1m(theta);
    return(ans)
  }
}
model{
  ...
  hospitalizations ~ bin(prevalence .* population, hospitalProp);
}

Edit : incorporated corrections by @dherrera1911 below.

2 Likes

There are probably more sophisticated solutions, but the fact is that seroprevalence is technically not supposed to be a continuous percentage value, but a fraction R/P, and because population P is the denominator it would make your estimate of infections be an integer.

An easy workaround would be to just truncate infections and make it integer, since the decimals appear because of your seroprevalence approximation on a continuous scale. There are round, floor etc functions in Stan, though assigning a real value to an integer may no be as straightforward as it should be.

3 Likes

When you say that you have data on prevalence, do you mean that you have direct population-level measurements of the prevalence? Or do you mean that you have been able to use a sample to do inference on the per-individual probability of seropositivity? These are subtly distinct, but the subtle distinction is hard to see because we need a modestly large dataset to reliably estimate the per-individual probability of seropositivity, and then we need to assume that the sample is drawn from a population that is much larger than the sample in order that correct inference about the population doesn’t need to account for the fact that the sample itself is a substantial fraction of the population. These two requirements, taken together, place us in a regime where binomial(n,p) is effectively equal to n*p. But if you are concerned about small-population regimes, then it’s harder to sweep the subtlety under the rug. A couple of points here, though.

  • Technically, if the population is not huge compared to the sample, then to get correct inference on the population you should treat the sample as fixed and then use statistics only to infer the probable infection status of individuals not present in the sample.
  • If we have immunoassays for a random sample of the population, we might attempt to infer the “population seropositivity rate” using, e.g., infections \sim binomial(n,p). In this case, p is the per-individual probability of infection. The generative model isn’t “take n*p infections and distribute them among the sample”; it is “do n independent Bernoulli trials with probability p and sum the result”. For large populations, the true population-level number of infections asymptotically is just N*p. But for small populations, the true population-level number of infections (neglecting what is known about the sample; see previous bullet) is binomial(N, p). Because p fundamentally is the per-individual probability of infection, not the population-level fraction of positives.
  • My intuition is that whatever model is being used to estimate the “population-level” infection rate that you use is very likely to actually give inference on the per-individual infection probability. Again, this distinction breaks down under the assumption of “large-enough” population sizes, which we generally assume as a matter of course when doing applied statistics. But if you are worried about what happens when the population sizes are small, then I suspect you’ll find that binomial sampling captures the generative process properly, whereas n*p does not.
1 Like

Thanks for those clarifying thoughts, I think I mostly understand. Yes, these are quite small serosurveys compared to total population sample. However, in some cases they are not very large, so they are not a precise estimate of p (particularly since p is small in many of them), and that’s the main source of uncertainty that gives rise to that gamma distribution. I would plug in directly the serosurvey raw data (n=number of tests, positives= positive tests) as positives \sim binomial(n,p), and somehow manage to get an integer number of total infections from that, but I can’t do that because additional corrections for test characteristics are required to get the estimates and uncertainty, and from some sources I could only get the estimate + CI, and not the raw numbers.

I do think though that the method you propose is likely good for my purposes, partly because I don’t think I’ll be able to do anything better, so I guess we’ll just have to assume away those concerns.

1 Like

I think you’re fine here. Since the populations are “large enough” binomial(n,p) = n*p will be a pretty good approximation. Moreover, inasmuch as it’s a poor approximation, I think the term that is more apt for your purposes is the binomial term rather than n*p, because again I think that inference on p is better thought of as the per-individual infection probability.

With that said, given that the approximation is likely to be very good it is worth keeping in mind that @maxbiostat’s approach might well yield a more efficient implementation, since it keeps the number of trials in the binomial distribution smaller.

1 Like

Thanks, that almost worked (I’m fighting initialization failures now, but I don’t think it has to do with your solution). I think it should be lpmf instead of lpdf, and I also had some trouble with vectors and so I put a for loop in the model line, applying it element-wise.

1 Like

You’re right. I usually implement this for the case where everything is a real. In your case, hospitalizations is an integer, so _lpmf is the right way to go. Good catch.

2 Likes

Just for final update, it’s now running. I have to solve some convergence issues, but it’s going good, thanks!

Thanks. That’s what originally the use of an integer intermediate step was supposed to do, but I couldn’t figure out how to in the end “assign a real value to an integer” (I tried just introducing floor and round after seeing your comment, but that alone doesn’t work). Both @maxbiostat and @jsocolar workarounds seem to work though.

Also, it should be log1m instead of log1p (that one took me a while to find). Just leaving it here for completeness in case someone else uses this in the future.

1 Like