Marginalizing a double binomial-Poisson hierarchical distribution

Matthew_Spencer1 · June 11, 2022, 10:00pm

I want to fit a model involving a binomial-Poisson hiearchical distribution with two sets of conditionally independent binomial counts:
y_i \sim \text{Poisson}(\lambda_i),
z_{1, i} \sim \text{binomial}(y_i, \kappa),
z_{2, i} \sim \text{binomial}(y_i, \kappa),
where z_{1, i} and z_{2,i} are conditionally independent given y_i, y_i is unobserved, and the \lambda_i come from a regression with a log link (and Gaussian overdispersion).

If I had only a single z_i, it would be easy to marginalize: z_i \sim \text{Poisson}(\kappa \lambda_i).

Approaching marginalizing with two sets of counts in the same way, and with a bit of help from SageMath, I found that the joint marginal density of (z_{1,i}, z_{2, i}) is
P(Z_{1, i} = z_{1, i}, Z_{2, i} = z_{2, i}) = \kappa^{z_{1,i} + z_{2, i}}e^{-\lambda_i}(1 - \kappa)^{-(z_{1, i} + z_{2, i})}a^{z_{1, i}}\binom{z_{1, i}}{z_{2, i}}\frac{1}{z_{1, i}!} {_1F_1}(z_{1, i} + 1, z_{1, i} - z_{2, i} + 1, a),
where a = (1 - \kappa)^2 \lambda_i, I assumed without loss of generality that z_{1, i} \geq z_{2, i}, and _1F_1() is a generalized hypergeometric function.

Can I implement this easily in Stan? There might be some obvious simplification that I haven’t recognized. If not, I haven’t found a generalized hypergeometric function in the documentation, but it doesn’t look too difficult to write a Stan function based on genhypergeo() in the R package hypergeo (although I can imagine there is potential for numerical problems).

I’d be quite happy to have y_i as a latent variable instead, but I couldn’t immediately see how to do this in Stan: it can’t be a parameter because it’s discrete. I don’t need to be able to identify \kappa and \lambda_i separately, although looking at the formula suggests that this will be possible.

jsocolar · June 12, 2022, 1:40pm

I know that @maxbiostat and @GuidoAMoreira have been thinking some about marginalizations of this sort.

Matthew_Spencer1 · June 13, 2022, 11:37am

I had a go at implementing this, with partial success (Stan code and simulated data set attached). My full model is actually multivariate, with 6 categories in each observation (indexed by j below), two conditionally independent counts for each category in each observation, and an offset a_i. Each category has its own detection probability \kappa_j. The linear predictor \eta involves an intercept vector \beta_0, an explanatory variable x with coefficient vector \beta_1 and observation-level random effects \varepsilon:

z_{1,ij} \sim \text{binomial}(y_{ij}, \kappa_j),
z_{2,ij} \sim \text{binomial}(y_{ij}, \kappa_j),
y_{ij} \sim \text{Poisson}(a_i \lambda_{ij}),
\lambda_{ij} = \exp(\eta_{ij}),
\eta_{i} = \beta_0 + \beta_1 x_i + \varepsilon_i,
\varepsilon_i \sim N(0, \Sigma).

I evaluated the hypergeometric function _1F_1 (actually a confluent hypergeometric function of the first kind, I think) as above, using functions based on genhypergeo_series()in the R package hypergeo.

This works OK if I leave out the observation-level random effects \varepsilon in the linear predictor. I still sometimes get divergent transitions after warmup, when the hypergeometric function evaluation fails, but I can get pretty good estimates of known parameters from simulated data. The \kappa appear to be identifiable, as expected from looking at the formula for the joint marginal density in the original post.

However, when I include the \varepsilon, sampling fails to initialize, because evaluation of the hypergeometric almost always fails (even when the true value of \Sigma is not far from the zero matrix). Typical case during initialization: \kappa = 0.16, z_1 = 1, z_2 = 0, a \lambda = 7816.13. It’s obviously going to fail because a bivariate count of (1, 0) is extremely unlikely with this value of a\lambda, and then there will be numerical problems with evaluating the hypergeometric.

Any suggestions, e.g.:
Should I be trying to initialize differently (not something I usually do)? The main problem could just be diffuse initial conditions combined with the right skew in the Poisson-lognormal, that means we often have to evaluate implausibly large \lambda when initializing.
Am I doing something wrong in the way the random effects are coded? They seem to work OK if I have only a single count for each category in each observation, so that I just have Poisson densities to evaluate.
A different numerical approach to evaluating the hypergeometric function, or somehow make it fail in a better way? I think this is always going to be difficult because in examples like the initialization case above, we’re asking it to estimate a very small probability. This can happen even when \lambda isn’t implausibly large, if the counts are large but fairly different, and the proposed value of \kappa makes them unlikely.
Is there a fundamental difficulty with having both \varepsilon and \kappa in a model of this kind? They don’t appear to be doing the same thing, so I can’t see why this should be the case.
bivariatebinomialPoisson.stan (4.4 KB)
testdata.txt (1.9 KB)

jsocolar · June 13, 2022, 2:31pm

Is it possible to evaluate the probabilities on the log scale? If so, this should be much more numerically stable when the probabilities are small.

Matthew_Spencer1 · June 13, 2022, 8:27pm

There is definitely scope for an approach using log probabilities (although not in the form of a hypergeometric function, I think). The log marginal probability is

\log \sum_{k = z_{1, i}}^\infty \exp \left[\log P(Z_{1,i} = z_{1,i} | Y_i = k) + \log P(Z_{2,i} = z_{2,i} | Y_i = k) + \log P(Y_i = k ) \right]

(assuming z_{1,i} \geq z_{2,i}). I could directly approximate that using log_sum_exp, up to a sufficiently large upper limit on the summation. A naive implementation of this is definitely much more stable, but very slow (hours rather than a couple of minutes, for a small data set), and I haven’t implemented a way to decide what upper limit is needed for sufficient accuracy:

real lbivariatebinPoisdirect(real kappa, int z1, int z2, real lambda) {
    int mink;
    real lp;
    vector[100] contributions; //rough guess
    int k;
    
    mink = max(z1, z2);
    k = mink;
    for(i in 1:100) {
      contributions[i] = binomial_lpmf(z1 | k, kappa) + binomial_lpmf(z2 | k, kappa) + poisson_lpmf(k | lambda);
      k = k + 1;
    }
    lp = log_sum_exp(contributions);
    return lp;
  }

Those two things could go together: I’m currently just summing over 100 values of k, but in many cases that’s probably more than is needed (and in others may not be enough).

I guess it’s not going to be possible to use my existing Taylor series method (which is fast) and then switch to the direct method above when it fails (I’d expect gradient problems).

Otherwise, this paper looks good, although they don’t recommend a single method for all cases:

Matthew_Spencer1 · June 15, 2022, 11:10pm

Turns out it’s easy to work on the log scale for hypergeometric function too. The following approximates the log of the hypergeometric (just by taking the log of each term in the algorithm from the hypergeo R package and then doing log_sum_exp on it) and is much more stable than the original version. It’s then easy to add the logs of the other factors in the expression for the joint marginal probability from the original post. I guess a sensible choice for a stopping criterion is
\frac{|S_{K + 1} - S_{K}|}{|S_K|} < \text{tol}
where S_K denotes the $K$th partial sum in the series form of \log {}_1F_1 and \text{tol} is some small number.

It seems to be much faster in easy cases than summing a fixed number of terms in the expression for the log marginal probability, although it can get slow in difficult cases. Are there any obvious ways to speed up? It’s going to get called a lot in my code, and for large values of z, it may need many iterations. I can’t see a way to avoid evaluating each partial sum to implement the stopping rule.

  /**log of confluent hypergeometric function
#Function has the form _1F_1(U; L; z) = \sum_n = 0^\infty (U)_n / (L)_n z^n / n!, where (x)_n denotes rising factorial. Evaluate by a partial sum.
#Arguments: U, L, z as in formula above
#tol: tolerance (suggest 1e-6)
#maxiter: max iterations (suggest 10000)
#If convergence problems, try increasing tol or maxiter*/
  real log1F1(int U, int L, real z, real tol, int maxiter) {
    real logfacold;
    real logfacnew;
    real plogsumexpold;
    real plogsumexpnew;
    real logz;
    real stopcrit;
    int Uinc;
    int Linc;
    
    if (z < 0) {
      return not_a_number();
    }

    Uinc = U;
    Linc = L;
    logz = log(z);
    logfacold = 0.0;
    plogsumexpold = 0.0;
    for(n in 1:maxiter){
      logfacnew = log(Uinc) - log(Linc) + logz - log(n) + logfacold;
      plogsumexpnew = log_sum_exp(plogsumexpold, logfacnew);
      if(n > 2){//plogsumexp[1] is zero, so can't divide by this
	stopcrit = fabs(plogsumexpnew - plogsumexpold) / fabs(plogsumexpold); //stopping criterion: change small compared to current value
	if(stopcrit < tol){
	  return plogsumexpold;
	}
      }
      Uinc = Uinc + 1;
      Linc = Linc + 1;
      plogsumexpold = plogsumexpnew;
      logfacold = logfacnew;
    }
    print("warning: log1F1(): series not converged. Try increasing tol or maxiter.");
    return not_a_number();
  }

There are some other interesting approaches available. In particular, this Laplace approximation approach (the scalar case, section 4.1) is very fast and easy to implement, but when I tried, it didn’t seem to do well for large z_1, z_2, \lambda.

maxbiostat · June 16, 2022, 12:23pm

I’m sorry it took me this long to get to this. I’m glad you’re making steady progress. In Adaptive truncation of infinite sums: applications to Statistics, @GuidoAMoreira and I look into some strategies for summing out series with guarantees.

A few comments on the latest implementation: whilst you’re right you can’t avoid checking the condition at every step, you don’t need to log_sum_exp at every step. This leads to a somewhat fast deterioration of numerical quality. It’s much safer to accumulate the log terms in a vector and then apply log_sum_exp only once. See this repo for some implementations. We don’t have much in the way of documentation yet, but @GuidoAMoreira and I would be happy to help.

I unfortunately do not have the time to implement the 1F1 (or the original marginalisation) myself, but feel free to PM me and I can try and help out. This is an interesting application.

Matthew_Spencer1 · June 16, 2022, 6:08pm

Thank you. That looks like a great paper: I’ll have a proper read later, but I can already see lots of things in it that I want to know more about. But I found a better solution for my case, in Johnson, Kotz and Balakrishnan (1997), Discrete Multivariate Distributions, p. 114. They call this distribution a bivariate compound Poisson. They work with probability generating functions to get an exact expression for the marginal joint probability of (z_1, z_2) involving a finite sum. You need to differentiate the generating function for Y, but that’s easy for Y \sim \text{Poisson}(\lambda). And because the number of terms in the sum depends only on the observed data, computation is not going to slow down radically when, for example, \lambda gets large (which was a problem with my partial sum implementations, where large \lambda usually meant lots of terms needed to reach a given tolerance).

    /**log marginal probability for a bivariate binomial-Poisson hierarchical distribution. Johnson, Kotz and Balakrishnan (1997), Discrete Multivariate Distributions, p. 114, eq. 36.90, with N ~ Poisson(lambda), p00 = (1 - kappa)^2, p10 = p01 = kappa * (1 - kappa), p11 = kappa^2. They call this a bivariate compound Poisson distribution.
#Arguments:
#kappa: success probability in binomial
#z1, z2: two counts observed
#lambda: Poisson parameter
#Value: log marginal probability*/
  real lbivariatebinPois(real kappa, int z1, int z2, real lambda) {
    int zmin;
    real lp;
    vector[3000] lcontrib; //would like to allocate this based on zmin
    real p00;
    real logp10p01;
    real logp11i;
    real logGNz1z2minusi;
    real logkappa;
    real log1minuskappa;
    real loglambda;
    real lambdatimes1minusp00;

    p00 = (1 - kappa)^2;
    zmin = min(z1, z2);
    logkappa = log(kappa);
    log1minuskappa = log(1 - kappa);
    loglambda = log(lambda);
    lambdatimes1minusp00 = lambda * (1 - p00);
    for(i in 0:zmin){
      logp10p01 = (z1 + z2 - 2 * i) * (logkappa + log1minuskappa);
      logp11i = 2 * i * logkappa;
      logGNz1z2minusi = (z1 + z2 - i) * loglambda - lambdatimes1minusp00;
      lcontrib[i + 1] = -lgamma(z1 - i + 1) - lgamma(z2 - i + 1) - lgamma(i + 1) + logGNz1z2minusi + logp10p01 + logp11i;
    }
    lp = log_sum_exp(lcontrib[1:(zmin + 1)]);
    return lp;
  }

It seems to be working well.

Topic		Replies	Views
Poisson-binomial distribution - any existing Stan implementation? Modeling	11	4933	March 12, 2019
Hierarchical model with a poisson distributed parameter Modeling cmdstanr	12	1102	December 7, 2022
Confluent hypergeometric functions to estimate population size and detection probability from spatially replicated counts Modeling techniques , ecology	11	1125	November 30, 2023
Mixture of generalized poisson distribution Modeling mixture , poisson	18	1968	December 26, 2022
Sum of binomials when only the sum is observed Modeling	21	3706	November 10, 2021

Marginalizing a double binomial-Poisson hierarchical distribution

Related topics