Poisson binomial model -- advice needed on modelling approach

I am trying to model some data that is Poisson-Binomially distributed. The Poisson-Binomial distribution was discussed here and there is a great implementation for Stan in that thread that was written by Bob Carpenter (thanks!), which I am using.

As I’ve written the model I am getting very poor coverage on simulated data. I am unclear on whether this is because 1.) what I am trying to do is impossible or 2.) the way I’m trying to do it is wrong. So I’ll describe the problem I’m trying to solve in case any of you have some better ideas about how to model it. The problem I’m working on is historical financial data (19th c.) so the data is limited in particular ways and certain gaps cannot be filled in.

The problem

I have a sample of stocks (index i), and each stock falls into a group (index g). There is a binary outcome associated with each stock that I want to model. Let’s call this outcome y_{ig}^* \in \{0,1\}.

For each stock I observe some characteristics that I think should predict the outcome: x_{ig}. In fact, lets stipulate that

y_{ig}^* \sim Bernoulli(p_{ig}) \\ p_{ig} = g(x_{ig} \beta)

where g() is a link function like logit or probit (I try both). So far so straightforward.

Here comes the twist: I mostly don’t observe y_{ig}^*, instead what I observe is y_g = \sum_{i \in g} y_{ig}^*. In words, I don’t observe the binary outcome, instead I observe the sum of these outcomes within each group. I say mostly because for a subset of the data (~30%) I do observe the outcome y_{ig}^* but only for stocks where y_{ig}^*=0.

Thus what I am observing is the sum of independent but not identically distributed bernoulli trials, and this is the definition of the Poisson-Binomial distribution. So I have

y_g \sim PB(N_g, g(x_{ig} \beta))

where N_g is the number of observations in group g. What I have been doing so far is treating observations I know the outcome for as groups of size 1, so for those observations the model collapses to a standard logit/probit.

Stan code

So the model in stan looks like

//
// This Stan program implements a poisson binomial 
// regression where the probability vector is 
// parameterized as a logit. The object is to
// recover the logit coefficients
// I use a poisson_binomial_lpmf function taken from the
// Stan forums, see: https://discourse.mc-stan.org/t/poisson-binomial-distribution-any-existing-stan-implementation/

functions {

  /**
   * Return the Poisson-binomial log probability mass for the specified
   * count y and vector of probabilities theta.  The result is the log
   * probability of y successes in N = size(theta) independent
   * trials with probabilities of success theta[1], ..., theta[N].
   *
   * See:  https://en.wikipedia.org/wiki/Poisson_binomial_distribution
   *
   * @param y number of successes
   * @param theta vector of probabilities
   * @return Poisson-binomial log probability mass
   */
  real poisson_binomial_lpmf(int y, vector theta) {
    int N = rows(theta);
    matrix[N + 1, N + 1] alpha;
    vector[N] log_theta = log(theta);
    vector[N] log1m_theta = log1m(theta);

    if (y < 0 || y > N)
      reject("poisson binomial variate y out of range, y = ", y,
             " N = ", N);
    for (n in 1:N)
      if (theta[n] < 0 || theta[n] > 1)
        reject("poisson binomial parameter out of range,",
               " theta[", n, "] =", theta[n]);

    if (N == 0)
      return y == 0 ? 0 : negative_infinity();

    // dynamic programming with cells
    // alpha[n + 1, tot + 1] = log prob of tot successes in first n trials
    alpha[1, 1] = 0;
    for (n in 1:N) {
      // tot = 0
      alpha[n + 1, 1] = alpha[n, 1] + log1m_theta[n];

      // 0 < tot < n
      for (tot in 1:min(y, n - 1))
        alpha[n + 1, tot + 1]
            = log_sum_exp(alpha[n, tot] + log_theta[n],
                          alpha[n, tot  + 1] + log1m_theta[n]);

      // tot = n
      if (n > y) continue;
      alpha[n + 1, n + 1] = alpha[n, n] + log_theta[n];
    }
    return alpha[N + 1, y + 1];
  }

}

data {
  int<lower=0> N; // N obs
  int<lower=0> K; // N covariates
  int<lower=0> G; // number of groups
  int<lower=0> y[G]; //
  matrix[N, K] X; // covariates
  int<lower=1> gsize[G]; // vector of group sizes
}

// The parameters accepted by the model. We
// estimate an intercept and the covariates delta
parameters {
  real mu;
  vector[K] delta;
}

// Make the probability param
transformed parameters {
  vector[N] p; // probability
  p = inv_logit(mu + X * delta); 
}

// The model to be estimated. We model the output
// 'y' to be poisson-binomially distributed with 
// individual probabilities p that are a function of
// the parameters delta and mu
model {
  int pos;
  pos = 1;
  mu ~ std_normal();
  delta ~ normal(0,5);
  
  for (g in 1:G) {
    target += poisson_binomial_lpmf(y[g] | segment(p, pos, gsize[g]));
    pos = pos + gsize[g];
  }
}

The structure of the problem is similar to some ecological inference problems, with the advantage of knowing some individual covariate information. I’m not the first to see the link between the poisson-binomial distribution and classical ecological regression problems like voting totals and individual characteristics.[^1] In the Rosenman paper the poisson binomial is approximated with a normal distribution. Others have tried to approximate it with a binomial distribution. Perhaps this is because the problem is otherwise intractable? I tried a normal approximation and it went attrociously (not included) but include a binomial approximation which works generally worse than the main model.

Problems

Anyway, what I am finding is that the estimated coefficients are consistently too shrunk towards 0 (the prior) and coverage is poor. The posterior parameter estimates typically appear to be ‘moving in the right direction’ but don’t quite get there. Estimates are rarely terrible, but they’re certainly not great. See for instance this output from the logit model as an example (this is generated from the attached R script):

As you can see the posteriors are all a bit off. In simulation a 95% interval traps the true paramater like ~10% of the time, which is obviously not good.

More promisingly, the estimated probabilities are not atrocious. These are the median estimates of p from the model above:

For these 1000 p parameters, about half of them are putting the true parameter in the central 95% of the posterior distribution. You can see a common pattern here, which is that when p_{ig} is close to 0 or 1 the model shrinks its prediction towards .5 which yields a sort of flipped-sigmoid-curve look.

On the one hand, this is not great. On the other hand, it is much better than the “mean approximation” model I have included as an alternative. For the “mean approximation” model I just average the p_{ig} terms in each group and then estimate a binomial model Y_g \sim Bin(N_g, mean(p_{ig})). This approach is consistently less accurate.

Advice

I’m at a dead-end in terms of ideas about how to improve the model. The probit version of the model is the correct generative model of the simulated data, but the stan code is not recovering the parameters. I’ve tried fiddling the priors, or running the models longer, etc, and haven’t noticed a difference. Perhaps I need to be doing something more with the subsample of the data I observe more fully? I’m not sure how it could be used more aggresively.

Any advice about how to either 1.) improve the approach I’m currently pursuing or 2.) rethinking the problem would be much appreciated. Its unclear to me whether what I am trying to do is feasible: I cannot seem to find discussion of a sort of GLM approach to poisson binomial as I am doing here.

Obviously I would like to perfectly estimate all the parameters, but if that is impossible that I care more about getting p_{ig} well approximated so that I can make smart guesses about y_{ig}^*.

I attach 3 stan programs (logit link function, probit link function, and the mean approximation) and an R script that generates fake data and runs the models. There is also a wrapper function in the R script that can be run to look at coverage but that takes a bunch of time.

Code

discourse_example_poi_bin.R (6.0 KB)
poisson_binomial_logit.stan (2.8 KB)
poisson_binomial_mean_approx.stan (990 Bytes)
poisson_binomial_probit.stan (2.8 KB)

[^1] Rosenman, “Some New Results for Poisson Binomial Models”, Arxiv gives a good overview.

1 Like

Hi,
this is a great, well researched, but hard to answer question :-)

Some incomplete thoughts:

Are the model diagnostics (divergences, Rhat, ESS) OK? Because this almost certainly means that either the computation fails (which should be signalled by the diagnostics) or your Stan code does not match the simulator and there is a bug in at least one of those :-)

A strategy I would try to debug the issue:

  • Alter your simulation code to simulate exactly as one of your approximate models. Do you get good recovery/coverage in this case? If yes, it means the regression part of the model is probably OK. If there are issues, there might be a bug in the regression part.
  • Try a model that only uses the Poisson-binomial in some very simple way. Simulate data according to this simple model. Do you recover parameters and get good coverage? If not, the Poisson-binomial implementation might be wrong.

You can be somewhat more rigorous about “recovery” and “coverage” and use SBC, for which I recently helped develop a package - check out the vignettes at Articles • SBC if you are interested.

So as long as the exact poisson-binomial is tractable, I would focus there (and try to find the bug). I’ll note that an additional approximation approach that might make sense would be the saddlepoint approximation - I have an intro on my blog with Stan code: Approximate Densities for Sums of Variables: Negative Binomials and Saddlepoint but that focuses on sums of negative binomials (and the approximation is not very useful there), but there is also worked-out solution in R for sums of binomial variables: Liu & Quertermous: Approximating the Sum of Independent Non-Identical Binomial Random Variables where I would expect a naive approximation to work worse than the saddlepoint (as witnessed by the fact that pure binomial approximation does not work well for you). The saddlepoint is however computationally costly and if the number of variables you sum is small, the exact solution may still be faster in the end.

To summarise: I wouldn’t be too eager to try the saddlepoint, but I wanted to share it as an additional option that could improve performance.

Best of luck with your model!

While this isn’t very helpful at the moment, we do have the poisson_binomial distribution implemented, but not yet exposed. The relevant Github issue for exposing the distribution is here: Add poisson_binomial lpmf/lcdf/lccdf/rng · Issue #618 · stan-dev/stanc3 · GitHub

@Simon I think you were taking the lead on this one, is that right?

1 Like