Compound Sampling-Binomial?


Hi,

I’m curious if there are any solutions to the following problem.

I’ve tried to simplify the problem i’m actually facing as much as possible so i’ll try and illustrate using a ball and urn analogy:

Problem:

There are I different pools of uniquely coloured balls. The balls are sampled from each pool into a single urn such that the probability of x_i balls being sampled from the i^{th} pool is Poisson distributed with rate \lambda. Following this sampling for all I pools, there are now X balls in urn one.
K balls are now sampled with replacement from urn one into a second urn. I now have a vector of counts, k_i for each of the I unique colours.

(I include 0 counts, so no need to condition on k > 0).

Hence, the probability of observing any given i colour k times in the second urn is

p(k) = \sum_{x}^{X} p(k | x) p(x)

I think you would classify this as a compound Poisson-Binomial process (?) where,

x_i \sim Pois(\lambda) and k_i \sim Binom(K, \frac{x_i}{X}) , where X = \sum_{i=1}^{I}x_i

I want to provide a vector of k_i s to STAN and estimate the rate parameter \lambda.

This is the attempt of a custom lpmf function I have written to return the log_prob for p(k):

functions{
      real samp_K_Pois_lpmf(int[] ki, real[] lam, int K, int X1, real X2){
      // Need two Xs - one for indexing (int) and one for prob (real).

      vector[num_elements(x)] lprob;
      vector[X1 + 1] log_px; // log_p(x)
      real log_pkx; // log_p(ki | x)
      vector[X1+ 1] log_pkxxs; // log_p(ki | x) + log_p(x)
      real bin_px; // (x/X) - probability of success in Binom.


      // Pre-calculate the log_p(x)s to speed up.
      for(x in 0:X1){
          // Adjust to n + 1 for indexing.
          log_px[x + 1] = poisson_lpmf(x | lam);
      }
      
      for(i in 1:num_elements(ki)){
        for(x in 0:X1){
          bin_px = (x/X2);
          log_pkx = binomial_lpmf(k[i] | K, bin_pn);
          log_pkxxs[x+1] = log_px[x+1] + log_pkx;
        }
        lprob[i] = log_sum_exp(log_pkxxs);
      }

      return sum(lprob);

    }
  
} 

Questions:

  • Assuming i’ve not made any simple maths erorrs - is this the right way to go about coding up this problem in STAN?
  • The actual problem i’m tackling has a lot of data points (the vector k_i) - even when pre-allocating the log_p(x)s to save time, I think this is very quickly going to become very slow?
  • If I have managed to summarise the problem in terms of p.m.fs correctly, and it is going to be very slow, are there any creative ways to get around doing this double for-loop? Or re-framing the problem entirely for efficiency’s sake?

I’m relatively new to both modelling and STAN so don’t hesitate to say if i’ve gone about this in entirely the wrong way/point out any obvious mistakes!

Thanks in advance for any help.

/

One thing you can do is to “compress” the counts, as shown here.

That said, I’m not so sure about this model. Is there a motivation for modelling the probabilities as normalised Poisson counts?

2 Likes

I’m a bit confused here. Before digging into the code, I want to make sure that I have the concepts/math right.

Here, \lambda isn’t indexed by i. Is this intentional (all processes have the same \lambda)? So you’re trying to infer \lambda based on the sampling variation in the relative abundance of each type in urn 1?

Or do you mean to index \lambda_i by i? Unless I’ve missed something, this won’t work, because:

there’s no constraint on the total number of balls in urn 1. If I use large lambdas, I can tune the relative abundances of the x_i with arbitrary precision, and I can always get the same relative abundances with equal or better precision by multiplying all of my lambdas by a constant greater than 1. Or do you have repeated measures of the k's under resampling of the x's?

Finally:

does not enforce the constraint that the sum of the k_i be equal to some pre-specified K, but your verbal description of the problem suggests that you do want this constraint.

2 Likes

Thank you both for your replies.

@maxbiostat - thanks for suggesting compressing the counts. That’s very smart - the real data will have consist of many repeated counts so this will speed it up markedly.

@jsocolar - thanks for sticking with me for the conceptual stuff. This has already made me think again about how I want the model to behave. I’ve had a play around with simulating the process today and a few things have cropped up, so I have tried to just focus on the most important here.

You were absolutely correct when saying i’m “trying to infer ‘some population-wide parameter’ based on the sampling variation in the relative abundance of each type in urn 1”. I think i’m just struggling to devise this appropriately as a model.

First, I think i’ve shot myself in the foot by trying to simplify the problem to a Poisson distribution with only one parameter controlling the mean and variance, so i’ll try and reframe the problem using some hypothetical p.m.f. D parameterised so mean \mu and variance v.

  1. The first urn consists of drawing balls from I uniquely coloured pools where all draws follow the distribution D(\mu, v) - I begin by assuming I know \mu, but not v (neither indexed by i).
  2. I observe the total number of balls in urn 1, X, such that \sum_{i=1}^{I}x_i = X.
  3. However, I cannot observe the vector of x_i s.
  4. Balls are sampled from urn 1 K times with replacement to urn 2, so our final observable vector is of I counts, k_i, where \sum_{i=1}^{I}k_i = K.
  • Is this now a) possible in theory, and, if so, b) possible subbing in the hypothetical pmf in for the poisson in my STAN function?
  • You already hinted at one of the next steps which would be using multiple samples of K from urn 1. Am I right in thinking this is an easier problem because the variance between K s for a given x_i gives you more information about the variance of x_i in urn 1?
2 Likes

My suggestion would be to first assume that you can observe the x_i and explore the problem of estimating v based on the observed x_i. The full problem will be at least this hard, so gaining some intuition here could be useful. How much data do you need (either in terms of I or in terms of the number of repeated experiments) to get a good estimate of v? How strongly is your estimate of v controlled by the sampling variation of the normalized x_i's versus by the disagreement between \mu * I (which is fixed) and X?

I say this because it might be the case that this sub-problem already exposes pathologies or unrealistic data requirements, thereby saving you a lot of work.

2 Likes

Awesome, will do - thanks for your help. Has helped unearth some holes in my intuitions. Definitely a case of back to the toy models and i’ll hopefully open a new thread when I have some more STAN-specific issues down the line.

Hi,

As a follow up to this, is there a clever way to simulate compressed counts in the generated quantities block? I now have a custom likelihood function that i need to randomly generate values from to get the posterior predictive distribution (as per 27.1 Simulating from the posterior predictive distribution | Stan User’s Guide). However, if i try and simulate uncompressed data the memory will blow up as i’ll have to simulate ~10^6 individual values for each draw from the posterior…

Don’t know if this will work, but what I’d do is to figure out what the range of 99% of the counts is likely to be and then treat the sequence c_1, \ldots, c_K as realisations of a multinomial with probabilities \alpha_1, \ldots, \alpha_K where \alpha_i = \textrm{Pr}(X = c_i). This is not perfect, because it “censors” the counts; it might happen the the last category is noticeably more probable than you’d expect. That’d be a sign that you truncated early and c_K is much too small. On the other hand, without further information, it’ll be hard to know if your approximate posterior predictive really respects the tail of the true predictive. Others might have better ideas. @jsocolar ?

2 Likes

My main suggestion would be to consider whether you really need this simulation performed at every draw from the posterior. If you do not (maybe you only need 50 or 100 draws), then instead of doing this in generated quantities you can implement this simulation separately in R/python/whatever using the model output.

Note that if a margin of the posterior predictive distribution is approximately normal, then by the time you simulate 100 times the MCMC standard error for the marginal mean will be only a tenth as large as the standard deviation of the posterior margin itself.

Note also that if your goal is posterior predictive checking, and you’d like to implement @maxbiostat’s strategy above (which looks like a good approach to me), then you can choose X, where you try to cover X% of the counts, to be large enough so that you will unambiguously conclude that you fail the PPC if you fail to resemble the true data within the range X.

Finally, I have trouble imagining what you would do with a posterior summary of 10^6 quantities, other than to compute some much lower-dimensional summary or function of those data for use downstream. If that’s the case, then you can just compute that function and monitor the output rather than saving and storing all of the simulated counts themselves.

1 Like

Hi both,

Thanks again for your quick replies.

Re not simulating in generated quantities, I’ve just read Using Leave-one-out cross-validation for large data • loo which also suggests doing everything in R - and manually sub-sampling from the posterior draws so looks like this will get around a lot of the problems of trying to compress the counts in the in generated quantities block.

The multinomial approach looks like a good idea and would speed up simulating for each posterior draw (even if I performed this purely in R? - am I right in thinking that I just need the simulated counts and the log-probability for a given posterior draw to perform posterior predictive checking in loo?).

I’ve already been setting an appropriate (but generous) c_K when fitting the model for speed’s sake so can carry this over.

(The 10^6 quantities are cell lineages - many become extinct although there are in the order of 10^7 individual cells in the data, but lots of lineages share the same count so the compression helps massively!)

Thanks again both.

As a follow up, @jsocolar I think you might have preemptively identified a problem. Because the custom R function that is passed to loo_subsample must return the log-likelihood for a single observation, I have to ‘un-compress’ the counts after calculating the LL for each unique count.

I think this means that trying to perform loo with loo_subsample becomes prohibitively slow (this is running on an order of magnitude lower, simulated data).

At the moment I have a custom likelihood function that i use within STAN for fitting the model. This can handle the large numbers as it is working with the compressed counts. You suggested that i might instead work with some summary statistic of the data? Would that be as a remedy to this problem when trying to perform posterior predictive checks? Or more generally?

I don’t completely understand your workflow, but I guess you are saying:
You have a dataset with 10^6 observations, and you want to perform loo on it. However, you don’t have enough memory to store the entire N x 10^6 pointwise log-likelihood matrix. Is this the crux of the problem?

If so, then I think the simplest solution would be to do the subsetting in generated quantities itself. If you want to redo the computation on multiple subsets, then you could fit the model and instead of computing the pointwise log-likelihood in generated quantities, compute it for an arbitrary subset of points in R or python using the fitted model object. Finally, you might consider your options for running this job in an environment with more RAM. Storing 1000 * 10^6 numbers in RAM should consume on the order of 8GB.

Do I understand correctly that all observations with the same count have the same LL (i.e. there are no covariates?). If so, then you can just get the LL for one observation of each count, and reuse it N times, where N is the number of observations with that count.

I think I have this working now: I’ll outline what i ended up doing (I used a combination of ideas suggested by you both @jsocolar and @maxbiostat).

  • I created pointwise log-likelihoods for each unique count in the generated quantities block whilst fitting the model - these only needed to be calculated for each unique count, so this was not memory intensive.
  • (in R) I chose a subset of the iterations to peform loo. I expanded the pointwise LLs by their observed frequencies in the original data, whilst making sure that the order of the expanded LLs matched that of the original (non compressed) count data.
  • I realised that I can’t do LOO-PIT with discrete data (after staring at terrible LOO-PIT plots for some time… I eventually found this useful thread: Worst LOO-PIT plots ever. PP Checks great)
  • I still simulated new data from the posterior draws from my sub-sample of iterations by calculating the vector of probabilities \alpha_{1}, ..., \alpha_{K} up to some ‘generous’ upper count, c_{K} and used these as probabilities in a multinomial distribution.
  • I used these simulated values for graphical posterior predictive checks, as per Graphical posterior predictive checks using the bayesplot package • bayesplot

Just to clarify, your model has no covariates that influence the expected counts?

That is correct, yes.