Hi,
I’m curious if there are any solutions to the following problem.
I’ve tried to simplify the problem i’m actually facing as much as possible so i’ll try and illustrate using a ball and urn analogy:
Problem:
There are I different pools of uniquely coloured balls. The balls are sampled from each pool into a single urn such that the probability of x_i balls being sampled from the i^{th} pool is Poisson distributed with rate \lambda. Following this sampling for all I pools, there are now X balls in urn one.
K balls are now sampled with replacement from urn one into a second urn. I now have a vector of counts, k_i for each of the I unique colours.
(I include 0 counts, so no need to condition on k > 0).
Hence, the probability of observing any given i colour k times in the second urn is
p(k) = \sum_{x}^{X} p(k | x) p(x)
I think you would classify this as a compound Poisson-Binomial process (?) where,
x_i \sim Pois(\lambda) and k_i \sim Binom(K, \frac{x_i}{X}) , where X = \sum_{i=1}^{I}x_i
I want to provide a vector of k_i s to STAN and estimate the rate parameter \lambda.
This is the attempt of a custom lpmf function I have written to return the log_prob for p(k):
functions{
real samp_K_Pois_lpmf(int[] ki, real[] lam, int K, int X1, real X2){
// Need two Xs - one for indexing (int) and one for prob (real).
vector[num_elements(x)] lprob;
vector[X1 + 1] log_px; // log_p(x)
real log_pkx; // log_p(ki | x)
vector[X1+ 1] log_pkxxs; // log_p(ki | x) + log_p(x)
real bin_px; // (x/X) - probability of success in Binom.
// Pre-calculate the log_p(x)s to speed up.
for(x in 0:X1){
// Adjust to n + 1 for indexing.
log_px[x + 1] = poisson_lpmf(x | lam);
}
for(i in 1:num_elements(ki)){
for(x in 0:X1){
bin_px = (x/X2);
log_pkx = binomial_lpmf(k[i] | K, bin_pn);
log_pkxxs[x+1] = log_px[x+1] + log_pkx;
}
lprob[i] = log_sum_exp(log_pkxxs);
}
return sum(lprob);
}
}
Questions:
- Assuming i’ve not made any simple maths erorrs - is this the right way to go about coding up this problem in STAN?
- The actual problem i’m tackling has a lot of data points (the vector k_i) - even when pre-allocating the log_p(x)s to save time, I think this is very quickly going to become very slow?
- If I have managed to summarise the problem in terms of p.m.fs correctly, and it is going to be very slow, are there any creative ways to get around doing this double for-loop? Or re-framing the problem entirely for efficiency’s sake?
I’m relatively new to both modelling and STAN so don’t hesitate to say if i’ve gone about this in entirely the wrong way/point out any obvious mistakes!
Thanks in advance for any help.
/