Dirichlet-Gamma-Poisson mixture distribution, or something like that?

I’m struggling with a problem that isn’t directly related to Stan, though it influences whether Stan is an appropriate sampler for the model. The actual project is about ungulate population dynamics in relation to hunting, but there are a lot of background and details required to describe the full model. Instead, I’ll describe below a simpler problem that mirrors my issue, hoping that one of the stats wizards that dwell here might have a potion of insight.

Assume there is a worm of length L that has been split into N segments, each of length x_i , but only M segments are observed. As a starting point, let’s for simplicity assume that L=1. Each segment i produces K_i eggs at a rate r_i. The rates vary both randomly as well as in relation to the size of the segment. The way I’m thinking of a model for this problem is
{\bf{x}} \sim {\rm{Dirichlet}}\left( {\bf{y}} \right) ,
where {\bf{x}} = {x_1},{x_2},...,{x_N} are the N segment lengths and the distribution is symmetric such that all elements of the vector {\bf{y}} , of length N, contains the same value a. Further, the number of eggs for each segment can be modeled as
{K_i} \sim {\rm{Gamma–Poisson}}\left( {{\nu _i},\alpha } \right),
where \alpha is the shape parameter of the associated gamma distribution and {\nu _i} is the expected number of eggs for segment i, which is assumed to vary with its length, however not necessarily proportionally. The way I’ve modeled this in previous models is to define
{\nu _i} = \mu {x_i}\exp \left( {\phi \log \left( {{x_i}{m^{ - 1}}} \right)} \right),
where \mu is the average rate for a segment of average length m, which here would be 1/N, and \phi models the effect of segment length. It is not essential that the function has exactly this form, but neither the simplifying assumption that rate is proportional to segment length (\phi=0) nor that it’s independent of length (\phi=-1) is sufficient; shorter segments compensate partly but not fully.

I’m seeking a formula for the probability of total number of eggs produced by the unobserved segments, \hat K = \sum\limits_{i = M + 1}^N {{K_i}} , conditional on the observed data ( {{\bf{x}}_{obs}} = {x_1},{x_2},...,{x_M} and {{\bf{K}}_{obs}} = {K_1},{K_2},...,{K_M}) and model parameters (\phi, a, \mu, \alpha). Theoretically, I could explicitly estimate in an MCMC the total number of segments N (confined to the range [M+1,\inf]) along with their unobserved lengths {{\bf{x}}_{unobs}} = {x_{M + 1}},{x_{M + 2}},...,{x_N} and produced eggs {{\bf{K}}_{unobs}} = {K_{M + 1}},{K_{M + 2}},...,{K_N}. That, however, would require RJMCMC to jump between models of varying and unknown parameter dimensions ({{\bf{x}}_{unobs}} has an unknown number of elements), which is typically computationally challenging. I think this would particularly be an issue with especially using Stan’s HMC algorithm (which so far works great for everything else in the model, so I rather not move to a different sampler) since there are no gradients to evaluate in the dimension jump. The full problem entails several thousands of “worms” (with on average around 10 observed segments covering 25% of the length, yet observed coverage varies from 0% to 100%) that ultimately would be modelled in some hierarchical model. Thus, there are many dimensions to explore.

Does anyone see a way to integrate/sum out N, {{\bf{x}}_{unobs}}, {{\bf{K}}_{unobs}} and arrive at a statement for P\left( {\hat K|a,\alpha ,\mu ,\phi ,{{\bf{K}}_{obs}},{{\bf{x}}_{obs}}} \right)? Preferably, I’d like to both be able to evaluate the log-likelihood of this (for me elusive) distribution in Stan and sample \hat K from it in generated quantities. Grateful for any insight. Attaching a figure to illustrate my problem.

2 Likes

I think you should be able to get an exact solution or a reasonable approximation based on details of the additional assumptions you are willing to make.

If I understand you correctly, what you call Gamma-Poisson is an alternative name for the negative binomial distribution (noting this, because most of the resources I will point to discuss this as neg. binomial).

Let’s attack some subproblems:

If {\bf{x}} \sim {\rm{Dirichlet}}\left( {\bf{y}} \right) then
(x_1, ... , x_M, \sum_{i=M+1}^N x_i) \sim \rm{Dirichlet}(y_1, ... y_M, \sum_{i=M+1}^N y_i)
(via Dirichlet distribution - Wikipedia)

Warning: In the next few paragraphs, I will follow convention for negative binomial distribution and use \mu, \phi to mean something different than what it means in your post

In Stan, the parametrization behind neg_binomial_2(mu, phi), which I am more familiar with, corresponds (if I am correct) directly to your parametrization such that \mu_i = \nu_i, \phi = \alpha. If I am messing up, the arguments below still roughly hold, but the correspondence is different, please check this for yourself.

I’ve spent some time playing with the distribution of sum of negative binomial (NB) variables. The sum is exactly NB distributed as long as the Fano factor (mean/variance) of all elements is constant. I.e. as long as \frac{\phi_i}{\phi_i + \mu_i} = f for all i, the sum distribution is NB(\bar \mu, \bar \phi) with

\bar \mu = \sum \mu_i \\ \bar \phi = \frac{ \left(\sum \mu_i \right)^2 }{\sum \frac{\mu_i^2}{\phi_i}}

Proof outline: In this case, the underlying Gamma distributions have the same rate and thus their sum is also Gamma - distributed.

If the Fano factor varies, the NB distribution with those parameters is still (usually) a good approximation. I tried improving over it with saddlepoint aprroximation, but it doesn’t help much. See my blog on the topic for more details.

Importantly, the distribution of the sum very quickly looses any information about the individual components. The take home is that even if we know \hat K, we have learned very little about the constituent \nu_i.

You could probably parametrize the sum as \hat K \sim NB(\sum_{i=M+1}^N \nu_i, \alpha q) where q would encompass all the effects of the division. But this is just a guess…

Warning: Now I get back to using \mu, \phi the same way you use it in your post. Sorry for any confusion

So the remaining issue is to build a link between \sum \bf{x_{unobs}} = \sum_{i=M+1}^N x_i and \sum_{i=M+1}^N \nu_i. I would expect that you would be able to approximate this relationship - we are almost certainly unable to get anything sensible about the individual constituents, but maybe you could collapse the difference between \sum_{i=M+1}^N \nu_i and \mu \sum_{i=M+1}^N x_i into a new parameter to be fit per worm? But this is just a guess and I am too lazy today to do the math. Hope this helps anyway.

Best of luck with your model!

2 Likes

Thank you very much Marin for your reply. I’m ruminating over your suggestions. One question as I continue to do so: was it clear from my description of the problem that N is unknown?

1 Like

I believe (though I might be wrong) that the outline I gave requires you to only know the total length of the missing segments (\sum \bf{x_{unobs}}). Knowing \hat K for at least some observations would likely help a lot in identifying the additional parameters the approximations require. The main idea (that, once again, might be wrong) is that the distribution of \hat K is very likely to be quite similar between an “even” distribution of the remaining length into segments of “typical” length and other possible segmentations. One could thus likely derive the distribution for such an “even” segmentation and introduce a suitable additional parameter to account for the differences from this ideal situation.

But I have to say that this is currently a hunch/guess on my side and it is possible I am overlooking some structure in the problem.

Does that make sense?

Thank you very much for your pointers. So far, I’ve managed to convince myself that, yes, at least as long as the shape parameter of the negative binomial (which is the same distribution as the Gamma-Poisson) is the same for all samples, the sum of the samples are also a negative binomial. Attaching some R code and a figure in case it’s helpful for anyone else. It has a slightly different model for size dependence to remove m from the equation: {\nu _i} = \mu x_i^\phi , where \phi=1 means a linear dependence on segment length. (sorry, using my original notation here)

library(tidyverse)
library(cowplot)
as=c(100,.01)# shape parameter of Dirichlet distribution
Ns=c(10,1000) #number of segments
propobs=.5 #prop of unobserved segments
meanr=10 #mu in my original post
sizeeffect=0.5 # phi in my original post
alpha=4 #shape parameter of Gamma-Poisson/NegBin
reps=100000 #number of replicates

pan=0 #for plotting

for (a in as){
for (N in Ns){
M=round(N*propobs) #number of unobserved segments
pan =pan+1
Khat=Khatsum=array(dim = reps)
for (repi in 1:reps){
#simulate segments by Dirichlet distribution
x=rgamma(n=N,shape = a,rate=1)
x=x/sum(x);
###

nu=x^sizeeffect*meanr #mean rate for each segment

Kobs=rnbinom(M, mu = nu[1:M], size = alpha)
Kunobs=rnbinom(N-M, mu = nu[M+1:N], size = alpha)

Khat[repi]=sum(Kunobs)
Khatsum[repi]=rnbinom(1, mu = sum(nu[(M+1):N]), size = sum(nu[(M+1):N])^2/sum(nu[(M+1):N]^2/alpha))

}
method=c(rep("Individual",reps),rep("Sum",reps))
KH=c(Khat,Khatsum)
df<- as.tibble(cbind(method,KH))
df$method=as.factor(df$method)
df$KH=as.integer(df$KH)

p[[pan]] <- ggplot(df, aes(x=KH, color=method)) +
geom_histogram(fill="white", alpha=0.5, position="identity",binwidth=1) +
ggtitle(paste("Dirichlet shape=",a, ", N-M=",N-M))
print(pan)
}
}

plot_grid(p[[1]],p[[2]],p[[3]],p[[4]], labels = "AUTO")


I think the next step would be to assume that N-M is known and figure out what the conditional distributions of (using my original notation) \sum {{\nu _i}} and \sum {\nu _i^2} (or maybe even better a continuous approximation that permits e.g. 10.2 segments so I don’t have to model this in Stan as a mixture distribution with unknown number of components.) Grateful for any suggestions about how to pursue that. Either way, thanks for pointing me in the right direction.

Also, for clarity, \hat K is never known for my problem of inference. It is a quantity I’m aiming to predict. I however also plan to split up my worm segments into training and validation sets to see how well different models are at predicting the latter based on the former. For that purpose, I would need a distribution for which the log-density can be evaluated for observed \hat K. Sorry about any confusion.

I realized there is an important component missing in the model: you don’t seem to have a generative process to actually choose N. I guess it is somewhat implied - you probably assume that the segments you observed are a random draw from all the segments (Is that true? Couldn’t probability of observing a segment depend on segment length?) so the number of missing segments is constrained by having similar distribution of lenghts as the ones you observed. I think making this explicit is important as it will let you sample from this process directly. Sorry for not noticing this earlier.

If, for example, the breaking points are assumed independent (precisely: memoryless), one could treat the break locations as a Poisson point process. This would imply an exponential distribution of segments length. Precisely: exponential distribution would be the case for an infinitely long worm, for a finitely-long worm the length of the last segment is constrained by the remaining length and I am currently not sure how to sensibly handle that. But if you have a lot of segments per worm, than the difference may be quite small. The distribution of observed segment lenghts is probably something you can check in your data. For the Poisson process you would get that number of breaks in any subsegment is Poisson-distributed. This would imply that N - 1 and N - M - 1 are Poisson distributed. Would that make sense?

If yes, then this IMHO mostly solves the issue of sampling \hat K: you fit the exponential distribution to observed segment lengths. You can then sample from this distribution to generate segment lengths dividing the unobserved part of the worm. All the other parameters should be well informed by the observed segments.

If you can’t assume memorylessnes (exponential distribution of segment length) the sampling process can generalize to other distributions for segment length (e.g. gamma), but the distributions for the number of segments will likely cease to be nice - I tried to search for some properties of “Gamma renewal process” but the literature seems focused on other problems than the expected number of breaks in an interval (this may be a signal that it is trivial or that it is unknown :-).

So maybe, if we focus on the distribution of segment lengths instead, we can sidestep the issue of computing the density of \hat K altogether? This would be very welcome, because even if we could assume the Poisson process for the breaks, marginalizing N - M out would still likely be tricky.

Just to clarify - I am quite sure, that it is the Fano factor of the NB distribution that needs to stay constant, which implies the rate parameter of the Gamma needs to be the same. So maybe we are just calling the parameters a different names? Or is there an actual misunderstanding? (It is also possible it is me who is wrong)

Makes complete sense - I used different notation only because I copied some math from elsewhere and only later realized you already use the same symbols and was too lazy to edit the math… Let’s continue with your original notation.

Thanks, Martin, for your suggestion. Your point about a process for generating N is well taken. I indeed have assumed that the probability of observing a segment is independent of its length. For the real problem, it’s debatable if that’s true, but it’s an assumption I am willing to live with, at least for now. Thinking about my problem in terms of a Poisson point process is maybe OK, but I can’t strictly assume that’s a good approximation. The data of “segments” indicate that some “worms” are more heterogeneous in their segment lengths than others, but maybe it’s not that important for the end result. I could probably explore the effect of deviations from through simulations if you think there is a way forward based on the assumption of the Poisson processes.

I do however need an expression to evaluate \hat K, at least to solve my larger problem (of which the focal issue is a part) in the neat way that I’d like. Sorry about being vague about this. Again, integrating out the number of missing segments—lets denote that Q=N-M — isn’t strictly necessary; I could very well include Q as an additional parameter in the analysis if needed—and even better if I then can approximate this as well as \hat K by a continuous variable.

I started looking at the absolute simplest case that Q=2, in which case the Dirichlet distribution is a Beta distribution, to see if I could work out the distribution of \bar \nu (\bar \mu in Martin’s first post) as needed to define a Negative Binomial from a sum of Negative Binomials. From the general formula to work out a distribution F\left( Y \right) of a transformed variable Y = h\left( X \right),

F\left( Y \right) = G\left( X \right)\left| {{{d{h^{ - 1}}\left( Y \right)} \over {dY}}} \right|,

I was hoping I could define

\eqalign{ & G\left( x \right) = {\rm{Beta}}\left( {x|a,a} \right) \cr & h\left( x \right) = \sum\limits_{i = 1}^2 {{\nu _i} = } \mu \left( {{x^\phi } + {{\left( {1 - x} \right)}^\phi }} \right) \cr}

and then work out the determinant of the inverse transformation {h^{ - 1}}, but I quickly realize there’s no way to solve the inverse explicitly. I think I’ve confused myself into a corner here…

I also tried another approach, starting from your assumption that unobserved segment lengths are exponentially distributed. If Q is large, \sum\nolimits_{i = 1}^Q {{v_i}} and \sum\nolimits_{i = 1}^Q {v_i^2} would be the first and second raw moment, respectively, multiplied by Q. These assumptions might be a starting point to something more refined or, best case, a good enough approximation. I ended up with

G\left( x \right) = \lambda \exp \left( { - \lambda x} \right),

where \lambda is the rate parameter of the exponential distribution,

{h^{ - 1}}\left( Y \right) = {Y^{1/\phi }},

\left| {{{d{h^{ - 1}}\left( Y \right)} \over {dY}}} \right| = {1 \over \phi }{Y^{1/\phi - 1}}

and

\sum\nolimits_{i = 1}^Q {{v_i}} \approx Q \mu {\lambda \over \phi }\int\limits_0^\infty {{Y^{1/\phi }}\exp \left( { - \lambda {Y^{1/\phi }}} \right)dY}
\sum\nolimits_{i = 1}^Q {v_i^2} \approx Q \mu {\lambda \over \phi }\int\limits_0^\infty {{Y^{1/\phi + 1}}\exp \left( { - \lambda {Y^{1/\phi }}} \right)dY} ,

but I don’t know how to solve the integrals for anything but special cases of \phi.

Thoughts?

Edit: I think maybe I can solve those integrals now that I look at it again. Either way, happy for any thoughts on the general approach.

I fear vagueness could cost us a lot of wasted effort here. What do you mean by "evalute \hat K"? Because being able to draw from distribution of \hat K given all the other parameters is IMHO much easier than evaluating the log-mass of an observed \hat K value given the other parameters. Do you really need both? In other words: do you need to use \hat K as a source of information? Or is it simply a derived characteristic you would like to compute after extracting information from the K_i? Or do you just want to do cross-validation against a held-out part of the dataset? Or is there something else you need?

I don’t think I follow what you are trying to accomplish here… My reasoning was really very simple (but maybe I didn’t explain it very well): since we end up building the neg.binomial by matching the mean and variance of the sum, we might as well just work with mean and co-variance all the way.

So assuming \phi = 0 for simplicity, we get

\rm{E}(\hat K) = \rm{E}(\sum \nu_i) = \rm{E}(\sum \mu x_i) = \mu \\ \rm{Var}(\hat K) = \sum \rm{Var}(K_i) + 2 \sum_{1 < i < j < Q} \rm{Cov}(K_i, K_j) = \\ = \sum \mu x_i + \sum \frac{\mu^2 x_i^2}{\alpha} + 2 \sum_{1 < i < j < Q} \rm{Cov}(K_i, K_j) \\

The covariance terms are tricky and I don’t see any immediate way to compute them but if Q is large, the covariance terms will be small-ish. We could also probably approximate \rm{Cov}(K_i, K_j) \approx \rm{Cov}(\nu_i, \nu_j) = \rm{Cov}(\mu x_i, \mu x_j) = \mu^2 \rm{Cov}(x_i, x_j) = \mu^2 \frac{-a^2}{(Qa)^2(Qa + 1)}. I hope the generalization to other values of \phi doesn’t present any additional big challenges.

Assuming \hat K is roughly NB, the mean and variance are then enough to fully describe the distribution. Obviously a simulation study should be run to check the approximation is OK-ish.

But I note once again, that this is only needed if we need to evalute the mass of specific, observed \hat K given the parameters.

I honestly also didn’t completely understand the logic of this one, but I think I can follow the rough directions. (Note: I am not so good at math, despite possibly looking confident). I think trying to find the moments is sensible. Anyway, I can recommend Wolfram alpha (or Wolfram Cloud for more advanced uses) for solving integrals. Even the free versions can definitely solve integrals I am completely unable to crack :-). I can however see how you could (if Q is large-ish) assume that Q is roughly gamma distributed and treat it as additional parameter in the model…

But the point of assuming the exponential distribution was that you can simulate \hat K under this assumption very easily and you can easily do cross-validation against a held-out set of K_i values. However, I don’t think it lets you compute the mass for an observed \hat K in any easy way.

EDIT: I’ve been mixing density and mass in the post, so tried to fix that.

Hope I am making sense - maybe my proposals don’t work and I am overseeing some basic problem, so please treat my suggestions with some skepticism.

wxMaxima is your friend also. For a nice frontend, see this. I don’t do any integration/differentiation by hand if I can avoid it. This is not ~just~ out of lazyness: it also ensures fewer errors propagate through calculations.

1 Like

Sorry about the slow response to this after all good input. There were a fair few things I needed to get straight before bothering anyone more with my problem. And again, also sorry about previous vagueness… I do seek an expression to evaluate the likelihood of \hat K.

This is where I’m at with this now. First, modeling segment length proportional to the total length L=1, I assume that the M observed segment proportion x_1,x_2,...,x_M of total length \check L = \sum\nolimits_{i = 1}^Q { x_i} and the Q=N-M unobserved segment proportions \hat x_1,\hat x_2,...,\hat x_Q of total length \hat L =1-\check L = \sum\nolimits_{i = 1}^Q {\hat x_i} are drawn randomly from the total population of segments and model {\bf{X}} = {\left[ {{x_1},{x_1},...,{x_M},\hat L} \right]^{\rm{T}}} as
\bf{X} \sim Dirichlet\left( {\bf{A}} \right),
where \bf{A} of length M+1 is defined as
{A_i} = \left\{ {\matrix{ {a \textrm{ if } i = 1,2,...,M} \cr {aQ {\textrm{ if }} i = M + 1}. \cr } } \right.

Here is a little code snipped to estimate Q and a, lazily assuming uniform priors on the support of the parameters:

data {
int<lower=0> M;
vector[M+1] xobsandxunobssum;
}
parameters {
real<lower=0> a;
real<lower=1> Q;
}
transformed parameters{
vector[M+1] avec;
for(i in 1:M){
avec[i]=a;
}
avec[M+1]=Q*a;
}
model{
xobsandxunobssum ~ dirichlet(avec);
}



This assumes that Q is continuous, which is, I think, OK for the focal problem.

Further, as in above post, I assume random variation in egg production rate r_i can be modelled by a gamma distribution such that

{r_i} \sim {\rm{Gamma}}\left( {{\alpha _i},\beta } \right),

where the shape parameter \alpha_i depends on \rm{E}\left( {{r_i}} \right) = {\nu _i} as {\alpha _i} = {\nu _i}\beta and the rate parameter \beta is assumed to be constant, adhering to the assumption that the Fano factor (not shape like I confused myself about above, thanks @martinmodrak ), 1/\beta , is constant. Under this assumption, the rate of egg production by all unobserved segments

\sum {{r_i}} \sim {\rm{Gamma}}\left( {\sum {{\alpha _i}} ,\beta } \right) = {\rm{Gamma}}\left( {\beta \bar \nu ,\beta } \right),

where

\bar \nu = \sum {{\nu _i}}

is the Gamma distribution’s mean. Consequently, the likelihood of \hat K eggs produced by the unobserved segments follows a negative binomial with mean \bar \nu and \rm{size}=\beta \bar \nu .

I model the effect of segment length on expected rate with the parameter \phi as

\nu _i = \mu {\left( {{x_i \over {\bar x}}} \right)^\phi },

where the inclusion of average segment length, \bar x, is included to aid the interpretation that \mu is the average rate of egg production for a segment of average length. Because \bf{X} is normalized to sum to 1, \bar x = 1/N, where N=M+Q, and

\nu _i = h\left( x_i\right)=\mu {\left( {{x_i N}} \right)^\phi }.

If Q is large, the distribution of the unobserved segments should tend to a Gamma distribution with mean {\hat L}/Q and shape parameter a, corresponding to the Gamma rate parameter b=aQ/\hat L. I haven’t bother trying to prove this, but it follows from how Dirichlet random numbers are generated: simulate (here) Q gamma random numbers with shape a and arbitrary rate and normalize them such that they sum to one. Under this parameterization, the probability distribution of \nu is

{\rm{F}}\left(\nu \right) = {\rm{Gamma}}\left( {{{\rm{h}}^{ - 1}}\left( \nu \right)} \right)\left| {{{\rm{h}}^{ - 1}}\left( \nu \right)} \right|,

where

\left| {{{\rm{h}}^{ - 1}}\left( \nu \right)} \right| = {{N{{\left( {N\nu } \right)}^{{\phi ^{ - 1}} - 1}}} \over {\mu \phi }}

is the determinant of the inverse of variable transformation.

With “some” help from Mathematica, I derived an expression for the first moment of the distribution \rm{F}\left(\nu \right), i.e. the mean:

m={{{b^a}\Gamma (a + \phi ){{\left( {{{{{\left( {{1 \over \mu }} \right)}^{1/\phi }}} \over N}} \right)}^a}{{\left( {{{b{{\left( {{1 \over \mu }} \right)}^{1/\phi }}} \over N}} \right)}^{ - a - \phi }}} \over {\Gamma (a)}}.

This is finite as long a \phi > -a, and under this condition, there is a first moment approximation to \bar\nu as \hat\nu=Qm.

Now, the important question is if it’s a good enough approximation. I’ve made some simulations, here just focusing on the case where all segments are considered unobserved, i.e. Q=N. I’ve simulated by explicitly simulating individual segments’ egg production (green), by using the sum of individual segments’ egg production (\bar\nu, blue), and by using the approximation (\hat\nu, red). What the below figure hopefully demonstrates is that when the Dirichlet shape parameter a is large, indicating that segments are similar in size, or that \phi \approx 1, indicating approximately a linear relationship between segment length and egg production rate, the approximation produces similar results as the exact simulations.

However, when \phi indicates distinct deviations from a linear relationship and a is low, indicating large differences in segment size, the approximation starts to fail. For high Q (here 10000), it’s just a matter of of underestimating the variability, but for low Q (here 10), there is also some noticeable bias. I sort of understand why that happens. When there are only a handful of segments simulated and a is low, the un-normalized Gamma samples will in most cases not include samples from the tail, leading to normalized \bf{x} that are more similar than the expectation, which when phi<1 leads to a higher egg production rate (it is highest when all segments are of equal length). I think the approximation may be good enough for my problem, but I’d feel a lot more confident about it if I could account for the sampling variability here. I had to dig into dusty, decade-old corners of unused math to get to this point, so very grateful for any ideas about how to further refine the approximation.

2 Likes

No worries, just take this at your own pace (and I will take it on my own pace).

Just to understand the problem better - why exactly do you need it? Earlier you mentioned you wanted to do cross validation - is that why you need it? (because if it is just for cross validation, I think we can easily do cross validation without this density). Or is there something else? (because then maybe we can rephrase the problem to make the math easier).

You seem to have made a lot of progress - I am still trying to understand exactly what you did, will get back to this later.

There are at least two purposes. Firstly, I would like to evaluate the likelihood of \hat K in a validation data-set. Doing e.g. PSIS LOO CV is informative at the level of individual segments, but I also want to evaluate predictive performance at a higher level. I very rarely have all segments, but I could split up the segments into training and validation sets and aim to predict the total number of eggs produced by the latter based on parameterization based on the former without taking into account the information of individual segment lengths in the validation data. It’s an imperfect proxy for evaluating performance at the level of all unobserved segments.

Secondly, \hat K would feed into larger, temporal population models where “eggs” affects the dynamics. As such, there would be other information informing (a large number of) \hat K, but the model developed here would provide the information about \hat K from the direction of the data for each “worm.”

1 Like

OK, so I hope I understand roughly what you did, although I have a few missing pieces. I see that getting the moments of F(\nu) was actually non-trivial, so good job there. Do I get it right that a in the expression for m is the parameter of the underlying Dirichlet and b is the rate parameter of the Gamma? Or a and b are the shape and rate parameters of the underlying gamma distribution? In any case, the expression for F(\nu) = ... has only one parameter for the Gamma, so I am not sure I follow exactly…

Anyway, I think I understand, how you get to the mean of the approximating distribution. Do you take the final distribution to be neg. binomial? If so, I am not sure how you get the size (or variance) of that distribution…

A hunch (I am not sure if it will be feasible/will help): make an attempt to work more precisely also with the variance and covariance. This is important for two reasons:

1. The assumption of constant Fano factor (or Gamma rate) can be seriously limiting - it means that as your mean gets larger the sd/mean ratio shrinks so the distribution will be increasingly precise on the original scale (while for the parametrization in neg_binomial_2_lpmf you actually get roughly constant sd/mean, because variance is allowed to grow with the square of the mean)
2. The negative correlations across the individual components induced by the Dirichlet distribution could be important for the small Q case.

But before we get into the weeds a useful thing that should be easy to implement would be to try if you can approximate the empirical distribution well. I.e. if you take the simulated values (where you explicitly simulate all segments egg production), compute their mean and variance and then plot the negative binomial distribution with matching mean and variance, do you get a good approximation? Because all we are doing up to this point is a way to get to this mean and variance. If neg. binomial is not a good approximation for any parameters, then we are in trouble.

If that works, then we just need to find a way to find the parameters of the neg. binomial analytically. And since the Poisson part of the noise is independent, we can safely just find a Gamma approximation to \sum \nu_i and then add the Poisson noise back in (This is what you are already doing, right?).

Now - if I get it correctly, please check me twice - this is symbolically:

K_i \sim \rm{NB_2}(\nu_i, \tau) \Leftrightarrow \\ K_i \sim \rm{Poisson}(r_i), r_i \sim \rm{Gamma}(\tau, \frac{\tau}{\nu_i}) \\ \hat K = \sum K_i \Leftrightarrow \\ \hat K \sim Poisson(\sum r_i)

I introduced \tau for the shape parameter, as the \alpha letter seems to be used at other places.

As for 1), since the rate parameter of the gamma is allowed to vary, we no longer have \sum r_i \sim \rm{Gamma}( \rm{something}), but we will approximate it this way by matching moments.

You have already worked out how to get m = E(r_i) (i.e. combining the Gamma distribution with the h transform), but we would still need to work out the variance - is it possible?

Now for 2), the covariance. I am building slightly on top of one of my previous posts, but to adjust it to the situation at hand, we have:

\rm{Var}(\sum r_i) = \sum \rm{Var}(r_i) + 2 \sum_{1 < i < j < Q} \rm{Cov}(r_i, r_j) = \\ = Q (\rm{Var}(r_1)) + Q(Q - 1) \rm{Cov}(r_1, r_2) \\

(since we assume that r_i are interchangeable). And we could approximate the covariance via the covariance of the Dirichlet. So having a as the shared Dirichlet parameter across all the segments, we would have

\rm{Cov}(r_1, r_2) \approx \rm{Cov}(h(x_1), h(x_2)) \approx \frac{h(x_1)}{x_1} \frac{h(x_2)}{x_2} \rm{Cov}(x_1, x_2) \approx \\ \rm{E} \left( \frac{h(x_1)}{x_1} \right) ^2 \frac{-a^2}{(Na)^2(Na + 1)}

Where \rm{E} \left( \frac{h(x_1)}{x_1} \right) should be hopefully possible to find analytically, or one could just put another approximation, maybe

\rm{E} \left( \frac{h(x_1)}{x_1} \right) \approx \frac{h(E(x_1))}{E(x_1)}

would be enough…

Obviously, it is possible I am wrong. But if you haven’t tried something like this already, I would give it a try.

Best of luck and hopefully I am not putting you on a wrong track.