I’m struggling with a problem that isn’t directly related to Stan, though it influences whether Stan is an appropriate sampler for the model. The actual project is about ungulate population dynamics in relation to hunting, but there are a lot of background and details required to describe the full model. Instead, I’ll describe below a simpler problem that mirrors my issue, hoping that one of the stats wizards that dwell here might have a potion of insight.

Assume there is a worm of length L that has been split into N segments, each of length x_i , but only M segments are observed. As a starting point, let’s for simplicity assume that L=1. Each segment i produces K_i eggs at a rate r_i. The rates vary both randomly as well as in relation to the size of the segment. The way I’m thinking of a model for this problem is

{\bf{x}} \sim {\rm{Dirichlet}}\left( {\bf{y}} \right) ,

where {\bf{x}} = {x_1},{x_2},...,{x_N} are the N segment lengths and the distribution is symmetric such that all elements of the vector {\bf{y}} , of length N, contains the same value a. Further, the number of eggs for each segment can be modeled as

{K_i} \sim {\rm{Gamma–Poisson}}\left( {{\nu _i},\alpha } \right),

where \alpha is the shape parameter of the associated gamma distribution and {\nu _i} is the expected number of eggs for segment i, which is assumed to vary with its length, however not necessarily proportionally. The way I’ve modeled this in previous models is to define

{\nu _i} = \mu {x_i}\exp \left( {\phi \log \left( {{x_i}{m^{ - 1}}} \right)} \right),

where \mu is the average rate for a segment of average length m, which here would be 1/N, and \phi models the effect of segment length. It is not essential that the function has exactly this form, but neither the simplifying assumption that rate is proportional to segment length (\phi=0) nor that it’s independent of length (\phi=-1) is sufficient; shorter segments compensate partly but not fully.

I’m seeking a formula for the probability of total number of eggs produced by the unobserved segments, \hat K = \sum\limits_{i = M + 1}^N {{K_i}} , conditional on the observed data ( {{\bf{x}}_{obs}} = {x_1},{x_2},...,{x_M} and {{\bf{K}}_{obs}} = {K_1},{K_2},...,{K_M}) and model parameters (\phi, a, \mu, \alpha). Theoretically, I could explicitly estimate in an MCMC the total number of segments N (confined to the range [M+1,\inf]) along with their unobserved lengths {{\bf{x}}_{unobs}} = {x_{M + 1}},{x_{M + 2}},...,{x_N} and produced eggs {{\bf{K}}_{unobs}} = {K_{M + 1}},{K_{M + 2}},...,{K_N}. That, however, would require RJMCMC to jump between models of varying and unknown parameter dimensions ({{\bf{x}}_{unobs}} has an unknown number of elements), which is typically computationally challenging. I think this would particularly be an issue with especially using Stan’s HMC algorithm (which so far works great for everything else in the model, so I rather not move to a different sampler) since there are no gradients to evaluate in the dimension jump. The full problem entails several thousands of “worms” (with on average around 10 observed segments covering 25% of the length, yet observed coverage varies from 0% to 100%) that ultimately would be modelled in some hierarchical model. Thus, there are many dimensions to explore.

Does anyone see a way to integrate/sum out N, {{\bf{x}}_{unobs}}, {{\bf{K}}_{unobs}} and arrive at a statement for P\left( {\hat K|a,\alpha ,\mu ,\phi ,{{\bf{K}}_{obs}},{{\bf{x}}_{obs}}} \right)? Preferably, I’d like to both be able to evaluate the log-likelihood of this (for me elusive) distribution in Stan and sample \hat K from it in `generated quantities`

. Grateful for any insight. Attaching a figure to illustrate my problem.