I’m interested in using Stan for hierarchical models where there is a discrete stochastic number of events, and each event has a continuous value associated with it. But, I’m running into problems with using Stan for predictive simulations because the number of observations is stochastic.
Example
For whatever reason, we’re studying caterpillar eggs. Caterpillar i lays n_i eggs, and we observe the number of eggs laid by caterpillars i=1, ..., N. If a caterpillar lays any eggs, we also observe the sizes of each egg s_{i, 1}, ..., s_{i, n_i}.
We could model the egg counts and sizes jointly:
n_i \sim \text{Poisson}(\lambda) for i=1, ..., N
s_{i, j} \mid n_i > 0 \sim f(\theta) for j=1, ..., n_i
where f(\theta) is the egg size distribution.
The number of egg size observations is the total number of eggs \sum_{i=1}^N n_i.
Problem
Simulations from the predictive distribution from this model will have varying numbers of eggs, because the number of eggs \sum_{i=1}^N n_i is stochastic. A ragged array would be a useful data structure for this simulation, but as far as I know we can’t build variably-sized ragged arrays on the fly in the generated quantities block.
Tentative solutions
-
Don’t use Stan - instead use a language like R or python. Doable, but maybe more error-prone.
-
If using Stan, the best workaround I’ve come up with so far is to simulate binned values from the predictive distribution, to ensure that the dimensionality of the simulated data is constant. For example, with N caterpillars and B bins, we can simulate an interval-censored response in a N \times B matrix. The approach here would be to treat the binned sizes as a multinomial random variable, where the number of eggs n_i is the sample size and cell probabilities are computed from the CDF of the egg size distribution.
However, I am curious whether there are other solutions out there that I’m missing.
Here’s an R script/Stan model where f is an exponential distribution if it’s helpful to have a concrete example:
library(cmdstanr)
# simulate data
N <- 20
n <- rpois(N, lambda = exp(rnorm(1)))
list_of_s <- sapply(n, rexp, rate = exp(rnorm(1)))
s <- unlist(list_of_s)
stan_d <- list(N = N,
n = n,
s = s)
# fit model
mod <- cmdstan_model("eggs.stan")
fit <- mod$sample(
data = stan_d,
parallel_chains = 4,
)
Here’s the Stan program
data {
int<lower=1> N;
int<lower = 0> n[N];
vector[sum(n)] s;
}
parameters {
real<lower = 0> lambda;
real<lower = 0> rate;
}
model {
lambda ~ lognormal(0, 1);
rate ~ lognormal(0, 1);
n ~ poisson(lambda);
s ~ exponential(rate);
}
Are there strategies for predictive simulation in Stan from these kinds of models that I’m missing (beyond binning the data)? I’m tempted to do the simulations in R, but thought I’d check.