# Predictive simulation with stochastic data sizes

I’m interested in using Stan for hierarchical models where there is a discrete stochastic number of events, and each event has a continuous value associated with it. But, I’m running into problems with using Stan for predictive simulations because the number of observations is stochastic.

## Example

For whatever reason, we’re studying caterpillar eggs. Caterpillar i lays n_i eggs, and we observe the number of eggs laid by caterpillars i=1, ..., N. If a caterpillar lays any eggs, we also observe the sizes of each egg s_{i, 1}, ..., s_{i, n_i}.

We could model the egg counts and sizes jointly:

n_i \sim \text{Poisson}(\lambda) for i=1, ..., N

s_{i, j} \mid n_i > 0 \sim f(\theta) for j=1, ..., n_i

where f(\theta) is the egg size distribution.

The number of egg size observations is the total number of eggs \sum_{i=1}^N n_i.

## Problem

Simulations from the predictive distribution from this model will have varying numbers of eggs, because the number of eggs \sum_{i=1}^N n_i is stochastic. A ragged array would be a useful data structure for this simulation, but as far as I know we can’t build variably-sized ragged arrays on the fly in the generated quantities block.

## Tentative solutions

1. Don’t use Stan - instead use a language like R or python. Doable, but maybe more error-prone.

2. If using Stan, the best workaround I’ve come up with so far is to simulate binned values from the predictive distribution, to ensure that the dimensionality of the simulated data is constant. For example, with N caterpillars and B bins, we can simulate an interval-censored response in a N \times B matrix. The approach here would be to treat the binned sizes as a multinomial random variable, where the number of eggs n_i is the sample size and cell probabilities are computed from the CDF of the egg size distribution.

However, I am curious whether there are other solutions out there that I’m missing.

Here’s an R script/Stan model where f is an exponential distribution if it’s helpful to have a concrete example:

library(cmdstanr)

# simulate data
N <- 20
n <- rpois(N, lambda = exp(rnorm(1)))
list_of_s <- sapply(n, rexp, rate = exp(rnorm(1)))
s <- unlist(list_of_s)

stan_d <- list(N = N,
n = n,
s = s)

# fit model
mod <- cmdstan_model("eggs.stan")
fit <- mod\$sample(
data = stan_d,
parallel_chains = 4,
)


Here’s the Stan program

data {
int<lower=1> N;
int<lower = 0> n[N];
vector[sum(n)] s;
}

parameters {
real<lower = 0> lambda;
real<lower = 0> rate;
}

model {
lambda ~ lognormal(0, 1);
rate ~ lognormal(0, 1);
n ~ poisson(lambda);
s ~ exponential(rate);
}


Are there strategies for predictive simulation in Stan from these kinds of models that I’m missing (beyond binning the data)? I’m tempted to do the simulations in R, but thought I’d check.

4 Likes

It’s not exactly elegant, but you could initialize vectors to store replications of s to be larger than one could reasonably expect from your model.

data {
int<lower=1> N;
int<lower = 0> n[N];
vector[sum(n)] s;
int<lower=0> N_rep;
}

parameters {
real<lower = 0> lambda;
real<lower = 0> rate;
}

model {
lambda ~ lognormal(0, 1);
rate ~ lognormal(0, 1);
n ~ poisson(lambda);
s ~ exponential(rate);
}
generated quantities {
int<lower=0> n_rep[N_rep];
vector[50] s_rep[N_rep];

for (i in 1:N_rep) {
// Draw n
n_rep[i] = poisson_rng(lambda);
for (j in 1:n_rep[i]) {
// Draw s conditional on n
s_rep[i][j] = exponential_rng(rate);
}
}
}


In the above, s_rep vectors are of an arbitrary size 50 – I ran this model with some fake data that had \lambda = 2, so we’d never expect 50 be drawn from that Poisson distribution. Any of the slots in those vectors that aren’t used are left as NaNs, which you can just discard after running the model.

2 Likes

That’s correct. In my view, the main idea with variable arrays is to declare the array as large as likely required, and then fill it partially based on sampling from the poisson.

2 Likes