I’m working on modeling a system where individuals within subpopulation i at time t can be in one of three states, y^t_{ij} \in \{1, 2, 3\};. A sensible model for transitions on the individual level is via an ordered logistic model. The individuals themselves, however, are not observed. The data is the proportion of individuals within a particular subpopulation that are state 2 at a given time, so if there are N_i individuals, we observe a quantity z^t_i = | \{ j\in \{1, \dots, N_i \} : y_{ij} = 2\} | / N_i.

Assuming the parameters for the ordered logistic are \beta, c_1, and c_2, then the probability that a given individual is in state two is given by \theta = \mathrm{logit}^{-1}(X_i\beta - c_1) - \mathrm{logit}^{-1}(X_i\beta - c_2), where X_i contains information about the individuals in subpopulation i.

The catch here is that N_i is not known with certainty nor do we have exact counts, but we do have prior information that allows us to bound N_i. If we did have exact counts, say k_i^t was observed, then we could model k_i^t | N_i, \theta \sim \mathrm{Binomial}(N_i, \theta) and N_i | \lambda \sim \mathrm{Poisson}(\lambda) and we could marginalize out N_i in a fairly straightforward manner.

The problem I am running into here is since my observed data is a *proportion* and not a count, I can’t figure out how to marginalize out N_i and sample this in Stan. In particular, we have

Am I at a point where this is intractable? If I assume knowledge of N_i, then I can fit the model fine, but I’d love to have some uncertainty on it.

I’d appreciate any insight on approaches to problems like this – another thought I had was to model the data directly via a beta regression centered on the state 2 probability derived from the ordered logistic above, but I feel like the binomial model with uncertainty on the sample size is a more faithful representation of my system.