# Marginalizing out unknown binomial sample size when observations are proportions

I’m working on modeling a system where individuals within subpopulation i at time t can be in one of three states, y^t_{ij} \in \{1, 2, 3\};. A sensible model for transitions on the individual level is via an ordered logistic model. The individuals themselves, however, are not observed. The data is the proportion of individuals within a particular subpopulation that are state 2 at a given time, so if there are N_i individuals, we observe a quantity z^t_i = | \{ j\in \{1, \dots, N_i \} : y_{ij} = 2\} | / N_i.

Assuming the parameters for the ordered logistic are \beta, c_1, and c_2, then the probability that a given individual is in state two is given by \theta = \mathrm{logit}^{-1}(X_i\beta - c_1) - \mathrm{logit}^{-1}(X_i\beta - c_2), where X_i contains information about the individuals in subpopulation i.

The catch here is that N_i is not known with certainty nor do we have exact counts, but we do have prior information that allows us to bound N_i. If we did have exact counts, say k_i^t was observed, then we could model k_i^t | N_i, \theta \sim \mathrm{Binomial}(N_i, \theta) and N_i | \lambda \sim \mathrm{Poisson}(\lambda) and we could marginalize out N_i in a fairly straightforward manner.

The problem I am running into here is since my observed data is a proportion and not a count, I can’t figure out how to marginalize out N_i and sample this in Stan. In particular, we have

p(z_{i}^t, N_i | \theta, \lambda) = p(z_{i}^t | N_i, \theta) p(N_i |\lambda) = \binom{N_i}{z_{i}^t N_i} \theta^{z_i^t N_i} ( 1 - \theta)^{N_i - z_i^t N_i} \frac{\lambda^{N_i} e^{\lambda}}{N_i!}

Am I at a point where this is intractable? If I assume knowledge of N_i, then I can fit the model fine, but I’d love to have some uncertainty on it.

I’d appreciate any insight on approaches to problems like this – another thought I had was to model the data directly via a beta regression centered on the state 2 probability derived from the ordered logistic above, but I feel like the binomial model with uncertainty on the sample size is a more faithful representation of my system.

1 Like

Is the proportion observed exactly? That is, can we rule out all potential values of N that don’t yield integers when multiplied by the observed true proportion?

1 Like

I don’t believe so - I’ve reached out to the data collector to get more details, but I’m operating under the assumption we don’t have that much precision.

If the N_i can be assumed to be reasonably big, you can approximate and treat N_i as continous unknown parameter (with a suitable prior). Stan will require N to be an integer, but you can easily reimplement a version of the binomial LPMF where N can be continuous. The log binomial coefficient (lchoose in Stan) generalize to reals just fine.

If N is big and the true proportions are not extreme, you can even approximate with a normal distribution due to the central limit theorem.

If N can be small, but can be bounded from above, you can always explicitly marginalize over some range of integers, following 7 Latent Discrete Parameters | Stan User’s Guide . Also see [2010.09335] Statistical Models for Repeated Categorical Ratings: The R package rater for a nice intro to marginalization.

Does that make sense?

I think this is also likely to be fruitful. Conceptually, for both the normal and Beta approximations, having a prior on N_i directly translates into having a prior on the variability of the outcome (standard deviation for the normal approximation or precision for the Beta approximation). I’d expect you could obtain/find some quite straightforward analytical expression for the relationship, but I didn’t really dig any deeper into this.

Best of luck with your model!

1 Like

I believe the continuous approximation is more or less just what I need in this case – thank you for the reply! Marking this as solved.

1 Like