Marginalizing out unknown binomial sample size when observations are proportions

amas0 · August 14, 2023, 10:20pm

I’m working on modeling a system where individuals within subpopulation i at time t can be in one of three states, y^t_{ij} \in \{1, 2, 3\};. A sensible model for transitions on the individual level is via an ordered logistic model. The individuals themselves, however, are not observed. The data is the proportion of individuals within a particular subpopulation that are state 2 at a given time, so if there are N_i individuals, we observe a quantity z^t_i = | \{ j\in \{1, \dots, N_i \} : y_{ij} = 2\} | / N_i.

Assuming the parameters for the ordered logistic are \beta, c_1, and c_2, then the probability that a given individual is in state two is given by \theta = \mathrm{logit}^{-1}(X_i\beta - c_1) - \mathrm{logit}^{-1}(X_i\beta - c_2), where X_i contains information about the individuals in subpopulation i.

The catch here is that N_i is not known with certainty nor do we have exact counts, but we do have prior information that allows us to bound N_i. If we did have exact counts, say k_i^t was observed, then we could model k_i^t | N_i, \theta \sim \mathrm{Binomial}(N_i, \theta) and N_i | \lambda \sim \mathrm{Poisson}(\lambda) and we could marginalize out N_i in a fairly straightforward manner.

The problem I am running into here is since my observed data is a proportion and not a count, I can’t figure out how to marginalize out N_i and sample this in Stan. In particular, we have

p(z_{i}^t, N_i | \theta, \lambda) = p(z_{i}^t | N_i, \theta) p(N_i |\lambda) = \binom{N_i}{z_{i}^t N_i} \theta^{z_i^t N_i} ( 1 - \theta)^{N_i - z_i^t N_i} \frac{\lambda^{N_i} e^{\lambda}}{N_i!}

Am I at a point where this is intractable? If I assume knowledge of N_i, then I can fit the model fine, but I’d love to have some uncertainty on it.

I’d appreciate any insight on approaches to problems like this – another thought I had was to model the data directly via a beta regression centered on the state 2 probability derived from the ordered logistic above, but I feel like the binomial model with uncertainty on the sample size is a more faithful representation of my system.

jsocolar · August 14, 2023, 11:21pm

Is the proportion observed exactly? That is, can we rule out all potential values of N that don’t yield integers when multiplied by the observed true proportion?

amas0 · August 15, 2023, 12:18am

I don’t believe so - I’ve reached out to the data collector to get more details, but I’m operating under the assumption we don’t have that much precision.

martinmodrak · August 15, 2023, 12:01pm

If the N_i can be assumed to be reasonably big, you can approximate and treat N_i as continous unknown parameter (with a suitable prior). Stan will require N to be an integer, but you can easily reimplement a version of the binomial LPMF where N can be continuous. The log binomial coefficient (lchoose in Stan) generalize to reals just fine.

If N is big and the true proportions are not extreme, you can even approximate with a normal distribution due to the central limit theorem.

If N can be small, but can be bounded from above, you can always explicitly marginalize over some range of integers, following 7 Latent Discrete Parameters | Stan User’s Guide . Also see [2010.09335] Statistical Models for Repeated Categorical Ratings: The R package rater for a nice intro to marginalization.

Does that make sense?

I think this is also likely to be fruitful. Conceptually, for both the normal and Beta approximations, having a prior on N_i directly translates into having a prior on the variability of the outcome (standard deviation for the normal approximation or precision for the Beta approximation). I’d expect you could obtain/find some quite straightforward analytical expression for the relationship, but I didn’t really dig any deeper into this.

Best of luck with your model!

amas0 · August 15, 2023, 9:59pm

I believe the continuous approximation is more or less just what I need in this case – thank you for the reply! Marking this as solved.

Topic		Replies	Views
Estimating the binomial rate when number of trials is uncertain Modeling	12	2637	June 27, 2018
Binomial function and population model providing "real" population size Modeling specification	12	895	July 31, 2020
Sum of binomials when only the sum is observed Modeling	21	3739	November 10, 2021
Marginalizing over discrete parameter with gamma distribution Modeling specification , covid-19 , discrete-parameters	16	861	June 29, 2021
Adding a varying intercept for mu in beta_proportion model when mu is a vector of parameters Modeling rstan	12	731	May 19, 2022

Marginalizing out unknown binomial sample size when observations are proportions

Related topics