Treating a missing count data as a parameter and putting truncated binomial prior for it

TerrenceT · October 19, 2023, 4:52am

Dear Stan users and developers,

I am a new Stan user transitioning from JAGS. In my model, I want to treat a few missing count data points as parameters and put truncated binomial priors on them (some example code is shown below). The “p” parameter in the truncated binomial will be shared in some other places in the model system, and the “N” and truncated boundary “upper” are known data. It seems that Stan does not support this kind of integer parameter specification (i.e., Y_miss) since in the “parameter” section, only “real” type is supported (the code below will result in the error “No distribution ‘binomial’ was found with the correct signature.”).

Does anyone have similar experience or have any suggestions for this issue when using Stan? Thanks very much for any discussions. (In JAGS, this kind of prior can be specified and this model can work.)

data {
 int<lower=1> N;
 int<lower=1, upper=N> upper;
// other data are left out
}
parameter {
 real Y_miss;
 // other parameters are left out
}
model {
  Y_miss ~ binomial(N, p) T[0, upper]; //I want to put truncated binomial prior for this Y_miss parameter
// likelihood function depending on Y_miss (left out)
}

Sincerely,
Terrence

simonbrauer · October 19, 2023, 1:24pm

Welcome! Could you say a bit more about the other aspects of your model, or share the full code? It’s possible that your specific model might have some simplified form for the missing data that others will be able to identify.

At the very least, you could create a model for Y and use that to infer the probabilities of each value of Y_miss. You could then marginalize over those probabilities to account for the uncertainty in the true value. This would be pretty easy and efficient if the rest of your model is straight-forward and upper is modest. It would get inefficient if, for example, upper = 10000.

You should be able to find a handful of topics here on marginalizing unobserved discrete variables. The Stan User Guide’s discussion of finite mixture models is also helpful. 5 Finite Mixtures | Stan User’s Guide

TerrenceT · October 19, 2023, 6:22pm

Thanks very much for your reply! I will go to check that Guide later.

Let me provide some more details about the example data and model code. I have many observations of multinomial counts (3 categories), e.g., row 1: (10, 10, 10), row 2: (12, 8, 10), …, row S: (9, 9, 12). My primary target is to estimate the 3 probability parameters corresponding to the 3 multinomial categories. The first S multivariate observations are fully observed and can be fed into the common multinomial likelihood (with N=30, p=p[1:3]). Due to some special reasons, I also collect 5 extra observations with (NA, NA, NA), with informative missingness. The informative missingness means, I know the missing value for category 1 should be smaller than or equal to 10, and the missing value for category 2 is affirmed to be smaller than or equal to 8. Due to this informative missingness, the missing observations should contribute in the model. The strategy I planned for this is, treating the missing category 1 value as a parameter called Y_miss and putting prior, Truncated Binomial(N=30, p=p[1])T[0,10] on it. And then, conditional on the category 1 value (Y_miss), the category 2 count’s corresponding random variable will follow Binomial(N-Y_miss, p=p[2]/(p[2]+p[3])). For also incorporating the information that “category 2 value should be <=8”, we can add the cumulative version of the binomial distribution function above into the likelihood function to help enhance the estimation of p[1:3] (see the sample code below).

data {
 int S;
 int<lower=0> Y[S,3];
}
parameters {
 real<lower=0, upper=1> p[3]; 
 real Y_miss[5]; //the type here is problematic, but I do not know how to deal with this given the request below
}
model {
  // some priors can be put for p[1:3], e.g., Dirichlet

  for (s in 1:S){ //for those S rows of complete data
    target += multinomial_lupmf(Y[s,]|to_vector(p));
  }
  for (i in 1:5){ //for those 5 rows of informative NA's
    Y_miss[i] ~ binomial(30, p[1]) T[0, 10]; //I want to put truncated binomial prior for this Y_miss parameter
    target += binomial_lcdf(8 | 30-Y_miss[i], p[2]/(p[2]+p[3])) //this is my planned likelihood function to handle this informative missing problem, not 100% sure this is appropriate
}

I hope the explanation and the example code clarifies my inquiry better. This model looks a little overparameterized by involving those missing value parameters, but they are nuisance and ancillary ones.

Thanks for any comments on the Stan implementation of this idea or any critiques on my model thoughts for this problem.

Sincerely,
Terrence

LucC · October 26, 2023, 8:02am

Hi Terrence, I would interpret the observation model you describe as “censoring” rather than truncation. The “informative missingness” tells you that the true number of counts in some bin is below some value (i.e., left-censored data). This is for instance analogous to survival analysis, in which right-censoring occurs (we know persons’ or items’ survival up to point in time, after which we stopped observing them). Truncation is a different observation process, i.e., events below/under some threshold do not appear in the data at all.

So, the likelihood for these informatively missing counts in your model should be a cumulative probability of that count being under the thresholds you mention. As explained by @simonbrauer, integrating over all possible counts by means of the HMC itself is not possible in Stan as integer parameters are not allowed. So integration by marginalising over all possible counts is the way to go. Now, I’m not sure that using the binomial distribution here is correct, as counts in any of two of the three categories are correlated with one another. The binomial only applies when if missingness is informative in only one of categories. So, you need a cumulative probability function for a multinomial distribution. I don’t what this is from the top of my head, but imagine it’s manageable. Maybe someone else here or the interweb itself knows.

[EDIT: typo correction]

LucC · October 26, 2023, 8:05am

A quick search for multinomial cumulative distribution function yielded this 1981 (pay walled from the train so haven’t read):

https://www.jstor.org/stable/2240628

simonbrauer · October 26, 2023, 1:36pm

The multinomial CDF came up over in the Julia forums a while back. There are a handful of methods available but some are way over my head. Frey wrote some R code for one approach that I think could be re-written in Stan–though it might not be efficient enough to be practically useful in this case. Lebrun describes a more efficient method, as well as a nice summary of other methods and their limitations.

LucC · October 26, 2023, 1:48pm

With at most a \times b = 80 possible combinations of censoring (a <8 counts in one category and b<10 counts in another), a brute force integration by marginalising may not even be so bad, @simonbrauer?

simonbrauer · October 26, 2023, 6:58pm

That does seem likely. Given the current model has no additional predictors of the outcome, the probabilities will be the same for all unobserved cases. So @TerrenceT could calculate those probabilities once rather than 5 times.

Topic		Replies	Views
Truncated model for neg_binomial_2 Modeling	20	1698	June 9, 2017
Missing data in categorical data models Modeling rstan	7	1311	August 12, 2023
Assigning bernoulli prior to missing entries in covariate Modeling	2	1056	August 15, 2019
Model Poll data (Categorical Likelihood with Dirichlet Prior) Modeling rstan	1	325	April 21, 2023
Missing parameters and priors Modeling	25	810	June 27, 2020

Treating a missing count data as a parameter and putting truncated binomial prior for it

Related topics