Impute partially missing discrete outcome

anthony_462 · May 19, 2023, 4:25pm

Data
100 individuals report the number of events (e.g., number of people they had contacts with during the day) for each of four categories (e.g., age group of the contacts).
Let n_{tot}(i) be the total number of events of category i reported by the 100 individuals and n_{tot} the total number over the four categories.

Unfortunately, for a certain number $n_{missing} of the n_{tot}, the category is missing.
Let’s write n_{obs}(i) the number of observed events in category i and n_{obs} the total number of observed events.
We thus have n_{tot} = n_{obs} + n_{missing}.

For the n_{missing} events where the category is missing, we nevertheless have a good knowledge of their probabilities to belong to each of the four categories, which we denote p(i) (i=1,\cdots,4).

Aim
The aim is to build a model to estimate the number of observed events for each category.

Model
My idea was to impute the distribution over the four categories of the n_{missing} events using a multinomial distribution in the ‘transformed parameter’ block and run the model for multiple chains (e.g., 10). However, it seems that Stan generate the same samples for all the different chains. Do I then need to run independent models n times and combine the results?

Another idea would be to treat the missing events as parameter, but it is not possible as the outcome is a discrete variable.

You can find below the R code and the Stan model.

R code

#Count data
#Four categories: count data
N_ind = 100 #number of individuals
N_cat = 4
prob = c(0.1,0.1,0.2,0.6)
n_ind = matrix(NA,nrow=n_ind,ncol=N_cat)
for(i in 1:100){
  n_ind[i,] = rpois(4,c(1,5,2,2))
}
n_tot_i = apply(n_ind,2,sum)
n_tot=sum(n_tot_i)
n_missing = 200
n_missing_i = rmultinom(1,n_missing,prob=prob) %>% as.numeric()
n_obs_i =  n_tot_i-n_missing_i

mod <- cmdstan_model("stan/test_multinomial.stan")

data_list=list(N_cat=N_cat,
               N_ind=N_ind,
               n1_cat=n_tot_i,
               n2=n_missing,
               prob=prob,
               inference=1)
fit <- mod$sample(adapt_delta=0.99,
                    data = data_list,
                    chains = 10, 
                    parallel_chains = 4,
                    iter_warmup = 500,
                    iter_sampling = 300,
                    refresh = 200)
fit$diagnostic_summary()
fit$summary()
count_tot

Stan

functions {

}

// load data objects
data {
  int N_cat;
  int n_ind;
  
  array[N_cat] int n1_cat;
  int n2;
  array[N_cat] real prob;
  
  int inference; //if 0: prior predictive check, if 1: inference
}

transformed data {
  array[N_cat] int n_cat;
  array[N_cat] int n2_cat = multinomial_rng(to_vector(prob),n2);
  print(n2_cat);
  for(i in 1:N_cat){
    n_cat[i] = n1_cat[i] + n2_cat[i];
  }
}

parameters {
  array[N_cat] real <lower=0> mu_ind;
}

transformed parameters {
  array[N_cat] real mu;
  for(i in 1:N_cat){
    mu[i] = mu_ind[i] * n_ind;
  }
}

model {
  //priors
  for(i in 1:N_cat){
    mu_ind[i] ~ gamma(2.5,1);
  }

  // likelihood
  if(inference==1){//Poisson
    target += poisson_lpmf(n_cat | mu);
  }
}

generated quantities {

}

anthony_462 · May 22, 2023, 9:29am

I spotted a few small mistake in my post but I didn’t manage to find how to edit it. Am I not able to edit my post anymore or did I miss something?

Topic		Replies	Views
Missing data in categorical data models Modeling rstan	7	1330	August 12, 2023
Guidelines for Practical Imputation with Stan? Modeling cmdstan , rstan , techniques , specification , missing-data	4	1524	September 6, 2023
Model Poll data (Categorical Likelihood with Dirichlet Prior) Modeling rstan	1	328	April 21, 2023
Multinomial with missing data Modeling	6	681	February 7, 2021
How to write a finite discrete mixture model in stan Modeling specification , cognitive-science	11	729	March 29, 2021

Impute partially missing discrete outcome

Related topics