Data
100 individuals report the number of events (e.g., number of people they had contacts with during the day) for each of four categories (e.g., age group of the contacts).
Let n_{tot}(i) be the total number of events of category i reported by the 100 individuals and n_{tot} the total number over the four categories.
Unfortunately, for a certain number $n_{missing} of the n_{tot}, the category is missing.
Let’s write n_{obs}(i) the number of observed events in category i and n_{obs} the total number of observed events.
We thus have n_{tot} = n_{obs} + n_{missing}.
For the n_{missing} events where the category is missing, we nevertheless have a good knowledge of their probabilities to belong to each of the four categories, which we denote p(i) (i=1,\cdots,4).
Aim
The aim is to build a model to estimate the number of observed events for each category.
Model
My idea was to impute the distribution over the four categories of the n_{missing} events using a multinomial distribution in the ‘transformed parameter’ block and run the model for multiple chains (e.g., 10). However, it seems that Stan generate the same samples for all the different chains. Do I then need to run independent models n times and combine the results?
Another idea would be to treat the missing events as parameter, but it is not possible as the outcome is a discrete variable.
You can find below the R code and the Stan model.
R code
#Count data
#Four categories: count data
N_ind = 100 #number of individuals
N_cat = 4
prob = c(0.1,0.1,0.2,0.6)
n_ind = matrix(NA,nrow=n_ind,ncol=N_cat)
for(i in 1:100){
n_ind[i,] = rpois(4,c(1,5,2,2))
}
n_tot_i = apply(n_ind,2,sum)
n_tot=sum(n_tot_i)
n_missing = 200
n_missing_i = rmultinom(1,n_missing,prob=prob) %>% as.numeric()
n_obs_i = n_tot_i-n_missing_i
mod <- cmdstan_model("stan/test_multinomial.stan")
data_list=list(N_cat=N_cat,
N_ind=N_ind,
n1_cat=n_tot_i,
n2=n_missing,
prob=prob,
inference=1)
fit <- mod$sample(adapt_delta=0.99,
data = data_list,
chains = 10,
parallel_chains = 4,
iter_warmup = 500,
iter_sampling = 300,
refresh = 200)
fit$diagnostic_summary()
fit$summary()
count_tot
Stan
functions {
}
// load data objects
data {
int N_cat;
int n_ind;
array[N_cat] int n1_cat;
int n2;
array[N_cat] real prob;
int inference; //if 0: prior predictive check, if 1: inference
}
transformed data {
array[N_cat] int n_cat;
array[N_cat] int n2_cat = multinomial_rng(to_vector(prob),n2);
print(n2_cat);
for(i in 1:N_cat){
n_cat[i] = n1_cat[i] + n2_cat[i];
}
}
parameters {
array[N_cat] real <lower=0> mu_ind;
}
transformed parameters {
array[N_cat] real mu;
for(i in 1:N_cat){
mu[i] = mu_ind[i] * n_ind;
}
}
model {
//priors
for(i in 1:N_cat){
mu_ind[i] ~ gamma(2.5,1);
}
// likelihood
if(inference==1){//Poisson
target += poisson_lpmf(n_cat | mu);
}
}
generated quantities {
}