Modeling many zeros and low values in a multinomial outcome

Hello,

I am trying to model a multinomially-distributed variable (N = 2,486) with 5 response categories (first_pl, first_sg, second, third_pl, and third_sg).

I have fit the following model with no divergences, all R-hat equal to or less than 1.01, and adequate ESS:

mod_pro_multinom <-  brm(formula = pronoun_count_CHI | trials(total_pro_by_bin_CHI) ~ 1,
                         mupronouncountCHIfirstsg ~ collection_id + child_age_bin_z + target_child_sex + groupminded * prop_pers_num_MOT_z_first_pl + groupminded * prop_pers_num_MOT_z_first_sg + (groupminded | corpus_id : target_child_name) + (child_age_bin_z + target_child_sex + groupminded * prop_pers_num_MOT_z_first_pl + groupminded * prop_pers_num_MOT_z_first_sg | corpus_id),
                         mupronouncountCHIsecond ~  collection_id + child_age_bin_z + target_child_sex + groupminded * prop_pers_num_MOT_z_first_pl + groupminded * prop_pers_num_MOT_z_second + (groupminded | corpus_id : target_child_name) + (child_age_bin_z + target_child_sex + groupminded * prop_pers_num_MOT_z_first_pl + groupminded * prop_pers_num_MOT_z_second | corpus_id),
                         mupronouncountCHIthirdpl ~ collection_id + child_age_bin_z + target_child_sex + groupminded * prop_pers_num_MOT_z_first_pl + groupminded * prop_pers_num_MOT_z_third_pl + (groupminded | corpus_id : target_child_name) + (child_age_bin_z + target_child_sex + groupminded * prop_pers_num_MOT_z_first_pl + groupminded * prop_pers_num_MOT_z_third_pl | corpus_id),
                         mupronouncountCHIthirdsg ~ collection_id + child_age_bin_z + target_child_sex + groupminded * prop_pers_num_MOT_z_first_pl + groupminded * prop_pers_num_MOT_z_third_sg + (groupminded | corpus_id : target_child_name) + (child_age_bin_z + target_child_sex + groupminded * prop_pers_num_MOT_z_first_pl + groupminded * prop_pers_num_MOT_z_third_sg | corpus_id)),
      data = master_pro_wider,
      prior = c(prior(normal(0, 1.25), class = b, dpar = "mupronouncountCHIfirstsg"),
                prior(normal(0, 1.25), class = b, dpar = "mupronouncountCHIsecond"),
                prior(normal(0, 1.25), class = b, dpar = "mupronouncountCHIfirstpl"),
                prior(normal(0, 1.25), class = b, dpar = "mupronouncountCHIthirdsg"),
                prior(student_t(10, 0, 1.25), class = sd, dpar = "mupronouncountCHIfirstsg"),
                prior(student_t(10, 0, 1.25), class = sd, dpar = "mupronouncountCHIsecond"),
                prior(student_t(10, 0, 1.25), class = sd, dpar = "mupronouncountCHIfirstpl"),
                prior(student_t(10, 0, 1.25), class = sd, dpar = "mupronouncountCHIthirdsg"),
                prior(lkj(2), class = cor)),
      family = multinomial(refcat = "pronoun_count_CHI_first_pl"),
      chains = 7,
      init = "0",
      warmup = 1500,
      iter = 3500,
      control = list(adapt_delta = .90),
      cores = 7)

My issue has to do with the ability of the model to capture high rates of zeros and very low values in the data. Posterior predictive checks look as follows:

My initial solution to this problem was to fiddle around with the fixed and random effects, in particular with the hope that expanding the model would better capture the data. However, this has so far not proven effective, with the above posterior checks robust to my attempted changes.

My current line of thought is to include an overdispersion parameter in the above model, i.e., phi (Define Custom Response Distributions with brms). To initially test out this idea on a simpler model, I fitted the following binomial model (i.e., which lacks phi) to the “first_pl” outcome to see if I could replicate the same pattern for “first_pl” as found in the multinomial model, above. With minor modifications to the model formula relative to the multinomial formula, above, the binomial model was specified as

dum_moddumm_beta <- brm(formula = pronoun_count_CHI|trials(total_pro_by_bin_CHI) ~ collection_id + child_age_bin_z + target_child_sex + groupminded * prop_pers_num_MOT_z + (groupminded * prop_pers_num_MOT_z | corpus_id : target_child_name) + (child_age_bin_z + target_child_sex | corpus_id),
                   data = master_pro %>% filter(pers_num == "first_pl") %>% mutate(child_age_bin_z = (child_age_bin - mean(child_age_bin))/(2*sd(child_age_bin)),
                                                                                   prop_pers_num_MOT_z = (prop_pers_num_MOT - mean(prop_pers_num_MOT))/(2*sd(prop_pers_num_MOT))),
                   prior = c(prior(normal(0, 1.25), class = b),
                             prior(student_t(10, 0, 1.25), class = sd),
                             prior(gamma(2, 0.02), class = phi),
                             prior(lkj(2), class = cor)),
                   family = binomial(),
                   init = "0",
                   warmup = 300,
                   iter = 500,
                   control = list(adapt_delta = .83),
                   cores = 7,
                   chains = 7)

This model contained no divergences and had appropriate r-hats, though there were low ESS warnings. Moreover, as hoped, I was able to replicate the above posterior predictive graphs for “first_pl” using the binomial model,

Importantly, the inclusion of the phi parameter, by switching the formula above to “beta_binomial()”, seemed to go some way towards solving the issue. Posterior predictive checks for “first_pl”:

Likewise for the “third_pl” outcome,

Under the beta_binomial model, the dens and ecdf plots look adequate to me, although I am a bit concerned about the ppc_stat_2d plots. Nonetheless, the latter still look better than those of the multinomial outcome, above. Therefore, overall, it seems like the strategy of modeling overdispersion in my data seems to be on the right track. Given this, I am curious about two things. First, does this actually seem like an appropriate way to go about modeling the data? That is, is there some other tool or trick that I could use, instead of or in addition to modeling overdispersion, to capture the dataset using the multinomial model?

Second, assuming modeling overdispersion is the way to go, there is currently not native implementation of phi in the multinomial model for brms. Thus, how might I go about modeling overdispersion in a multinomial outcome? In the post I linked to, above, BĂĽrkner implements a custom family function for overdispersion in the binomial model. Is this feasible for a multinomial?

Thank you.

OS: Mac12.6
RStudio: 4.2.1
brms: 2.17.0

Update: With hopefully clearer terminology, I think that what I am asking about is how to implement a dirichlet-multinomial model via brms via custom_family. The same question is asked in Dirichlet-multinomial custom family in brms.

1 Like