Hi all,
I am working with frequency data where the response variable consists of 500 observations for the affinity of the population of a town to one of two, larger population clusters. There are 500 different towns. Values at or closer to zero represent greater affinity to the larger “A” population. Values at or closer to one represent greater affinity to the larger “B” population. The explanatory variables consist of four two-level categorical variables. These variables represent administrative borders where a “0” value means that the given town was on the A side of the administrative border. A value of “1” means that the given town was on the B side of the administrative border. I am interested in whether which side of the border a town falls on can explain the population affinity of that town.
I have fit a zero-one beta regression model to the data. Initial model runs using default brms priors had very long computation times and a lot of divergent transitions. I “solved” this by setting a normal prior to the condition-one inflation parameter. Given that the data mostly consists of counts of 0s and 1s, I specified an informed prior on the mu parameter of the “non-zero” and “none-one” frequencies. This was a normal prior with a mean and standard deviation equal to the mean and standard deviation of the actual “non-zero” and “non-one” frequencies.
My questions are these: (1) is this an appropriate statistical model for data such as these? (2) if it is appropriate, are the priors correctly specified (I am not a statistician but I have the feeling that the priors I used are inappropriate), and (3) how can I evaluate model performance given that LOO isn’t appropriate (as far as I know) for models with a zero-inflation component.
The below R code generates data and fits a model that replicates my actual data and model. I am not in a position to share the actual data.
Thank you in advance!
- Operating System: Windows 10 Enterprise LTSC, version 21H2, OS Build 19044.3324
- brms Version: 2.20.1
library(brms)
# 1.0 Create dummy data where the response variable is both zero-inflated and one-conditional inflated----
one_inflation<- rbinom(n = 225, size = 1, prob = 0.97)
zero_inflation<- rbinom(n = 225, size = 1, prob = 0.03)
frequencies<- runif(50, 0.01, 0.99)
data<- rbind(c(one_inflation, zero_inflation, frequencies))
data_1<- as.data.frame(t(data))
names(data_1)[1]<- "pop_frequency"
data_1$border_1<- rbinom(n = 500, size = 1, prob = 0.80)
data_1$border_2<- rbinom(n = 500, size = 1, prob = 0.70)
data_1$border_3<- rbinom(n = 500, size = 1, prob = 0.50)
data_1$border_4<- rbinom(n = 500, size = 1, prob = 0.60)
# 2.0 Run model----
model_formulae<- bf(pop_frequency ~ border_1 + border_2 + border_3 + border_4,
phi ~ border_1 + border_2 + border_3 + border_4,
zoi ~ border_1 + border_2 + border_3 + border_4,
coi ~ border_1 + border_2 + border_3 + border_4)
model_1<- brm(formula = model_formulae,
data = data_1,
chains = 4, cores = 8, iter = 11000, warmup = 1000,
family = zero_one_inflated_beta(),
# Set a normal prior on the conditional one-inflation parameter as the beta prior
# leads to non-convergence and very long computation times----
prior = c(prior(normal(0, 1), class = "b", dpar = "coi"),
prior(normal(0.4828837,0.2706979), class = "b", dpar = "mu")),
control=list(max_treedepth=15, adapt_delta = 0.99),
sample_prior = "yes")
# Model checks----
mcmc_plot(model_1, type = "acf")
pp_check(model_1)
plot(model_1)
prior_summary(model_1)```