- Operating System: macOS Mojave
- brms Version: 2.8.0
Hello,
I am just starting using Stan and brms and appreciate if someone could help me with some issues. Basically, I have an overdispersed count dataset and would like to fit a zero-inflated negative binomial model with random-effects and many covariates as follows:
m1 <- brm(bf(CASES_IND_total ~offset(FOLLOW_UP_IND) + LOCALIZACAO + BORRIFACAO_14_15_16 + DISTRIBUICAO_MOSQUITEIROS + USO_MOSQUITEIRO_NOITE_ANTERIOR + NR_MOSQUITEIROS_CASA + IDADE + SEXO + HORARIO_DORMIR + HORAIO_ACORDAR + NR_MORADORES + PRESENCA_PAREDE + PRESENCA_FORRO + LOCAL_BANHO + FREQUENTA_SITIO_FORA_CIDADE + WEALTH_INDEX + (1|DOM)),
data=CENSO_complete,
family = zero_inflated_negbinomial(),
prior = set_prior("lasso(1)"),
chains = 1,
cores = getOption("mc.cores",3L))
summary(m1)
Family: zero_inflated_negbinomial
Links: mu = log; shape = identity; zi = identity
Formula: CASES_IND_total ~ offset(FOLLOW_UP_IND) + LOCALIZACAO + BORRIFACAO_14_15_16 + DISTRIBUICAO_MOSQUITEIROS + USO_MOSQUITEIRO_NOITE_ANTERIOR + NR_MOSQUITEIROS_CASA + IDADE + SEXO + HORARIO_DORMIR + HORAIO_ACORDAR + NR_MORADORES + PRESENCA_PAREDE + PRESENCA_FORRO + PRESENCA_.JANELAS_FECHAM + PRESENCA_.TELA_.JANELAS + PRESENCA_TELA_PORTAS + LOCAL_BANHO + FREQUENTA_SITIO_FORA_CIDADE + WEALTH_INDEX + (1 | DOM)
Data: CENSO_complete (Number of observations: 8431)
Samples: 1 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup samples = 1000
Group-Level Effects:
~DOM (Number of levels: 2221)
Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sd(Intercept) 0.67 0.03 0.61 0.74 336 1.00
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
Intercept -3.26 0.39 -3.96 -2.43 600 1.00
LOCALIZACAO1 0.45 0.05 0.34 0.55 894 1.00
BORRIFACAO_14_15_161 0.19 0.08 0.02 0.36 799 1.00
BORRIFACAO_14_15_162 -0.02 0.06 -0.14 0.09 912 1.00
DISTRIBUICAO_MOSQUITEIROS1 0.12 0.06 0.00 0.24 1035 1.00
DISTRIBUICAO_MOSQUITEIROS2 -0.06 0.06 -0.19 0.06 958 1.00
USO_MOSQUITEIRO_NOITE_ANTERIOR1 0.07 0.05 -0.02 0.17 1334 1.00
NR_MOSQUITEIROS_CASA1 0.01 0.05 -0.09 0.11 959 1.00
IDADE2 0.23 0.08 0.07 0.39 805 1.00
IDADE3 0.43 0.08 0.27 0.58 665 1.00
IDADE4 0.33 0.09 0.14 0.51 769 1.00
IDADE5 -0.20 0.11 -0.43 0.02 949 1.00
SEXO1 -0.23 0.04 -0.31 -0.16 1756 1.00
HORARIO_DORMIR2 0.02 0.04 -0.08 0.10 1255 1.00
HORARIO_DORMIR3 -0.04 0.06 -0.16 0.08 1677 1.00
HORAIO_ACORDAR2 0.01 0.06 -0.11 0.12 1212 1.00
HORAIO_ACORDAR3 0.00 0.06 -0.13 0.13 1078 1.00
NR_MORADORES2 0.10 0.05 -0.00 0.20 1149 1.00
PRESENCA_PAREDE1 -0.43 0.34 -1.16 0.08 545 1.00
PRESENCA_FORRO1 -0.07 0.06 -0.20 0.05 1098 1.00
PRESENCA_.JANELAS_FECHAM1 -0.04 0.17 -0.41 0.29 1249 1.00
PRESENCA_.TELA_.JANELAS1 -0.03 0.17 -0.38 0.31 1019 1.00
PRESENCA_TELA_PORTAS1 0.04 0.21 -0.38 0.55 878 1.00
LOCAL_BANHO1 -0.15 0.06 -0.26 -0.03 929 1.00
FREQUENTA_SITIO_FORA_CIDADE1 0.05 0.05 -0.04 0.14 1179 1.00
WEALTH_INDEX2 -0.02 0.05 -0.12 0.08 1125 1.00
WEALTH_INDEX3 -0.03 0.07 -0.17 0.10 1117 1.00
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
shape 1.44 0.28 1.02 2.04 456 1.00
zi 0.12 0.05 0.03 0.22 489 1.00
Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
is a crude measure of effective sample size, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
I am using LASSO to performs both variable selection and regularisation because I could not find a function to perform variable selection, such as Bayesian Model Averaging.
Question 1: How do people usually perform variable selection with brms?
After, I would like to check the distribution of random-effects. I assume that the random-effects are normally distributed with mean 0 and standard deviation given by the following distribution:
plot(m1, pars = "^sd_")
Question 2: How to access the distribution of random effects? Is it possible to shrink the random-effects predictors toward their overall mean in brm? (Example: parameter lambda in [https://rdrr.io/cran/gamlss/man/random.html])
Next, I checked the ppc_loo_pit_qq
plot and had, I assume, I very bad result:
m1.loo <- loo(m1, save_psis = TRUE, cores = 3)
psis1 <- m1.loo$psis_object
lw <- weights(psis1)
y <- CENSO_complete[,"CASES_IND_total"]
yrep <- posterior_predict(m1)
ppc_loo_pit_qq(y, yrep, lw = lw, compare = "uniform")
Here I had the warning:
Warning message:
Found 44 observations with a pareto_k > 0.7 in model 'm1'. With this many problematic observations, it may be more appropriate to use 'kfold' with argument 'K = 10' to perform 10-fold cross-validation rather than LOO.
On the other hand, by checking the graphical posterior predictive (ppc_rootogram
), I had a reasonable result, I assume:
ppc_rootogram(y, yrep[1:100,],prob = .95)
Question 3: Could someone help me with the interpretation of both graphs? It seems to me that they lead to different conclusions. Does brms performs Worm plots?
Next, I tried to performed a 10-fold cross validation as suggested on the above warning. I had the following error:
(kfold1 <- kfold(m1,K = 10))
Fitting model 1 out of 10
Fitting model 2 out of 10
Fitting model 3 out of 10
Fitting model 4 out of 10
Fitting model 5 out of 10
Fitting model 6 out of 10
Fitting model 7 out of 10
Fitting model 8 out of 10
Fitting model 9 out of 10
Fitting model 10 out of 10
Start sampling
Start sampling
Start sampling
Start sampling
Start sampling
Start sampling
Start sampling
Start sampling
Start sampling
Start sampling
10-fold cross-validation
Error in data.frame(Estimate = x$elpd_kfold, SE = x$se_elpd_kfold, row.names = "elpd_kfold") :
'row.names' should specify one of the variables
Question 4: Any ideia what is wrong here?
I was using another package (gamlss) where I could satisfactorily fit the same zero-inflated negative binomial with random effects to my dataset. However, the package does not support LASSO. Worm and QQ plots bellow:
Thanks in advance!
PS: My dataset has 8431 observations.