Zero-inflated negative binomial model with random-effects

  • Operating System: macOS Mojave
  • brms Version: 2.8.0

Hello,

I am just starting using Stan and brms and appreciate if someone could help me with some issues. Basically, I have an overdispersed count dataset and would like to fit a zero-inflated negative binomial model with random-effects and many covariates as follows:

m1 <- brm(bf(CASES_IND_total ~offset(FOLLOW_UP_IND) + LOCALIZACAO + BORRIFACAO_14_15_16 + DISTRIBUICAO_MOSQUITEIROS + USO_MOSQUITEIRO_NOITE_ANTERIOR + NR_MOSQUITEIROS_CASA + IDADE + SEXO + HORARIO_DORMIR + HORAIO_ACORDAR + NR_MORADORES + PRESENCA_PAREDE + PRESENCA_FORRO + LOCAL_BANHO + FREQUENTA_SITIO_FORA_CIDADE + WEALTH_INDEX + (1|DOM)),
                            data=CENSO_complete, 
                            family = zero_inflated_negbinomial(),
                            prior = set_prior("lasso(1)"),
                            chains = 1,
                            cores = getOption("mc.cores",3L))

summary(m1)

 Family: zero_inflated_negbinomial 
  Links: mu = log; shape = identity; zi = identity 
Formula: CASES_IND_total ~ offset(FOLLOW_UP_IND) + LOCALIZACAO + BORRIFACAO_14_15_16 + DISTRIBUICAO_MOSQUITEIROS + USO_MOSQUITEIRO_NOITE_ANTERIOR + NR_MOSQUITEIROS_CASA + IDADE + SEXO + HORARIO_DORMIR + HORAIO_ACORDAR + NR_MORADORES + PRESENCA_PAREDE + PRESENCA_FORRO + PRESENCA_.JANELAS_FECHAM + PRESENCA_.TELA_.JANELAS + PRESENCA_TELA_PORTAS + LOCAL_BANHO + FREQUENTA_SITIO_FORA_CIDADE + WEALTH_INDEX + (1 | DOM) 
   Data: CENSO_complete (Number of observations: 8431) 
Samples: 1 chains, each with iter = 2000; warmup = 1000; thin = 1;
     total post-warmup samples = 1000

Group-Level Effects: 
~DOM (Number of levels: 2221) 
          Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sd(Intercept)     0.67      0.03     0.61     0.74        336 1.00

Population-Level Effects: 
                            Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
Intercept                          -3.26      0.39    -3.96    -2.43        600 1.00
LOCALIZACAO1                        0.45      0.05     0.34     0.55        894 1.00
BORRIFACAO_14_15_161                0.19      0.08     0.02     0.36        799 1.00
BORRIFACAO_14_15_162               -0.02      0.06    -0.14     0.09        912 1.00
DISTRIBUICAO_MOSQUITEIROS1          0.12      0.06     0.00     0.24       1035 1.00
DISTRIBUICAO_MOSQUITEIROS2         -0.06      0.06    -0.19     0.06        958 1.00
USO_MOSQUITEIRO_NOITE_ANTERIOR1     0.07      0.05    -0.02     0.17       1334 1.00
NR_MOSQUITEIROS_CASA1               0.01      0.05    -0.09     0.11        959 1.00
IDADE2                              0.23      0.08     0.07     0.39        805 1.00
IDADE3                              0.43      0.08     0.27     0.58        665 1.00
IDADE4                              0.33      0.09     0.14     0.51        769 1.00
IDADE5                             -0.20      0.11    -0.43     0.02        949 1.00
SEXO1                              -0.23      0.04    -0.31    -0.16       1756 1.00
HORARIO_DORMIR2                     0.02      0.04    -0.08     0.10       1255 1.00
HORARIO_DORMIR3                    -0.04      0.06    -0.16     0.08       1677 1.00
HORAIO_ACORDAR2                     0.01      0.06    -0.11     0.12       1212 1.00
HORAIO_ACORDAR3                     0.00      0.06    -0.13     0.13       1078 1.00
NR_MORADORES2                       0.10      0.05    -0.00     0.20       1149 1.00
PRESENCA_PAREDE1                   -0.43      0.34    -1.16     0.08        545 1.00
PRESENCA_FORRO1                    -0.07      0.06    -0.20     0.05       1098 1.00
PRESENCA_.JANELAS_FECHAM1          -0.04      0.17    -0.41     0.29       1249 1.00
PRESENCA_.TELA_.JANELAS1           -0.03      0.17    -0.38     0.31       1019 1.00
PRESENCA_TELA_PORTAS1               0.04      0.21    -0.38     0.55        878 1.00
LOCAL_BANHO1                       -0.15      0.06    -0.26    -0.03        929 1.00
FREQUENTA_SITIO_FORA_CIDADE1        0.05      0.05    -0.04     0.14       1179 1.00
WEALTH_INDEX2                      -0.02      0.05    -0.12     0.08       1125 1.00
WEALTH_INDEX3                      -0.03      0.07    -0.17     0.10       1117 1.00

Family Specific Parameters: 
  Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
shape     1.44      0.28     1.02     2.04        456 1.00
zi        0.12      0.05     0.03     0.22        489 1.00

Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
is a crude measure of effective sample size, and Rhat is the potential 
scale reduction factor on split chains (at convergence, Rhat = 1).

I am using LASSO to performs both variable selection and regularisation because I could not find a function to perform variable selection, such as Bayesian Model Averaging.

Question 1: How do people usually perform variable selection with brms?

After, I would like to check the distribution of random-effects. I assume that the random-effects are normally distributed with mean 0 and standard deviation given by the following distribution:

plot(m1, pars = "^sd_")

Rplot

Question 2: How to access the distribution of random effects? Is it possible to shrink the random-effects predictors toward their overall mean in brm? (Example: parameter lambda in [https://rdrr.io/cran/gamlss/man/random.html])

Next, I checked the ppc_loo_pit_qq plot and had, I assume, I very bad result:

m1.loo <- loo(m1, save_psis = TRUE, cores = 3)
psis1  <- m1.loo$psis_object
lw     <- weights(psis1)

y     <- CENSO_complete[,"CASES_IND_total"]
yrep  <- posterior_predict(m1)

ppc_loo_pit_qq(y, yrep, lw = lw, compare = "uniform")

Rplot01

Here I had the warning:

Warning message:
Found 44 observations with a pareto_k > 0.7 in model 'm1'. With this many problematic observations, it may be more appropriate to use 'kfold' with argument 'K = 10' to perform 10-fold cross-validation rather than LOO.

On the other hand, by checking the graphical posterior predictive (ppc_rootogram), I had a reasonable result, I assume:

ppc_rootogram(y, yrep[1:100,],prob = .95)

Rplot02

Question 3: Could someone help me with the interpretation of both graphs? It seems to me that they lead to different conclusions. Does brms performs Worm plots?

Next, I tried to performed a 10-fold cross validation as suggested on the above warning. I had the following error:

(kfold1 <- kfold(m1,K = 10))
Fitting model 1 out of 10
Fitting model 2 out of 10
Fitting model 3 out of 10
Fitting model 4 out of 10
Fitting model 5 out of 10
Fitting model 6 out of 10
Fitting model 7 out of 10
Fitting model 8 out of 10
Fitting model 9 out of 10
Fitting model 10 out of 10
Start sampling
Start sampling
Start sampling
Start sampling
Start sampling
Start sampling
Start sampling
Start sampling
Start sampling
Start sampling

 10-fold cross-validation

Error in data.frame(Estimate = x$elpd_kfold, SE = x$se_elpd_kfold, row.names = "elpd_kfold") : 
  'row.names' should specify one of the variables

Question 4: Any ideia what is wrong here?

I was using another package (gamlss) where I could satisfactorily fit the same zero-inflated negative binomial with random effects to my dataset. However, the package does not support LASSO. Worm and QQ plots bellow:

Thanks in advance!
PS: My dataset has 8431 observations.

Q1: The LASSO prior provides not good enough shrinkage and I would generally recommend not using it. Instead, the horseshoe prior usually has better shrinkage behavior. Our recommended procedure for variable selection would be to use the horseshoe prior and then perform projective predictions via the projpred package. However, the latter is not yet available for such complex models such as yours but we are currently working on generalizing projpred.

Q2: ranef or coef

Q3: Because of the 44 observations with khat > .7 I don’t think the LOO-PIT plot can be trusted. Perhaps @jonah has more insides on that.

Q4: Could you provide a minimal reproducible example for me?

Thank you for your reply, Paul! Horseshoe prior seems to be better.

I am sharing the data here:

https://drive.google.com/drive/folders/1_1MEN0h-UN6uYpMNU8kE0LzKSitWX8sb?usp=sharing

One more question: I am checking the random-effects distribution and it seems to be not normal. My guess this is also contributing for a bad diagnostic plot. Any thoughts about it?

Rplot03

The random effects distribution fulfills the purpose of a prior (it is a prior in fact) and it is often useful and reasonable to assume a normal distribution to apply shrinkage even if the true distribution might look somewhat different. I remember @richard_mcelreath having some interesting thoughts about this in his book “statistical rethinking”.

Thanks also for providing a reproducible example!

Unfortunately, I cannot reproduce the error you see. Do you have the latest release versio of brms 2.9.0 or higher? One alternative explanation is that you had loaded rstanarm additionally to brms in the same session, which may cause problems with kfold (that are hopefully fixed soonish).