Hi everyone,
First time posting. I’m struggling with understanding a hurdle_gamma
model with non-independent observations with brms
and was hoping you could point me in the right direction.
What I think the estimates mean
If I simulate a zero-inflated gamma distributed variable with five observations per participant, I think I can recover the parameter values:
library(tidyverse)
library(brms)
set.seed(42)
N <- 1000
J <- 200
id <- 1:J
id_shapes <- rep(rnorm(J, 2, 0.1), each = N/J)
id_zero <- rep(rnorm(J, 0.8, 0.1), each = N/J)
id_zero <- if_else(id_zero > 1, 1, id_zero)
id_scales <- rep(rnorm(J, 2, 0.01), each = N/J)
dat <-
tibble(
non_zero = rbinom(N, 1, id_zero),
g_vals = rgamma(n = N, shape = id_shapes, scale = id_scales),
id = rep(1:J, each = 5),
y = non_zero * g_vals
)
m <-
brm(
bf(
y ~ 1 + (1|id),
hu ~ 1 + (1|id)
),
data = dat,
seed = 42,
cores = 4,
iter = 4000,
warmup = 2000,
control = list(adapt_delta = 0.95),
family = hurdle_gamma()
)
summary(m)
Family: hurdle_gamma
Links: mu = log; shape = identity; hu = logit
Formula: y ~ 1 + (1 | id)
hu ~ 1 + (1 | id)
Data: dat (Number of observations: 1000)
Samples: 4 chains, each with iter = 4000; warmup = 2000; thin = 1;
total post-warmup samples = 8000
Group-Level Effects:
~id (Number of levels: 200)
Estimate Est.Error l-95% CI u-95% CI Rhat
sd(Intercept) 0.07 0.05 0.00 0.17 1.00
sd(hu_Intercept) 0.57 0.19 0.14 0.91 1.00
Bulk_ESS Tail_ESS
sd(Intercept) 1735 3596
sd(hu_Intercept) 1245 1017
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
Intercept 1.41 0.03 1.36 1.46 1.00 13210
hu_Intercept -1.48 0.11 -1.70 -1.28 1.00 4359
Tail_ESS
Intercept 5671
hu_Intercept 5182
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
shape 2.03 0.10 1.84 2.23 1.00 11763 5616
Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
If I understand correctly, the Intercept uses the log link and is the mean of non-zero values. So the model retrieves that mean well: exp(1.41)
= 4.10; the raw mean of non-zero values is 4.10.
hu_intercept
is the odds of the variable being zero. Transformed back to probability, plogis(-1.48)
= 0.19, which is super close to what I simulated.
My data and model
Now, my data dat.txt (53.3 KB) look similar (created with dput
function). I have reason to believe the variable follows a zero-inflated gamma distribution. Participants reported how many hours they listened to music, receiving a zero when they didn’t listen to music.
Here the comparison between the simulated variable and my real variable:
In contrast to the simulated data, the number of observations per participant isn’t constant. Some have five observations, others only one.
Now, when I fit the model, the estimates are really off.
m2 <-
brm(
bf(
y ~ 1 + (1 | id),
hu ~ 1 + (1 | id)
),
data = my_dat,
family = hurdle_gamma(),
warmup = 2000,
iter = 4000,
seed = 42,
chains = 4,
cores = 4
)
summary(m2)
Family: hurdle_gamma
Links: mu = log; shape = identity; hu = logit
Formula: y ~ 1 + (1 | id)
hu ~ 1 + (1 | id)
Data: my_dat (Number of observations: 1000)
Samples: 4 chains, each with iter = 4000; warmup = 2000; thin = 1;
total post-warmup samples = 8000
Group-Level Effects:
~id (Number of levels: 189)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 0.76 0.04 0.68 0.85 1.00 1313 2909
sd(hu_Intercept) 3.99 0.51 3.11 5.09 1.00 2806 4104
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 0.91 0.06 0.79 1.03 1.00 997 1275
hu_Intercept -3.52 0.46 -4.51 -2.71 1.00 2870 4227
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
shape 6.01 0.33 5.38 6.68 1.00 8270 6442
Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
The intercept is quite far off from the raw mean: exp(0.91)
= 2.48, the raw mean of non-zero values is 3.19. hu_intercept
: plogis(-3.52)
= 0.0288 is extremely small, proportion of zeros in the data is 0.20.
When I inspect loo(m2)
, there are quite a handful of problematic observations (32) with a high Pareto k value.
Model diagnostics:
Can someone give me a hint what’s going on here? When I fit the model ignoring the nested nature of the data, the model estimates are well-aligned with the raw data. Is the varying number of observations per participant problematic?
- Operating System: Windows 10
-
brms
version: 2.13.5 -
rstan
version: 2.21.2