Estimates in hurdle gamma model

Hi everyone,

First time posting. I’m struggling with understanding a hurdle_gamma model with non-independent observations with brms and was hoping you could point me in the right direction.

What I think the estimates mean

If I simulate a zero-inflated gamma distributed variable with five observations per participant, I think I can recover the parameter values:

library(tidyverse)
library(brms)

set.seed(42)
N <- 1000

J  <-  200
id <- 1:J
id_shapes <- rep(rnorm(J, 2, 0.1), each = N/J)
id_zero <- rep(rnorm(J, 0.8, 0.1), each = N/J)
id_zero <- if_else(id_zero > 1, 1, id_zero)
id_scales <- rep(rnorm(J, 2, 0.01), each = N/J)

dat <- 
  tibble(
    non_zero = rbinom(N, 1, id_zero),
    g_vals = rgamma(n = N, shape = id_shapes, scale = id_scales),
    id = rep(1:J, each = 5),
    y = non_zero * g_vals
  )

m <- 
  brm(
    bf(
      y ~ 1 + (1|id),
      hu ~ 1 + (1|id)
    ),
    data = dat,
    seed = 42,
    cores = 4,
    iter = 4000,
    warmup = 2000,
    control = list(adapt_delta = 0.95),
    family = hurdle_gamma()
  )

summary(m)
 Family: hurdle_gamma 
  Links: mu = log; shape = identity; hu = logit 
Formula: y ~ 1 + (1 | id) 
         hu ~ 1 + (1 | id)
   Data: dat (Number of observations: 1000) 
Samples: 4 chains, each with iter = 4000; warmup = 2000; thin = 1;
         total post-warmup samples = 8000

Group-Level Effects: 
~id (Number of levels: 200) 
                 Estimate Est.Error l-95% CI u-95% CI Rhat
sd(Intercept)        0.07      0.05     0.00     0.17 1.00
sd(hu_Intercept)     0.57      0.19     0.14     0.91 1.00
                 Bulk_ESS Tail_ESS
sd(Intercept)        1735     3596
sd(hu_Intercept)     1245     1017

Population-Level Effects: 
             Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
Intercept        1.41      0.03     1.36     1.46 1.00    13210
hu_Intercept    -1.48      0.11    -1.70    -1.28 1.00     4359
             Tail_ESS
Intercept        5671
hu_Intercept     5182

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
shape     2.03      0.10     1.84     2.23 1.00    11763     5616

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

If I understand correctly, the Intercept uses the log link and is the mean of non-zero values. So the model retrieves that mean well: exp(1.41) = 4.10; the raw mean of non-zero values is 4.10.

hu_intercept is the odds of the variable being zero. Transformed back to probability, plogis(-1.48) = 0.19, which is super close to what I simulated.

My data and model

Now, my data dat.txt (53.3 KB) look similar (created with dput function). I have reason to believe the variable follows a zero-inflated gamma distribution. Participants reported how many hours they listened to music, receiving a zero when they didn’t listen to music.

Here the comparison between the simulated variable and my real variable:
dat
my_dat

In contrast to the simulated data, the number of observations per participant isn’t constant. Some have five observations, others only one.

Now, when I fit the model, the estimates are really off.

m2 <- 
  brm(
    bf(
      y ~ 1 + (1 | id),
      hu ~ 1 + (1 | id)
    ),
    data = my_dat,
    family = hurdle_gamma(),
    warmup = 2000,
    iter = 4000,
    seed = 42,
    chains = 4,
    cores = 4
  )

summary(m2)
 Family: hurdle_gamma 
  Links: mu = log; shape = identity; hu = logit 
Formula: y ~ 1 + (1 | id) 
         hu ~ 1 + (1 | id)
   Data: my_dat (Number of observations: 1000) 
Samples: 4 chains, each with iter = 4000; warmup = 2000; thin = 1;
         total post-warmup samples = 8000

Group-Level Effects: 
~id (Number of levels: 189) 
                 Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)        0.76      0.04     0.68     0.85 1.00     1313     2909
sd(hu_Intercept)     3.99      0.51     3.11     5.09 1.00     2806     4104

Population-Level Effects: 
             Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept        0.91      0.06     0.79     1.03 1.00      997     1275
hu_Intercept    -3.52      0.46    -4.51    -2.71 1.00     2870     4227

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
shape     6.01      0.33     5.38     6.68 1.00     8270     6442

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

The intercept is quite far off from the raw mean: exp(0.91) = 2.48, the raw mean of non-zero values is 3.19. hu_intercept: plogis(-3.52) = 0.0288 is extremely small, proportion of zeros in the data is 0.20.

When I inspect loo(m2), there are quite a handful of problematic observations (32) with a high Pareto k value.

Model diagnostics:

diagnostics

Can someone give me a hint what’s going on here? When I fit the model ignoring the nested nature of the data, the model estimates are well-aligned with the raw data. Is the varying number of observations per participant problematic?


  • Operating System: Windows 10
  • brms version: 2.13.5
  • rstan version: 2.21.2
2 Likes

Hi, sorry for taking long to respond. Your question is relevant and well written.

This is tricky to check well - since your model also contains the varying intercepts per id, there can be some interaction (I honestly can’t think through this completely clearly to definitely judge if your approach should recover the values or should not). What is always OK is to use predictions as those include all the model parameters. You’ll notice in the pp checks you sent. that predicted means look OK. You might also want to look at predicted means only for the non-zero values (e.g. by grouping). You could also use the stat PP check to see the proportion of predicted zeroes.

The LOO-PIT IMHO shows that the data is somewhat more concentrated in the “middle” range the model can accomodate while observations near the “bottom” are less common. So I would try to see if there are some additional subgroups in your data or anything else that would hint at a source of this problem. I can easily imagine this being an artifact of the study design - as you can see in the data, participants almost always report neat rounded values which could conflict with the completely continuous nature of the gamma distribution. (so e.g. when the fitted mean of the gamma is 0.5, there are very few options for the participants to report a lower value, so those lower values could be underreperesnted). This could probably be handled but it is tricky, e.g. see chapter on rounding in the Stan user guid (and I don’t think currently supported by brms).

Hope I am making some sense and best of luck with your model!

2 Likes