Custom gaussian hurdle family not quite working in brms

I’m trying to use the custom_family() function in brms to build a Gaussian hurdle family model, following the suggestion here, and following the vignette on custom families. I feel like I’m really close, but there are so few examples of custom_family out in the wild that it’s hard to know what’s going wrong.

Here’s some reproducible data (values distributed normally around 20ish, with a bunch of 0s), along with semi-working custom_family code:

library(tidyverse)
library(brms)


# Make example data
set.seed(1234)
example_data <- tibble(outcome = rnorm(n = 1000, mean = 15, sd = 3)) %>%
  mutate(x = rnorm(n = 1000, mean = 5, sd = 1)) %>% 
  mutate(outcome = outcome * rbinom(1000, 1, 0.9) * (0.25 * x))
head(example_data)
#> # A tibble: 6 x 2
#>   outcome     x
#>     <dbl> <dbl>
#> 1    10.8  3.79
#> 2    21.0  5.30
#> 3    15.8  3.46
#> 4    11.2  5.64
#> 5    23.2  5.70
#> 6     0    3.09

# Normal around 20ish, but lots of 0s
ggplot(example_data, aes(x = outcome)) + 
  geom_histogram(binwidth = 1, boundary = 0, color = "white")

# Create a custom family that is logit if y = 0, normal/gaussian if not
hurdle_gaussian <- 
  custom_family("hurdle_gaussian", 
                dpars = c("mu", "sigma", "hu"),
                links = c("identity", "log", "logit"),
                lb = c(NA, NA, 0),
                type = "real")

stan_funs <- "
  real hurdle_gaussian_lpdf(real y, real mu, real sigma, real hu) { 
    if (y == 0) { 
      return bernoulli_logit_lpmf(1 | hu); 
    } else { 
      return bernoulli_logit_lpmf(0 | hu) +  
             normal_lpdf(y | mu, sigma); 
    } 
  }
"

stanvars <- stanvar(scode = stan_funs, block = "functions")

model_fit <- brm(bf(outcome ~ x, hu ~ 1), data = example_data, 
                 family = hurdle_gaussian, stanvars = stanvars,
                 chains = 2, init = 1, seed = 1234)
#> Compiling Stan program...

# There's a summary!
summary(model_fit)
#>  Family: hurdle_gaussian 
#>   Links: mu = identity; sigma = identity; hu = logit 
#> Formula: outcome ~ x 
#>          hu ~ 1
#>    Data: example_data (Number of observations: 1000) 
#> Samples: 2 chains, each with iter = 2000; warmup = 1000; thin = 1;
#>          total post-warmup samples = 2000
#> 
#> Population-Level Effects: 
#>              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> Intercept       -1.05      0.66    -2.34     0.23 1.00     2283     1629
#> hu_Intercept    -2.04      0.10    -2.24    -1.85 1.00     2153     1385
#> x                3.95      0.13     3.70     4.21 1.00     2242     1687
#> 
#> Family Specific Parameters: 
#>       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> sigma     3.87      0.09     3.68     4.06 1.00     2167     1592
#> 
#> Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
#> and Tail_ESS are effective sample size measures, and Rhat is the potential
#> scale reduction factor on split chains (at convergence, Rhat = 1).

# Adaopted from posterior_predict_hurdle_lognormal at
# https://github.com/paul-buerkner/brms/blob/master/R/posterior_predict.R#L736
posterior_predict_hurdle_gaussian <- function(i, prep, ...) {
  theta <- prep$dpars$hu[, 1]
  mu <- prep$dpars$mu[, 1]
  sigma <- prep$dpars$sigma
  ndraws <- prep$nsamples
  
  hu <- runif(ndraws, 0, 1)
  ifelse(hu < theta, 0, rnorm(ndraws, mu, sigma))
}

# It picks up on the two processes! But the distribution is shifted really far to the left
pp_check(model_fit)
#> Using 10 posterior samples for ppc type 'dens_overlay' by default.

The model is fitting (yay!) and making posterior predictions, but the predictions are off.

I think one main issue is that I’m using normal_lpdf() instead of the more model-focused normal_id_glm_lpdf(). The custom Stan code might need to look more like this?

stan_funs <- "
  real hurdle_gaussian_lpdf(real y, matrix x, real alpha, vector beta, real sigma, real hu) { 
    if (y == 0) { 
      return bernoulli_logit_lpmf(1 | hu); 
    } else { 
      return bernoulli_logit_lpmf(0 | hu) +  
             normal_id_glm_lpdf(y | alpha, beta, sigma); 
    } 
  }
"

I’m not sure at all though.

I’d appreciate any help or insight anyone has about this!


  • Operating System: macOS Big Sur
  • brms Version: 2.14.4
3 Likes

I can relate, getting custom_families to work is tricky. I cannot say if your family does what you want, but I did spot a reason why the predictions don’t work as expected.
In the posterior_predict_hurdle_gaussian method, replace [ ,1] with [, i]:

posterior_predict_hurdle_gaussian <- function(i, prep, ...) {
 theta <- prep$dpars$hu[, i]
 mu <- prep$dpars$mu[, i]
 sigma <- prep$dpars$sigma
 ndraws <- prep$nsamples
 
 hu <- runif(ndraws, 0, 1)
 ifelse(hu < theta, 0, rnorm(ndraws, mu, sigma))
}

with this, pp_check looks good for me.

3 Likes

Ah ha! Good catch! That definitely fixes pp_check!

1 Like

Hi,

To use the custom family “hurdle_gaussian” that you wrote above, does someone only need the functions you included above? I tried implementing the above custom family but I get the error “hurdle_gaussian function not found” . Is there more code to include to get the custom family working?

Thanks!