Custom gaussian hurdle family not quite working in brms

I’m trying to use the custom_family() function in brms to build a Gaussian hurdle family model, following the suggestion here, and following the vignette on custom families. I feel like I’m really close, but there are so few examples of custom_family out in the wild that it’s hard to know what’s going wrong.

Here’s some reproducible data (values distributed normally around 20ish, with a bunch of 0s), along with semi-working custom_family code:

library(tidyverse)
library(brms)


# Make example data
set.seed(1234)
example_data <- tibble(outcome = rnorm(n = 1000, mean = 15, sd = 3)) %>%
  mutate(x = rnorm(n = 1000, mean = 5, sd = 1)) %>% 
  mutate(outcome = outcome * rbinom(1000, 1, 0.9) * (0.25 * x))
head(example_data)
#> # A tibble: 6 x 2
#>   outcome     x
#>     <dbl> <dbl>
#> 1    10.8  3.79
#> 2    21.0  5.30
#> 3    15.8  3.46
#> 4    11.2  5.64
#> 5    23.2  5.70
#> 6     0    3.09

# Normal around 20ish, but lots of 0s
ggplot(example_data, aes(x = outcome)) + 
  geom_histogram(binwidth = 1, boundary = 0, color = "white")

# Create a custom family that is logit if y = 0, normal/gaussian if not
hurdle_gaussian <- 
  custom_family("hurdle_gaussian", 
                dpars = c("mu", "sigma", "hu"),
                links = c("identity", "log", "logit"),
                lb = c(NA, NA, 0),
                type = "real")

stan_funs <- "
  real hurdle_gaussian_lpdf(real y, real mu, real sigma, real hu) { 
    if (y == 0) { 
      return bernoulli_logit_lpmf(1 | hu); 
    } else { 
      return bernoulli_logit_lpmf(0 | hu) +  
             normal_lpdf(y | mu, sigma); 
    } 
  }
"

stanvars <- stanvar(scode = stan_funs, block = "functions")

model_fit <- brm(bf(outcome ~ x, hu ~ 1), data = example_data, 
                 family = hurdle_gaussian, stanvars = stanvars,
                 chains = 2, init = 1, seed = 1234)
#> Compiling Stan program...

# There's a summary!
summary(model_fit)
#>  Family: hurdle_gaussian 
#>   Links: mu = identity; sigma = identity; hu = logit 
#> Formula: outcome ~ x 
#>          hu ~ 1
#>    Data: example_data (Number of observations: 1000) 
#> Samples: 2 chains, each with iter = 2000; warmup = 1000; thin = 1;
#>          total post-warmup samples = 2000
#> 
#> Population-Level Effects: 
#>              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> Intercept       -1.05      0.66    -2.34     0.23 1.00     2283     1629
#> hu_Intercept    -2.04      0.10    -2.24    -1.85 1.00     2153     1385
#> x                3.95      0.13     3.70     4.21 1.00     2242     1687
#> 
#> Family Specific Parameters: 
#>       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> sigma     3.87      0.09     3.68     4.06 1.00     2167     1592
#> 
#> Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
#> and Tail_ESS are effective sample size measures, and Rhat is the potential
#> scale reduction factor on split chains (at convergence, Rhat = 1).

# Adaopted from posterior_predict_hurdle_lognormal at
# https://github.com/paul-buerkner/brms/blob/master/R/posterior_predict.R#L736
posterior_predict_hurdle_gaussian <- function(i, prep, ...) {
  theta <- prep$dpars$hu[, 1]
  mu <- prep$dpars$mu[, 1]
  sigma <- prep$dpars$sigma
  ndraws <- prep$nsamples
  
  hu <- runif(ndraws, 0, 1)
  ifelse(hu < theta, 0, rnorm(ndraws, mu, sigma))
}

# It picks up on the two processes! But the distribution is shifted really far to the left
pp_check(model_fit)
#> Using 10 posterior samples for ppc type 'dens_overlay' by default.

The model is fitting (yay!) and making posterior predictions, but the predictions are off.

I think one main issue is that I’m using normal_lpdf() instead of the more model-focused normal_id_glm_lpdf(). The custom Stan code might need to look more like this?

stan_funs <- "
  real hurdle_gaussian_lpdf(real y, matrix x, real alpha, vector beta, real sigma, real hu) { 
    if (y == 0) { 
      return bernoulli_logit_lpmf(1 | hu); 
    } else { 
      return bernoulli_logit_lpmf(0 | hu) +  
             normal_id_glm_lpdf(y | alpha, beta, sigma); 
    } 
  }
"

I’m not sure at all though.

I’d appreciate any help or insight anyone has about this!


  • Operating System: macOS Big Sur
  • brms Version: 2.14.4
3 Likes

I can relate, getting custom_families to work is tricky. I cannot say if your family does what you want, but I did spot a reason why the predictions don’t work as expected.
In the posterior_predict_hurdle_gaussian method, replace [ ,1] with [, i]:

posterior_predict_hurdle_gaussian <- function(i, prep, ...) {
 theta <- prep$dpars$hu[, i]
 mu <- prep$dpars$mu[, i]
 sigma <- prep$dpars$sigma
 ndraws <- prep$nsamples
 
 hu <- runif(ndraws, 0, 1)
 ifelse(hu < theta, 0, rnorm(ndraws, mu, sigma))
}

with this, pp_check looks good for me.

3 Likes

Ah ha! Good catch! That definitely fixes pp_check!

1 Like

Hi,

To use the custom family “hurdle_gaussian” that you wrote above, does someone only need the functions you included above? I tried implementing the above custom family but I get the error “hurdle_gaussian function not found” . Is there more code to include to get the custom family working?

Thanks!

@BrittaL @andrewheiss @paul.buerkner

Has anything changed re: the hurdle components in these custom families? I’ve run into some issues with other hurdle models, but they crop up in this simpler example (which previously worked for me).

Basically, the non-zero density always looks OK – the problem seems specific to underestimating the number of zeroes.

library(data.table)
library(brms)
library(ggplot2)

# Make example data
set.seed(1234)

example_data <- data.table("outcome" = rnorm(1000, 15, 3))

example_data[,x := rnorm(1000, 5, 1)]
example_data[,outcome := outcome*rbinom(1000,1,0.9) * (0.25*x)]

# Normal around 20ish, but lots of 0s
ggplot(example_data, aes(x = outcome)) + 
  geom_histogram(binwidth = 1, boundary = 0, color = "white")

prop_zero <- function(x) mean(x == 0)
zeroes_in_data <- prop_zero(example_data$outcome) # ~ 11% zeros
logit_scaled(zeroes_in_data) # On logit scale, around -2

# Create a custom family that is logit if y = 0, normal/gaussian if not
hurdle_gaussian <- 
  custom_family("hurdle_gaussian", 
                dpars = c("mu", "sigma", "hu"),
                links = c("identity", "log", "logit"),
                lb = c(NA, 0, NA),
                type = "real")

stan_funs <- "
  real hurdle_gaussian_lpdf(real y, real mu, real sigma, real hu) { 
    if (y == 0) { 
      return bernoulli_logit_lpmf(1 | hu); 
    } else { 
      return bernoulli_logit_lpmf(0 | hu) +  
             normal_lpdf(y | mu, sigma); 
    } 
  }
"

stanvars <- stanvar(scode = stan_funs, block = "functions")

model_fit <- brm(bf(outcome ~ x, 
                    hu ~ 1), 
                 data = example_data, 
                 family = hurdle_gaussian, 
                 stanvars = stanvars,
                 chains = 4,
                 iter = 1000,
                 warmup = 500,
                 cores = 4,
                 backend = "cmdstanr")

summary(model_fit)

# Family: hurdle_gaussian 
# Links: mu = identity; sigma = identity; hu = logit 
# Formula: outcome ~ x 
# hu ~ 1
# Data: example_data (Number of observations: 1000) 
# Draws: 4 chains, each with iter = 1000; warmup = 500; thin = 1;
# total post-warmup draws = 2000
# 
# Population-Level Effects: 
#   Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
# Intercept       -1.03      0.65    -2.28     0.26 1.00     2093     1527
**# hu_Intercept    -6.58      1.32    -9.84    -4.67 1.00     1618      773**
# x                3.94      0.13     3.69     4.20 1.00     2123     1395
# 
# Family Specific Parameters: 
#   Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
# sigma     3.87      0.10     3.68     4.07 1.00     1830     1245
# 
# Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
# and Tail_ESS are effective sample size measures, and Rhat is the potential
# scale reduction factor on split chains (at convergence, Rhat = 1).

# # hu_intercept parameter seems off, so it's not likely an error in the posterior predictive functions

posterior_predict_hurdle_gaussian <- function(i, prep, ...) {
  mu <- brms::get_dpar(prep, "mu", i = i)
  sigma <- brms::get_dpar(prep, "sigma", i = i)
  theta <- brms::get_dpar(prep, "hu", i = i)
  
  hu <- runif(prep$ndraws, 0, 1)
  ifelse(hu < theta, 0, rnorm(prep$ndraws, mu,sigma))
  
}

pp_check(model_fit)

Thanks for any assistance!

R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggplot2_3.3.5     brms_2.16.7       Rcpp_1.0.7        data.table_1.14.2

loaded via a namespace (and not attached):
  [1] nlme_3.1-152         matrixStats_0.61.0   xts_0.12.1           lubridate_1.7.10    
  [5] threejs_0.3.3        httr_1.4.2           rstan_2.21.3         tensorA_0.36.2      
  [9] tools_4.1.0          backports_1.4.1      utf8_1.2.2           R6_2.5.1            
 [13] DT_0.21              mgcv_1.8-35          DBI_1.1.1            projpred_2.0.2      
 [17] colorspace_2.0-3     withr_2.5.0          tidyselect_1.1.2     gridExtra_2.3       
 [21] prettyunits_1.1.1    processx_3.5.2       Brobdingnag_1.2-7    curl_4.3.2          
 [25] compiler_4.1.0       textshaping_0.3.5    cli_3.0.1            shinyjs_2.1.0       
 [29] labeling_0.4.2       colourpicker_1.1.1   posterior_1.2.1      scales_1.1.1        
 [33] dygraphs_1.1.1.6     checkmate_2.0.0      mvtnorm_1.1-3        ggridges_0.5.3      
 [37] callr_3.7.0          systemfonts_1.0.2    stringr_1.4.0        digest_0.6.29       
 [41] StanHeaders_2.21.0-7 minqa_1.2.4          base64enc_0.1-3      pkgconfig_2.0.3     
 [45] htmltools_0.5.2      lme4_1.1-28          fastmap_1.1.0        htmlwidgets_1.5.4   
 [49] rlang_1.0.2          rstudioapi_0.13      shiny_1.7.1          farver_2.1.0        
 [53] generics_0.1.2       zoo_1.8-9            jsonlite_1.7.2       crosstalk_1.2.0     
 [57] gtools_3.9.2         dplyr_1.0.8          distributional_0.3.0 inline_0.3.19       
 [61] magrittr_2.0.1       proverbs_0.1.0       loo_2.5.0            bayesplot_1.9.0.9000
 [65] Matrix_1.3-3         munsell_0.5.0        fansi_1.0.2          abind_1.4-5         
 [69] lifecycle_1.0.1      stringi_1.7.6        MASS_7.3-54          pkgbuild_1.3.1      
 [73] plyr_1.8.6           grid_4.1.0           parallel_4.1.0       promises_1.2.0.1    
 [77] crayon_1.5.0         miniUI_0.1.1.1       lattice_0.20-44      splines_4.1.0       
 [81] knitr_1.37           ps_1.6.0             pillar_1.7.0         igraph_1.2.11       
 [85] boot_1.3-28          markdown_1.1         shinystan_2.6.0      reshape2_1.4.4      
 [89] codetools_0.2-18     stats4_4.1.0         rstantools_2.1.1     glue_1.4.2          
 [93] RcppParallel_5.1.5   nloptr_2.0.0         vctrs_0.3.8          httpuv_1.6.5        
 [97] gtable_0.3.0         purrr_0.3.4          assertthat_0.2.1     xfun_0.30           
[101] mime_0.12            xtable_1.8-4         coda_0.19-4          later_1.3.0         
[105] ragg_1.1.3           diffobj_0.3.5        tibble_3.1.6         shinythemes_1.2.0   
[109] gamm4_0.2-6          cmdstanr_0.4.0       ellipsis_0.3.2       bridgesampling_1.1-2

FWIW, solved by changing from bernoulli_logit_lpmf to bernoulli_lpmf

hurdle_gaussian <- 
  custom_family("hurdle_gaussian", 
                dpars = c("mu", "sigma", "hu"),
                links = c("identity", "log", "logit"),
                lb = c(NA, 0, NA),
                type = "real")

stan_funs <- "
  real hurdle_gaussian_lpdf(real y, real mu, real sigma, real hu) { 
    if (y == 0) { 
      return bernoulli_lpmf(1 | hu); 
    } else { 
      return bernoulli_lpmf(0 | hu) +  
             normal_lpdf(y | mu, sigma); 
    } 
  }
stanvars <- stanvar(scode = stan_funs, block = "functions")

model_fit <- brm(bf(outcome ~ x, 
                    hu ~ 1), 
                 data = example_data, 
                 family = hurdle_gaussian, 
                 stanvars = stanvars,
                 chains = 4,
                 iter = 1000,
                 warmup = 500,
                 cores = 4,
                 backend = "cmdstanr")

summary(model_fit)

# Family: hurdle_gaussian 
# Links: mu = identity; sigma = identity; hu = logit 
# Formula: outcome ~ x 
# hu ~ 1
# Data: example_data (Number of observations: 1000) 
# Draws: 4 chains, each with iter = 500; warmup = 0; thin = 1;
# total post-warmup draws = 2000
# 
# Population-Level Effects: 
#   Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
# Intercept       -1.02      0.68    -2.35     0.28 1.00     2141     1283
# hu_Intercept    -2.05      0.10    -2.25    -1.85 1.00     1904     1568
# x                3.94      0.13     3.69     4.20 1.00     2163     1369
# 
# Family Specific Parameters: 
#   Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
# sigma     3.87      0.09     3.69     4.04 1.00     2399     1516
# 
# Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
# and Tail_ESS are effective sample size measures, and Rhat is the potential
# scale reduction factor on split chains (at convergence, Rhat = 1).

posterior_predict_hurdle_gaussian <- function(i, prep, ...) {
  mu <- brms::get_dpar(prep, "mu", i = i)
  sigma <- brms::get_dpar(prep, "sigma", i = i)
  theta <- brms::get_dpar(prep, "hu", i = i)
  
  hu <- runif(prep$ndraws, 0, 1)
  ifelse(hu < theta, 0, rnorm(prep$ndraws, mu,sigma))
  
}

pp_check(model_fit)