I’m trying to use the custom_family()
function in brms to build a Gaussian hurdle family model, following the suggestion here, and following the vignette on custom families. I feel like I’m really close, but there are so few examples of custom_family
out in the wild that it’s hard to know what’s going wrong.
Here’s some reproducible data (values distributed normally around 20ish, with a bunch of 0s), along with semi-working custom_family
code:
library(tidyverse)
library(brms)
# Make example data
set.seed(1234)
example_data <- tibble(outcome = rnorm(n = 1000, mean = 15, sd = 3)) %>%
mutate(x = rnorm(n = 1000, mean = 5, sd = 1)) %>%
mutate(outcome = outcome * rbinom(1000, 1, 0.9) * (0.25 * x))
head(example_data)
#> # A tibble: 6 x 2
#> outcome x
#> <dbl> <dbl>
#> 1 10.8 3.79
#> 2 21.0 5.30
#> 3 15.8 3.46
#> 4 11.2 5.64
#> 5 23.2 5.70
#> 6 0 3.09
# Normal around 20ish, but lots of 0s
ggplot(example_data, aes(x = outcome)) +
geom_histogram(binwidth = 1, boundary = 0, color = "white")
# Create a custom family that is logit if y = 0, normal/gaussian if not
hurdle_gaussian <-
custom_family("hurdle_gaussian",
dpars = c("mu", "sigma", "hu"),
links = c("identity", "log", "logit"),
lb = c(NA, NA, 0),
type = "real")
stan_funs <- "
real hurdle_gaussian_lpdf(real y, real mu, real sigma, real hu) {
if (y == 0) {
return bernoulli_logit_lpmf(1 | hu);
} else {
return bernoulli_logit_lpmf(0 | hu) +
normal_lpdf(y | mu, sigma);
}
}
"
stanvars <- stanvar(scode = stan_funs, block = "functions")
model_fit <- brm(bf(outcome ~ x, hu ~ 1), data = example_data,
family = hurdle_gaussian, stanvars = stanvars,
chains = 2, init = 1, seed = 1234)
#> Compiling Stan program...
# There's a summary!
summary(model_fit)
#> Family: hurdle_gaussian
#> Links: mu = identity; sigma = identity; hu = logit
#> Formula: outcome ~ x
#> hu ~ 1
#> Data: example_data (Number of observations: 1000)
#> Samples: 2 chains, each with iter = 2000; warmup = 1000; thin = 1;
#> total post-warmup samples = 2000
#>
#> Population-Level Effects:
#> Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> Intercept -1.05 0.66 -2.34 0.23 1.00 2283 1629
#> hu_Intercept -2.04 0.10 -2.24 -1.85 1.00 2153 1385
#> x 3.95 0.13 3.70 4.21 1.00 2242 1687
#>
#> Family Specific Parameters:
#> Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> sigma 3.87 0.09 3.68 4.06 1.00 2167 1592
#>
#> Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
#> and Tail_ESS are effective sample size measures, and Rhat is the potential
#> scale reduction factor on split chains (at convergence, Rhat = 1).
# Adaopted from posterior_predict_hurdle_lognormal at
# https://github.com/paul-buerkner/brms/blob/master/R/posterior_predict.R#L736
posterior_predict_hurdle_gaussian <- function(i, prep, ...) {
theta <- prep$dpars$hu[, 1]
mu <- prep$dpars$mu[, 1]
sigma <- prep$dpars$sigma
ndraws <- prep$nsamples
hu <- runif(ndraws, 0, 1)
ifelse(hu < theta, 0, rnorm(ndraws, mu, sigma))
}
# It picks up on the two processes! But the distribution is shifted really far to the left
pp_check(model_fit)
#> Using 10 posterior samples for ppc type 'dens_overlay' by default.
The model is fitting (yay!) and making posterior predictions, but the predictions are off.
I think one main issue is that I’m using normal_lpdf()
instead of the more model-focused normal_id_glm_lpdf()
. The custom Stan code might need to look more like this?
stan_funs <- "
real hurdle_gaussian_lpdf(real y, matrix x, real alpha, vector beta, real sigma, real hu) {
if (y == 0) {
return bernoulli_logit_lpmf(1 | hu);
} else {
return bernoulli_logit_lpmf(0 | hu) +
normal_id_glm_lpdf(y | alpha, beta, sigma);
}
}
"
I’m not sure at all though.
I’d appreciate any help or insight anyone has about this!
- Operating System: macOS Big Sur
- brms Version: 2.14.4