# Regularizing priors to help brms fit a multilevel negative binomial hurdle model with identity link

Hi all,

I am busy with a validation study of a field method used in ecology and have three variables: “identity”, “observed_count” and “proxy_count”. Identity is individual identity (n = 38), observed_count is the true value of a variable of interest, and proxy_count is an index of the observed_count. I want to know how reliably I can expect to predict the observed_count in future, when only proxy_count is available.

The counts are zero-inflated and there is a weak positive correlation between the observed and proxy counts (as I expected).

Using brms, I fitted a negative binomial hurdle model. Important: I expect the relationship between the response and predictor to be linear (especially in some of the other “proxy_counts” I have, which are better representations of the “observed_count” than the current example). Therefore, I specify an identity link, not a log link. This model fits reasonably well though it overpredicts zeros.

``````# Define weakly informative priors
model_prior <- c(
set_prior("normal(0, 1)", class = "Intercept"),              # Prior for mu_Intercept
set_prior("beta(2, 2)", class = "Intercept", dpar = "hu", lb = 0, ub = 1), # Prior for hu_Intercept
set_prior("normal(0, 1)", class = "b"),                      # Prior for regression coefficients
set_prior("normal(0, 1)", class = "b", dpar = "hu"),         # Prior for regression coefficients
set_prior("lognormal(1, 1)", class = "shape"))

mod_nb <- brm(bf(observed_count ~ proxy_count, hu ~ proxy_count),
data = dat,
prior = model_prior, sample_prior = T)

pred <- posterior_predict(mod_nb)
bayesplot::ppc_dens_overlay(y = log1p(dat\$observed_count), yrep = log1p(pred[1:100,]))

``````

Given that I want to estimate the predictive performance for new individuals, I’ve been thinking about ‘out-of-sample’ predictive performance - e.g., estimating leave-one-out predictive performance with `loo_predictive_metric`.

But before I get to this, I need to ideally include a random intercept and slope for every individual, to account for the structure in the data (multiple observations per individual; scatterplots show individuals tend to have different slopes). For example:

``````mixmod_nb <- brm(bf(observed_count ~ proxy_count + (1 + proxy_count | id),
hu ~ proxy_count + (1 + proxy_count | id)),
data = dat,
prior = model_prior, sample_prior = T)
``````

I can fit this model when specifying a log link function, but it doesn’t fit with an identity link (that is not constrained to be positive) (a zero-inflated model behaves the same). I wondered whether appropriate regularizing priors will allow this model to run? That is were I get stuck - I don’t (yet) have a good handle on setting proper priors (in general and with brms). I’d welcome suggestions for appropriate regularizing priors if you think this is a possible solution. Any general comments regarding the approach (am I even on the right track here? I sometimes have doubts) are also appreciated.

I include a csv file with the data (2471 observations) used here and code below if anyone is interested to have a look at it:
input.csv (26.6 KB)

``````library(brms)
library(tidyverse)
library(modelr)

hist(dat\$observed_count, breaks = 20)

ggplot(data = dat, aes(proxy_count, observed_count)) +
geom_point(alpha = 0.2) + theme_bw()

# Fit hurdle_negbinomial brm model
# Define weakly informative priors
model_prior_informative <- c(
set_prior("normal(0, 1)", class = "Intercept"),              # Prior for mu_Intercept
set_prior("beta(2, 2)", class = "Intercept", dpar = "hu", lb = 0, ub = 1), # Prior for hu_Intercept
set_prior("normal(0, 1)", class = "b"),                      # Prior for regression coefficients
set_prior("normal(0, 1)", class = "b", dpar = "hu"),         # Prior for regression coefficients
set_prior("lognormal(1, 1)", class = "shape"))

mod_nb <- brm(bf(observed_count ~ proxy_count, hu ~ proxy_count),
data = dat,
prior = model_prior_informative,
sample_prior = T)

mod_nb

pp_check(mod_nb)

# To highlight the zero process, we log the predicted values.
# Look again - how well do the posterior draws fit the actual data?
pred <- posterior_predict(mod_nb)
bayesplot::ppc_dens_overlay(y = log1p(dat\$observed_count), yrep = log1p(pred[1:100,]))

# posterior predicitons
conditional_effects(mod_nb, method = "posterior_epred", prob = 0.9)
conditional_effects(mod_nb, method = "posterior_epred", prob = 0.9, dpar = "hu")

dat %>%
modelr::data_grid(proxy_count = seq_range(proxy_count, n = 20)) %>%
ggplot(aes(x = proxy_count, y = observed_count)) +
ggdist::stat_lineribbon(aes(y = .prediction), .width = c(.90, .80, .50), alpha = 0.25) +
geom_point(data = dat, alpha = 0.2) +
geom_segment(aes(x = 0, y = 0, xend = 14, yend = 14),
color = "red", linewidth = 1)  # 1:1 line

``````
1 Like

Identity link is almost certain to cause problems unless all the model predictions are far from zero. There are however at least two ways to achieve something close to linear relationship between a predictor and the observed counts:

1. Use log of the predictor to predict the log of observed count
2. Use the `softplus` link and use untransformed predictor - this assumes the relationship is linear far from zero, but constraints the predictions to be positive.

Best of luck with your model.

1 Like