Hi all,
I am busy with a validation study of a field method used in ecology and have three variables: “identity”, “observed_count” and “proxy_count”. Identity is individual identity (n = 38), observed_count is the true value of a variable of interest, and proxy_count is an index of the observed_count. I want to know how reliably I can expect to predict the observed_count in future, when only proxy_count is available.
The counts are zero-inflated and there is a weak positive correlation between the observed and proxy counts (as I expected).
Using brms, I fitted a negative binomial hurdle model. Important: I expect the relationship between the response and predictor to be linear (especially in some of the other “proxy_counts” I have, which are better representations of the “observed_count” than the current example). Therefore, I specify an identity link, not a log link. This model fits reasonably well though it overpredicts zeros.
# Define weakly informative priors
model_prior <- c(
set_prior("normal(0, 1)", class = "Intercept"), # Prior for mu_Intercept
set_prior("beta(2, 2)", class = "Intercept", dpar = "hu", lb = 0, ub = 1), # Prior for hu_Intercept
set_prior("normal(0, 1)", class = "b"), # Prior for regression coefficients
set_prior("normal(0, 1)", class = "b", dpar = "hu"), # Prior for regression coefficients
set_prior("lognormal(1, 1)", class = "shape"))
mod_nb <- brm(bf(observed_count ~ proxy_count, hu ~ proxy_count),
data = dat,
family = hurdle_negbinomial(link = "identity", link_shape = "log", link_hu = "logit"),
prior = model_prior, sample_prior = T)
pred <- posterior_predict(mod_nb)
bayesplot::ppc_dens_overlay(y = log1p(dat$observed_count), yrep = log1p(pred[1:100,]))
Given that I want to estimate the predictive performance for new individuals, I’ve been thinking about ‘out-of-sample’ predictive performance - e.g., estimating leave-one-out predictive performance with loo_predictive_metric
.
But before I get to this, I need to ideally include a random intercept and slope for every individual, to account for the structure in the data (multiple observations per individual; scatterplots show individuals tend to have different slopes). For example:
mixmod_nb <- brm(bf(observed_count ~ proxy_count + (1 + proxy_count | id),
hu ~ proxy_count + (1 + proxy_count | id)),
data = dat,
family = hurdle_negbinomial(link = "identity", link_shape = "log", link_hu = "logit"),
prior = model_prior, sample_prior = T)
I can fit this model when specifying a log link function, but it doesn’t fit with an identity link (that is not constrained to be positive) (a zero-inflated model behaves the same). I wondered whether appropriate regularizing priors will allow this model to run? That is were I get stuck - I don’t (yet) have a good handle on setting proper priors (in general and with brms). I’d welcome suggestions for appropriate regularizing priors if you think this is a possible solution. Any general comments regarding the approach (am I even on the right track here? I sometimes have doubts) are also appreciated.
I include a csv file with the data (2471 observations) used here and code below if anyone is interested to have a look at it:
input.csv (26.6 KB)
library(brms)
library(tidyverse)
library(modelr)
dat = read.csv("input.csv")
hist(dat$observed_count, breaks = 20)
ggplot(data = dat, aes(proxy_count, observed_count)) +
geom_point(alpha = 0.2) + theme_bw()
# Fit hurdle_negbinomial brm model
# Define weakly informative priors
model_prior_informative <- c(
set_prior("normal(0, 1)", class = "Intercept"), # Prior for mu_Intercept
set_prior("beta(2, 2)", class = "Intercept", dpar = "hu", lb = 0, ub = 1), # Prior for hu_Intercept
set_prior("normal(0, 1)", class = "b"), # Prior for regression coefficients
set_prior("normal(0, 1)", class = "b", dpar = "hu"), # Prior for regression coefficients
set_prior("lognormal(1, 1)", class = "shape"))
mod_nb <- brm(bf(observed_count ~ proxy_count, hu ~ proxy_count),
data = dat,
family = hurdle_negbinomial(link = "identity", link_shape = "log", link_hu = "logit"),
prior = model_prior_informative,
sample_prior = T)
mod_nb
pp_check(mod_nb)
# To highlight the zero process, we log the predicted values.
# Look again - how well do the posterior draws fit the actual data?
pred <- posterior_predict(mod_nb)
bayesplot::ppc_dens_overlay(y = log1p(dat$observed_count), yrep = log1p(pred[1:100,]))
# posterior predicitons
conditional_effects(mod_nb, method = "posterior_epred", prob = 0.9)
conditional_effects(mod_nb, method = "posterior_epred", prob = 0.9, dpar = "hu")
dat %>%
modelr::data_grid(proxy_count = seq_range(proxy_count, n = 20)) %>%
tidybayes::add_predicted_draws(mod_nb) %>%
ggplot(aes(x = proxy_count, y = observed_count)) +
ggdist::stat_lineribbon(aes(y = .prediction), .width = c(.90, .80, .50), alpha = 0.25) +
geom_point(data = dat, alpha = 0.2) +
geom_segment(aes(x = 0, y = 0, xend = 14, yend = 14),
color = "red", linewidth = 1) # 1:1 line