Comparing Gaussian, gamma, and lognormal versions of regression model

I am modeling the effect of a fixed factor group, a positive numeric covariate base, and a random factor site on a positive numeric (i.e. not a count) outcome. Here is the deidentified data

exemplarData.RData (3.5 KB)

The outcome is very skewed.

I saw online that gamma and lognormal regression are quite good for modeling positively skewed numeric data. In this excellent notebook Aki Vehtari demonstrates that when the outcome is a count, you can compare the performance of gaussian models to betabinomial via loo-cv. I was wondering if the same applies to Gaussian, gamma and lognormal regression: if you can compare these models with different famillies and link functions via loo-cv.

Here is the code. Note that gamma regression requires all values to be above zero so I simple added 0.000001 to the zeroes. Not sure if this is kosher (let me know).

library(brms)

load(file = "exemplarData.RData") # exemplarData

# gamma needs values above 0

exemplarData %<>%
  mutate(study = case_when(study == 0 ~ 0.000000000001,
                           TRUE ~ study))

################################
########### Gaussian ###########
################################


# run model
brm(formula = study ~ group + base + (1|site),
    data = exemplarData,
    family = gaussian,
    save_pars = save_pars(all=TRUE),
    seed = 1234,
    refresh = 0) -> fit_gauss

# add criterion to the object so we can compare models using cross-validation later
fit_gauss <- add_criterion(x = fit_gauss,
                           criterion="loo",
                           save_psis = TRUE,
                           moment_match = T,
                           reloo = T)

# loo, with refitting.
loo(fit_gauss,
    reloo = TRUE,
    save_psis = TRUE) -> loo_gauss

##############################
########### Gamma ############
##############################

# run model
brm(formula = study ~ group + base + (1|site),
    data = exemplarData, 
    family = Gamma(link = "log"),
    save_pars = save_pars(all=TRUE),
    seed = 1234,
    refresh = 0) -> fit_gamma

# add criterion
fit_gamma <- add_criterion(x = fit_gamma,
                           criterion="loo",
                           save_psis = TRUE,
                           moment_match = T,
                           reloo = T)

# loo with refitting
loo(fit_gamma,
    reloo = TRUE,
    save_psis = TRUE) -> loo_gamma

##############################
########### lognormal ########
##############################

# run model
brm(formula = study ~ group + base + (1|site),
    data = exemplarData,
    family = lognormal(),
    save_pars = save_pars(all=TRUE),
    seed = 1234,
    refresh = 0) -> fit_lognormal

# add criterion 
fit_lognormal <- add_criterion(x = fit_lognormal,
                               criterion="loo",
                               save_psis = TRUE,
                               moment_match = T,
                               reloo = T)

# loo with refitting. 
loo(fit_lognormal,
    reloo = TRUE,
    save_psis = TRUE) -> loo_lognormal

# now compare the three models
loo_compare(loo_gauss, loo_gamma, loo_lognormal)

# output
#            elpd_diff   se_diff
# fit_gamma        0.0       0.0 
# fit_gauss      -98.4      39.1 
# fit_lognormal -232.5      13.2

The output suggests the gamma is the superior model. Which brings me back to my question: can one use loo-cv to compare Gaussian to gamma to lognormal regression?

Help much appreciated as always.

Generally, as long as you are modeling y directly, not a transformation (e.g., \sqrt{y} or \log y) yes you can directly compare predictive densities (Cross-validation FAQ).

brms may automatically check this for you but I’m not familiar enough with the package to say for certain.

1 Like

Thank you @js592. Good point. I think this is where my concerns lie: i.e. that the identity link function in the Gaussian would invalidate comparison of it with the Gamma and lognormal, which have log link functions? And yes brms does warn you when models are incompatible (e.g in the example posted on Professor Vehtari’s notebook brms warns - incorrectly in that specific case - that the yhat of the gaussian and binomial models are incompatible), I just don’t know if it warns you in every situation.

Hopefully someone with more experience with brms can chime in. In the meantime, I would just simulate some simple data e.g., from a lognormal/gamma dist, fit intercept only models in brms, extract the posterior draws and manually compute the ELPD via e.g., dlnorm() and loo::elpd() then compare these numbers to brms.

1 Like

The problem discussed in @avehtari 's notebook has nothing to do with link functions or differences between continuous response families. It is entirely about the fact that probabilities and probability densities cannot be compared directly, a problem that arises only when comparing models with a discrete component to models with a continuous component. If you want to compare models of different continuous families (that all lack discrete components–so no hurdle models), this issue doesn’t exist. Confidently trust the loo-cv output, provided it passes the relevant diagnostics.

5 Likes