Lognormal model predicts extremely high and unlikely values in brms

Hello all,

I used brms to test if there are differences in microbiome diversity (Shannon index) between spiders collected in natural and urban environments. Since the two environments have unequal variances, I also enable sigma to vary among the two.

I ran the following lognormal model with the default priors for now :

form3 <- brmsformula(Shannon ~ 0 + sample_env + (1 | location),
                     sigma ~ 0 + sample_env)

fit3 <- brm(formula = form3,
            family = lognormal(),
            #prior = priors3,
            iter = 1500,
            warmup = 500,
            thin = 4,
            chains = 4,
            seed = 123,
            control = list(adapt_delta = 0.99),
            save_pars = save_pars(all = TRUE),
            sample_prior = TRUE,
            data = data)

which returns the following summary :

Family: lognormal 
  Links: mu = identity; sigma = log
Formula: Shannon ~ 0 + sample_env + (1 | location)
         sigma ~ 0 + sample_env
   Data: data (Number of observations: 14)
  Draws: 4 chains, each with iter = 1500; warmup = 500; thin = 4;
         total post-warmup draws = 1000

Group-Level Effects:
~location (Number of levels: 4)
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     1.27      1.17     0.07     4.06 1.00      713      751

Population-Level Effects:
                       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sample_envdesert          -1.79      1.86    -5.27     1.95 1.00      930      865
sample_envurban           -0.41      1.18    -2.90     2.06 1.01      857      778
sigma_sample_envdesert     1.12      0.33     0.54     1.83 1.00      913      881
sigma_sample_envurban     -0.22      0.34    -0.78     0.54 1.01      729      780

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

When I plot the conditional effects, I see that the model predicts extremely large values, for instance, 329910.8 (i.e. exp(12.70658)) when my raw data does not go higher than 1.77. However, it does so only for a handful of simulated posterior values. I found that 72 posterior observations had values (back transformed with exp()) higher than 5, which is roughly 3.6% of the whole posterior values.

I also assume that it may explain why the mean diversity for each sampling location is quite high, which seems unlikely since the raw data is low as I have stated above (between 0 and 1.8).

I have attached the raw data, the fitted model object, and the data produced by conditional_effects().
data.Rdata (675 Bytes)
model.Rdata (1.8 MB)
effects_data.Rdata (727 Bytes)

I would highly appreciate if someone could provide some guidance in how to understand the problem. Would specifying some priors on the group means help?

Thank you very much

Maxime

Here is my session info

R version 4.1.3 (2022-03-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)  
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=English_Canada.1252  LC_CTYPE=English_Canada.1252    LC_MONETARY=English_Canada.1252
[4] LC_NUMERIC=C                    LC_TIME=English_Canada.1252

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] data.table_1.14.2 ggplot2_3.3.6     phyloseq_1.38.0   brms_2.17.0       Rcpp_1.0.9

loaded via a namespace (and not attached):
  [1] colorspace_2.0-3       ellipsis_0.3.2         ggridges_0.5.3         markdown_1.1           XVector_0.34.0        
  [6] base64enc_0.1-3        farver_2.1.1           rstan_2.21.5           DT_0.23                fansi_1.0.3
 [11] mvtnorm_1.1-3          bridgesampling_1.1-2   codetools_0.2-18       splines_4.1.3          shinythemes_1.2.0
 [16] bayesplot_1.9.0        ade4_1.7-19            jsonlite_1.8.0         cluster_2.1.3          shiny_1.7.2
 [21] compiler_4.1.3         backports_1.4.1        Matrix_1.4-1           fastmap_1.1.0          cli_3.3.0
 [26] later_1.3.0            htmltools_0.5.3        prettyunits_1.1.1      tools_4.1.3            igraph_1.3.2
 [31] coda_0.19-4            gtable_0.3.0           glue_1.6.2             GenomeInfoDbData_1.2.7 reshape2_1.4.4        
 [36] dplyr_1.0.9            posterior_1.2.2        Biobase_2.54.0         vctrs_0.4.1            Biostrings_2.62.0
 [41] rhdf5filters_1.6.0     multtest_2.50.0        ape_5.6-2              nlme_3.1-158           iterators_1.0.14
 [46] crosstalk_1.2.0        tensorA_0.36.2         stringr_1.4.0          ps_1.7.1               mime_0.12
 [51] miniUI_0.1.1.1         lifecycle_1.0.1        renv_0.15.5            gtools_3.9.3           zlibbioc_1.40.0
 [56] MASS_7.3-58            zoo_1.8-10             scales_1.2.0           colourpicker_1.1.1     promises_1.2.0.1
 [61] Brobdingnag_1.2-7      parallel_4.1.3         biomformat_1.22.0      rhdf5_2.38.1           inline_0.3.19
 [66] shinystan_2.6.0        gridExtra_2.3          loo_2.5.1              StanHeaders_2.21.0-7   stringi_1.7.8
 [71] S4Vectors_0.32.4       dygraphs_1.1.1.6       foreach_1.5.2          checkmate_2.1.0        permute_0.9-7
 [76] BiocGenerics_0.40.0    pkgbuild_1.3.1         GenomeInfoDb_1.30.1    rlang_1.0.4            pkgconfig_2.0.3
 [81] matrixStats_0.62.0     bitops_1.0-7           distributional_0.3.0   lattice_0.20-45        purrr_0.3.4
 [86] Rhdf5lib_1.16.0        labeling_0.4.2         rstantools_2.2.0       htmlwidgets_1.5.4      processx_3.7.0
 [91] tidyselect_1.1.2       plyr_1.8.7             magrittr_2.0.3         R6_2.5.1               IRanges_2.28.0
 [96] generics_0.1.3         DBI_1.1.3              withr_2.5.0            pillar_1.8.0           mgcv_1.8-40
[101] xts_0.12.1             survival_3.3-1         abind_1.4-5            RCurl_1.98-1.7         tibble_3.1.7
[106] crayon_1.5.1           utf8_1.2.2             grid_4.1.3             callr_3.7.1            vegan_2.6-2
[111] threejs_0.3.3          digest_0.6.29          xtable_1.8-4           httpuv_1.6.5           RcppParallel_5.1.5
[116] stats4_4.1.3           munsell_0.5.0          shinyjs_2.1.0

Afternoon,

Thank you for all the details, code, and data! Can you print out what the default priors are in brms with the get_priors() command?

Hello @Ara_Winter,

Thank you for your answer. I use the function prior_summary and get this output :

prior_summary(fit)
                prior class             coef    group resp  dpar nlpar lb ub       source
               (flat)     b                                                       default
               (flat)     b sample_envdesert                                 (vectorized)
               (flat)     b  sample_envurban                                 (vectorized)
               (flat)     b                                sigma                  default
               (flat)     b sample_envdesert               sigma             (vectorized)
               (flat)     b  sample_envurban               sigma             (vectorized)
 student_t(3, 0, 2.5)    sd                                             0         default
 student_t(3, 0, 2.5)    sd                  location                   0    (vectorized)
 student_t(3, 0, 2.5)    sd        Intercept location                   0    (vectorized)

from which I assume are the default priors in brms.

Maxime

1 Like

Hi @Maxime, looking at the data I see that there is an extreme observation (sample_no = S1) which on the log-scale will be an outlier. To accommodate this observation, the conditional variation sigma will need to be really large, which will also imply very large observations (conditional normal distribution on log-scale) occasionally, which may be the source of the extreme predictions.

I tried this model:

brm(bf(log(Shannon) ~ 1 + sample_env, sigma ~ 1 + sample_env), data = spider_data)

And seemed to converge okay, but the left skew of this variable will make these difficult to model I think (at least with a symmetric distribution).

1 Like

Hi @AWoodward . Thank you very much for your answer. Just to be sure, the difference between the model you used and mine is that you log-transformed the response and thus assumed a Gaussian distribution for the model am I correct?

So from what I understand, this approach would be less sensible to the outliers? I can also investigate if the extreme value you refer to in sample S1 is a true outlier.

Thanks again :)

Maxime

It might be helpful to explore other response distributions in brms.
For example, using family=weibull provides more reasonable posterior predictive checks and conditional effects plots, than does lognormal (although there are still extreme values). I just used the default priors, but you could do better with sensible ones.
brm(Shannon ~ 0 + sample_env + (1|location), family=weibull, data=data, cores=4, control = list(adapt_delta = 0.99))


Ultimately you don’t have much data, and you have one value that is very very small compared to the others.
If you log transform first and use student instead of gaussian then you can also get more reasonable posterior predictive checks, but I am not sure if log transforming first is the best approach. I don’t know.

I don’t have enough experience with these types of response distributions to offer more, but you might look around at other response distributions, if that makes sense for whatever “Shannon index” is.

1 Like

@Maxime hi again, family = lognormal() and the log-transformation within the model formula should be equivalent, it’s just my usual habit to express it that way. I agree with @jd_c that a more flexible response distribution may be sensible here. Ultimately with such little data it would be difficult to make further progress on that. But I’d pay close attention to the observations and make sure that there are no technical errors in them. Wouldn’t extremely small values for Shannon index suggest practically no diversity (and probably therefore be quite uncertain)?

Hey @AWoodward. Yes indeed, low values suggest very low diversity. However, since the samples come from spiders, it may be biologically realistic since they don’t move much. I will dive into my analyses in a couple of weeks. Thanks again for the great insight!

Thank you @jd_c . I will try with your suggestions!