Lognormal model predicts extremely high and unlikely values in brms

Hello all,

I used brms to test if there are differences in microbiome diversity (Shannon index) between spiders collected in natural and urban environments. Since the two environments have unequal variances, I also enable sigma to vary among the two.

I ran the following lognormal model with the default priors for now :

form3 <- brmsformula(Shannon ~ 0 + sample_env + (1 | location),
                     sigma ~ 0 + sample_env)

fit3 <- brm(formula = form3,
            family = lognormal(),
            #prior = priors3,
            iter = 1500,
            warmup = 500,
            thin = 4,
            chains = 4,
            seed = 123,
            control = list(adapt_delta = 0.99),
            save_pars = save_pars(all = TRUE),
            sample_prior = TRUE,
            data = data)

which returns the following summary :

Family: lognormal 
  Links: mu = identity; sigma = log
Formula: Shannon ~ 0 + sample_env + (1 | location)
         sigma ~ 0 + sample_env
   Data: data (Number of observations: 14)
  Draws: 4 chains, each with iter = 1500; warmup = 500; thin = 4;
         total post-warmup draws = 1000

Group-Level Effects:
~location (Number of levels: 4)
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     1.27      1.17     0.07     4.06 1.00      713      751

Population-Level Effects:
                       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sample_envdesert          -1.79      1.86    -5.27     1.95 1.00      930      865
sample_envurban           -0.41      1.18    -2.90     2.06 1.01      857      778
sigma_sample_envdesert     1.12      0.33     0.54     1.83 1.00      913      881
sigma_sample_envurban     -0.22      0.34    -0.78     0.54 1.01      729      780

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

When I plot the conditional effects, I see that the model predicts extremely large values, for instance, 329910.8 (i.e. exp(12.70658)) when my raw data does not go higher than 1.77. However, it does so only for a handful of simulated posterior values. I found that 72 posterior observations had values (back transformed with exp()) higher than 5, which is roughly 3.6% of the whole posterior values.

I also assume that it may explain why the mean diversity for each sampling location is quite high, which seems unlikely since the raw data is low as I have stated above (between 0 and 1.8).

I have attached the raw data, the fitted model object, and the data produced by conditional_effects().
data.Rdata (675 Bytes)
model.Rdata (1.8 MB)
effects_data.Rdata (727 Bytes)

I would highly appreciate if someone could provide some guidance in how to understand the problem. Would specifying some priors on the group means help?

Thank you very much

Maxime

Here is my session info

R version 4.1.3 (2022-03-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)  
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=English_Canada.1252  LC_CTYPE=English_Canada.1252    LC_MONETARY=English_Canada.1252
[4] LC_NUMERIC=C                    LC_TIME=English_Canada.1252

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] data.table_1.14.2 ggplot2_3.3.6     phyloseq_1.38.0   brms_2.17.0       Rcpp_1.0.9

loaded via a namespace (and not attached):
  [1] colorspace_2.0-3       ellipsis_0.3.2         ggridges_0.5.3         markdown_1.1           XVector_0.34.0        
  [6] base64enc_0.1-3        farver_2.1.1           rstan_2.21.5           DT_0.23                fansi_1.0.3
 [11] mvtnorm_1.1-3          bridgesampling_1.1-2   codetools_0.2-18       splines_4.1.3          shinythemes_1.2.0
 [16] bayesplot_1.9.0        ade4_1.7-19            jsonlite_1.8.0         cluster_2.1.3          shiny_1.7.2
 [21] compiler_4.1.3         backports_1.4.1        Matrix_1.4-1           fastmap_1.1.0          cli_3.3.0
 [26] later_1.3.0            htmltools_0.5.3        prettyunits_1.1.1      tools_4.1.3            igraph_1.3.2
 [31] coda_0.19-4            gtable_0.3.0           glue_1.6.2             GenomeInfoDbData_1.2.7 reshape2_1.4.4        
 [36] dplyr_1.0.9            posterior_1.2.2        Biobase_2.54.0         vctrs_0.4.1            Biostrings_2.62.0
 [41] rhdf5filters_1.6.0     multtest_2.50.0        ape_5.6-2              nlme_3.1-158           iterators_1.0.14
 [46] crosstalk_1.2.0        tensorA_0.36.2         stringr_1.4.0          ps_1.7.1               mime_0.12
 [51] miniUI_0.1.1.1         lifecycle_1.0.1        renv_0.15.5            gtools_3.9.3           zlibbioc_1.40.0
 [56] MASS_7.3-58            zoo_1.8-10             scales_1.2.0           colourpicker_1.1.1     promises_1.2.0.1
 [61] Brobdingnag_1.2-7      parallel_4.1.3         biomformat_1.22.0      rhdf5_2.38.1           inline_0.3.19
 [66] shinystan_2.6.0        gridExtra_2.3          loo_2.5.1              StanHeaders_2.21.0-7   stringi_1.7.8
 [71] S4Vectors_0.32.4       dygraphs_1.1.1.6       foreach_1.5.2          checkmate_2.1.0        permute_0.9-7
 [76] BiocGenerics_0.40.0    pkgbuild_1.3.1         GenomeInfoDb_1.30.1    rlang_1.0.4            pkgconfig_2.0.3
 [81] matrixStats_0.62.0     bitops_1.0-7           distributional_0.3.0   lattice_0.20-45        purrr_0.3.4
 [86] Rhdf5lib_1.16.0        labeling_0.4.2         rstantools_2.2.0       htmlwidgets_1.5.4      processx_3.7.0
 [91] tidyselect_1.1.2       plyr_1.8.7             magrittr_2.0.3         R6_2.5.1               IRanges_2.28.0
 [96] generics_0.1.3         DBI_1.1.3              withr_2.5.0            pillar_1.8.0           mgcv_1.8-40
[101] xts_0.12.1             survival_3.3-1         abind_1.4-5            RCurl_1.98-1.7         tibble_3.1.7
[106] crayon_1.5.1           utf8_1.2.2             grid_4.1.3             callr_3.7.1            vegan_2.6-2
[111] threejs_0.3.3          digest_0.6.29          xtable_1.8-4           httpuv_1.6.5           RcppParallel_5.1.5
[116] stats4_4.1.3           munsell_0.5.0          shinyjs_2.1.0

Afternoon,

Thank you for all the details, code, and data! Can you print out what the default priors are in brms with the get_priors() command?

Hello @Ara_Winter,

Thank you for your answer. I use the function prior_summary and get this output :

prior_summary(fit)
                prior class             coef    group resp  dpar nlpar lb ub       source
               (flat)     b                                                       default
               (flat)     b sample_envdesert                                 (vectorized)
               (flat)     b  sample_envurban                                 (vectorized)
               (flat)     b                                sigma                  default
               (flat)     b sample_envdesert               sigma             (vectorized)
               (flat)     b  sample_envurban               sigma             (vectorized)
 student_t(3, 0, 2.5)    sd                                             0         default
 student_t(3, 0, 2.5)    sd                  location                   0    (vectorized)
 student_t(3, 0, 2.5)    sd        Intercept location                   0    (vectorized)

from which I assume are the default priors in brms.

Maxime

1 Like

Hi @Maxime, looking at the data I see that there is an extreme observation (sample_no = S1) which on the log-scale will be an outlier. To accommodate this observation, the conditional variation sigma will need to be really large, which will also imply very large observations (conditional normal distribution on log-scale) occasionally, which may be the source of the extreme predictions.

I tried this model:

brm(bf(log(Shannon) ~ 1 + sample_env, sigma ~ 1 + sample_env), data = spider_data)

And seemed to converge okay, but the left skew of this variable will make these difficult to model I think (at least with a symmetric distribution).

1 Like