ZINB model convergence issues when predicting shape parameter

  • Operating System: Windows 10, Intel Xeon core
  • brms Version: 2.10.0

Hello: this is a follow-up to my original post (Priors on predicted shape parameter(s) in mixed zinb model). There I received helpful feedback regarding the interpretation of the shape variance parameter in ZINB models. Here I’m looking for assistance in properly specifying the model.

I am attempting a zero-inflated negative binomial mixed-effect model, and I am also interested in predicting the shape (dispersion) parameter (so as to assess whether variance is different across years in my model). I realize variance parameters are tricky and can require a lot of data, but if there is a way, I would like to predict it.

I am predicting a count response: cars95, annual reproductive success
Predictors include:
Main effects:
age (discrete) - most central to my question
year (discrete) - most central to my question
mort (0/1) - sparse data, not my primary focus
bwev (0/1) - sparse data, not my primary focus
prev sire (0/1) - sparse data, not my primary focus
age x year interaction - most central to my question
Group-level effects:
age|id (where id is categorical for individual across years; this random intercept is necessary to account for multiple samples from the same individual across years)

Thus, the model I would like to fit is:

model {
cars95 ~ age*year + mort + bwev + prevsire + (age|id), shape ~ year

This model fails to converge with default priors:

                  prior     class      coef group resp  dpar nlpar bound
1                               b                                       
2                               b       age                             
3                               b  age:year                             
4                               b     bwev1                             
5                               b     mort1                             
6                               b prevsire1                             
7                               b      year                             
8  student_t(3, -2, 10) Intercept                                       
9   student_t(3, 0, 10) Intercept                      shape            
10 lkj_corr_cholesky(1)         L                                       
11                              L              id                       
12  student_t(3, 0, 10)        sd                                       
13  student_t(3, 0, 10)        sd                      shape            
14                             sd              id                       
15                             sd       age    id                       
16                             sd Intercept    id                       
17                             sd            year      shape            
18                             sd Intercept  year      shape            
19           beta(1, 1)        zi                                                                  

I’m thrown the warnings:

Warning messages:
1: There were 861 divergent transitions after warmup. Increasing adapt_delta above 0.9 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup 
2: There were 59139 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded 
3: Examine the pairs() plot to diagnose sampling problems
 
4: The largest R-hat is 4.13, indicating chains have not mixed.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#r-hat 
5: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#bulk-ess 
6: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#tail-ess

Note, however, if I do not predict shape, the model converges fine. There is a lot of uncertainty around the shape parameter when no priors are given, so here I tweaked the default prior on shape (from the default gamma(0.01, 0.01) to gamma(1,1)) though I admittedly did not follow a rigorous method for doing so):

Family: zero_inflated_negbinomial 
  Links: mu = log; shape = identity; zi = identity 
Formula: cars95 ~ age * year + mort + bwev + prevsire + (age | id) 
   Data: c95data (Number of observations: 220) 
Samples: 4 chains, each with iter = 15000; warmup = 5000; thin = 2;
         total post-warmup samples = 20000

Group-Level Effects: 
~id (Number of levels: 60) 
                   Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)          1.82      0.97     0.15     3.86 1.00     4798     9806
sd(age)                0.54      0.22     0.14     1.02 1.00     4365     6019
cor(Intercept,age)    -0.63      0.37    -0.96     0.52 1.00     6200     6898

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept    -4.04      1.77    -7.93    -0.98 1.00     7970    13616
age           1.44      0.61     0.33     2.72 1.00     7962    14286
year          0.10      0.23    -0.35     0.58 1.00    14321    16159
mort1         0.58      0.90    -1.21     2.36 1.00    16313    17632
bwev1        -0.97      0.89    -2.81     0.70 1.00    16957    16902
prevsire1     0.07      0.50    -0.91     1.05 1.00     8872    14221
age:year     -0.15      0.08    -0.30    -0.00 1.00     8792    14912

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
shape     1.68      0.95     0.54     4.14 1.00     9684    14440
zi        0.12      0.09     0.01     0.34 1.00    14419    16585

Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
is a crude measure of effective sample size, and Rhat is the potential 
scale reduction factor on split chains (at convergence, Rhat = 1).

I’ve conducted loo tests and a number of post predictive checks on the the converged model without a predicted shape. Many, but not all, check out. Problems that arise with the initial model are a higher p_loo count than the number of parameters in my model, and a somehwat poor-performing ppc_dens_overlay plot:

Loo output:

Computed from 20000 by 220 log-likelihood matrix

         Estimate   SE
elpd_loo   -172.6 18.5
p_loo        34.2  6.5
looic       345.1 37.1
------
Monte Carlo SE of elpd_loo is 0.4.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     197   89.5%   13        
 (0.5, 0.7]   (ok)        23   10.5%   121       
   (0.7, 1]   (bad)        0    0.0%   <NA>      
   (1, Inf)   (very bad)   0    0.0%   <NA>      

All Pareto k estimates are ok (k < 0.7).
See help('pareto-k-diagnostic') for details.

Post-predictive checks:
PPC - bars:


PPC - ecdf overlay:

PPC- stat, mean

PPC - stat, mean and SD

PPC - ppc_check()

My questions for this post:

  1. Might including more informative priors for a model where I predict the shape variance parameter help convergence?
  2. If so, what priors should I assign - particularly for shape-specific parameters (the b, sd, and Intercept parameters associated with dpar=shape)
  3. Do you have any other suggestions/thoughts to improve model convergence beyond improved prior specification?
  4. Would you agree with my assessment that the model NOT predicting shape (for which I’ve included the model output and post-predictive checks above) ‘fits’ decently well - even considering the high p_loo value relative to my number of parameters?

I’ve included a reproducible example together with a link to data:

library(dplyr)
library(brms)
library(ggeffects)
library(GGally)
library(rstantools)
library(rstanarm)
library(loo)
library(bayesplot)

c95data <- read.csv("ARS95KAc.csv")
c95data$id <- as.factor(c95data$id)
c95data$year <- as.integer(c95data$year) #year (2-8) as integer NOT categorical
c95data$mort <- as.factor(c95data$mort)
c95data$prevsire <- as.factor(c95data$prevsire)
c95data$bwev <- as.factor(c95data$bwev)

#MODELS
#NOT predicting shape
mod1 <- brms(cars95 ~ age*year + mort + bwev + prevsire + (1|id) , 
     data=c95data,
     family=zero_inflated_negbinomial(), 
     prior= set_prior("gamma(1,1)"),
     iter=20000, warmup=5000, thin=2, chains=4,
     control = list(adapt_delta = 0.99), 
     cores=6)


##Predicting shape -- DOES NOT CONVERGE
mod2_nopri <- brms(bf(cars95 ~ age*year + mort + bwev + prevsire + (1|id) , 
     shape ~ year), 
     data=c95data,
     family=zero_inflated_negbinomial(), 
     iter=20000, warmup=5000, thin=2, chains=4,
     control = list(adapt_delta = 0.99), 
     cores=6)

#LOO
loo1 <- loo(mod1, reloo=T, cores=6)

##PPCs
#post predictive checks
yrep <- posterior_predict(mod1)
y <- c95data$cars95
bayesplot::ppc_stat(y, yrep[1:50, ], stat="mean")
brms::pp_check(mod1)
brms::pp_check(mod1, type="stat_2d", stat=c("mean", "sd"), nsamples=30)
brms::pp_check(mod1, type="ecdf_overlay", discrete=T, nsamples=30)
brms::pp_check(mod1, type="bars", nsamples=30)

DATA: https://drive.google.com/file/d/1T51Cp0m8qHtsf2fZcpqchMQJo6dIdQn1/view?usp=sharing

Thank you in advance to everyone/anyone who gives my struggles and questions some time! It is so appreciated.
El

Hello!
As a follow-up, I’d really like to boil my questions down to just 2:

  1. Any suggestions for getting a model wherein I predict the shape parameters to converge? I have tried fitting a shape-only model:
fitphi.nb <- brm(bf(cars95 ~  1 + (1|id), shape ~ year),  data=c95data,
                family=zero_inflated_negbinomial(),
                iter=9000, warmup=1000, chains=4,
                control = list(adapt_delta = 0.99), cores=6)

but am likewise struggling with fit. I’m not sure if the variance term is just too data-hungry for my data set, or if setting more informative priors might help.

  1. Does anyone have input on how my model without predicted shape (for which model output, loo output, and ppcs are shown) fits? Particularly considering that loo results seem to suggest poor-ish fit while ppcs and model output seem to suggest decent-ish fit.

@martinmodrak or @avehtari – do you have any suggestions? (I gather you’ve volunteered assistance on these posts in the past and have much expertise!)

Thank you!

Definitely. Current uniform priors are bad. Even weakly informative priors can have big difference as now diagnostics indicate bad non-identifiability.

Start with just some proper priors. Even better, simulate from the priors are plot histograms for location and shape.

Start with better priors.

Instead of mean stat, test e.g. 90% and 95% quantiles and max, see Regression and Other Stories: Roaches

I appreciate you included reproducible example, but I didn’t yet have time to run it.

1 Like

Thanks for the response @avehtari!

Can you give an example proper prior for any of the shape parameters?
And I presume for simulating priors, you mean a prior predictive check? I encountered that in some blog posts on this site, but am struggling:

                family=zero_inflated_negbinomial(), 
                iter=20000, warmup=5000, chains=4, sample_prior = "only", seed= 1234,
                control = list(adapt_delta = 0.99), cores=6)

 Error: Sampling from priors is not possible as some parameters have no proper priors. Error occured for class 'b'. 

I get the same error with a ‘shape-only’ model:

ppc_fitphi.nb <- brm(bf(cars95 ~ 1 + (1|id), shape ~ year),  data=c95data,
                family=zero_inflated_negbinomial(), 
                iter=20000, warmup=5000, chains=4, sample_prior = "only", seed= 1234,
                control = list(adapt_delta = 0.99), cores=6)

 Error: Sampling from priors is not possible as some parameters have no proper priors. Error occured for class 'b_shape'.

Finally, I’ve tried your quantile suggestion for the post-predictive checks on my converged model without predicting shape.
My plots at 90% and 95% are oddly sparse compared to your vignette, but predictions fall seem to fall in an okay place. Thoughts on the sparsity - is it okay?

90%:

ppc_stat(y=c95data$cars95, yrep=posterior_predict(fit3.11.nb), stat=function(y) quantile(y, probs=0.90))

95%:

ppc_stat(y=c95data$cars95, yrep=posterior_predict(fit3.11.nb), stat=function(y) quantile(y, probs=0.95))

99%:

ppc_stat(y=c95data$cars95, yrep=posterior_predict(fit3.11.nb), stat=function(y) quantile(y, probs=0.99))

max (apologies for the extended axis, I can’t seem to narrow it):

ppc_stat(y=c95data$cars95, yrep=posterior_predict(fit3.11.nb), stat=max, binwidth=0.15, xlim(0,50))

Continued thanks!

Proper prior means that it’s integral is finite. Integral of uniform over real values or over non-negative values is infinite. Start the experiments with any wide proper prior, e.g. normal(0, something), which is automatically half-normal for non-negative constraint variables. However now that you showed additional PPC plots, I think the shape of negbinomial is weakly identifiable.

As it says, you need to have proper priors for all parameters.

Sparsity is due to discreteness of the outcome. What is the maximum outcome value in the data? To me it looks like it is suspiciously small. 99% of y is around 7, but it’s not possible to the see max y value in the last plot. When I now re-checked previous PPC plot, it seems max of y is 7. Is there natural maximum for y?

It seems there is not enough data to learn much more, but experimenting with proper priors and prior predictive checks is good to thing to practice anyway.

1 Like

Thanks @avehtari

By “weakly identifiable”, do you mean that it is unlikely that I’ll be able to get a model to converge where I predict shape? What about the plots tells you this?

Yes, the max. observed response in my data is 8 - and the natural ceiling is likely less than 20. Is this problematic for the ZINB distribution? It seems that there are predicted values from my converged model that exceed the realized ceiling (sometimes by a lot), but it’s very rare so I assumed it was ‘okay’.

Okay - I will try setting some ‘wide’ normal priors. I suppose I’m afraid of being too informative as well as not informative enough - with little knowledge on how to gauge it either way.

Thanks for your continued feedback and suggestions! It is so appreciated.

Here is an example of putting some moderately (whatever that means!) tight priors on some of the parameters. Note that I am not claiming that these are correct, just that you can get the model to run without divergences or tree-depth problems with this set of priors.

Some comments:
I did not convert variables to factors. I do not think that is necessary (Stan not knowing anything about factors).

The loo results are not the best with p_loo being high and 9 data points with the pareto k above 0.7

I can talk myself into accepting priors of this type by arguing that all of your data are integers and both the distribution average (mu) and the shape have a log link. If a parameters moves by one and its coefficient is one, then the value of mu or shape will go up by a factor of e. Particularly with mu, since your data only go up to 8 and you are handling many of the zeros with zi, that is a huge shift. I am less convinced that a N(0,3) prior on the intercept of shape is called for. I didn’t try other priors.

c95dataNoFactors <- read.csv("c:/users/fjcc/Documents/R/Play/ARS95KAc.csv")
priorMy2 <- c(prior(normal(0, 1), class = "b"),
              prior(normal(0, 1), class = "b", coef = "year", dpar = "shape"),
              prior(normal(0,3), class = "Intercept", dpar = "shape"))

ShapeNoFact4 <- brm(bf(cars95 ~ age*year + mort + bwev + prevsire + (age|id),
                       shape ~ year),
                    control = list(adapt_delta = 0.99), 
                    prior = priorMy2,
                    data=c95dataNoFactors,
                    family=zero_inflated_negbinomial())
summary(ShapeNoFact4)
Family: zero_inflated_negbinomial 
Links: mu = log; shape = log; zi = identity 
Formula: cars95 ~ age * year + mort + bwev + prevsire + (age | id) 
shape ~ year
Data: c95dataNoFactors (Number of observations: 220) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup samples = 4000
# 
Group-Level Effects: 
   ~id (Number of levels: 60) 
                     Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)          1.60      0.80     0.13     3.19   1.00      729      765
sd(age)                0.48      0.21     0.10     0.91   1.00      501      495
cor(Intercept,age)    -0.58      0.39    -0.95     0.57   1.00      801     1164
 
Population-Level Effects: 
                 Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept          -3.15      1.41    -6.05    -0.64 1.00     1287     2299
shape_Intercept     5.32      3.48    -0.27    12.95 1.00     3634     2178
age                 1.04      0.45     0.19     1.95 1.00     1281     1780
year                0.00      0.21    -0.40     0.43 1.00     2296     2586
mort                0.26      0.64    -1.01     1.58 1.00     5607     3085
bwev               -0.54      0.63    -1.72     0.69 1.00     4154     3112
prevsire            0.19      0.40    -0.58     0.94 1.00     2090     2700
age:year           -0.10      0.06    -0.22     0.02 1.00     1229     1932
shape_year         -0.66      0.52    -1.69     0.41 1.00     3532     2076
 
Family Specific Parameters: 
    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
zi     0.16      0.10     0.01     0.38 1.00     2857     2276

> loo(ShapeNoFact4)
 
Computed from 4000 by 220 log-likelihood matrix
 
           Estimate   SE
elpd_loo   -168.7   17.7
p_loo        32.1    5.6
looic       337.4   35.3
 ------
  Monte Carlo SE of elpd_loo is NA.
 
 Pareto k diagnostic values:
                         Count  Pct.    Min. n_eff
 (-Inf, 0.5]   (good)     188   85.5%   422       
 (0.5, 0.7]   (ok)        23   10.5%   180       
 (0.7, 1]     (bad)        8    3.6%   17        
 (1, Inf)     (very bad)   1    0.5%   3

Weak idenitifiability may cause convergence problems, but even if the sampling is perfect, it means that you are not learning from the data and the posterior for shape parameter is equal to prior (and if your prior is improper, convergence is not possible).

Quite small data.

Whether tails going to unfeasible values is ok or not depends on the goals of your inference. Since you have said that you are interested in the shape parameter, then you need to care what happens in the tail. To learn the shape parameter of NB, you would need to observations from the tail of NB. In your fit the tail of NB goes beyond the maximum observed value and thus there is no information about the shape. You could try beta-binomial with upper limit instead. Then you would zero posterior predictive mass for unfeasible values. It’s possible that changing to more realistic observation model, doesn’t change much what you learn about the coefficients.

It gets easier after you have some understanding how even arbitrarily chosen priors change the prior predictive distribution.

1 Like