Can non-convergence of a model be taken as proof of its inferiority compared to another converged model?

llewmills · September 24, 2024, 11:37pm

On this forum I have often read people who know a lot about modeling say that lots of divergent transitions means the model may be mis-specified. I am wondering how solid a rule that is for judging a model to be inferior to another.

I recently added a cubic polynomial interaction term to a model. Four chains of 1000 iterations. The Rhats for all coefficients ranged from 1.46 - 1.60 and the bulk ESS were 7-8. Also a message saying 1000 transitions after warmup.

The previous model containing a linear and quadratic interaction term on the other hand, performed marvelously, all Rhats 1.00, ESS all over 2000. Great posterior predictive checks etc.

Usually when comparing models I use k-fold or loo cv and compare via the loo-compare() function in brms. But with such poor diagnostics I can’t really k-fold or loo the cubic model.

Is it sufficient, legitimate, and most importantly defendable to say “the cubic model failed to converge and hence was judged to be inferior to the quadratic?”

Alternatively should I try to bandage up the cubic to a point where it can be legitimately compared to the quadratic using CV (e.g. by increasing adapt_delta)?

Bob_Carpenter · September 27, 2024, 9:52pm

This is often the root cause, but not always. Sometimes you can write a perfectly good model for the data and Stan just struggles. This is the case with centered vs. non-centered parameterizations of hierarchical varying effects—both models are in some sense the same, so it’s not the model’s fault. It’s just that Stan won’t be able to fit the centered parameterization because of the challenging geometry.

This is the kind of thing that can easily lead to non-identifiability of the likelihood depending on how the inputs to the polynomial look in practice. This can then lead to large number of divergences if the predictors are nearly collinear.

I think that might be a hard sell to a reviewer. The problem is that you don’t know if it’s the way you coded the model or that the model itself doesn’t work. If you can share your model, we may be able to help debug what’s going wrong.

llewmills · September 29, 2024, 8:57pm

Thanks so much @Bob Carpenter, this is why I love this site. What aspects of the model would you need. How about just the summary output as a start?

Just FYI it is a longitudinal hierarchical model examining the effects of amphetamine use at start of treatment on subsequent amphetamine use in the following year of treatment. outcome is days of amphetamine in the previous 28 days. This is measured at start of treatment and then at least once (but sometimes more) during treatment. set is the number of days it is possible to have used amphetamine at each measurement point (always 28). yearsFromStart is a numeric indicator variable, indicating the time each measurement was taken, in years of treatment, all values between 0 and 1. ats_baseline is the number of days of amphetamine use in the 28 days prior to starting treatment.

I have been working through the basic approach laid out by Aki Vehtari here and have reposted related recently questions, including here.

 Family: zero_inflated_beta_binomial 
  Links: mu = logit; phi = identity; zi = identity 
Formula: outcome | trials(set) ~ ats_baseline * yearsFromStart + ats_baseline * I(yearsFromStart^2) + ats_baseline * I(yearsFromStart^3) + (1 | encID) 
   Data: workingDF (Number of observations: 2309) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Multilevel Hyperparameters:
~encID (Number of levels: 695) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     1.68      0.16     1.45     1.98 1.54        7       25

Regression Coefficients:
                               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept                         -3.76      1.94    -5.20    -0.36 1.59        7       11
ats_baseline                       0.13      0.41    -0.62     0.41 1.60        7       11
yearsFromStart                     2.39      1.32     0.58     4.65 1.51        7       13
IyearsFromStartE2                 -3.07      3.30    -9.14     1.26 1.46        8       18
IyearsFromStartE3                  1.75      1.64    -1.62     5.29 1.57     1080     1414
ats_baseline:yearsFromStart       -1.16      0.64    -1.86    -0.08 1.59        7       11
ats_baseline:IyearsFromStartE2     2.72      0.68     1.87     4.03 1.44        8       23
ats_baseline:IyearsFromStartE3    -1.27      1.00    -2.61     0.35 1.59        7       11

Further Distributional Parameters:
    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
phi     3.19      1.02     1.54     4.51 1.58        7       12
zi      0.18      0.19     0.03     0.51 1.56        7       14

Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Warning messages:
1: Parts of the model have not converged (some Rhats are > 1.05). Be careful when analysing the results! We recommend running more iterations and/or setting stronger priors. 
2: There were 1000 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup

Bob_Carpenter · October 3, 2024, 7:49pm

Sorry—I didn’t see this follow up.

For what?

Formula: 
outcome | trials(set) ~ ats_baseline * yearsFromStart 
                        + ats_baseline * I(yearsFromStart^2)
                        + ats_baseline * I(yearsFromStart^3)
                        + (1 | encID)

The problem is that you need a lot of data to fit more and more complicated models reliably. In this case, the cubic model is strictly more expressive than the model with only the linear and quadratic terms. If you set the coefficient to zero, it reduces to the other models. The problem is that you need a lot of data to be sure that coefficient goes to zero, so in practice, the more complicated model can perform worse. That’s probably because the prior on the cubic effect isn’t concentrating enough around zero.

Shouldn’t that speed up sampling? 🥁 (Someone said there were zero emojis used on our site and this seemed like a good excuse to change that.)

llewmills · October 3, 2024, 8:23pm

No worries @Bob Carpenter. I’ll take any help I can get. "For what?" Enough information for you to advise me how to improve this model. I just wondered if you’d need loo-ecdf or any of those diagnostic plots. Doesn’t sound like you do.

What do you mean by "if you set the coefficient to zero"?`. Do you mean set the prior on the cubic term to zero? I realised i probably should have included the model itself

fit_zibetabinomial_cubic_ats <- brm(formula = outcome | trials(set) ~ ats_baseline*yearsFromStart + ats_baseline*I(yearsFromStart^2) + ats_baseline*I(yearsFromStart^3) + (1 | encID),
                                data = workingDF,
                                family = zero_inflated_beta_binomial(),
                                prior = c(prior(normal(0, 3),
                                                class = Intercept),
                                          prior(normal(0, 4),
                                                class = b),
                                          prior(cauchy(0, 3),
                                                class = sd)),
                                save_pars = save_pars(all=TRUE),
                                seed = 1234,
                                refresh = 0,
                                cores = 4)

Pretty generic prior setting I realise. I could go more specific for different coefficients but haven’t had to up to this stage (i.e. through all the iterations of Gaussian, Binomial, Beta-Binomial, Zero-inflated Beta-Binomial, Zero-inflated Beta Binomial with a quadratic).

Not much I can do about the amount of data I have. Can you suggest a better prior for the cubic, assuming that’s what you mean by setting coefficients to zero?

p.s. Thanks for your help
p.p.s. More puns please, and emojis

Bob_Carpenter · October 3, 2024, 8:59pm

I generally pay more attention to posterior predictive checks, as those will show you where the model is failing to capture aspect of your data.

I mean that if you have a model that looks like

y_n \sim \textrm{normal}(\alpha + \beta_1 x_n + \beta_2 x_n^2 + \beta_3 x_n^3, \sigma),

then you can set \beta_3 = 0 and get the model

y_n \sim \textrm{normal}(\alpha + \beta_1 x_n + \beta_2 x_n^2, \sigma).

The problem is that with Bayesian inference, you’re unlikely to infer a value of \beta_3 near zero unless you have a lot of data and/or a prior that heavily concentrates around zero.

Not really. This is the first problem that Andrew Gelman gave to me and Matt Hoffman when we started working for him in 2011. We found that too hard, so we built Stan and NUTS instead :-). Andrew calls this problem the “unfolding flower”—finding a prior that lets you fit coefficients to non-zero values when you have enough data. It’s the kind of thing for which frequentists use the lasso (L1-regularized regression), because it’ll allow exact zero values. In Bayesian stats, that requires zero-inflation for a positive probability (not density) on a zero value, which you can only get with a spike-and-slab prior (mixture of zero and normal, say). You can code those in Stan for a few variables with marginalization, but the marginalization is exponential in number of variables that are zero-inflated.

I would avoid those Cauchy priors—they’re consistent with values of 10K, etc., that are unlikely to show up. Usually not a problem if you have a lot of data.

In general, beta-binomial and other “over dispersed” models are hard to fit, and zero inflated ones can be even harder. The reason is that there are two ways to explain a larger value (i.e., give it higher density or mass in this case)—a higher mean in the beta-binomial or more dispersion (lower values of \alpha, \beta in the beta-binomial). Whenever that happens, there are immediately identification problems as there will be a lot of correlation between the scale of \alpha and \beta and the error scale. I don’t know what brms uses, but we’ve found it helpful to parameterize beta and beta-binomial in terms of mean \alpha / (\alpha + \beta) and total count \alpha + \beta. It cuts out the prior correlation between \alpha and \beta.

llewmills · October 6, 2024, 12:35am

Ok a lot of what you said went way over my head. I don’t think I have the wherewithal to fit the reparameterised beta-binomial in the way you describe. Perhaps a few years ago when I was more immersed in manually coding stan rather than using shells like brms, but it seems those skills decay if you don’t exercise them regularly.

But in a way what you say gets back to the thrust of my original post. Using model comparison techniques like loo-cv gives you a way of comparing models that is somewhat objective and sequential and where you know when to stop: when the next model doesn’t add predict anything any better you stay with the simpler model. But that process is predicated on being able to compare models, which you can’t do if you can fit the model in the first place. It’s a hard limit on model refinement. And it occurred to me reading your latest post that yet another hard limit is one’s own knowledge: I don’t know how to fit the reparameterised model you mentioned. I might have to start practising some radical acceptance of my own limitations, and that they may limit what I can do. Anyway thank you for your answers. They’re very much appreciated.

Bob_Carpenter · October 14, 2024, 8:51pm

Happy to unpack a bit if you can be more specific.

You have to be careful here, because models only make sense when coupled with data. That is, if you collect ten times as much data, you can typically fit a bigger, more refined model, that will outperform the best model you could fit with 1/10th of the data.

llewmills · October 14, 2024, 9:23pm

Thanks for replying @bob carpenter. Sounds like another hard limit I hadn’t thought of: I can’t change the amount of data I have. It’s not small, ~2000 observations spread across ~600 id numbers, but a lot of those are zeros and anyway I cannot change how many data I have.

I will try to elaborate on what I don’t understand a bit later in the day when my brain is working.

Topic		Replies	Views
Divergent transitions - a primer General howto , divergences	8	15226	August 20, 2025
Divergent transitions Modeling	12	961	July 17, 2019
One chain not moving: sampling for complex nonlinear mixed effects model brms techniques , fitting-issues , specification , ecology	14	3384	November 6, 2020
Divergence or Convergence based on rhat using different datasets Modeling	16	1898	May 1, 2018
I want to hear any opinion about my new model from the signal detection theory Modeling	5	527	December 13, 2019

Can non-convergence of a model be taken as proof of its inferiority compared to another converged model?

Related topics