Advi to avoid divergent transitions?


#1

Is there any circumstance when using (full-rank) advi can help fit models that NUTS is having issues with divergent transitions on? My math is nowhere near good enough to figure it out on my own, and I can’t remember any paper on ADVI discussing that question.

I try to work a lot with hierarchical glm’s but right now most of my time working with Stan is spent on re-parameterizing models to avoid divergent transitions, and I would really like to cut down on that. Or maybe I should just suck it up and become better/faster at parameterizing models =)


#2

No. If NUTS is struggling then ADVI will do even worse.


#3

I recall someone was working on running NUTS and ADVI versus a couple hundred (?) models and data sets from BUGS and other sources. Did that ever get published anywhere? Arxiv?


#4

This paper https://arxiv.org/abs/1603.00788, is the only one I know of that does sort of that (but on 10 models) but iirc they never discuss models that are problematic to fit.

But it’s a shame that advi doesn’t help, but that is at least consistent with my experiments. Divergent transitions are my nemeses…


#5

A lot of people seem to do model tests really slowly, if it’s really
sucking up your time you might want to post that as a question, I’m sure
the rest of the Stan team would have good suggestions, especially for
hierarchical glm’s.


#6

Just remember that if you were using another algorithm then you’d very likely be suffering from similar problems, only without the diagnostic. Divergences don’t hurt people, pathological models hurt people.


#7

Yeah, I know, which is why I don’t feel comfortable with moving away from Stan. But ideally, I would like to specify my hierarchical models in a centered way, and then Stan(or some other library) could automagically transform and use what ever parameterization that is most efficient for the data I have, but I do realise that it may not be possible.

But is there any ongoing work/plan/ideas in this area?

More specifically my problems are usually, I want to fit some regression and I have a bunch hierarchical covariates. Now for every covariate I test which parameterization works (best), or if I sequentially add covariates I get a bunch of models for each I have to make sure their parameterization works. Usually I go through centered -> non-centered (Section 26.6. in the manual) -> hard sum-to-zero (Section 8.7 in the manual) to mitigate divergent transitions. Changing adapt_delta very seldom helps. And this is what takes time…

Is this a good workflow, or are there more efficient ways to work?
And what do I do when I still have divergent transitions after going through this process? I guess the answer to that is very model specific, but in general I have interpreted that as with the current data it is not possible to fit the model, and I must look at changing what covariates I’m trying to fit or what priors I’m using.

(Sorry if this got long and off-topic, and as you probably can tell I’m not a “real” statistician, but rather comes from a design/product development background)


#8

jonsjoberg http://discourse.mc-stan.org/u/jonsjoberg
May 9

Yeah, I know, which is why I don’t feel comfortable with moving away from
Stan. But ideally, I would like to specify my hierarchical models in a
centered way, and then Stan(or some other library) could automagically
transform and use what ever parameterization that is most efficient for the
data I have, but I do realise that it may not be possible.

It’s possible, we just aren’t there yet. Know anybody who wants to
contribute? :)

But is there any ongoing work/plan/ideas in this area?

More specifically my problems are usually, I want to fit some regression
and I have a bunch hierarchical covariates. Now for every covariate I test
which parameterization works (best), or if I sequentially add covariates I
get a bunch of models for each I have to make sure their parameterization
works.Usually

I go through centered -> non-centered (Section 26.6. in the manual) ->

hard sum-to-zero (Section 8.7 in the manual) to mitigate divergent
transitions.

You shouldn’t need to test all these. Use noon-centered unless you have
plenty of observations in all groups. Even then non centered works fine.

Think about identifiability first, it comes up in specific contexts. If you
think the issue will come up just code for it to begin with.

Also, priors in these models are critical. You won’t know what
implications of priors are without simulation in a complex hierarchical
model. Check that simulation from the weak prior yields reasonable values
for parameters. Seriously, check!

Changing

adapt_delta very seldom helps. And this is what takes time…

You should be able to do relatively short runs to check this stuff. A few
dozen iterations at most to see where stepsize end up and if that’s good
you can do a longer run to see divergences.

Is this a good workflow, or are there more efficient ways to work?

And what do I do when I still have divergent transitions after going
through this process?

If you are using gamma cdf or incomplete gamma functions, or the bets
binomial a few math lib calculations weren’t/aren’t as good as they should
be. I have some improvements in that should make the beta binomial better
and a branch that makes gamma models easier to fit. It’s mostly numerical
inaccuracy that messes with adaptation.

I

guess the answer to that is very model specific, but in general I have
interpreted that as with the current data it is not possible to fit the
model, and I must look at changing what covariates I’m trying to fit or
what priors I’m using.

Sometimes it also means we have a problem to fix so don’t be afraid to file
ask questions on the list or just make a reproducible example and file an
issue. Stan should be able to fit hierarchical glm’s.

Sometimes it also means your model doesn’t for your data in a really bad
way. Check that too.

(Sorry if this got long and off-topic, and as you probably can tell I’m not

a “real” statistician, but rather comes from a design/product development
background)

In the age of machine learning and data science I think you’re doing fine.


#9

Got any specific tips on how to think about identifiability? I know I should think about it, but I’m not sure how to diagnose in a proper way if the model isn’t identifiable.

The way I’ve been trying checking priors in this context is by running the model without conditioning it on the data, is that the right way of doing it? And what does it mean if I get divergent transitions when doing that?


#10

I don’t think I have an answer to this that’s not workshop-length. In general if you have x=f(a,b) and you only have information about x without any independent information about a and b it’s easy to get in trouble. Then you generally want to model x itself and make a or b transformed parameters, but that’s not always straightforward.

Running the model without data is a good way to figure out the priors, but only if you can check that outputs (parameters and simulated data) fall into reasonable ranges, with reasonable distributions for what your system might produce. Sometimes you have a specific system that will have parameters with interpretable meaning (reproductive rate, tensile strength, something like that) and you can easily check, otherwise if you’re making a generic model for a certain type of data it’s harder.


#11

Thanks for all the input, I find question like these the hardest to wrap my head around, partly because for most other issues there is a lot of good information on how to solve/work around them. So I guess it’s an indication that it’s not easy to summarise it into something general (or maybe it is just me that are having these issues).


#12

If you’re using relatively straightforward hierarchical GLMs, then you can use the Rstanarm package which has excellent parameterizations for most linear model applications. Some of their defaults are quite sophisticated and often faster than what I can come up with, plus you can use the easy modeling syntax from the R package mle.

In particular, Rstanarm can handle any model mle can, which includes all random intercept/random slope models, plus you can fit some more exotic models if you figure out the mle syntax.


#13

Thanks for the tip, I didn’t even think about Rstanarm, it seems to generate very efficient models for many of my problems =)