Recently a number of both subtle and not-so-subtle issues were reported (partly by me) for multiple probability distributions:
https://github.com/stan-dev/math/issues/1495, https://github.com/stan-dev/math/issues/1496, https://github.com/stan-dev/math/issues/1498, https://github.com/stan-dev/math/issues/1531
This involves even the neg_binomial_2
, which I thought is very “basic” and is presumably used in thousands of deployed models. Now most of the issues I’ve seen would rarely completely invalidate a model, but I can see some of them causing divergences here and there (many of the issues could manifest as large discontinuities in density for extreme parameters).
This might indicate we need more/better unit tests for distributions. As an example, when fixing the initial issue I found in neg_binomial_2
, I discovered 4 additional issues and wrote or overhauled 7 tests to achieve some confidence that all code paths work (see https://github.com/stan-dev/math/pull/1497 for more details).
We rarely test the distribution for anything but “normal” values of parameters - extremes (large, close to 0, …) are tested at most for not throwing anything or not returning nonsense, but not for accuracy. Many tests actually involve only one parameter combination. Some distributions (e.g. inv. gamma) have no tests for derivatives.
Some relatively simple ways to strengthen the tests are IMHO:
- Iterate over a grid of parameter values spanning several order of magnitudes for all extant tests
- Compute values for both the result and the derivatives in Mathematica and test against those
- Whenever a discrete choice is made (as when delegating from Neg. binomial to Poisson), test continuity at the boundary
- Test derivatives from autodiff against derivatives computed with complex step.
I’ve tried to build a few tests for some distributions that are less “obvious”, and all of them pass, which is nice (see https://github.com/stan-dev/math/pull/1575). So maybe neg_binomial_2
was an outlier with unexpectedly tricky implementation while everything else is mostly OK?
I also understand that adding the tests might be a big effort and not necessarily worth it, hence this is very much up to discussion.