Possibly interesting (since I have heard that Stan will warn on improper posteriors, but perhaps that only applies when there is infinite mass extending to infinity).
parameters {
real<lower=0,upper=1> a;
}
model {
a ~ beta(1,0.25);
a ~ beta(1,0.25);
}
Inference for Stan model: beta.
4 chains, each with iter=10000; warmup=1000; thin=1;
post-warmup draws per chain=9000, total post-warmup draws=36000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
a 1.00 0.00 0.00 1.00 1.00 1.00 1.00 1.00 13836 1
lp__ 17.17 0.02 1.11 14.13 16.82 17.47 17.87 18.38 5062 1
Samples were drawn using NUTS(diag_e) at Wed Oct 23 17:55:28 2019.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
Warning messages:
1: There were 1133 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
2: Examine the pairs() plot to diagnose sampling problems
3: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#bulk-ess
which would tell me something is amiss. Thatâs precisely why @betanalpha is so stringent about divergences [and other diagnostics].
Edit. Can confirm that in my system I also get a âcleanâ run if I increase the maximum tree depth. If decrease the step size by making adapt_delta = 0.99 I get
Warning messages:
1: There were 3762 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 15. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
2: Examine the pairs() plot to diagnose sampling problems
3: The largest R-hat is 1.21, indicating chains have not mixed.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#r-hat
4: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#bulk-ess
5: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#tail-ess
The âcleannessâ of a run is pretty random, as can been seen from this treedepth graph where one chain has decided that even treedepth=20 isnât quite enough.
The upper corner shows the corresponding step size. Adaptation has no idea what step size the model needs.
Letâs take a look at log(1-a) versus lp__ graph.
Theoretically, all points should fall on a single smooth curve. The right side of the graph looks like weâd expect it to. On the left, the line breaks apart due to numerical errors. Here the floating point representation of a is inaccurate.
Then thereâs the weird vertical group hanging under the main line. I wouldâve expected that when the sampler runs against the upper bound the constraining transform overflows to a = 1 and target becomes infinite. That would get flagged as a divergent transition. However, the transform is specifically made to avoid that and instead truncates it at a = 1 - 1e-15. So the sampler is free to move the unconstrained value beyond that point. There the constrained value and the likelihood are constant. The Jacobian determinant is still well-behaved.
tl;dr the posterior is proper because of floating point problems.
Iâm not sure what the question is here. The model specified in that Stan program is ill-posed because of the double application of the Beta density. Using the conjugacy of the Beta density one can show that multiplying the same density, with parameters \alpha and \beta, twice results in something proportional to another Beta density but with parameters \alpha' = 2 \alpha - 1 and \beta' = 2 \beta - 1. For \alpha = 1 and \beta = 1/4 this gives \alpha' = 1 and \beta' = -1/2 which results in an ill-defined density because both parameters have to be positive for the density to be integrable.
You can confirm this by explicitly transforming the specified density to the unconstrained space where Stan is actually sampling. When you are worried about sampling problems and you have constrained parameters then always work out whatâs going on in the unconstrained space. And explore not the constrained samples but rather the unconstrained samples in the file specified with the diagnostic_file argument.
Here the constraining map is
a = \frac{1}{1+e^{-x}}
and the push forward density function would be
\pi(x) = \frac{e^x}{\sqrt{1 + e^{x}}}.
This diverges as x \rightarrow \infty.
The resulting Markov chains will then end up in a race to infinity, hence the max_treedepth warnings. Youâll also see this if you look at the values of logit(a) verses iteration. The chains might also move at different speeds, which is why you might sometimes see \hat{R} warnings.
The fact that the target density is ill-posed means that the identify function is no longer square integrable and you canât trust the Markov chain Monte Carlo estimators. A corollary is that the variance is also ill-defined and the metric adaptation will yield chaotic results.
If you suspect that a target distribution is improper then the best way to confirm is to look at the samples, and to do so on the unconstrained space.
I think the OP was acutely aware the âposteriorâ in question is improper. The point is that the diagnostics appear to fail to flag bad behaviour. Of course they donât have to, but itâd be nice to get some divergences or other, stronger, warnings. I take tree depth exceedances as a sign of tough geometry, not necessarily pathological. I think @nhuurreâs answer clarifies whatâs going on here.
Iâm not sure what additional/stronger warnings you would expect.
The treedepth indicates very long trajectories which could be due to small stepsizes or an ill-posed target density, and following up by examining the unconstrained samples here one can confirm that it is the former. Indeed the treedepth limit was introduced in the first place to guard against trajectories moving towards infinity and never triggering the trajectory termination criterion.
For additional evidence the quantiles for a are all pushed up against 1 which indicates problems. The bulk ESS warning, however, suggests that the mean is ill-defined, which further supports with an ill-posed target. Once can examine the inverse metrics learned in each chainâs adaptation period as well â as I noted earlier inconsistent values add additional evidence for ill-defined moments.
The major indication of an ill-posed target density is the sampler trying to reach infinity. The secondary indication is ill-defined mean and variance behavior consistent due to functions no longer being integrable.
Whether or not you see divergences as the sampler pushes towards infinity depends on the target density. Sampling from a uniform density, for example, will never diverge because the gradients vanish everywhere. When constraints are involved you have to consider potential overflow/underflow the constraining map and the boundaries of the target density.
In particular because the sampling happens on the unconstrained scale there is nothing keeping a to "smallest floating point number below 1` as was suggested earlier. Firstly the bounds are inclusive, not exclusive. Secondly the logistic transformation responsible for mapping back to the constrained space can easily overflow to 1 (or the closest floating point equivalent).
Here the beta densities are well-defined at a = 1 so there isnât any problem with the latter, and the logistics function (and its gradient) in Stan is explicitly written to be robust to the $x \rightarrow \infty` limit. Consequently divergences wonât arise as the sampler tries to reach infinity and you are left with the other diagnostics.
The point is not what I would expect, but rather that sometimes diagnostics fail. This is a moderately interesting instance of that. You donât have to agree; you may very well think itâs silly.
Good point. But I donât think you can expect practitioners to look at samples in the unconstrained space. Hence, the absence of divergences or other stronger warnings could lead someone less experienced to think things were fine, or at least that increasing tree depth would solve the problem [doing that will randomly âsolveâ the problem].
Disagree. If I do:
parameters {
real<lower=0,upper=1> a;
}
model {
a ~ beta(9999, 1);
}
I also get all quantiles pushing against 1.
Except you donât always get that warning, as the OP found and I confirmed in different versions.
So, to sum up as this is the last Iâll say on this matter: both OP and myself thought this was a weird little example of a target thatâs clearly ill-posed and yet might escape detection by diagnostics. And I think this is something that could be emphasized more. I may very well be misrepresenting here, but I think this was kind of @anon75146577âs point here as well â âdiagnostics can and in fact often do failâ. Please do let me know if I misunderstood.
Yes, itâs an interesting example. Under best practices we should be using proper priors, but itâs also good to be able to run software under edge cases and see what happens. No diagnostics will catch every problem, so it can be useful for us to develop a menagerie of difficult examples.
I definitely agree with that point â itâs something I stress whenever I teach. See, for example, the discussion in Robust Statistical Workflow with RStan.
One of the challenges with a project this big is that many people will have different opinions, and because many parts of the project are essentially independent (for example the diagnostics reported in RStan were chosen entirely by the RStan developers) itâs nigh impossible to get consistency in a diagnostic workflow. I mean there are even prominent members of the community that have trashed divergences, and if you canât agree on those when what can you agree on?
Anyways, I dug into this example a little bit more deeply to understand whatâs going on. On one hand itâs interesting, on the other the conclusion might not be satisfying to you.
One thing to keep in mind is that there is a hierarchy of tasks involved with a computation fit, with higher levels depending on lower levels to work.
The lowest level is floating point arithmetic â if we canât evaluate mathematical functions accurately then we canât trust anything going forwards. Sometimes breakdowns of floating point arithmetic manifest in higher-level diagnostic failures, but not always. Sometimes the breakdowns manifest a problem that looks entirely normal to the everything that follows.
Similarly if the symplectic integrator in HMC breaks down then we canât trust that the numerical trajectories explore as much as they should. If we canât diagnose that breakdown directly, however, the resulting samples might not be obviously wrong.
Lastly there are function-specific problems. Even if the Markov chains are exploring as they should, MCMC estimators will be ill-defined if the functions whose expectations weâre taking arenât square integrable (technically cubed integrable if we estimate the variance to estimate the MCMC standard error).
So with that in mind what might we expect here?
We know that the target density is improper so we might expect that the Markov chains will diffuse towards positive infinity. If we run multiple chains and they diffuse at different rates then this might manifest in large $\hat{R}$s.
Even more our Hamiltonian trajectories should coherently zoom straight towards that positive infinity. These trajectories should trigger the max_treedepth warnings.
At the same time the identify function is no longer integrable so the mean estimators shouldnât be well-behaved. Bulk or tail ESS or k-hat diagnostics might trigger if the samplers explore sufficiently far.
As you note you sometimes see these effects, but not consistently. I appreciate the fear that this incites knowing that something is going wrong. Why arenât we getting the behavior that we want? Because the floating point is failing in a weird way.
Remember that if the floating point arithmetic fails then we have no guarantees that everything else will behave in the way that it should, including all of the higher-order diagnostics. To track down whatâs going on we can run with the diagnostic_file option and investigate the unconstrained states and the corresponding gradients.
First letâs look at the unconstrained values of logit(a),
As logit(a) gets bigger, and a gets really close to 1the gradient starts to waver a bit around -0.5 (directly the trajectories towards positive infinity) until after the wall we saw above where it jumps up to 1 (directing the trajectories back towards zero).
If you work out the gradient analytically then youâll see that it should be -0.5 all the way to infinity, which would keep pushing the numerical trajectories and the resulting Markov chains out there. The immediate suspect is floating point arithmetic.
To confirm that suspicion you have to go a bit deeper into the target density function and how the autodiff is computed. On the unconstrained scale the gradient decomposes into the gradient of the log constrained density plus the log Jacobian determinant. Their expected analytic values? -1.5 and 1, respectively. Add them up and you get -0.5, keep only the latter and you getâŚ1, like we see for logit(a) above 37 or so. Ultimately right around logit(a) hits 37 the logistic function that maps it back to the constrained space overflows to 1, and the log1m in the log target density underflows to minus infinity. In the beta_lpdf implementation this results in only the gradient of the log Jacobian determinant being returned.
So, in this case the way floating point arithmetic typically fails is equivalent to the working evaluation of a log density function that is well-posed, with the density rapidly falling to zero right after logit(a) = 37 and pushing the sampler back towards 0. Thatâs why you donât see any of the other diagnostics failing â from what they see everything is consistent with a working fit!
This is a pretty insidious failure of the floating point evaluation of the constraining transform here, but to be fair itâs being pushed to its limits. The real problem is that the failure is quite exceptional and thereâs no real way to diagnose it other than to somehow run the floating point at higher precision, which isnât currently feasible with the Stan Math Library.
That said I would consider the empirical quantiles all concentrating to the same value with the standard deviation estimated to be zero a worrying sign, and worth investigation on the unconstrained scale. The beta(9999, 1) suggestion with which you countered actually has nontrivial standard deviation, but honestly its extreme enough that I would also have investigated it. In this case the fit only needs to reach out to logit(a) = 20 or so for which the floating point math is fine. I added four more 9s to get beta(99999999, 1) and logit(a) only reached 26 where the floating point was still sufficiently stable.
Is looking at the unconstrained space and gradients advanced? Yes. But so are the intricacies of floating point arithmetic which underpins everything we do.
@betanalpha do you know if the contents of the diagnostic file are documented publicly anywhere? I didnât find it in the CmdStan manual (only that the argument diagnostic_file exists). I can open an issue for that somewhere and help contribute doc, but if itâs used in multiple interfaces (we have it in RStan but also now CmdStanR/Py) perhaps it should go in the Stan users guide somewhere. Or Iâm fine with all of them pointing to CmdStan if itâs documented there.
Good call â the diagnostic_file has always kind of been a hidden feature. I thought it was documented at one point long ago but that might have been lost as the docs were restructured. One challenge is that in CmdStan, RStan, and PyStan the diagnostic file gets dumped to a local CSV with the same structure as the CmdStan CSV. For CmdStan that makes documenting the structure straightforward â it just replaces the constrained values with unconstrained values and adds momenta and gradient components with the p_ and g_ prefixes. For RStan and PyStan youâd have to first introduce the CmdStan structure or change some I/O plumbing in those interfaces to read in the values into the fit object so that they can be accessible directly by the user.
Thatâs what I thought too but if you look at the relevant code youâll see the transformation does check and (attempt to) correct for overflow. If you plot the constraining transform you get this:
The jumps are twice as big they need to be which I guess implies some round-to-even fudging in my CPU.
Given that @betanalpha didnât know about this, I assume that code is what it looks like: a poorly thought-out kludge that isnât needed and doesnât really fix whatever it was intended to fix.
The beta(1, -0.5) density and itâs derivative are infinite at a = 1. Thatâs not well-defined at all. A model with an explicit inv_logit instead of a parameter constraint gets thousands of divergent transitions. And, it turns out, the step size adaptation is more stable leading to a faster run with no max_treedepth problems.
Take for example the nonidentifiable pair of correlated parameters.
parameters {
real x;
real y;
} model {
x + y ~ std_normal();
}
This is improper but gives no warnings other than tree depth exceedance and vanishing ESS. Tree depth warning typically signals geometry that looks nonidenfiable and/or improper.
Stan has a rare error that plainly tells you the posterior is improper:
Exception initializing step size.
Posterior is improper. Please check your model.
âRareâ meaning I donât think itâs ever possible to get this error unless the posterior is completely flat. Even a small bump can allow step size initialization to succeed.
parameters {
real x;
} model {
target += log_sum_exp(0, -square(x));
}
This model fails in a surprising way. The step size shoots upâobviouslyâand the ârace to infinityâ goes on until the inverse metric blows up. Without a metric the energy is always undefined and therefore looks conserved even when the integrator sends the sample point to NaNland. Thus there are no divergencies. The u-turn criterion thinks NaNland is a u-turn and stops all trajectories nicely at treedepth 1. The new proposed point is never accepted because it has accept_prob = NaN. The sampler can no longer move but sees no problems.
Effective sample size and r-hat are of course atrocious once the sampling finishes. If the sampling finishes. When run with default settings the metric fails during the penultimate metric adaptation window which means thereâs a full window with a frozen sampler. At the end of last window, when the metric is updated to (presumably?) all-zeros the sampler somehow throws itself into an infinite loop.
Just wanted to say that I am overwhelmed by the insights and energy given into this by @nhuurre and @betanalpha but I have to say this thread starts to look a bit like a contest of proving the other side wrong. I find this unfortunate and kindly ask you to try to shift the tone. I hope the ultimate goal of everybody here is to understand what is happening, learn and/or educate others and possibly find some way to improve Stan along the way. If you agree, please try to make sure this is explicitly visible from your posts.
Thanks and sorry for being preachy! You can obviously disagree with my assessment.
Probably due to the fused multiply-add at the end of the function. I donât think 1 has an exact double precision representation so the multiply will skew the values a bit.
I was not aware of this code and I agree that it is questionable, although I recommend not trying to assign intent to the developer who originally wrote this code. My guess is that it was meant to fix Bernoulli density evaluations in logistic regression models back when the priors were hyper diffuse and overflow happened often.
Itâs worth creating an issue on stan-dev:math to get feedback on removing it.
Yes, at the boundary of the constrained space but not for any a away from that boundary. Stan doesnât fit using the density on the constrained space, it fits on the unconstrained space using the log density. Mapping to the unconstrained space pushes the infinite gradient all they way to positive infinity and brings with it a log Jacobian determinant which actually cancels part of the infinity to leave a finite gradient. Again one can work out the exact gradient of the unconstrained density here and show that it should be equal to -0.5 asymptotically.
Yes, with improper densities being a subset of non-identifiability densities.
This warning comes from the initialization code that tries to find a step size that results in a reasonable change of Hamiltonian energy after one step of the leapfrog integrator. If the target density is constant then this step size can be made arbitrarily large so the initialization code throw an exception when the initial step size is adapted beyond 1e7.
As with the other diagnostics this is a necessary but not sufficient condition for pathology. If we see it we know that the target density is improper (the special kind of improper with constant density) but thereâs no guarantee that we will see if it.
Question: Should we throw an error message when we round inverse logit to 1 in the inverse transforms back to a constrained space? (This is technically rounding, not overflowâfloating point overflow typically produces exceptions or infinite values.) Now that everythingâs in the test framework, itâs easy to change this edge-case behavior.
Suggestion: If thereâs a real model someone cares about running into arithmetic precision issues near 1, the thing to do is flip 0 and 1. You only have around 1e-14 precision near 1, whereas you have over 1e-300 precision near zero. Thatâs why we have functions like log1p. So youâll find beta(1, 1e5) much better behaved than beta(1e5, 1). Thereâs more trouble to be had when that low number goes below one.
I hope you donât take the following the same way!
The CmdStan docs have not been restructured yet. I think someone may have opened an issue about this, so Iâd check before opening another one.
I used that example in the userâs guide chapter on problematic posteriors! What you get is overflow in the values
I think you also get near overflow parameter estimates of 10e300.
Thanks for explaining that. Iâm wondering if we should also somehow put in diagnostics for frozen samplers. MCMC is doing the right thing in terms of the stationary distribution and balance here, so itâs not a bug other than that it feels like a bug to a user when Stan hangs. So we should probably monitor hangs and just stop the process automatically. Iâm not entirely sure what thresholds of time or iterations that would require, though, as sometimes this will be OK and itâll only get stuck for a few iterations. Once itâs been stuck for dozens, itâs probably too late for error messages.
Itâs an even power of 2, so it has an exact representation because the base and exponent are both integers.
Which code is this? Is it inv_logit or in the bernoulli? What about bernoulli_logit?
+1 for that. I think we should throw exceptions rather than round when things go bad at boundaries. A lot of our functions and distributions donât have well thought out edge conditions (not trying to offend anyone hereâI wrote a lot of this stuff myself!). Nor do our functions behave consisently at boundaries at every level of autodiff yet. The scalar ones mostly do, but not the matrix functions.