# Simple Example of an Improper Posterior Without Warnings from Stan

Possibly interesting (since I have heard that Stan will warn on improper posteriors, but perhaps that only applies when there is infinite mass extending to infinity).

parameters {
real<lower=0,upper=1> a;
}
model {
a ~ beta(1,0.25);
a ~ beta(1,0.25);
}

Inference for Stan model: beta.
4 chains, each with iter=10000; warmup=1000; thin=1;
post-warmup draws per chain=9000, total post-warmup draws=36000.

mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
a     1.00    0.00 0.00  1.00  1.00  1.00  1.00  1.00 13836    1
lp__ 17.17    0.02 1.11 14.13 16.82 17.47 17.87 18.38  5062    1

Samples were drawn using NUTS(diag_e) at Wed Oct 23 17:55:28 2019.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).

3 Likes

I ran

library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = 4)

model_string <- '
parameters {
real<lower=0,upper=1> a;
}
model {
a ~ beta(1,0.25);
a ~ beta(1,0.25);
}
'
comp <- stan_model(model_code = model_string)

chain <- sampling(comp)

chain



and got

Warning messages:
1: There were 1133 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
2: Examine the pairs() plot to diagnose sampling problems

3: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#bulk-ess


which would tell me something is amiss. That’s precisely why @betanalpha is so stringent about divergences [and other diagnostics].

> sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 19.04

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.8.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.8.0

locale:
[1] LC_CTYPE=pt_BR.UTF-8       LC_NUMERIC=C               LC_TIME=pt_BR.UTF-8
[4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=pt_BR.UTF-8    LC_MESSAGES=en_US.UTF-8
[10] LC_TELEPHONE=C             LC_MEASUREMENT=pt_BR.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:

loaded via a namespace (and not attached):
[1] Rcpp_1.0.2         rstudioapi_0.10    magrittr_1.5       tidyselect_0.2.5   munsell_0.5.0
[6] colorspace_1.4-1   R6_2.4.0           rlang_0.4.0        dplyr_0.8.3        tools_3.5.2
[11] parallel_3.5.2     pkgbuild_1.0.5     grid_3.5.2         gtable_0.3.0       loo_2.1.0
[16] cli_1.1.0          withr_2.1.2        matrixStats_0.55.0 lazyeval_0.2.2     assertthat_0.2.1
[21] tibble_2.1.3       crayon_1.3.4       processx_3.4.1     gridExtra_2.3      purrr_0.3.2
[26] callr_3.3.2        codetools_0.2-16   ps_1.3.0           inline_0.3.15      glue_1.3.1
[31] compiler_3.5.2     pillar_1.4.2       prettyunits_1.0.2  scales_1.0.0       stats4_3.5.2
[36] pkgconfig_2.0.3

1 Like

sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached):
[1] compiler_3.6.1 tools_3.6.1

library(rstan)
options(mc.cores = parallel::detectCores(),max.print=10000)
rstan_options(auto_write = TRUE)
data <- list()
fit <- stan(
file = “beta.stan”,
data = data,
chains = 4,
warmup = 1000,
iter = 10000
)
print(fit)

Do you get any divergences?

None, but I had 2 transitions that exceeded the maximum tree depth, and after increasing to 15 I had a clean run.

Now that’s interesting. @bbbales2 @betanalpha ?

Edit. Can confirm that in my system I also get a “clean” run if I increase the maximum tree depth. If decrease the step size by making adapt_delta = 0.99 I get

Warning messages:
1: There were 3762 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 15. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
2: Examine the pairs() plot to diagnose sampling problems

3: The largest R-hat is 1.21, indicating chains have not mixed.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#r-hat
4: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#bulk-ess
5: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#tail-ess

1 Like

The “cleanness” of a run is pretty random, as can been seen from this treedepth graph where one chain has decided that even treedepth=20 isn’t quite enough.

The upper corner shows the corresponding step size. Adaptation has no idea what step size the model needs.

Let’s take a look at log(1-a) versus lp__ graph.

Theoretically, all points should fall on a single smooth curve. The right side of the graph looks like we’d expect it to. On the left, the line breaks apart due to numerical errors. Here the floating point representation of a is inaccurate.
Then there’s the weird vertical group hanging under the main line. I would’ve expected that when the sampler runs against the upper bound the constraining transform overflows to a = 1 and target becomes infinite. That would get flagged as a divergent transition. However, the transform is specifically made to avoid that and instead truncates it at a = 1 - 1e-15. So the sampler is free to move the unconstrained value beyond that point. There the constrained value and the likelihood are constant. The Jacobian determinant is still well-behaved.

tl;dr the posterior is proper because of floating point problems.

6 Likes

I’m not sure what the question is here. The model specified in that Stan program is ill-posed because of the double application of the Beta density. Using the conjugacy of the Beta density one can show that multiplying the same density, with parameters \alpha and \beta, twice results in something proportional to another Beta density but with parameters \alpha' = 2 \alpha - 1 and \beta' = 2 \beta - 1. For \alpha = 1 and \beta = 1/4 this gives \alpha' = 1 and \beta' = -1/2 which results in an ill-defined density because both parameters have to be positive for the density to be integrable.

You can confirm this by explicitly transforming the specified density to the unconstrained space where Stan is actually sampling. When you are worried about sampling problems and you have constrained parameters then always work out what’s going on in the unconstrained space. And explore not the constrained samples but rather the unconstrained samples in the file specified with the diagnostic_file argument.

Here the constraining map is

a = \frac{1}{1+e^{-x}}

and the push forward density function would be

\pi(x) = \frac{e^x}{\sqrt{1 + e^{x}}}.

This diverges as x \rightarrow \infty.

The resulting Markov chains will then end up in a race to infinity, hence the max_treedepth warnings. You’ll also see this if you look at the values of logit(a) verses iteration. The chains might also move at different speeds, which is why you might sometimes see \hat{R} warnings.

The fact that the target density is ill-posed means that the identify function is no longer square integrable and you can’t trust the Markov chain Monte Carlo estimators. A corollary is that the variance is also ill-defined and the metric adaptation will yield chaotic results.

If you suspect that a target distribution is improper then the best way to confirm is to look at the samples, and to do so on the unconstrained space.

1 Like

I think the OP was acutely aware the “posterior” in question is improper. The point is that the diagnostics appear to fail to flag bad behaviour. Of course they don’t have to, but it’d be nice to get some divergences or other, stronger, warnings. I take tree depth exceedances as a sign of tough geometry, not necessarily pathological. I think @nhuurre’s answer clarifies what’s going on here.

2 Likes

I’m not sure what additional/stronger warnings you would expect.

The treedepth indicates very long trajectories which could be due to small stepsizes or an ill-posed target density, and following up by examining the unconstrained samples here one can confirm that it is the former. Indeed the treedepth limit was introduced in the first place to guard against trajectories moving towards infinity and never triggering the trajectory termination criterion.

For additional evidence the quantiles for a are all pushed up against 1 which indicates problems. The bulk ESS warning, however, suggests that the mean is ill-defined, which further supports with an ill-posed target. Once can examine the inverse metrics learned in each chain’s adaptation period as well – as I noted earlier inconsistent values add additional evidence for ill-defined moments.

The major indication of an ill-posed target density is the sampler trying to reach infinity. The secondary indication is ill-defined mean and variance behavior consistent due to functions no longer being integrable.

Whether or not you see divergences as the sampler pushes towards infinity depends on the target density. Sampling from a uniform density, for example, will never diverge because the gradients vanish everywhere. When constraints are involved you have to consider potential overflow/underflow the constraining map and the boundaries of the target density.

In particular because the sampling happens on the unconstrained scale there is nothing keeping a to "smallest floating point number below 1 as was suggested earlier. Firstly the bounds are inclusive, not exclusive. Secondly the logistic transformation responsible for mapping back to the constrained space can easily overflow to 1 (or the closest floating point equivalent).

Here the beta densities are well-defined at a = 1 so there isn’t any problem with the latter, and the logistics function (and its gradient) in Stan is explicitly written to be robust to the $x \rightarrow \infty limit. Consequently divergences won’t arise as the sampler tries to reach infinity and you are left with the other diagnostics. The point is not what I would expect, but rather that sometimes diagnostics fail. This is a moderately interesting instance of that. You don’t have to agree; you may very well think it’s silly. Good point. But I don’t think you can expect practitioners to look at samples in the unconstrained space. Hence, the absence of divergences or other stronger warnings could lead someone less experienced to think things were fine, or at least that increasing tree depth would solve the problem [doing that will randomly “solve” the problem]. Disagree. If I do: parameters { real<lower=0,upper=1> a; } model { a ~ beta(9999, 1); }  I also get all quantiles pushing against 1. Except you don’t always get that warning, as the OP found and I confirmed in different versions. So, to sum up as this is the last I’ll say on this matter: both OP and myself thought this was a weird little example of a target that’s clearly ill-posed and yet might escape detection by diagnostics. And I think this is something that could be emphasized more. I may very well be misrepresenting here, but I think this was kind of @Daniel_Simpson’s point here as well – “diagnostics can and in fact often do fail”. Please do let me know if I misunderstood. 2 Likes Max: Yes, it’s an interesting example. Under best practices we should be using proper priors, but it’s also good to be able to run software under edge cases and see what happens. No diagnostics will catch every problem, so it can be useful for us to develop a menagerie of difficult examples. 1 Like I definitely agree with that point – it’s something I stress whenever I teach. See, for example, the discussion in https://betanalpha.github.io/assets/case_studies/rstan_workflow.html. One of the challenges with a project this big is that many people will have different opinions, and because many parts of the project are essentially independent (for example the diagnostics reported in RStan were chosen entirely by the RStan developers) it’s nigh impossible to get consistency in a diagnostic workflow. I mean there are even prominent members of the community that have trashed divergences, and if you can’t agree on those when what can you agree on? Anyways, I dug into this example a little bit more deeply to understand what’s going on. On one hand it’s interesting, on the other the conclusion might not be satisfying to you. One thing to keep in mind is that there is a hierarchy of tasks involved with a computation fit, with higher levels depending on lower levels to work. The lowest level is floating point arithmetic – if we can’t evaluate mathematical functions accurately then we can’t trust anything going forwards. Sometimes breakdowns of floating point arithmetic manifest in higher-level diagnostic failures, but not always. Sometimes the breakdowns manifest a problem that looks entirely normal to the everything that follows. Similarly if the symplectic integrator in HMC breaks down then we can’t trust that the numerical trajectories explore as much as they should. If we can’t diagnose that breakdown directly, however, the resulting samples might not be obviously wrong. Lastly there are function-specific problems. Even if the Markov chains are exploring as they should, MCMC estimators will be ill-defined if the functions whose expectations we’re taking aren’t square integrable (technically cubed integrable if we estimate the variance to estimate the MCMC standard error). So with that in mind what might we expect here? We know that the target density is improper so we might expect that the Markov chains will diffuse towards positive infinity. If we run multiple chains and they diffuse at different rates then this might manifest in large$\hat{R}\$s.

Even more our Hamiltonian trajectories should coherently zoom straight towards that positive infinity. These trajectories should trigger the max_treedepth warnings.

At the same time the identify function is no longer integrable so the mean estimators shouldn’t be well-behaved. Bulk or tail ESS or k-hat diagnostics might trigger if the samplers explore sufficiently far.

As you note you sometimes see these effects, but not consistently. I appreciate the fear that this incites knowing that something is going wrong. Why aren’t we getting the behavior that we want? Because the floating point is failing in a weird way.

Remember that if the floating point arithmetic fails then we have no guarantees that everything else will behave in the way that it should, including all of the higher-order diagnostics. To track down what’s going on we can run with the diagnostic_file option and investigate the unconstrained states and the corresponding gradients.

First let’s look at the unconstrained values of logit(a),

Per our expectations these values should have been pushing towards positive infinity, but it looks like they’re hitting a wall around 37 or so.

Let’s see what’s happening to the gradient there.

As logit(a) gets bigger, and a gets really close to 1the gradient starts to waver a bit around -0.5 (directly the trajectories towards positive infinity) until after the wall we saw above where it jumps up to 1 (directing the trajectories back towards zero).

If you work out the gradient analytically then you’ll see that it should be -0.5 all the way to infinity, which would keep pushing the numerical trajectories and the resulting Markov chains out there. The immediate suspect is floating point arithmetic.

To confirm that suspicion you have to go a bit deeper into the target density function and how the autodiff is computed. On the unconstrained scale the gradient decomposes into the gradient of the log constrained density plus the log Jacobian determinant. Their expected analytic values? -1.5 and 1, respectively. Add them up and you get -0.5, keep only the latter and you get…1, like we see for logit(a) above 37 or so. Ultimately right around logit(a) hits 37 the logistic function that maps it back to the constrained space overflows to 1, and the log1m in the log target density underflows to minus infinity. In the beta_lpdf implementation this results in only the gradient of the log Jacobian determinant being returned.

So, in this case the way floating point arithmetic typically fails is equivalent to the working evaluation of a log density function that is well-posed, with the density rapidly falling to zero right after logit(a) = 37 and pushing the sampler back towards 0. That’s why you don’t see any of the other diagnostics failing – from what they see everything is consistent with a working fit!

This is a pretty insidious failure of the floating point evaluation of the constraining transform here, but to be fair it’s being pushed to its limits. The real problem is that the failure is quite exceptional and there’s no real way to diagnose it other than to somehow run the floating point at higher precision, which isn’t currently feasible with the Stan Math Library.

That said I would consider the empirical quantiles all concentrating to the same value with the standard deviation estimated to be zero a worrying sign, and worth investigation on the unconstrained scale. The beta(9999, 1) suggestion with which you countered actually has nontrivial standard deviation, but honestly its extreme enough that I would also have investigated it. In this case the fit only needs to reach out to logit(a) = 20 or so for which the floating point math is fine. I added four more 9s to get beta(99999999, 1) and logit(a) only reached 26 where the floating point was still sufficiently stable.

Is looking at the unconstrained space and gradients advanced? Yes. But so are the intricacies of floating point arithmetic which underpins everything we do.

6 Likes

@betanalpha do you know if the contents of the diagnostic file are documented publicly anywhere? I didn’t find it in the CmdStan manual (only that the argument diagnostic_file exists). I can open an issue for that somewhere and help contribute doc, but if it’s used in multiple interfaces (we have it in RStan but also now CmdStanR/Py) perhaps it should go in the Stan users guide somewhere. Or I’m fine with all of them pointing to CmdStan if it’s documented there.

4 Likes

Good call – the diagnostic_file has always kind of been a hidden feature. I thought it was documented at one point long ago but that might have been lost as the docs were restructured. One challenge is that in CmdStan, RStan, and PyStan the diagnostic file gets dumped to a local CSV with the same structure as the CmdStan CSV. For CmdStan that makes documenting the structure straightforward – it just replaces the constrained values with unconstrained values and adds momenta and gradient components with the p_ and g_ prefixes. For RStan and PyStan you’d have to first introduce the CmdStan structure or change some I/O plumbing in those interfaces to read in the values into the fit object so that they can be accessible directly by the user.

3 Likes

That’s what I thought too but if you look at the relevant code you’ll see the transformation does check and (attempt to) correct for overflow. If you plot the constraining transform you get this:

The jumps are twice as big they need to be which I guess implies some round-to-even fudging in my CPU.

Given that @betanalpha didn’t know about this, I assume that code is what it looks like: a poorly thought-out kludge that isn’t needed and doesn’t really fix whatever it was intended to fix.

The beta(1, -0.5) density and it’s derivative are infinite at a = 1. That’s not well-defined at all. A model with an explicit inv_logit instead of a parameter constraint gets thousands of divergent transitions. And, it turns out, the step size adaptation is more stable leading to a faster run with no max_treedepth problems.

Take for example the nonidentifiable pair of correlated parameters.

parameters {
real x;
real y;
} model {
x + y ~ std_normal();
}


This is improper but gives no warnings other than tree depth exceedance and vanishing ESS. Tree depth warning typically signals geometry that looks nonidenfiable and/or improper.

Stan has a rare error that plainly tells you the posterior is improper:

Exception initializing step size.


“Rare” meaning I don’t think it’s ever possible to get this error unless the posterior is completely flat. Even a small bump can allow step size initialization to succeed.

parameters {
real x;
} model {
target += log_sum_exp(0, -square(x));
}


This model fails in a surprising way. The step size shoots up–obviously–and the “race to infinity” goes on until the inverse metric blows up. Without a metric the energy is always undefined and therefore looks conserved even when the integrator sends the sample point to NaNland. Thus there are no divergencies. The u-turn criterion thinks NaNland is a u-turn and stops all trajectories nicely at treedepth 1. The new proposed point is never accepted because it has accept_prob = NaN. The sampler can no longer move but sees no problems.
Effective sample size and r-hat are of course atrocious once the sampling finishes. If the sampling finishes. When run with default settings the metric fails during the penultimate metric adaptation window which means there’s a full window with a frozen sampler. At the end of last window, when the metric is updated to (presumably?) all-zeros the sampler somehow throws itself into an infinite loop.

5 Likes

Just wanted to say that I am overwhelmed by the insights and energy given into this by @nhuurre and @betanalpha but I have to say this thread starts to look a bit like a contest of proving the other side wrong. I find this unfortunate and kindly ask you to try to shift the tone. I hope the ultimate goal of everybody here is to understand what is happening, learn and/or educate others and possibly find some way to improve Stan along the way. If you agree, please try to make sure this is explicitly visible from your posts.

Thanks and sorry for being preachy! You can obviously disagree with my assessment.

6 Likes

Probably due to the fused multiply-add at the end of the function. I don’t think 1 has an exact double precision representation so the multiply will skew the values a bit.

I was not aware of this code and I agree that it is questionable, although I recommend not trying to assign intent to the developer who originally wrote this code. My guess is that it was meant to fix Bernoulli density evaluations in logistic regression models back when the priors were hyper diffuse and overflow happened often.

It’s worth creating an issue on stan-dev:math to get feedback on removing it.

Yes, at the boundary of the constrained space but not for any a away from that boundary. Stan doesn’t fit using the density on the constrained space, it fits on the unconstrained space using the log density. Mapping to the unconstrained space pushes the infinite gradient all they way to positive infinity and brings with it a log Jacobian determinant which actually cancels part of the infinity to leave a finite gradient. Again one can work out the exact gradient of the unconstrained density here and show that it should be equal to -0.5 asymptotically.

Yes, with improper densities being a subset of non-identifiability densities.

This warning comes from the initialization code that tries to find a step size that results in a reasonable change of Hamiltonian energy after one step of the leapfrog integrator. If the target density is constant then this step size can be made arbitrarily large so the initialization code throw an exception when the initial step size is adapted beyond 1e7.

As with the other diagnostics this is a necessary but not sufficient condition for pathology. If we see it we know that the target density is improper (the special kind of improper with constant density) but there’s no guarantee that we will see if it.

4 Likes

Question: Should we throw an error message when we round inverse logit to 1 in the inverse transforms back to a constrained space? (This is technically rounding, not overflow—floating point overflow typically produces exceptions or infinite values.) Now that everything’s in the test framework, it’s easy to change this edge-case behavior.

Suggestion: If there’s a real model someone cares about running into arithmetic precision issues near 1, the thing to do is flip 0 and 1. You only have around 1e-14 precision near 1, whereas you have over 1e-300 precision near zero. That’s why we have functions like log1p. So you’ll find beta(1, 1e5) much better behaved than beta(1e5, 1). There’s more trouble to be had when that low number goes below one.

4 Likes

I hope you don’t take the following the same way!

The CmdStan docs have not been restructured yet. I think someone may have opened an issue about this, so I’d check before opening another one.

I used that example in the user’s guide chapter on problematic posteriors! What you get is overflow in the values

I think you also get near overflow parameter estimates of 10e300.

Thanks for explaining that. I’m wondering if we should also somehow put in diagnostics for frozen samplers. MCMC is doing the right thing in terms of the stationary distribution and balance here, so it’s not a bug other than that it feels like a bug to a user when Stan hangs. So we should probably monitor hangs and just stop the process automatically. I’m not entirely sure what thresholds of time or iterations that would require, though, as sometimes this will be OK and it’ll only get stuck for a few iterations. Once it’s been stuck for dozens, it’s probably too late for error messages.

It’s an even power of 2, so it has an exact representation because the base and exponent are both integers.

Which code is this? Is it inv_logit or in the bernoulli? What about bernoulli_logit?

+1 for that. I think we should throw exceptions rather than round when things go bad at boundaries. A lot of our functions and distributions don’t have well thought out edge conditions (not trying to offend anyone here—I wrote a lot of this stuff myself!). Nor do our functions behave consisently at boundaries at every level of autodiff yet. The scalar ones mostly do, but not the matrix functions.

4 Likes