Why doesn't sampling fail when optimizing does?


As far as I understand, finding the posterior mode via optimization usually doesn’t work for hierarchical models since the objective function is unbounded due to the contribution from the prior, e.g.:

# parameters
effects <- c(0.0, 0.0)
effects_sd <- 0.0

# prior lpdf
effects_prior_lpdf <- sum(dnorm(effects, 0.0, effects_sd, log = TRUE))

> print(effects_prior_lpdf)
[1] Inf

so the maximum value of the objective function could be found at infinity via setting the all the hierarchical effects and scale to 0. In this case the solution effectively ignores any contribution to the objective function from the likelihood - which doesn’t sound particularly useful.

What I am struggling to get my head round is, why does sampling work fine? Why doesn’t the sampler eventually propose some value sufficiently close this unbounded mode and get stuck?

Any help would be awesome - thanks!


The posterior mode is a consequence of an arbitrary parameterization and is not “typical” of the posterior distribution itself. Algorithms like Markov chain Monte Carlo try to quantify an entire probability distribution and hence rarely spend any time around the uncharacteristic mode.

For a much more detailed explanation see https://betanalpha.github.io/assets/case_studies/probabilistic_computation.html.


Thanks for this reply, @betanalpha. The case study is super helpful!

1 Like

A slightly more technical answer is that the sampler will stay in the typical set, which is the set of elements with log density within epsilon of the entropy (expected log density). The typical set usually doesn’t contain the mode in even moderately high dimensions.

I find it useful to work through simple examples by simulation, which I do in this case study: https://mc-stan.org/users/documentation/case-studies/curse-dims.html

1 Like