Hello!

I was wondering what’s the frequentist interpretation of the MLE when multimodality is a present condition of the Likelihood. How are the sampling distributions affected and interpreted?

Hey, it may help to unpack a more precise version of your question. It’s super useful to feature a model and/or some code to demonstrate some effect and then to make the question relative to the effect. E.g. “here is some multi-modal data. I’ve calculated some sampling distributions like so… what i don’t understand is how to interpret this thing X that I see. How might a frequentist deal with this?”

IMHO, MLE just means “choose parameters to maximise a likelihood function”. Meanwhile, multimodal or multi-peaked may hint at a distribution best described as a mixture of simpler ones or other model structure that should be taken into account beyond a simple distribution. Is there are reason you’d expect a specifically frequentist interpretation? Do you mean to ask what frequentist methods are available for the fitting multimodal distributions?

Well, my question is more philosophical than practical. It’s specifically related to the fact of “choose parameters to maximise a likelihood function”. In the bayesian framework you can always find a better (more flexible) model. However, in the frequentist side this seems to be limited and I can’t visualise what information from the likelihood function is actually “lost”.

Multimodality is a problem in both approaches!

In classic, you are uncertain of being in a global maximum, and in Bayesian, MCMC methods have trouble sampling the posterior distribution.

Maybe @betanalpha has something interesting to say about the information loss in multimodal distribution. Cause in this case the distribution’s geometry makes it hard

I believe there are some fundamental misconceptions about Bayesian and frequentist modeling at play here.

In frequentist modeling one specifies an observational model, \pi(y; \theta) and introduces estimators, functions from the observational space to the parameter space \hat{\theta}: Y \rightarrow \Theta, and a loss function L(\hat{\theta}, \theta) that quantifies how useful an estimator is if \theta identifies the true data generating process. A frequentist analysis then *calibrates* the estimator by computing the worst case expected loss. At least a frequentist analysis *tries* to perform such a calibration; in practice this is often too computationally demanding for nontrivial observational models or estimators or loss functions.

Evaluating the observational model at an observed measurement, \tilde{y}, yields the likelihood function, \pi(\tilde{y}; \theta). The parameter values that maximize the likelihood function define the *maximum likelihood estimator*. Under very specific conditions the maximum likelihood estimator can be approximately calibrated – unbiased, intervals around the maximum likelihood have nice coverage properties, etc.

One necessary condition for the maximum likelihood to be (approximated) calibrated is that the likelihood function concentrates in a single neighborhood. In other words seeing multiple models indicates that any calibrations in invalid. You can still compute a maximum likelihood, or try to at least, it just won’t have any expected behavior.

In a Bayesian analysis the observational model is complemented with a prior model to give a joint distribution over the data and parameter space. When that joint distribution is conditioned on the observed data we get a posterior distribution. We then quantify inference as expectation values with respect to that posterior distribution.

In general a posterior distribution has no calibration – we have no idea how the posterior distribution or posterior expectation values, will behave a priori unless we do the calibration ourselves.

Multimodality doesn’t prevent us from trying to calibrate our Bayesian model in theory, but in practice it can prevent us from implementing the calibration because we can’t estimate expectation values accurately.

For much more see https://betanalpha.github.io/assets/case_studies/modeling_and_inference.html.

Thank you @betanalpha.

What would be the approach when the mass is not concentrated in a single neighborhood? Where does the classical inference rely in this case (can you still derive confidence intervals)?

Thank you!

There are only scattered, usually not very useful, results in the literature. You can construct intervals, but you won’t be able to associate them with any particular coverage guarantees.

This is why frequentist methods are so limited in practice – frequentist calibration relies on relatively simple assumptions to allow for demanding calculations to be done as analytically as possible. Instead of allowing people to build models relevant to their problems they force people to accept models that are much too simple, which makes the calibration irrelevant anyways.

This paper (which I found out about via @vianeylb) might help answer those questions from a theoretical side https://arxiv.org/pdf/1807.04431.pdf