From what I understand, it’s meant to give saner estimates for when the likelihood surface has several local maxima in particular. There may be many solutions with some local maxima, when several parameters are unknown. It seems to me that MML approaches are used to avoid the problem of joint ML when multiple maxima exist. The gist is to maximize across some set of dimensions, then fix those, and maximize across the other dimensions, treat those as fixed, rinse and repeat. Pretty literally winds up looking a lot like a block gibbs sampler, where the goal is maximizing rather than sampling.
Anyway, you can imagine a really bumpy 2D density where x is one param, y is another, and z is the likelihood (or posterior). Fix x, maximize in y. Fix y, maximize in x. Fix x, maximize in y, etc. The end result is to find the maximum at the MARGINAL distributions, rather than the joint maximum. It finds the “overall” maximum for each parameter value, across all the small local maxima, across all other parameters’ values.
That’s how I understand it, anyway. It’s necessary for complicated models, including anything with latent variables, random effects, mixtures, etc. Joint maximum likelihood will just get stuck; MML won’t as easily.
Edit: This quick answer depicts it well: https://stats.stackexchange.com/a/133299