Is stan useful for MML? What is the state of GMO?


Is stan useful for empirical Bayes/maximum marginal likelihood?
It seems that the gmo package package should provide this functionality. Is it still being developed and will the corresponding write-up be published anytime soon?

In case there is, wouldn’t it be more efficient to perform maximization and sampling in parallel within stan instead of the current sequential implementation? Is something like this on the roadmap?


gmo specifically is on hold due to limited resources and priorization, but fast approximative marginalization in general is on the roadmap. Do you have specific modeling problems in mind or are you in general interested in this kind of approaches?


Thanks for the info! I was asking because I would like to perform a hypothesis test (shudder) for one coefficient in a fairly complex hierarchical model.


What exactly is max marginal likelihood? Is that just where for funnel-like problems you take the maximum of the marginal density of the variance term instead of taking the maximum of the joint density? I know the joint maximum at the funnel is all the way at the bottom of the neck, which is why maximum likelihood doesn’t give sensible estimates for hierarchical models.


You maximize the marginal likelihood :p

Specifically for the case where you have a multilevel model with a big vector of random effects x and a vector of parameters theta, you marginalise out x and find an estimate for theta by maximizing the marginal posterior p(theta | y).

You then treat that as fixed and compute the posterior p(x | y ,theta) and either use that for inference or compute its maximum.


From what I understand, it’s meant to give saner estimates for when the likelihood surface has several local maxima in particular. There may be many solutions with some local maxima, when several parameters are unknown. It seems to me that MML approaches are used to avoid the problem of joint ML when multiple maxima exist. The gist is to maximize across some set of dimensions, then fix those, and maximize across the other dimensions, treat those as fixed, rinse and repeat. Pretty literally winds up looking a lot like a block gibbs sampler, where the goal is maximizing rather than sampling.

Anyway, you can imagine a really bumpy 2D density where x is one param, y is another, and z is the likelihood (or posterior). Fix x, maximize in y. Fix y, maximize in x. Fix x, maximize in y, etc. The end result is to find the maximum at the MARGINAL distributions, rather than the joint maximum. It finds the “overall” maximum for each parameter value, across all the small local maxima, across all other parameters’ values.

That’s how I understand it, anyway. It’s necessary for complicated models, including anything with latent variables, random effects, mixtures, etc. Joint maximum likelihood will just get stuck; MML won’t as easily.

Edit: This quick answer depicts it well:


In something like lme4, it’s used because the maximum likelihood estimate doesn’t exist (the likelihood is unbounded).

It’s just one step. You start with p(alpha, phi), marginalize out to get p(phi) where usually phi are the hierarchical parameters and alpha the lower level, then find the MML estimate phi* = argmax p(phi), then plug that back in to get p(alpha, phi*). This is the “empirical Bayes” bit. At this point, you can either sample alpha or optimize it given a fixed phi*. At this point, we lose the uncertainty on phi by fixing it, but you can now optimize alpha and then you can lay down a Laplace approximation to underestimate uncertainty to some unknown degree.