Tutorial on Monte Carlo EM and variants for MML and MMAP


I was having trouble understanding what was going on in lme4, NONMEM, Monolix, etc., and so I did what any grad student would do and wrote up a tutorial.

Please let me know if you have questions; especially if you have corrections.

It’s not really a Stan tutorial per se because Stan doesn’t yet implement any of these algorithms. So I’m not sure what I’ll do with it, if anything.

mcem.pdf (202.3 KB)

P.S. This is just the knitr tutfte::tufte_handout format, which produces pdfs—the tufte_html produces HTML, but I couldn’t include it here in Discourse.


It isn’t just that the maximum of the complete data likelihood may not be finite when you try to maximize it with respect to both u and \theta. Even if those MLEs exist, the regulatory conditions for MLEs having known properties (such as consistency, asymptotic normality, etc.) require that the number of things to estimate remains fixed as N\uparrow \infty.

In simple cases where the u values are conditionally independent given some known grouping structure, we may be able to integrate them out by calling integrate_1d for each group.


Thanks for the clarification. Is this absolute or could there be a sublinear growth in parameters? I think that’s what winds up happening in non-parametric models like Dirichlet process mixtures.

I should really get a proper math stats education. I’m making progress in that at least I know what these things mean so I can understand the answers.

We should definitely try this.

I’m wondering if it would be more efficient for full Bayes, too, because it’s precisely those hierarchical parameters that mix poorly when we also sample the lower-level parameters.


You may be able to salvage consistency if the ratio of the number of parameters to the number of observations converges to zero, but I don’t know what you would do with the infinite order Fisher information matrix. A citation on the general issue is


Re Generalized Expectation Maximization on page 7:
I’m glad you mentioned this. It’s a seemingly simple generalization of EM, to choose any \theta^{(t+1)} for which Q(\theta^{(t+1)}| \theta^{(t)}) > Q(\theta^{(t)}| \theta^{(t)}). In my experience, replacing the full maximization of the M-step with a single hill-climbing step is quite computationally efficient.

Re distribute derivatives through integrals on page 8:
Technically you need the dominated convergence theorem and linearity to hold in order to interchange integration (expectation) and differentiation. Though, I agree that most people just do it with a subtle remark about it being only conditionally possible.

Re Gradient Based Marginal Optimization on page 9:
Is there a missing negative sign on the right hand side of the covariance matrix?

There’s some minor typos that I’ll call out, if you want.

Much like drezap’s comment in this thread What doc would help developers of the Math library?, your blog style posts are great. Thanks!


That’s what the GMO algorithm is doing.

Probably. I have to count convexity on my fingers.

Yikes. I don’t even know what the dominated convergence theorem is. I thought linearity was enough. Does it help that we’re making additive approximations? I can add a footnote clarifying the conditions if you could suggest the right language.

Let me drop this into a repo and then I can give you a link.


My understanding is that after the approximation linearity is sufficient.