Tutorial on Monte Carlo EM and variants for MML and MMAP

I was having trouble understanding what was going on in lme4, NONMEM, Monolix, etc., and so I did what any grad student would do and wrote up a tutorial.

Please let me know if you have questions; especially if you have corrections.

It’s not really a Stan tutorial per se because Stan doesn’t yet implement any of these algorithms. So I’m not sure what I’ll do with it, if anything.

mcem.pdf (202.3 KB)

P.S. This is just the knitr tutfte::tufte_handout format, which produces pdfs—the tufte_html produces HTML, but I couldn’t include it here in Discourse.


It isn’t just that the maximum of the complete data likelihood may not be finite when you try to maximize it with respect to both u and \theta. Even if those MLEs exist, the regulatory conditions for MLEs having known properties (such as consistency, asymptotic normality, etc.) require that the number of things to estimate remains fixed as N\uparrow \infty.

In simple cases where the u values are conditionally independent given some known grouping structure, we may be able to integrate them out by calling integrate_1d for each group.

Thanks for the clarification. Is this absolute or could there be a sublinear growth in parameters? I think that’s what winds up happening in non-parametric models like Dirichlet process mixtures.

I should really get a proper math stats education. I’m making progress in that at least I know what these things mean so I can understand the answers.

We should definitely try this.

I’m wondering if it would be more efficient for full Bayes, too, because it’s precisely those hierarchical parameters that mix poorly when we also sample the lower-level parameters.

You may be able to salvage consistency if the ratio of the number of parameters to the number of observations converges to zero, but I don’t know what you would do with the infinite order Fisher information matrix. A citation on the general issue is

1 Like

Re Generalized Expectation Maximization on page 7:
I’m glad you mentioned this. It’s a seemingly simple generalization of EM, to choose any \theta^{(t+1)} for which Q(\theta^{(t+1)}| \theta^{(t)}) > Q(\theta^{(t)}| \theta^{(t)}). In my experience, replacing the full maximization of the M-step with a single hill-climbing step is quite computationally efficient.

Re distribute derivatives through integrals on page 8:
Technically you need the dominated convergence theorem and linearity to hold in order to interchange integration (expectation) and differentiation. Though, I agree that most people just do it with a subtle remark about it being only conditionally possible.

Re Gradient Based Marginal Optimization on page 9:
Is there a missing negative sign on the right hand side of the covariance matrix?

There’s some minor typos that I’ll call out, if you want.

Much like drezap’s comment in this thread What doc would help developers of the Math library?, your blog style posts are great. Thanks!

That’s what the GMO algorithm is doing.

Probably. I have to count convexity on my fingers.

Yikes. I don’t even know what the dominated convergence theorem is. I thought linearity was enough. Does it help that we’re making additive approximations? I can add a footnote clarifying the conditions if you could suggest the right language.

Let me drop this into a repo and then I can give you a link.

My understanding is that after the approximation linearity is sufficient.

3 small comments:
i) SAEM stands for stochastic approximation EM (not stochastic averaging EM) and is due to Delyon-Lavielle-Moulines https://projecteuclid.org/euclid.aos/1018031103
ii) footnote 17 on page 7 of your pdf: the Robbins-Monro conditions are given wrt to a sequence \epsilon_t, which should instead be \lambda_t in that context.
iii) actually it is fundamental for SAEM to achieve the (proved) convergence to a stationary point of the observed likelihood, to have the complete likelihood written as a member of the exponential family, with analytically available sufficient summary statistics. This is proved in Delyon et al. Your description does not make use of such remark.

Thanks, great article!

Abstract: s/a posterior/a posteriori/
Abstract: s/can be estimate/can be estimated/

The “uncertainty in missing data” could use a little expansion: I’m unclear how exactly the matrix is defined.

In the initial definition of the Monte Carlo estimator (p. 5), you’re missing a superscript on u. (I’m not sure how to use LaTeX math in this forum, help appreciated.)

i) I pulled all the citations to the end, including Delyon et al.

ii) I’ll fix that

iii) Thanks—I completely missed that in the wall of math in that paper. I barely understood anything other than the algorithm pseudocode.

Is there a simple reason the algorithm requires sufficient summary statistics? And I take it nobody’s generalized the proof? I thought it was usually pretty easy to generalize these algorithms, it just led to difficult computation. So I’d very much like to understand why the exponential family is required here.

Thanks. I’ll fix the typos and try to explain what I mean by uncertainty in missing data. I detail that elsewhere and need to provide links, but it may just help to repeat the definition.

P.S. LaTeX math in the forum (and other Discourse forums where it’s enabled) is just


which yields e^x.

I tried to implement one of the methods @Bob_Carpenter talks about with PyStan in this notebook (along with a silly Stochastic Gradient Descent implementation), but something is wrong. Can anyone point me in the right direction? The main purpose of the notebook is to show how to use Stan for algorithm development if you so desired, but I don’t want to publish it with a broken algorithm implementation :P

Repeating what I replied to @seantalts via email.

  1. you need to reduce step size in SGD for it to converge and even so, it’s a fussy algorithm.

  2. you need to optimize through the expectation not plug in the expectation (at least in most cases).

Thanks! What does it mean to optimize through the expectation in this case? You don’t mean just replacing sampling with optimizing there, right?

What you need to evaluate is this:

\theta^* = \mathrm{argmax}_{\theta} \ \mathbb{E}_{p(u \mid \theta)}\!\left[ \log p(y \mid u, \theta) \right]

which isn’t necessarily equivalent to this:

\theta^+ = \mathrm{argumax}_{\theta} \ \log p(y \mid \mathbb{E}_{p(u \mid \theta)}[u], \theta).

Look at the proper MCMC EM definition in 2.ii on page 6, where the \theta^{(t+1)} is set.

hmm, gotcha. Not sure how to write that in a Stan model or even using a Stan model, haha. Seems like we’d need to be able to autodiff through HMC or something.

You’d need a second Stan model that could do the summation—you just need to bring in M parallel samples and evaluate the sum as written.