Tutorial on Monte Carlo EM and variants for MML and MMAP

Bob_Carpenter · August 11, 2018, 2:32pm

I was having trouble understanding what was going on in lme4, NONMEM, Monolix, etc., and so I did what any grad student would do and wrote up a tutorial.

Please let me know if you have questions; especially if you have corrections.

It’s not really a Stan tutorial per se because Stan doesn’t yet implement any of these algorithms. So I’m not sure what I’ll do with it, if anything.

mcem.pdf (202.3 KB)

P.S. This is just the knitr tutfte::tufte_handout format, which produces pdfs—the tufte_html produces HTML, but I couldn’t include it here in Discourse.

bgoodri · August 11, 2018, 6:19pm

It isn’t just that the maximum of the complete data likelihood may not be finite when you try to maximize it with respect to both u and \theta. Even if those MLEs exist, the regulatory conditions for MLEs having known properties (such as consistency, asymptotic normality, etc.) require that the number of things to estimate remains fixed as N\uparrow \infty.

In simple cases where the u values are conditionally independent given some known grouping structure, we may be able to integrate them out by calling integrate_1d for each group.

Bob_Carpenter · August 12, 2018, 10:26am

Thanks for the clarification. Is this absolute or could there be a sublinear growth in parameters? I think that’s what winds up happening in non-parametric models like Dirichlet process mixtures.

I should really get a proper math stats education. I’m making progress in that at least I know what these things mean so I can understand the answers.

We should definitely try this.

I’m wondering if it would be more efficient for full Bayes, too, because it’s precisely those hierarchical parameters that mix poorly when we also sample the lower-level parameters.

bgoodri · August 12, 2018, 1:21pm

You may be able to salvage consistency if the ratio of the number of parameters to the number of observations converges to zero, but I don’t know what you would do with the infinite order Fisher information matrix. A citation on the general issue is
http://www.econ.brown.edu/Faculty/Tony_Lancaster/papers/IncidentalParameters1948.pdf

roualdes · August 12, 2018, 6:48pm

Re Generalized Expectation Maximization on page 7:
I’m glad you mentioned this. It’s a seemingly simple generalization of EM, to choose any \theta^{(t+1)} for which Q(\theta^{(t+1)}| \theta^{(t)}) > Q(\theta^{(t)}| \theta^{(t)}). In my experience, replacing the full maximization of the M-step with a single hill-climbing step is quite computationally efficient.

Re distribute derivatives through integrals on page 8:
Technically you need the dominated convergence theorem and linearity to hold in order to interchange integration (expectation) and differentiation. Though, I agree that most people just do it with a subtle remark about it being only conditionally possible.

Re Gradient Based Marginal Optimization on page 9:
Is there a missing negative sign on the right hand side of the covariance matrix?

There’s some minor typos that I’ll call out, if you want.

Much like drezap’s comment in this thread What doc would help developers of the Math library?, your blog style posts are great. Thanks!

Bob_Carpenter · August 14, 2018, 9:57am

That’s what the GMO algorithm is doing.

Probably. I have to count convexity on my fingers.

Yikes. I don’t even know what the dominated convergence theorem is. I thought linearity was enough. Does it help that we’re making additive approximations? I can add a footnote clarifying the conditions if you could suggest the right language.

Let me drop this into a repo and then I can give you a link.

roualdes · August 15, 2018, 4:30pm

My understanding is that after the approximation linearity is sufficient.

umbertopicchini · August 18, 2018, 3:00pm

3 small comments:
i) SAEM stands for stochastic approximation EM (not stochastic averaging EM) and is due to Delyon-Lavielle-Moulines https://projecteuclid.org/euclid.aos/1018031103
ii) footnote 17 on page 7 of your pdf: the Robbins-Monro conditions are given wrt to a sequence \epsilon_t, which should instead be \lambda_t in that context.
iii) actually it is fundamental for SAEM to achieve the (proved) convergence to a stationary point of the observed likelihood, to have the complete likelihood written as a member of the exponential family, with analytically available sufficient summary statistics. This is proved in Delyon et al. Your description does not make use of such remark.

Rif_A_Saurous · August 18, 2018, 3:45pm

Thanks, great article!

Abstract: s/a posterior/a posteriori/
Abstract: s/can be estimate/can be estimated/

The “uncertainty in missing data” could use a little expansion: I’m unclear how exactly the matrix is defined.

In the initial definition of the Monte Carlo estimator (p. 5), you’re missing a superscript on u. (I’m not sure how to use LaTeX math in this forum, help appreciated.)

Bob_Carpenter · August 20, 2018, 8:34am

i) I pulled all the citations to the end, including Delyon et al.

ii) I’ll fix that

iii) Thanks—I completely missed that in the wall of math in that paper. I barely understood anything other than the algorithm pseudocode.

Is there a simple reason the algorithm requires sufficient summary statistics? And I take it nobody’s generalized the proof? I thought it was usually pretty easy to generalize these algorithms, it just led to difficult computation. So I’d very much like to understand why the exponential family is required here.

Bob_Carpenter · August 20, 2018, 8:36am

Thanks. I’ll fix the typos and try to explain what I mean by uncertainty in missing data. I detail that elsewhere and need to provide links, but it may just help to repeat the definition.

P.S. LaTeX math in the forum (and other Discourse forums where it’s enabled) is just

$e^x$

which yields e^x.

seantalts · October 21, 2018, 3:46pm

I tried to implement one of the methods @Bob_Carpenter talks about with PyStan in this notebook (along with a silly Stochastic Gradient Descent implementation), but something is wrong. Can anyone point me in the right direction? The main purpose of the notebook is to show how to use Stan for algorithm development if you so desired, but I don’t want to publish it with a broken algorithm implementation :P

Bob_Carpenter · October 22, 2018, 1:01am

Repeating what I replied to @seantalts via email.

you need to reduce step size in SGD for it to converge and even so, it’s a fussy algorithm.
you need to optimize through the expectation not plug in the expectation (at least in most cases).

seantalts · October 22, 2018, 1:20am

Thanks! What does it mean to optimize through the expectation in this case? You don’t mean just replacing sampling with optimizing there, right?

Bob_Carpenter · October 22, 2018, 1:33am

What you need to evaluate is this:

\theta^* = \mathrm{argmax}_{\theta} \ \mathbb{E}_{p(u \mid \theta)}\!\left[ \log p(y \mid u, \theta) \right]

which isn’t necessarily equivalent to this:

\theta^+ = \mathrm{argumax}_{\theta} \ \log p(y \mid \mathbb{E}_{p(u \mid \theta)}[u], \theta).

Look at the proper MCMC EM definition in 2.ii on page 6, where the \theta^{(t+1)} is set.

seantalts · October 22, 2018, 1:40am

hmm, gotcha. Not sure how to write that in a Stan model or even using a Stan model, haha. Seems like we’d need to be able to autodiff through HMC or something.

Bob_Carpenter · October 22, 2018, 1:50am

You’d need a second Stan model that could do the summation—you just need to bring in M parallel samples and evaluate the sum as written.

Topic		Replies	Views
What should API look like for a marginal optimization algorithm? Developers	21	1591	March 10, 2022
Is stan useful for MML? What is the state of GMO? General marginal-likelihood	6	1305	February 6, 2018
Maximum Likelihood estimation for Random Effect model for Meta Analysis Modeling specification , hierarchical-model	25	481	December 4, 2024
Vague Algorithm Roadmap Developers	13	1026	November 22, 2018
Using rstan::optimizing to obtain MLE does not seem to run with large number of parameters Modeling	12	1763	July 19, 2018

Tutorial on Monte Carlo EM and variants for MML and MMAP

Related topics