# What should API look like for a marginal optimization algorithm?

I’d like to pick up the topics such as EM and GMO. For such an algorithm that distinguishes data y, missing data \psi, and actual (optimization) variables \theta, what user-facing API should be like if we build it into Stan? Take hierarchical models for example, with \theta controls the population model, \psi_i for individual i's latent parameter, and y_{ij} for his observation at t=t_j, if we want to use Stan in the inner-loop sampling of \psi_i \sim p(\psi_i | y_{ij}, \theta), how should allow user to specify the statistical structure?

I understand a parallel topic should be how to impl the algorithm, or even should we do such thing (EM is not really Bayesian, mode vs mean, MAP is not parameterization invariant, GMO hasn’t seen much field validation, etc), but I’d rather not digress too much.

2 Likes

What is GMO, I’ve never heard of that? Want to take a look.

“Gradient-based marginal optimization”. There’s an inactive Stan repo GitHub - stan-dev/gmo: Inference on marginal distributions using gradient-based optimization and iirc also a Stan branch somewhere. Bob has written about it in Tutorial on Monte Carlo EM and variants for MML and MMAP. Maybe @andrewgelman and @avehtari have more materials on it?

1 Like

I regularly find Stan too tightly bound to HMC… I think it would be great to be able to specify the Stan model independent of the estimation approach, then in a case like this I imagine specifying something like a list of parameters (ideally subsets of parameter structures) to either include or exclude in the various estimation steps.

A further step along these lines that I would find great but might be dramatic is to stop distinguishing between parameters and data inside the stan model, but rather have this fed in externally. The current approach is, I assume, much simpler from a dev perspective, but to me doesn’t seem helpful for modelling. Would be great to just e.g. perform operations on y, which in some cases contained missing data and we could request sampling, and in other cases was complete so no sampling needed, without needing to change the model.

2 Likes

GMO has been used to mean different things, so it would be useful to define which GMO do you mean.

Ah, here you define which GMO you mean. The development phase experiments were discouraging and the development stopped years ago. We couldn’t find an example where it would have worked with less log density evaluations than dynamic HMC in Stan. There are some theoretical arguments hinting that in general it’s not going to work.

But Hamiltonian Monte Carlo using an adjoint-differentiated Laplace approximation: Bayesian inference for latent Gaussian models and beyond by @charlesm93 et al, works if the latent conditional posterior is unimodal and well approximated with multivariate normal. This is progressing.

Yeah I brought GMO as an example tho it’s been inactive for while. Another example algorithm would be EP, like your work on GitHub - gelman/ep-stan, which can also be applied the above hierarchical model by marginalizing out \psi_i. I’m mostly curious how we can design/retrofit Stan syntax to such a model.

Obviously an alternative is we do it on top of Stan by calling the sampler in python, R, etc.

Hi, attached is an old draft of a GMO paper. It looks like Dustin wrote much of this, and I think it’s also based on some earlier version of mine that I can’t find now. The idea is that it would have the good features of glmer etc but be more general. I still think it’s a potentially great idea. When we tried to implement it for some lmer/glmer-type problems, I recall the algorithm not being as fast as we would hope, because lmer/glmer gains computational efficiency by saving some intermediate computations between iterations. But I don’t remember the details. I’m also ccing Charles as he might have some thoughts on this.
Andrew

gmo.pdf (669 KB)

GMO’s idea is similar to Monte Carlo EM for general models which Bob’s tutorial linked in the thread describe. The specifics in GMO are around the type of algorithms we want to be using for Monte Carlo (namely, the proposal distribution we use to draw local parameters that we marginalize over and how to update that proposal over time). We also provide a Hessian using the Laplace approximation, and Sections 4-5 have a number of advanced cases that I still think are pretty neat.

When we tried to implement it for some lmer/glmer-type problems, I recall the algorithm not being as fast as we would hope, because lmer/glmer gains computational efficiency by saving some intermediate computations between iterations.

When we iterated on GMO for hierarchical GLMs, the only competing methods for nonconjugate models used numerical integration and it was easier to provide better estimates. Compared to lmer/glmer-type models, however, their lmer/glmer takes advantage of exact computations where coordinate ascent becomes iteratively reweighted least squares. We did want to have GMO for lmer/glmer models reduce to the exact computation as a special case (Sec 3.1). However, it was hard to beat lmer/glmer in benchmark time due to our code relying on RStan’s API which wasn’t low-level enough for our use cases: we had to call gradients multiple times unnecessarily.

The idea remains quite promising as it extends lmer/glmer to both nonconjugate models (via proposal distribution sampling) and large datasets where even IRLS is slow (via data subsampling). RStan’s API today may make the code slowdowns no longer an issue with refactoring.

2 Likes

Hah, I remembered you being more pessimistic about it! :D
Thanks for commenting yourself

Hi all!

I’m very interested in a similar idea and (maybe?) related with this topic. I have been working in create an approximation method (I’m not sure if this exists or not) using the EM algorithm and Laplace approximation. Attached is a document that I created where I describe briefly about this.

My idea is to use AD to estimate the second derivatives for the Hessian matrix but as I’m finishing my PhD, I have not had the time to create a complete code.

If someone want to work in this idea, I will be available very soon to work on this!

PS: I’m tagging to @Bob_Carpenter and @charlesm93 because I commented this idea with them a while ago.

Laplace-EM_JCavieres.pdf (206.6 KB)

1 Like

EM and EP both require splitting the parameters. We don’t do that in Stan. @andrewgelman has been asking for something like being able to pin parameters, which would take us part of the way there.

This is exactly the generalization of @mgorinova’s SlicStan that I was pushing at ProbProg this year. @WardBrian and I have been talking about doing just this.

This is what BUGS does and what Turing.jl does and exactly what I’ve been suggesting. It also helps with modularity (as in coding up a hierarchical prior once), which was one of the main motivations for SlicStan.

But now you’re talking about parts of variables, which is hard. It’s only R that has this funny NA variable and it’s one of the things that drags down R performance everywhere. I don’t even know how we’d code those inputs in other interfaces like Python. I’m very open to suggestions if anyone has any ideas. Maybe JSON just lets you use heterogeneous typing so you could add NA?

Stan’s modeling language isn’t at all tied to HMC per se—it just implements a log density and gradient. So I think a better formulation is that it’s assuming black-box inference based only on log densities and gradients. That’s literally what a Stan model defines (on the unconstrained scale), along with ways to randomly generate the generated quantities given parameters. So we can run ADVI for VB, L-BFGS for penalized MLE, and HMC/NUTS for full Bayes (we could run Metropolis, too, but it’s a poor competitor to HMC, so we don’t bother; we could also run ensemble methods like the Goodman-Weare sampler used in emcee, but we also never found cases where that outperforms NUTS).

The real drawback to our language is that we don’t restrict to graphical models like BUGS does. I you do that, then there’s all kinds of great stuff you can do at the cost of making models harder to define. For instance, you can calculate a Markov blanket for Gibbs, you can do simulation-based calibration automatically, etc.

1 Like

What is “pin parameters”? EM & EP requires defining a separate set of parameters that do not enter AD or being sampled. Is that what pin intends to do?

By “pin” or “clam” or “fix”, I just mean externally fixing a value of a parameter (which would then not enter into autodiff or get sampled). @andrewgelman wants them so he can write a model with a varying scale parameter, but then plug in a value when first starting.

The bigger problem with EM is that the “missing data” is often discrete (as in the textbook mixture example), so can’t be a Stan parameter at all. The other problem is that we want to marginalize, not just pin values.

In popPK we often deal with real “missing data”, in that case individual level parameters (I actually haven’t done any mixture model). If we use MCEM, these parameters are what will be sampled/marginalized, while the population level parameters will be updated by EM iterations. So I’m thinking of pinning the population parameters during the sampling. In this case the pinned parameters are never sampled but only updated by maximization step.

Just a quick amendment to Bob’s remarks:

• EM is a special case of variational inference. EM and the (vaporware) GMO involve a partition of parameter space. There are versions of variational inference that involve partitions of parameter space.

• EP involves a partition of the posterior, typically a partition of data space. So, not quite the same thing as a partition of parameter space. The partition of data space in EP is related to the partition of data space in LOO or cross-validation more generally.

1 Like

@andrewgelman: Can you elaborate on how the point estimate produced by EM acts as an approximate posterior? Is there something that isn’t a special case of variational inference? Are MCMC and penalized MLE also special cases of variational inference?

EM is a special case of variational inference where the approximating distribution is restricted to be a delta function. We discuss this in BDA3, actually we even have an index entry, “EM, as special case of variational inference”

That’s a very poor variational approximation because it has zero entropy. And by this reasoning, any MLE or penalized MLE is also a variational approximation.

I thought people said this because there were E steps in both algorithms (I recall Matt once said that either EP or VB was just E, E, E, E, instead of E, M, E, M, …).

Agreed, it’s a poor variational approximation. Just about any version of VB is better! But EM does exist, just as point estimation exists.

For EM I’m thinking about something like this

...
population parameters {
vector[n] mu;  // population mean
cov_matrix[n] sigma; // mixed effects cov
}
parameters {
vector[n] theta; // subject parameter
}
...
model {
target += multi_normal_lpdf(theta | mu, sigma)
...
}


so that mu & sigma won’t be sampled but maximized.