Some issues that may arise regarding transformations

In Stan, we have named parameters and we also have parameters on the underlying (typically unconstrained) scales. I have some questions/issues:

  1. When we do optimizing we will often get draws using Laplace approx on underlying scale. What will happen once we add this new feature in which linear transformations can depend on other parameters? In this case, the meaning of the underlying or unconstrained parameters changes in each iteration, hence if we draw on the underlying scale, we’d need to be able to convert back, using all the parameters in the model. All this info is there (it would seem to me to have the form of “generated quantities” in that it could be seen as a form of postprocessing) but I expect we’d need to put it in Stan to make this really work.

  2. Same issue for ADVI as well as future algs such as GMO and EP: We often want to work on the unconstrained space but then we have to return to reality at some point.

  3. Currently we do some of this in the interfaces (in rstan and rstanarm) but I’m thinking this is really fundamental and important enuf that it should be in Stan itself. Especially given that we’ve been talking about removing certain features in rstan such as the ability to access unconstrained parameters directly.

Who has been talking about that?

I thought Bob was talking about that. But I may have misunderstood!

Transformations can currently depend on other parameters. I can define an ordered sequence of elements thusly:

real a;
real<lower = a> b;
real<lower = b> c;

How so? Does the meaning of b change every iteration in the ordered definition above?

Correct. But they’re ordered, so you really only depend on the ones before the one you’re dealing with. That way the Jacobians all stack nicely :-)

Put what in Stan? What do you now for this situation?

We can start planning for EP and GMO when someone demonstrates they are worth implementing. I’m afraid of continuing the trend of ADVI and turning Stan into a package of relatively (to HMC) brittle approximations.

Some of what?

What I want to do is remove the back-door access to the vector of unconstrained parameters from within the Stan language. This has nothing to dow with RStan. @andrewgelman hounded me to add this capability, which I did through a back door over some of our other devs’ objections. As far as I know, he never used it. Is anyone else using it?

I thought about using it a couple of times but could never remember the name of the symbol for the vector of unconstrained parameters.

If there’s a good use case, we could include it, but I think most users will find it very confusing. I think Andrew wanted to do things like impose priors on everything, like a global normal(0, 10) or something, but I could be misremembering.

Interesting. OK, so I understand this is not a new issue. I’ll take your example and just change the parameter names slightly:

real a1;

real<lower=a1> a2;

For convenience I’ll use the term “a” for the two parameters a1, a2. The meanings of these never change. But the meanings of the parameters on the unconstrained scale can change. Call the unconstrained parameters “b”, so that b1=a1 and b2 = log(a2 - a1).

Now suppose we run Stan. Everything operates on the scale of b, so this is all fine; the transformations from “b” to “a” happen in each step of the algorithm so all is transparent for the user.

Here’s what confuses me: Suppose we run Stan-optimize. This again works on the unconstrained scale of b, so no problem. We get a point estimate of b, thus a point estimate of (a1, log(a2-a1)). That’s no problem. The challenge comes when we want to augment this with a Laplace approximation. We do that on the scale of b, so then we get 1000 draws of (b1, b2), which correspond to 1000 draws of (a1, log(a2-a1)). To get these back on the original scale (a1 and a2), Stan needs to make that transformation. Stan “knows” how to do it–it’s done at each step of HMC–but we need it for those 1000 draws, not for past steps. That’s why I’m saying it’s kinda like a generated quantities block.

Right now, rstan does this, I guess by using the preprogrammed access where it can convert back and forth between unbounded and original scale.

So maybe I’m saying that I think we should put Laplace approx in Stan? I’m not quite sure.

The case for GMO is clear. If let’s us replicated LME4 behaviour and is a standard and well understood algorithm for doing MLE-like things to hierarchical models. In some sense it is a safer, more stable, algorithm for a big part of our core users (those who do multilevel models) than BFGS, which will typically give the wrong answer.

To avoid possible confusion I’d like to remind that GMO refers to many different algorithms, and one of them is a standard and well understood and that is the one worth implementing.

Does lme4 use the MLE or does it use the unbiased estimates for scale parameters in a linear regression? I remember us not even lining up on the basics last time I checked.

Just to be clear, this is not a bug in Stan’s implementation of BFGS. L-BFGS (the limited-memory form we recommend over straight BFGS) doesn’t get the wrong answer, it’s just that there is no MLE for many hierarchical models.

I’d hardly call lme4’s answer the right answer under any Bayesian sense of correctness. It’s just well used and familiar.

It uses the version of GMO that we’re talking about iirc.

There is often a joint maximum it’s just not the thing anyone should compute.

I’m not really interested in this point. If there is an LBFGS implementation (which there should be if for no other reason than debugging and comparision resaons), there should be a maximum marginal likelihood implementation (which we are calling GMO for whatever reason)

Does that mean everything’s consistent if we get rid of L-BFGS and don’t support MLE going forward? I think we really need to focus if we’re going to continue to make progress. I personally don’t care at all about all these approximate algorithms, as that doesn’t seem like it’ll ever be our strength.

The problem with max marginal likelihood is that it doesn’t seem to be easy to implement performantly given a Stan model, partition of parameters, and an optimizer.

As soon as someone builds an MML estimator that works well enough that the people who want to use it are happy, we can talk about putting it in Stan! That probably won’t happen, though, because I don’t think anyone’s working on GMO or alternatives at the moment.

What are your criteria for including an estimation technique (MML) or algorithm (GMO)? Does ADVI pass muster? Does the current implementation of GMO?

I don’t really understand what you’re talking about here. Maximizing Laplace approximations to marginal likelihoods is a widely-used, well-understood estimation procedure. It works at least as well as maximization (in the sense that it won’t give a good estimate for every possible model, but if that’s our criterion we need to dump HMC as well).

If you get rid of L-BFGS it will be difficult for people to do comparisions with other methods, quick and dirty calculations, and a whole plethora of other things that are useful. (Already it’s inconvenient to use because of the insistence on treating transformations differently, but iirc that was going to become a flag). MML would extend that capability to hierarchical models and will work fairly well for a lot of our core-applications.

Parameter space splitting is also a key part of any possible way to automate non-centering, target adaptation better by improving the adaptation of the mass matrix (it really should reflect the hierarchical structure of the problem), and it would be key to any scalable data strategies (like laplace approximation).

Some of this stuff is in the research stream of what Stan does, but there are enough indications that algorithms that take into account the hierarchical structure outperform those that don’t in a lot of cases to make it worth looking seriously at.

At some point we have to decide if we want to freeze the Algorithms development in Stan. I personally don’t think they’re good enough on core applications to do that yet. I’m not sure if that’s what you’re suggesting, but if we don’t let tried-and-true, well understood algorithms in and we don’t let new algorithms in, we’ve frozen development and moved into “legacy” mode.

ADVI is not a great example. It was a “freshly baked” algorithm that was implemented poorly and pushed to the main package without any meaningful testing. A way to avoid making these mistakes again is to never allow a new algorithm into the package, but I don’t think that’s a good policy.

I wasn’t aware we had a current version of GMO. I thought @seantalts had stopped working on it.

2 Likes

I agree with Dan.

It might be better to not use term GMO as it refers to too many not so well defined things including an algorithm which didn’t work that well (but refers to also a wider framework which can include also Laplace approximation and MML)

I think to effectively go down this path we’d need some vision about what strengths to develop. The language is definitely a strength but it’s not going to grow if we don’t let people use various algorithms. Auto-tuned HMC is a strength but other packages are implementing it too so that won’t last. Our handling of output is ok but could use a lot of improvement. We’re effectively cross platform but that’s not going to get us much more adoption without having some more algorithms.

To me the strengths we have are the language, a solid HMC variant, and (on the table) being a great platform for high performance implementations of leading algorithms. Making the language improvements necessary to support algorithms that partition parameters is going to make the language a more broadly used platform, making the IO easier to deal with will get us more interface implementations without being a burden, and a cleaner (modular) interface for new algorithms would draw the people who want to implement their favorite new thing in a way that does burden us with gatekeeping. I haven’t said anything about MPI/GPU/threading because there are obvious wins there but I’d be surprised if we could do it better than a better funded project. Maybe if we just focus on HMC/RHMC and figure out how to partition the gradient calculation and use MPI/GPU without forcing users to do map_rect and such.

@Bob_Carpenter what do you think should get built out if we did cut down on algorithms?

OK, there are a bunch of issues here.

  • I don’t think it makes sense to talk about “MLE” in Stan because we don’t work with the likelihood. We could talk about posterior mode or simply optimization of the objective function.

  • GMO stands for gradient-based marginal optimization. lme4 does not use gradients so it’s not GMO. If we want to speak more generally, we can use the term “marginal optimization.” Actually, lme4 (in its glmer version) does not do marginal optimization; it does approximate marginal optimization: conditional on the hyperparameters it optimizes the approximate marginal posterior density (or marginal likelihood or marginal objective function) based on the Laplace approximation.

  • All the marginal optimization algorithms I know, have two steps: internal optimization and external optimization. Internal is conditional on the hyperparameters: there’s some optimization to find the conditional mode of the other parameters, followed with some distributional approximation. External is optimization over the hyperparameters. GMO uses gradients in the external optimization.

  • What does lme4 do? That’s complicated. lme4 uses an inefficient non-gradient-based optimizer for the external optimization. The reason that lme4 is fast is: (a) the external optimization is typically not so difficult, so even the inefficient external optimizer doesn’t require a lot of steps, and (b) I think lme4 saves certain matrix multiplications so that it runs fast after the first step.

  • I agree with Krzysztof and others that we want to be open to new algorithms. Lots of algorithms are gradient-based and can take advantages of the strengths of the Stan modeling language. Also, regarding Laplace, EP, VB, GMO, etc.: The point is that we’re already using these algorithms, in various forms, all the time, for problems where the current Nuts implementation is too slow. Examples include the medical exam problem, the birthday problem, and various others. I don’t think this need will go away.

  • I agree with Dan that partitioning of the parameter space is an important idea that comes up a lot.

  • How efficiently can/should Stan do marginal optimization, or approximate marginal optimization, for hierarchical regressions? I don’t know. It all depends on how hard it is to program. If it’s easy to program, I’d love to have it, as this would allow us to transition smoothly between Nuts and faster approximate algorithms.

  • Where does GMO fit in here? I’m not sure. It could make sense to first implement Laplace, with some special way to save matrix multiplications so as to efficiently compute hierarchical glms, and then move on to a more general GMO later.

I had no idea that you could access that, but it would have been hugely helpful to me. In a recent model I was working on, I basically gave up on letting Stan handle the parameter transformations since I needed to do so much troubleshooting on the unconstrained space. With the addition of the scale location transformations, I think being able to access unconstrained params would be even more important.

FWIW I’ve already found L-BFGS to be broken as soon as I need to handle a parameter transformation manually, since there is on way to tell stan to include a certain Jacobian correction in sampling, and not optimizing.

diagnostic_file
An optional character string providing the name of a file. If specified the diagnostics data for all parameters will be written to the file. If not provided, files are not created. When the folder specified is not writable, tempdir() is used. When there are multiple chains, an underscore and chain number are appended to the file name.

The output of diagnostic file includes the unconstrained parameters of each sample, the components of the gradient at that point, and the components of the momenta before they are resampled.

Also be default the optimizing methods don’t add Jacobians at all as they are designed to compute formal maximum likelihoods and not posterior density modes.

Right, but if I do the transformation manually, or have some other transformation that isn’t yet supported and do the Jacobian correction myself using target += Stan doesn’t know to exclude that in optimizing, no?

You could just pass that info as a switch in data…

1 Like