Any examples of Stan/brms being used explicitly for maximum likelihood model fitting?

I often point out to learners that they could use MCMC and other Bayesian samplers (with flat priors) to get a sample from the normalised likelihood and hence optimise to get the MLE and either do percentile CIs or estimate the second derivatives. But has anyone ever done this? I don’t mean a quote-unquote Bayesian analysis with flat priors, I mean someone who explicitly intended to do likelihood-based analysis.

Would be interested to know of any examples out there.

2 Likes

Maybe I am not as averse to flat priors as most hardcore bayesians, so I don’t mind using them as a way of getting started on simple problems without having to think about what the best priors should be. I have used an MCMC-based, MLE-like approach trying to replicate a differential gene expression method, but eventually I went beyond flat priors to set up a hierarchical GLM to account for random effects, and used zero-mean normal priors because it would actually be a better choice and improve inference.
I think the reason why it’s not done more often is because it becomes near-immediately obvious that flat priors are not the default or natural choices for some problems.

A flat prior is numerically the same result, but the interpretation is different because of the underpinning ontology. Any model with a prior, even flat, has as estimand the degree of belief in unknowns in the data generating process. I mean likelihood based inference, which has as estimand the putative true unknown in some ante rem or possible-worlds ontology. I explain this distinction in a philosophical standpoint document on my personal website, though it has become rather bloated, repetitive and self-indulgent. I am planning to do a v2.0 rewrite in the week before christmas.

The ctsem r package for discrete and continuous time heirarchical state space modelling uses a complicated stan model in the background, and default estimation is max likelihood. Two different optimizers are used, classic lbfgs and also some custom sub sampling stochastic gradient thing, for added complexity :)

2 Likes

I don’t have experience with the technique myself, but data cloning is an apparently important technique for coopting MCMC computation to yield frequentist inference, presented here:
https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1461-0248.2007.01047.x

I think that data cloning is just equivalent to multiplying the log likelihood term in a stan model by a large constant.

R functions for performing data cloning with Stan are provided here; I’m not clear on whether they achieve data cloning via multiplying the log likelihood by a constant or (equivalently) by actually fitting the model to multiple copies of the data simultaneously. stan.fit: Fit Stan models with cloned data in datacloning/dclone: Data Cloning and MCMC Tools for Maximum Likelihood Methods

This does not use Stan’s optimizer whatsoever, nor does it rely on flat priors, but it is an example of using Stan for maximum likelihood estimation.

I didn’t know that! Thanks, I’ll definitely cite it as an example.

Thanks, I looked at the R pkg documentation (such as it is) and the article and the promotional website. It looks … eccentric. But there are examples of people giving it a go, like here for DSGE models in economics (economists will give anything a go). I don’t see why they have to include a prior and then try to obliterate the prior in the limit as clones \rightarrow \infty. Why not just work with likelihood? But anyway, I guess it is a fringe example.

1 Like

It’s not clear to me what you mean by “underpinning ontology”, but if by that you mean there is a fundamental difference be between the two interpretations above, I’d disagree and say it may be a matter of the scope of choices you can make within each method.

Whether you choose flat priors explicitly by specifying them or implicitly by choosing MLE (or even not specifying priors and having the method default to flat, like Stan and other packages do) that is an inference assumption that is equivalent in all cases. It’s like using linear least squares versus MLE on a linear model with gaussian likelihood; although the choice of likelihood function cannot be an option in the former you are still stuck with the consequences of its assumptions.

To me the difference when using bayesian inference is it’s easier to relax the assumptions within the same framework, so if you don’t you probably need to justify it (e.g. “why did you choose those priors” vs. “why didn’t you go bayesian?”).

Maybe there’s a formal philosophical distinction that can be made between the options (and maybe that’s what you tried to convey in you personal page), but I cannot really think of a concrete consequence of making that distinction, especially if it’s few people who choose between MLE and Bayesian inference on philosophical grounds.

I think the reason for including a prior is to generalize the technique to MCMC engines that require one. The position of the paper seems to be that lots of people are resorting to MCMC techniques because frequentist alternatives aren’t available, and the goal is to provide a plug-in frequentist alternative for anybody’s Bayesian analysis. Not a goal that I share, nor one that I am convinced is achieved by this technique, but that’s how I read the paper.

I think an additional claim is that there is some class of problems where MCMC sampling from the likelihood raised to a power is computationally more stable than applying generic optimizers to the likelihood itself. That claim surprises me on its face, but then I think about all of the problems people have with optimizers and I think “well, maybe”.

I can imagine it might have been justified in the days of BUGS, with conjugate priors and then cloning to get rid of them. That makes sense. I wouldn’t do it, but it makes sense.

The other problem you mention, of summarising a normalised likelihood sample rather than optimising it, is twofold, I think. One is the likelihood-free / ABC argument where you can still get a posterior sample even if you can’t evaluate the likelihood, although I think you will always need, at least in a philosophical distinction, priors for that. The other is maybe the counter-argument, where summarising a sample gives stable, sufficient statistics, while trying to find the maximum density point involves density estimation, which gets hard in high-dimensional space as the distance between draws massively increases.

I agree there is no difference, but hardcore frequentists argue that there is, and appeal to ontology, though not in a very well-constructed way, in my experience. If we want to argue that there is no difference, then we need to meet them in the philosophical battlefield. Saying that it comes to the same number in the end will not wash. That’s part of what I’ve tried to put down as a standpoint for inference. I think we should be eclectic, but justify our choices in each case, not just pick what’s easy and certainly we shouldn’t mix estimands in the same project (as many network meta-analyses do).

Anyway, I asked this question because I’m interested in bridging the gap in the other direction, showing people who are scared of Bayes that they can use sampler algorithms for likelihood-based inference. There may be some who are scared of likelihood too, of course… I can’t help them.

I agree up to the battlefield metaphor, most people aren’t hardcore frequentists or bayesians and don’t care about the any battle. I was in a philosophy of science conference this July and was accused of peddling “bayesian propaganda” because I said something along the lines of ‘bayesian inference is the natural approach to formulating a hierarchical inference problem’. His arguments was that up to the priors any inference problem you could set up in a bayesian way you could set up in a frequenstist framework, and it was because the former required a deeper understanding of inference that it endeded up being more sophisticated, not anything intrinsic to bayesianism.

I realize a lot of people will, but I’m not sure I even have a problem with the definition that bayesian statistics is simply one that uses bayes rule and therefore priors. If that’s the case, though, the flat-prior assumption makes frequentist inference a particular case of bayesian inference and the choice between them would require understanding the assumptions of either approach. It kind of boils down to “choose whatever you like, but be prepared to justify the implications”.