Using rstan to calculate Bayes Factors

The Evidence of a model is the integral of the posterior likelihood over the space of parameters.
If the prior distributions are proper (they integrate to one), you can safely compare different models that describe EXACTLY THE SAME data. By taking the ratio of different Evidences you can calculate the Bayes Factors and answer the basic question is this model A significantly better than model B.

Calculating the Evidence is extremely difficult and the best way I found over the years is to use the “Thermodynamic Integration” method. In this method you run your model at different “Temperatures” simulating increasing amount of data, from 0 (only the prior) to 1 (prior+ likelihood) in the following way:

P(beta)= prior* likelihood^beta 0<=beta<=1
logP(beta)=log(prior)+beta* likelihood

The magic of thermodynamic integration is that if you calculate the Evidence is
Evidence= Integral(logLikeliood(beta), (beta=0, beta=1))
So, you basically run independent mcmc at different “temperatures beta” and approximate the Evidence integral.

Here is my questions
I did not find that the standard rtan package provides the facility of calculating the Evidence, Is there any that I miss?

I did not find an Rpackage that uses rstan to calculate the evidence, does somebody knows of any such package?

I am rising those questions because I think I found a way to “hack” rstan to do the Evidence calculations and I wonder if anybody has done that before.

1 Like

When it comes to the competing philosophies of quantitative inference, for whatever reason folks in the Stan community (myself included, though I also ironically had a long period advocating likelihood ratios before becoming Bayesian) tend to fall more in the “estimate parameters” camp than the “quantify evidence”, so it wouldn’t surprise me that you haven’t found much in the latter domain. I do know that a large % of Bayesians from the field of Psychology have been introduced to Bayes through Bayes Factors alone, so there’s quite a lot of (IMO alternative-naive) evidence folks there, if that helps your searching any. I haven’t seen anything like your idea of fitting many models, each with more data, as a step in evidence-quantification workflows before, but there are definitely packages for computing BFs the more standard ways and it wouldn’t surprise me if some are compatible with Stan in one of its forms (rstan/cmdstanr/etc).

The idea of computing lots of models sounds dauntingly expensive computationally; I wonder if the means by which the loo package achieves it’s leave-one-out approximations might offer a route to a faster alternative? @avehtari might have input.

Minor quibble: this is one definition of evidence, but not necessarily a consensus one. the likelihood ratio of the MLE, for example, is one alternative. I’m not saying I personally prefer LRs over BFs, but just making sure you’re aware that evidence isn’t something like “information” on which a consensus operationalization has yet emerged.


Check out the bridgesampling package.

Given the pitfalls and overall brittleness of Bayes factors, I understand the general reluctance to recommend and implement marginal likelihood-based methods in the Stan ecosystem. But I for one think people should have more tools, not fewer, at their disposal. It does mean there’s a need for proper teaching to make sure people understand when and how to apply certain techniques, but I really don’t like the whole “Bayes factors are bad; don’t use them” attitude. Yes, yes, M-open world and all that, but there are situations where a Bayes factor is fine. My two cents, anyway.


Totally agree. I maybe have a bit more of a negative view thanks to frustration that as the Psych Bayes community exploded over the last decade, none of the primary advocates/educators gave any time to admitting there even was multiple sub-camps and many were thence not exposed at all to estimation as an alternative to the in-that-field-modal evidence approach.


I have the feeling that when you have tons of data, and therefore the posteriors are very well approximated by a normal distribution both definitions, Maximum likelihoods Ratios and Bayes Factors coincide. My view is that Bayes factors is a generalization that allow to study situations where the Likelihood ratios make no sense, like having few observations.

I wonder if the problem is either the Bayes factors or really bad approximations to the Bayes Factors?

My understanding (disclosure I am a follower of Jaynes plus the belief that improper priors are an abomination that should have no place in the Universe) is that if you calculate your Evidence in the right way, there should be no problem.
What are good references for the brittleness of Bayes factors? I would like to look into that.
Thanks for your helpful responses.

1 Like

When I said BFs are brittle I meant to say they are as brittle as the marginal likelihoods that go into computing them, which are quite delicate to compute for models we care about. I come from statistical phylogenetics and computing Bayes factors is a pain in the neck for most real-world situations in that field.

My argument for not minding improper priors very much is the same as the one given by Robert in his The Bayesian Choice: they provide closure, both in the colloquial sense and in the mathematical sense.

The performances of the estimators derived from these generalized distributions are usually good enough to justify these distributions. Moreover, they often permit recovery of usual estimators like maximum likelihood estimators, thus guaranteeing a closure of the inferential field by presenting alternative approaches at the boundary of the Bayesian paradigm. (The Bayesian Choice, 2nd ed. pg. 27)

Now with respect to this

for a compelling and somewhat unorthodox critique, see Navarro’s blog-post-turned-paper: Others here might have other references. @betanalpha @jonah

1 Like

If you are considering writing custom code to calculate these quantities in Stan (and I admit I don’t know anything about the computations except that they are supposed to be subtle), it seems prudent to mention here that Stan’s ~ notation for specifying priors and likelihoods automatically and silently drops normalizing constants, so some care is required to ensure that you are getting the quantities that you need, generally by replacing ~ and target += *lupmf/*lupdf with target += *lpmf/*lpdf.


I want to avoid a long drawn out battle about Bayes factors so I’ll just make a few pertinent comments.

Inference and decision making are different

Bayes factors are inferences over models that are compatible with all other Bayesian inferences. For example the “regular” Bayesian inferences of a mixture model can be reframed as inferences of each component model weighted by Bayes factors, just without having to explicitly evaluate the Bayes factors. This mathematical consistently makes Bayes factors natural when a set of component models is natural.

Consequently when restricted to proper inferences over those models, such as model averaging, Bayes factors are fine. At the same time, however, those model inferences can often be implemented by fitting larger models that include the component models and avoid explicit Bayes factors entirely.

The danger with exact Bayes factors arises when they are used for model selection, which is not a well-defined inference; model selection is instead a decision problem. Now inferences like Bayes factors can be used to inform decision making processes, but there’s no guarantee on how useful those processes will be. For example choosing between two models based on a Bayes factor ratio threshold has no guarantee on the false positive or true positive rate, and indeed in many cases those rates can be terrible (especially if the prior model is too diffuse).

When using Bayes factors to inform model selection a full calibration needs to be worked through to see how accurate those selections are, and that is almost never done. The naive use of Bayes factor based model selection without this calibration has lead to lots of fragile results and reproducibility problems. Many of the common critiques of Bayes factors are based on the consequences of this fragility.

This is one reason why the naive replacement of p-values with Bayes factors (looks at psychology) doesn’t help anything.

Bayes factors and Bayes factor estimators are different

As has been mentioned Bayes factors cannot be evaluated exactly in practice and instead we must estimate them numerically. Unfortunately the marginal likelihood is not easy to estimate using our standard tools like Markov chain Monte Carlo.

Sampling-based computational methods like Monte Carlo, Markov chain Monte Carlo, importance sampling, and the like work best when the expectand (the function whose expectation value is being computed) is relatively uniform compared to the target distribution; if the expectand varies strongly then the sampling-based estimators will suffer from large error.

The two strategies for estimating the marginal likelihood direction are as a prior expectation value of the realized likelihood function,

E_{\text{prior}}[ \pi(\tilde{y} \mid \theta) ]

and a posterior expectation value of the inverse realized likelihood function,

E_{\text{posterior}} \Bigg[ \frac{1}{\pi(\tilde{y} \mid \theta)} \Bigg].

Unfortunately both of these functions vary strongly when we learn a lot from the data and the posterior distribution isn’t very close to the prior distribution. The variance of both expectants is large and in many cases actually infinite, which implies that the error on the sampling-based estimators is also infinite.

Thermodynamic methods try to introduce a sequence of intermediate distributions between the prior and the posterior distribution so that each neighboring distribution is very close, and the marginal likelihood between the two can be estimated more accurately. The sequence of marginal likelihood estimators can then be used to construct the prior/posterior marginal likelihood. The difficulty is in constructing a sufficiently nice sequence where the neighbors are sufficiently close together; there are many heuristics but all of them tend to be fragile, especially in high dimensions. For some more discussion see [1405.3489] Adiabatic Monte Carlo.

In practice we have to work with an estimated Bayes factor, and often one with questionable error quantification. Because of this error decisions based on the estimated Bayes factor can be significantly different than decisions based on the exact Bayes factor, often so much so that they have to be considered as different decision making processes entirely (at which point many of the nice theoretical properties of Bayes factors can no longer be relied on).

Relying on Bayes factor estimator-based decisions without any kind of calibration then becomes all the more fragile.


Hi Michael Betancourt.

I fully agree with your comments and I see where the reluctance towards a misuse of Bayes factors comes from.
Regarding thermodynamic integration, I have only experience with moderate sized systems, with less than a hundred parameters, and using advanced techniques like combining parallel tempering and Rieamann manifold mcmc I was able to estimate the Evidence of complex models involving one dimensional PDEs and markovian kinetics.
What I am trying to do is to hack rstan to run independent chains at different temperatures to estimate the Evidence of alternative models. It is slow as hell, but, I think that if you have a robust estimation of the Evidence you are in firm ground, for the exact formulation you made of your problem.

The hack I used is the following:
(this is my real code for an embarrassingly simple model)

  functions {
  // ... function declarations and definitions ...
data {
   real<lower=0, upper=1> beta; // for Evidence
   int<lower=0> N; // number of records
   // here I declare the data we use in the model
   vector[N] CrCl;
   vector[N] CLv;
   // here I declare some parameters that I have to come up with to describe the 
   // prior distribution of the fitted parameters. 
   // I provide a mean and a standard deviation for each one of the three parameters we
   // use: the variance of CL_v the CL_v for zero clearance and the slope
   real <lower=0> mean_log_sd_CLv;
   real <lower=0> sd_log_sd_CLv;
   real  mean_min_CLv;
   real <lower=0> sd_min_CLv;
   real <lower=0> mean_log_dCLv_dCrCl;
   real <lower=0> sd_log_sd_dCLv_dCrCl;

  // ... declarations ...
parameters {
   // I had to do a little trick here, for some reason rstan works better if 
   // we work with standarized parameters (that after we substract the mean and divide by
   // the standard deviation). So the algorithm will work on this parameters: 
   // the real ones are the transformed

   real sn_min_CLv;  //sn_ means standard to normal
   real sn_log_dCLv_dCrCl;
   real sn_log_sd_CLv;

   // ... declarations ...
transformed parameters {
   // ... declarations ... statements ...
   // here I transform the standarize parameters to the regular ones.
   real min_CLv=sn_min_CLv*sd_min_CLv+mean_min_CLv;
   real dCLv_dCrCl=exp(sn_log_dCLv_dCrCl*sd_log_sd_dCLv_dCrCl+mean_log_dCLv_dCrCl);
   real sd_CLv=exp(sn_log_sd_CLv*sd_log_sd_CLv+mean_log_sd_CLv);
   // here I calculate the fit for each individual point
      vector[N] CLvfit=dCLv_dCrCl*CrCl+min_CLv;

model {
    // in this block we set the distributions.
    // I define two variables: prior and loglikelihood
   real prior;
   real loglikelihood;
   // in the prior I sum the prior distribution of the three working parameters, all normal
   // distributions of the transformed parameters
   // the likelihood function asses how good is the fit of the CLv
   // this is the distribution that is sampled: it depends on the parameter beta
   // we have to run this model for different values of beta from 0 to 1
generated quantities {
   // unfortunately I have to calculate again the prior and loglikelihood so rstan records its
   // values. 
   real prior; 
   real loglikelihood; 
   // ... declarations ... statements ...

then I run the same model for different values of beta.