I want to avoid a long drawn out battle about Bayes factors so I’ll just make a few pertinent comments.
Inference and decision making are different
Bayes factors are inferences over models that are compatible with all other Bayesian inferences. For example the “regular” Bayesian inferences of a mixture model can be reframed as inferences of each component model weighted by Bayes factors, just without having to explicitly evaluate the Bayes factors. This mathematical consistently makes Bayes factors natural when a set of component models is natural.
Consequently when restricted to proper inferences over those models, such as model averaging, Bayes factors are fine. At the same time, however, those model inferences can often be implemented by fitting larger models that include the component models and avoid explicit Bayes factors entirely.
The danger with exact Bayes factors arises when they are used for model selection, which is not a well-defined inference; model selection is instead a decision problem. Now inferences like Bayes factors can be used to inform decision making processes, but there’s no guarantee on how useful those processes will be. For example choosing between two models based on a Bayes factor ratio threshold has no guarantee on the false positive or true positive rate, and indeed in many cases those rates can be terrible (especially if the prior model is too diffuse).
When using Bayes factors to inform model selection a full calibration needs to be worked through to see how accurate those selections are, and that is almost never done. The naive use of Bayes factor based model selection without this calibration has lead to lots of fragile results and reproducibility problems. Many of the common critiques of Bayes factors are based on the consequences of this fragility.
This is one reason why the naive replacement of p-values with Bayes factors (looks at psychology) doesn’t help anything.
Bayes factors and Bayes factor estimators are different
As has been mentioned Bayes factors cannot be evaluated exactly in practice and instead we must estimate them numerically. Unfortunately the marginal likelihood is not easy to estimate using our standard tools like Markov chain Monte Carlo.
Sampling-based computational methods like Monte Carlo, Markov chain Monte Carlo, importance sampling, and the like work best when the expectand (the function whose expectation value is being computed) is relatively uniform compared to the target distribution; if the expectand varies strongly then the sampling-based estimators will suffer from large error.
The two strategies for estimating the marginal likelihood direction are as a prior expectation value of the realized likelihood function,
E_{\text{prior}}[ \pi(\tilde{y} \mid \theta) ]
and a posterior expectation value of the inverse realized likelihood function,
E_{\text{posterior}} \Bigg[ \frac{1}{\pi(\tilde{y} \mid \theta)} \Bigg].
Unfortunately both of these functions vary strongly when we learn a lot from the data and the posterior distribution isn’t very close to the prior distribution. The variance of both expectants is large and in many cases actually infinite, which implies that the error on the sampling-based estimators is also infinite.
Thermodynamic methods try to introduce a sequence of intermediate distributions between the prior and the posterior distribution so that each neighboring distribution is very close, and the marginal likelihood between the two can be estimated more accurately. The sequence of marginal likelihood estimators can then be used to construct the prior/posterior marginal likelihood. The difficulty is in constructing a sufficiently nice sequence where the neighbors are sufficiently close together; there are many heuristics but all of them tend to be fragile, especially in high dimensions. For some more discussion see [1405.3489] Adiabatic Monte Carlo.
In practice we have to work with an estimated Bayes factor, and often one with questionable error quantification. Because of this error decisions based on the estimated Bayes factor can be significantly different than decisions based on the exact Bayes factor, often so much so that they have to be considered as different decision making processes entirely (at which point many of the nice theoretical properties of Bayes factors can no longer be relied on).
Relying on Bayes factor estimator-based decisions without any kind of calibration then becomes all the more fragile.