Summarize then back-transform vs back-transform then summarize?

Hi all,

When working with posterior draws from a model fitted on a transformed scale (e.g., log or log10), what’s the correct way to summarize predictions on the original scale?

Two common approaches:

  1. Summarize → back‑transform: Compute the median and 95% HDI on the transformed scale, then back‑transform the summary (e.g., using exp() or 10^).
  2. Back‑transform → summarize: Back‑transform all posterior draws first, then compute the median and 95% HDI on the original scale.

These can yield different results when the posterior is skewed. Which approach is generally recommended, especially for reporting estimates and uncertainty intervals? Why?

In particular, when using emmeans on Bayesian models (with type = "response" or regrid = "response" and epred = TRUE), it appears to always use the second approach. This differs from the frequentist behavior, where type = "response" and regrid = "response" yield different results. Is there a way to obtain the first approach in the Bayesian setting, asides from manually back-transforming the estimates?

See also the emmeans transformation documentation:
https://rvlenth.github.io/emmeans/articles/transformations.html#regrid

Thanks in advance!

1 Like

I think you almost always want to default to doing the back-transformation as a final step. This also reflects how you model, where you can flexibly model a linear predictor before back-transforming it to the desired scale (e.g., positive, probability, etc). Paul Bürkner also writes about this in one of his draft chapters of the brms book.

It might be worth checking out the posterior and tidybayes R packages where you can really nicely work with the rvar datatype using fitted objects.

1 Like

Back‑transform → summarize is the appropriate approach in most cases since summary metrics are not transform invariant (see for instance: Wang et al 2018).

Wang X, Ryan Y, Faraway J (2018) Bayesian regression modeling with INLA. CRC Press, USA

2 Likes

Chiming in since we have two answers saying opposite things.

If what you want is a summary of the back-transformed output, then you have to back-transform first and summarize last, as @MilaniC says. The universal rule of thumb in analysis of MCMC draws is to summarize at the very end. Every computation that you can perform iteration-wise (i.e. draw-wise) properly preserves and propagates the posterior through the computation. This lets us propagate posterior uncertainty exactly (up to the MCMC error in the draws themselves) and is unique to the MCMC setting, which might by why frequentist software sometimes does something different.

Computations that are not performed iteration-wise, but rather are performed on summary statistics, do not achieve this in general. One notable exception is when the transformation is monotonic and the summary gives quantiles. Then the transforms of the summary will be identical to the summary of the transforms (up to variation in how the estimate is placed between two relevant draws–the median of the back-transformed pair won’t be the same as the back-transformed median of the pair, but as long as there are enough draws that the MCMC error in the estimates is small, this won’t be a practical concern). For example, if you want to give the central 95% interval based on the 2.5% and 97.5% quantiles, it doesn’t matter whether you summarize or back-transform first. But if you want to give the 95% HDI it does matter.

2 Likes

What are some examples for when you’d want summaries of back-transformed output? I’d imagine the vast majority of use cases are going to want to make any computations with the posterior draws as @jsocolar nicely described and summarise at the very end. Again, I think it makes sense to think about doing your model in reverse to get to posterior predictions, which has the link functions as a final step before plugging into the distribution.

I think there’s some kind of semantic confusion here. If you have some output on the link scale and want to understand what’s going on on the back-transformed data scale, you back-transform first and then summarize. Always summarize last, after doing whatever (back) transformations you are interested in.

3 Likes

I’ll just focus on the posterior mean here, because the median and quantiles don’t depend on a monotonic transformation. You can treat the variance in a similar manner as the mean.

There’s a second way to look at this problem that I find quite helpful:

We usually define the mean of a random variable as \int x p(x) dx. This is not independent of the parametrization, so the result will change if we rescale the space, for instance with a log transform.

But we can also use the Frecet mean: The frecet mean is the point that minimizes the expected squared distance to all other points: \text{frecet mean} = \text{argmin}_p \int d(p, x)^2 p(x) dx. And interestingly, this definition does not depend on the parametrization (ie a transformation), but instead on how we define the distance between two points d(x, y). If you simply define the distance as d(x, y) = |x-y| as usual, then the frecet mean is exactly the same thing as your normal mean. But each transformation that you might do on your posterior corresponds to a choice of distance function: If you don’t transform, the normal mean gives you the frecet mean with distance measure d(x, y) = |x - y|. If you do a log transform, the mean is the frecet mean with distance d(x, y) = |\log(x) - \log(y)| etc.

With that, we can rephrase your question as:
“What distance function makes sense when I compute the mean (or variance etc) of my posterior”.

Let’s look at two examples:

  1. The quantity of interest is the amount of CO2 some process releases. Then you might think of the distance between two emission values as “how much worse the bigger one is than the smaller one”. Here, a distance function of `d(x, y) = |x - y|$ seems quite sensible to me. You are saying that the distance between a 10t and an 8t emission is 2t, so exactly the same as the distance between a 4t and a 2t emission. A log transform sounds rather strange to me. With it we would be saying that an emission difference between 1t and 10t is equal to an emission difference between 10t and 100t.
  2. The quantity of interest is gene expression fold-change in response to a treatment, so it measures how many times more transcripts of a gene a cell produced under some treatment. A 2-fold increase should be the same effect size as a 2-fold decrease (or a 0.5-fold increase). So we would want d(1, 2) = d(0.5, 1). This is clearly not the case on the natural scale with d(x, y) = |x - y|, but it is the case with a log transformation, so d(x, y) = |\log(x) - \log(y)|.

So in practice, choose your summarization approach based on which distance metric makes sense for your scientific question. Original scale for additive effects, log scale for multiplicative/proportional effects. While this choice is independent of your modeling decisions, in practice you might end up using similar transformations for both modeling and summarization since they often use the same underlying understanding of what makes a meaningful metric.

2 Likes

Although the other replies already cover the technical aspects quite comprehensively, I think there’s some nuance that may need to be addressed here. For instance, you state that:

It’s not clear to me what “fitted on a transformed scale means”, maybe you need to be more precise about why you are doing that and what it means to your model/inference: i.e., any transformation of the original parameters (however they are chosen) shouldn’t necessarily need to be back transformed. Conversely, you can summarize any transformation of the parameters. This in a way implies, as @jsocolar mentioned, that the order would be (“back”-, but also forward-, or whatever-)transform and then summarize, but goes further in that back-transforming (or I would rather say not transforming) is a choice that you make depending on what you are trying to look at, not a matter of right or wrong order (with care on what transforming a summary does to it, as others also mentioned).