Multiverse analysis - concatenating posteriors?


I gave a student presentation yesterday that focused on what some of Andrew’s thoughts regarding forking paths, multiverse analysis, and type M/S errors mean for the work being done at our school and as a result have been thinking how I can do more to integrate these ideas into my own work.

The multiverse paper was written from a p-value focused mindset, and I’m wondering about how I can best adapt it to the estimation focus I prefer to work in from a decision analysis standpoint. I’m wondering if anyone here thinks it makes sense at all to combine the posteriors from a multiverse analysis in a similar way to how we would combine multiple datasets in multiple imputation (that is, concatenate the posteriors together)? My goal would be to contrast the results of a primary analysis against results that would arise from other credible data processing decisions. Normally in my field we approach this through sensitivity analysis and show each analysis separately, but the numbers here are obviously much larger.

Edit: Added the multiverse paper


Without knowing the specifics – not familiar with multiverse analysis – I would look at logarithmic pooling aka log-linear mixtures, but I have been doing research on it, so I’m as biased as they come. Happy to engage in further discussion if anybody feels this is in the right direction.


I have linked the article in question in an edit, sorry about that.

Thanks for sharing these links, I’m glad I posted because I’m not sure I would have come across this otherwise. If I understand correctly, this would have the extra benefit of being able to assign weights to each analysis which seems attractive in those cases where certain data-processing decisions are more credible than others. Is that right?


Yes, that seems about right. In fact, some authors suggest interpreting the weights as the (relative) reliabilities of the experts (distributions). Of course, my own research has been on how to assign these weights, so I think there is a lot of interesting stuff to come from the marriage of multiverse analysis and LP. Interested in hearing from others, though.


How about "Using Stacking to Average Bayesian Predictive Distributions" ?


@avehtari, do you think logarithmic pooling (LP) could be used as well or do you think stacking is definitely the way to go? I’m curious because LP enjoys a few nice theoretical properties (e.g. it is the only “externally Bayesian” way of aggregating opinions/distributions) but stacking seems to have good empirical performance to go with some of the nice theoretical justifications you guys also give in the paper.


This link to logarithmic pooling asks login

this link worked, and read that paper.

I don’t know, but I see two differences

  1. Stacking is linear-mixture and log-pooling is log-linear-mixture.
  2. Stacking focuses on prediction, the above log-pooling focuses on preserving information from the experts?
    I guess it would be possible to do stacking with log-linear-mixture, too, but I don’t right away see if that’s sensible.


Sorry about that, @avehtari, link is now fixed (still paywalled, though ._.).

I think your assessment is pretty much spot on. External Bayesianity, for instance, is all about preserving information across sources/experts/distributions.

That’s my feeling too (“dunno”). Question is whether it would be a worthwhile pursuit to look into, say, a comparison between linear and log-linear aggregation for combining posteriors – the multiverse analysis seems like a nice use case. First caveat I think would be defining an optimality criterion, since using prediction squared error, say, would bias things in favour of stacking (linear “pooling”).

Thanks for engaging in the discussion.


Have had a chance to read these papers now and I am realizing I may be missing something. Both approaches seem to give you some sort of weighted posterior, but I’m wondering if this makes sense within the context I was originally thinking. The logarithmic pooling seems like a closer fit in that you may have stronger evidence that one realization of a data set may have more credibility than another, but I’m struggling conceptually with this waiting being based on predictive error. @avehtari maybe you can help me understand what I’m missing.

Originally I was thinking it would make sense just to treat each dataset as equally justifiable and then combine all the posteriors together in order to get a single summary that captures uncertainty in the estimate arising from data-processing decisions. The original paper just plotted frequency of different p-values, but the only other workable option I had thought of originally would be a forest plot of estimates from each dataset. In the original paper I think they had something like 150+ unique datasets, so this option seems too busy to be informative.


If I understood correctly the multiverse paper, they start with the one data, and in your case possible different datasets given to Stan would be transformations of that one dataset. I think this is similar to case of having a set of models with different variables, variable transformations, interactions and non-linearities like in stacking paper we have different models for arsenic well data (section 4.6). Then stacking would be sensible thing. Stacking is a good choice also because it avoids problems if some datasets happen to be very similar with each other (compare to fig 2c in Stacking paper).

On the other hand if you would have independent data sets (like in meta-analysis or in parallelized computing), then you might want to do something else.