How to think about prior effects on posterior distributions

The likelihoodists and frequentists have this approach of quantifying parameter uncertainty by treating the parameter as an estimator with a distribution obtained by bootstrap/jackknife resampling and successive refittings. The procedure is quite appealing because the interpretation is intuitive: the parameter estimate has an inherent variance which directly depends on the sample drawn. What has this equivalently “objective” quality in the Bayesian world?

Lots of us use weakly informative or convenient priors (e.g. symmetrical, beta(1,1), etc). We might eyeball graphs and then construct informative priors (which is cheating) or mess about with our priors as we fit the model to get better results (which is also cheating). However, all this has a direct implication on the variance of estimated parameters. This given, when presented with a Bayesian analysis, and wishing to decide whether it’s any good, how do we go about deciding whether the posterior distributions of parameters are reasonable especially when we do not share the same convictions about the priors chosen? In the wild, how much time do Bayesian analyses actually dedicate to justifying priors or to sensitivity analysis of them?

3 Likes

In the following I shall assume that you mean cheating as in violating the Likelihood Principle (LP).

Depends on exactly what graphs you’re talking about. Prior calibration is not cheating.

Again, got to be a tad careful here. Whilst changing the prior after “seeing” the data is technically violating the LP, it is important to have in mind that prior and likelihood are inextrincably linked, and often the prior can be seen as a regularisation of the likelihood. So in actual practice it is common and in my not humble opinion absolutely fine to adjust your prior if you can transparently show it doesn’t provide enough regularisation. The key word here is transparency. Bayesian analyses are in a way more open because one is forced to make the choice of prior explicit and justified. At least in theory.

This is a HUGE question, wrapped up in a few lines. “Any good” is not sharp enough a definition for a good answer, but along general lines, I look for whether the prior predictives make sense (in terms of substantive knowledge), whether the posterior predictives are consistent with observed data, how the posterior differs from the prior for a given parameter. In summary, loads of stuff.

Mileage may vary. I have been known to be quite nitpicky with others priors. Just ask @Bob_Carpenter how I’ve been nagging him about his most recent paper. As you might guess, I’m even more careful with priors in my own work, and dedicate a sizeable portion of my setup to eliciting and checking priors.

The very use of the term “objective” is questionable to start with. @andrewgelman has two papers on this topic which are worth reading: one with Cosma Shalizi and one with Chris Hennig. For a substantially different take, see this paper by James Berger.

6 Likes

Hey thank you @maxbiostat for the thought provoking reply.

Yes.

For example a scatter plot of the data. I think it’s cheating because it’s akin to looking at the data.

What I’m specifically interested in is evaluating a parameter’s posterior distribution in order to decide whether it’s sensible. Implied therein, is that the parameter has some theoretical or physical interpretation, and the hope is that the scientist can provide such steps as to arrive at the same posterior conclusion. But unlike in bootstrapping, we are not evaluating just how sensitive the model fitting is to resampling. In the Bayesian framework we are modifying a prior… what prior and where it came from seems to make all the difference. I understand that in simple examples, one could easily give a prior but in more complex examples intuiting a prior is difficult and it seems to me that if I cannot agree on the prior with the scientist then I cannot interpret the posterior without some kind of accompanying sensitivity analysis and a conclusion such as “the prior doesn’t really matter since you get pretty much the same conclusion wherever you start”. But in that case, it’s hardly Bayesian, right? You might as well have done it in a way without resulting to priors in the first place.

…exactly what I worry about.

My relevant background is in ML where objective function driven optimisation problems and out-of-sample testing are par for the course. I think the trouble here is that starting out with a model, fitting it, doing PPC/LOO/etc, deciding it’s not right, so changing assumptions and so on, is at least multiple hypothesis testing and at worst curve fitting, and therefore feels like it ought to be corrected for somehow. I presume Bayesian authors probably don’t tend to say “we started with something simple and then just kept trying stuff until the PPCs looked pretty good” :-) With regards to “transparency”, in ML it’s simply a matter of publishing the algorithm, how you trained it and how it performed out-of-sample . Since models are typically not specific to a particular problem, no-one is worried that you smuggled anything in to the model although they may worry that you smuggled something into the training of the model, but if you did it’s usually plain to see. The same cannot be said about priors in a analysis specific model.

Besides (and with direct parallels to ML), that the model fits and even that the model predicts well (or just fares well out of sample) doesn’t imply that it’s parameters are physically or theoretically significant. That’s the magical work of the prior in a “principled” Bayesian analysis (or so it seems to me). I mean ultimately a model and it’s structure in a Bayesian analysis is part of the prior; it just doesn’t all get updated. If we were to set out a model, and priors reflecting our knowledge, and then rigorously substantiated that and published it, and then after that we went and collected data to feed the model and calculated the posteriors, it’s easy to see how that would be different; principled; scientific. And how it may give the posteriors theoretical and physical interpretation. But when it’s all considered at once, iteratively until it all makes sense together, what separates that from any other way of tackling a modelling problem by optimising a (PPC driven in this case) objective function?

These look right on the money. Thanks, I’ll have a read.

1 Like

In lieu of a response, I’m gonna tag @betanalpha for him to share his thoughts. After that and after you’ve glanced at the papers I suggested, I can weigh in with some more ramblings if we decide it’s worth it.

1 Like

Thanks max, I’ll post a summary of my understanding here as soon as I’ve read the papers.

I read the Gelman, Shalizi paper. I like their style and stance of the paper generally speaking and most of it wasn’t surprising to me (I majored in Analytical Philosophy). Some of the references are great… and it’ll definitely help me fill in holes in my knowledge.

I think that in a nutshell the paper says prior and posterior are just regularisation devices, misspecification and the plurality of a priori possibilities nukes all illusions of priors having anything to do with “prior knowledge”, and really we proceed hypothetico-deductively by falsifying models using PPCs and then altering / expanding models to better correspond to the world until they are “robust enough”. Ultimately, all models (says authors about social sciences) are false and will be falsified given enough effort. Paper also says that the main value of BDA is richness of expression and that PPCs are fundamentally non-Bayesian. The authors say that proofs of consistency for Bayesian models tend to imply the existence of a non-Bayesian model for which the same conclusion can be drawn.

In answer to my own question, the way then to evaluate a Bayesian analysis generally is to consider how it was expanded to get to it’s present stage, and what alternatives were tried and found lacking or excessively false. All this in the language of Mayo’s error probes, Cox’s graphical checks, and other existing methodologies rebranded for BDA. The real value and knowledge is actually extracted from breaking models and so that’s really the important thing to convey. Why should I be convinced by the posterior of some parameter in the model? Because I can’t think of a reason not to be convinced by it.

I think that as an answer to my question, it is robust and sensible and acceptable. It relegates Bayes to the vagueness of any other methodology and pragmatically implies that it ought be used when it fits best (much like anything else) but occupies no philosophical high ground.

4 Likes

Thanks for the nice summary @emiruz.

I think that probably goes a little too far (but just a little). I think it’s hard to argue that all priors have nothing to do with prior knowledge, but it’s true that it is hard to pin this down rigorously, at least to a degree that would satisfy an analytical philosopher ;). In some sense (and yes, this is a bit hand-wavy) we must be able to incorporate prior knowledge into choices of prior distributions (and for that matter every other aspect of the model) because otherwise how is it that we’re making any sort of informed choice at all? What else could we be relying on other than prior knowledge, even if that prior knowledge is hard to define? We certainly didn’t evolve as a species to have an innate ability to specify decently performing statistical models without relying on acquired knowledge, so we must be using knowledge we’ve obtained previously, even if that knowledge amounts to what would be a reasonable way to regularize (that does, after all, require a prior sense of what is too extreme).

But, having said all that, I basically agree with you.

4 Likes

I tend to agree with this more pragmatic view. It might just be the scientist in me talking, though.

Here’s a quote from @richard_mcelreath in that direction:

This is from Statistical Rethinking, 1st edition Chapter 2, page 20.

All that said, I do think we’d do well to read De Finetti, Lindley, Jaynes and others and consider the theoretical foundations deeply, even if we need to make some theoretically unjustified adjustments in order to able to apply Bayesian methods in practice.

3 Likes

@maxbiostat Thanks for sharing that quote from Statistical Rethinking (there’s lots of good stuff to quote from that book). I must have read that when I went through the book but it’s been a while.

Agreed. Big fan.

2 Likes

Thanks @maxbiostat, @jonah for engaging!

2 Likes

Hi @emiruz Thanks for posting those thoughts and questions!

If we were to set out a model, and priors reflecting our knowledge, and then rigorously substantiated that and published it, and then after that we went and collected data to feed the model and calculated the posteriors, it’s easy to see how that would be different; principled; scientific. And how it may give the posteriors theoretical and physical interpretation. But when it’s all considered at once, iteratively until it all makes sense together, what separates that from any other way of tackling a modelling problem by optimising a (PPC driven in this case) objective function?

I think the challenge with your desire for probabilities to make “physical” sense is that there is no physical basis for them. Probability is not a physical thing out there in the world and much of our uncertainty is actually unrelated to sampling variability.

Laplace’s discussion of Bayes got straight to this point—he said, what if you have this coin and you think it may be asymmetrically shaped or weighted, what’s the probability of heads? Well first it is 0.5 because you don’t know which direction it might be weighted in, it could go either way. That value contradicts what you would call the physical probability and what Laplace called the “chances” of heads on a good toss. Once you start observing outcomes you’ll update that probability.

What this implies and what Jeffreys, Keynes, Cox, Jaynes argued was that probability is a logical relation between propositions. Its about making good sense of information. I think the opening pages to Cox’s Algebra of Probable Inference are some of the best (most concise) on the topic.

When Jeffreys was setting out some initial rules for coming up with a theory of probability he argued: “any rule given must be applicable in practice” and thus “must not involve an impossible experiment." When you start imagining that your data is a sample from an infinite population of possible samples and then compare it to that population of samples (which you’ve just conjured up in your mind or on your computer) you’re not really respecting the likelihood principle.

To examine whether a posterior distribution is reasonable, I think one good question ask is: does the model incorporate all of the sources of uncertainty we have about this hypothesis or these parameters? If the observations are from a survey, for example, is our uncertainty about the data being incorporated into the inference? Does the model take any uncertain quantities or decisions and plug them in, without considering alternatives?

In practice, the bootstrap doesn’t always get what you’re looking for. For instance, I’m looking at measurement error models and a popular one — SIMEX — uses a jackknife procedure, and only in limited situations can it come up with proper uncertainty about repeated sampling.

3 Likes

@cmcd thanks for the reply! I think more of the approach I’m personally attracted to is available in the non-parametric statistics and computational mechanics (e.g. Shalizi’s CSSR model) in which we are deriving something from the data that is not model based. My intuition says that in 10 years from now that’s what statisticians will do precisely because of the points we’ve touched on in this thread.

Sorry if I was vague, I wasn’t intending to make any point regarding the existence of probabilities; just that posteriors of theoretical or physical quantities can have a theoretical or physical interpretation, and that the posterior means something: namely, it represents how our belief should change as a result of the data.

I guess nothing ever does :-) In picking the bootstrap as an example, I was trying to highlight that (when it applies) it has a singular interpretation.

Firstly, you need to compare apples to apples. The jackknife and bootstrap are approximation techniques to try to understand the calibration of a estimator (how it behaves across an ensemble of data from the same data generating process). These approximations are accurate only in very limited circumstances, and in particular one has to verify the accuracy in order to have any faith in the “generic” results.

Regardless of the that initial comparison, when it come to prior modeling I do think that there is plenty of confusion out there regarding the interpretation of a prior model and, in particular, it’s relationship to one’s domain expertise. In particular all of the references exposing the prior as an exact realization of one’s domain expertise can be extremely frustrating given how implicit and qualitative our domain expertise seems to be when we actually reflect on it. It wasn’t until I read I.J. Good’s perspective on iterative domain expertise elicitation and principled prior models being self consistent approximations of one’s domain expertise that I was able to put everything into a self-consistent framework. In particular see the discussion in https://betanalpha.github.io/assets/case_studies/principled_bayesian_workflow.html#11_domain_expertise_consistency for why iteratively updating your prior isn’t “cheating” provided that you do it carefully.

3 Likes

I see @emiruz I probably misunderstood some of your points there. Thanks for clarifying and adding

1 Like

@betanalpha what do you think about the prior fiddling ambitions in this example:

https://discourse.mc-stan.org/t/correlation-between-bias-and-true-parameter-values

Particularly:

I don’t see what any problem with experimenting with prior models and their interaction with an observational model in a simulation study. If anything this can be interested as an investigation into which prior model properties are important for a given observational model which then sets the priorities for a proper domain expertise elicitation.

Again, as Good realized – and I tried to capture in my workflow case study linked above – it is unrealistic to expect that we can write down a prior model that captures all of our domain expertise immediately. Domain expertise elicitation takes time and effort, and in order to build a useful prior model in finite time we have to identify for which parts of model configuration space we need to prioritize our elicitation. Prior modeling in practice does not fit into the “think really hard about your domain expertise before you consider the measurement” abstraction introduced in so many Bayesian textbooks!

This identification of critical parts of the prior model can be done with simulation studies to understand a particular modeling technique and it can be done by examining the behavior of the posterior distribution realized from observed data and preliminary models. There are lots of methods that might seem dubious but can be made robust when put into this framework that separates the investigation of where to invest time in domain expertise elicitation from the actual elicitation process itself.

4 Likes

Thanks @betanalpha for clarifying. I think you make an interesting point but I struggle to believe that most folks (especially given the state of reproducibility in science) can iteratively illicit prior knowledge from observed data without at the same time over estimating the accuracy of their prior. I mean there’s a big difference between knowing something and thinking that you should have known something after you find it out. I will study your Bayesian workflow properly at some point next week; no doubt it’s nuanced and that’s important, so I look forward to understanding it in more detail. I’m also quite curious to see how much of it is specifically Bayesian. I think it’s important to separate a principled praxis from the claim that being Bayesian is in itself a principled praxis in some sufficient or necessary way.

I think it’s important to separate a principled praxis from the claim that being Bayesian is in itself a principled praxis in some sufficient or necessary way.

Yeah, agreed @emiruz that the first part of that statement is important but here’s a different perspective on the second claim, that Bayesian inference is not necessary for principled praxis.

I think for the sake of clarity it always helps to distinguish between rules for manipulating probabilities and rules or methods for assigning probabilities.

Bayes theorem is a valid rule for manipulating probabilities, its a simple result which we can reach from the sum and product rules. Everyone seems to agree on the validity of the sum and product rules, and thus Bayes theorem is uncontroversial. Hierarchical Bayesian models consist of the repeated application of these simple rules.

Assigning probabilities, as in @betanalpha’s post above, is not contained within Bayes theorem. But Bayesian analysis requires and encourages us to think about it very carefully (in the case study there, very carefully!)

That said, a principled praxis should respect the sum and product rules. There are a lot of methods and ‘estimators’ out there that do not respect those rules and I’m generally not comfortable using them. Obviously we all have to take some things as given (hierarchical modeling can only go on for so many levels, some problems are too big for MCMC) and respect our own finite capabilities, but we should still strive to follow the rules that we all agree upon; we should avoid violating those rules.

That would seem to put the core of Bayesian analysis in a unique position for most if not all scientific research—it is necessary but not sufficient.

When we must violate them, as occurs, we should recognize it as an attempt to “approximate” standard inference rather than claim some kind of relativistic equality of inferential paradigms that have foundations of unequal strength and versatility.

1 Like

Thanks for your comments @cmcd, the perspective is appreciated!

Any probability theory based methodology can happily respect the rules and consequences of probability theory without becoming Bayesian. This includes using Bayes rule, which frequentists do but that doesn’t commit anyone to “Bayesian inference” as a probability assignment methodology. So I don’t think this part of the praxis is distinctly Bayesian.

That Bayesian analysis is not necessary for scientific research is clearly proven by the fact that most science doesn’t and didn’t use Bayesian analysis to get to where it is or to go to where it’s going :-)

1 Like

okay, fair point at the end there—but you’re taking it out of context, and I shouldn’t have said scientific “research” when I meant inference.

I’d keep that statement in context, which I took to be more about whether or not various approaches to data analysis should be viewed as equivalently valid. Some approaches to inference are not correct or they’re not optimal, often they are not even able to ask/answer the right question; but of course using them doesn’t prevent us from learning anything from experience or from experimentation. It was the philosophical relativism that I was hearing that I don’t agree with, though I think I see that the difference in perspective may lie somewhere else.

Any probability theory based methodology can happily respect the rules and consequences of probability theory without becoming Bayesian.

It is common practice for people to come up with estimators and algorithms that are using other rules or guidelines (often unstated?). I believe two of the methods you’ve cited fall into that camp (the bootstrap and, from just a quick look at the paper, Shalizi’s CSSR model).