Numerical values for priors

I have a question that is perhaps a bit of a philosophical nature. I am using Stan mainly via brms, and the type of models I use are nonlinear algebraic equations or ODEs (to be more specific: kinetic equations that attempt to predict the course of chemical or physical reactions, but that is not relevant for my question). When it comes to priors, the general guideline is to set the priors before seeing the data. I understand the meaning and principle behind this but it is not a very practical guideline. I have to look at the data to see what values I should give to the priors, especially when using weakly regularizing priors, otherwise I may end up giving absurd numerical values. To be sure, this question is not about to choose which priors, but aout what numerical values to give to the prior parameters. So, my current workflow is to do a quick and dirty frequentist analysis of the data to get a first impression of the parameter values and then use those as an input for the Bayesian analysis, making sure that the priors will not be dominating (usually I have enough data so dominating priors are not very likely to occur anayway). As such that strategy works fine, but I am then violating the guideline that I should set the priors before seeing the data. But my very general problem is that my modeling does benefit from looking at the data before applying the model. How to deal with this? Am I taking this guideline of setting priors before seeing the data too strict? Thanks for any advice on this basic question!

EDIT: @maxbiostat edited this post to remove clutter.

1 Like

Hey there!

You might want to check out this paper (The prior can generally only be understood in the context of the likelihood).

Cheers,
Max

1 Like

Thanks Max, yes, I knew that paper, it describes the problem well, but does not really resolve my problem. More specifically, is it allowed to do first a frequentist analysis and to consider that prior knowledge before entering the Bayesian world, or is that considered cheating in the sense of working towards an answer? There I feel insecure. I think it is impossible to not look at the data, if only because one may need to modify some measurement scales in order to avoid numerical problems. So, my question is answered by the paper you refer to in a general sense: yes, you may look at the data, but not in the sense when it is considered cheating (also mentioned in that paper) if one would involve first guesses about numerical parameter values based upon the data.

I guess it is quite normal to go back and forth on fitting your model and refining it. Of course you can “cheat” by setting the priors close to the posterior, but you are first and foremost cheating yourself with that I would say. Also, I think there is a lot of room to navigate between “don’t look at the data” and “cheating”. The most principled way to do this is probably via prior predictive checks.

I’m sure @maxbiostat has a more elaborate opinion on this (if he’s is willing and has got time to weigh in).

Cheers,
Max

2 Likes

I wonder why you feel you have to fit the model using frequentist methods first in order to gauge parameter values. I mean, are there no previous experiments? If not, could you not reason about the domain of the parameters from first principles? Note that there is nothing wrong with simulating data from the joint distribution of the data conditional on the parameter under different parameter values to gauge what makes sense and what does not.

5 Likes

Thanks for your reply! It is not that I feel I have to, it is just convenient. There are not always previous experiments but even if there are I will have to look at the data to see in which order of magnitude my parameters will be and on what scale the measurements were done. For instance, I am studying rate constants that indicate how fast a chemical reaction proceeds, it will make a big difference if they are 0.000001 per second or 100 per second, the only way to find out which order of magnitude applies is to look at the data (and this is where for instance a least squares regression gives a quick and dirty answer). So, I could also ask: what is against it to do it like this?

Many statisticians – myself included – think one should conform to the Likelihood Principle (LP) when conducting a Bayesian analysis. Devising priors to cover the MLE, say, with high probability after looking at the data, most likely implies violating the LP.

Others, like @andrewgelman don’t seem to think the LP is such a big deal, and certainly not integral to a Bayesian analysis.

I don’t know where @betanalpha stands on this, but he has written extensively about using predictive analysis to gauge prior impact. I wonder if you could compute the pushfoward distribution of the data under a particular prior in order to understand what any particular prior entails in terms of model configurations.

2 Likes

Relevant are this paper by Dan Simpson, Michael Betancourt, and myself: The prior can often only be understood in the context of the likelihood and this paper by Lauren Kennedy, Daniel Simpson, and myself: The experiment is just as important as the likelihood in understanding the prior.

1 Like

Would it be more precise to say, prior when combined with model configuration gives us the pushforward distribution of data? I had the impression that prior does not affect model configuration directly, i.e not part of model configuration.

Multiple model configurations are simulated from a prior(1), each of which constructs it own simulation distribution 2. When the result of 2 is combined for every model configuration, it gives us pushforward distribution of data.

1. \tilde\theta \sim \pi_s(\theta) \\ 2. \tilde{y} \sim \pi_s(y|\tilde\theta)

Following are some @betanalpha 's notation and explanation from his writings.

observational model: pi(y | theta)
prior model: pi(theta)
Bayesian model: pi(y, theta) = pi(y | theta) * pi(theta)
Observational space Y
Observation: y \in Y.
Model configuration space: Theta
Model configuration: \theta \in Theta.

model configuration is just one point theta. For example if the observational model is N(y | mu, theta) then one model configuration might be mu = 0, theta = 1 or equivalently N(y | 0, 1).

2 Likes

I am not a statistician but a food scientist that applies Bayesian modeling to predict changes in foods. So, my question was not meant to provoke a statistical debate, but rather to get some practical advice. From the Gelman papers I learn that it is OK to also look at the experiments when deciding about priors. But I still struggle how far I can go in that. Suppose I have a student that comes to me with experimental results. I study the results and come to the conclusion that a nonlinear exponential decay model could be a candidate, c=c0exp(-kt) with two parameters c0 (initial concentration) and k (rate constant): I can only do that by looking at the data. Because the experiments are chemical measurements I decide for a normal likelihood, and a zero-bounded normal, or lognormal or exponential for the two parameters. What parameter values do I give these priors if I want to make the priors weakly informative (for noninformative priors it is easily solved but I learned that I’d better not do that) ? That depends on the scales of the measurements. Was time measured in seconds, days, months? I can only find out by looking at the data before setting the priors. So, what I considered a practical solution is to run a quick least-squares regression to get an idea about the order of magnitude of the parameter values, and then increase the standard deviation of the prior parameter values to avoid dominating priors. So, I find that an easy way but I do realize the danger of working towards a solution comparable to p-hacking (I try to avoid that by making the priors much wider than the least squares results suggest). On the other hand, by choosing prior parameter values that lead to crazy results because I did not look at the data beforehand forces me to reconsider new prior values and again I can only do that by looking at the data to find more sensible values than in the first attempt (The example of the exponential decay is just trivial, in practice I have more complicated models.) So, it is a very practical question for which I seek some advice whether or not it is a feasible way of working like this. In any case, I would like to thank the respondents because it already helped me a lot.

2 Likes

In the example you give, you should be able to come up with lots of prior information before collecting any data: you have some idea of how high the concentration will be at the beginning, some idea of how fast it will decay, etc. You may not know these numbers precisely, but you will have some rough ideas of their order of magnitude, and that you can encode into your prior before seeing any data. Then once you have fit your model you can see if its inferences make sense and fit the data (see chapter 6 of BDA); if there are problems, you can go back and consider other information external to your particular experiment. Maybe it will be easier if you think of this as “external information” rather than “prior information.”

3 Likes

We can talk about prior pushforward checks that look at the consequences of the prior model on different aspects of the model configuration space and prior predictive checks that look at the consequences of the prior model on different aspects of the observation space. Both of these can be implemented with simulations, in particular they can be implemented sequentially. For much more see Towards A Principled Bayesian Workflow.

Although posthoc modeling is common in many scientific fields it is a disaster statistically because it almost always leads to overfitting and poor generalization.

Or you can consider the experimental design. As a very crude example you can bound all of the relevant time scales that could have been observed below the time since the student joined your lab and above the resolution of the experimental apparatus available. Benchtop techniques aren’t going to give femtosecond resolution. Already that conservatively bounds the time scales between O(year) and whatever the resolution of the equipment is. More careful consideration can provide even stronger bounds without much more work. Unless the data were emailed to you with zero provenance then you’ll always be able to come up with enough domain expertise to motivate reasonable priors, and with practice you’ll be able to build quite informative priors even for complex models.

A prior model is meant to capture your domain expertise so that the posterior will identify the model configurations consistent with both your domain expertise and the information induced from the observed data. Tuning a prior model to the data is worse than appealing to “non-informative priors” (quotes because non-informative priors don’t exist; flat priors are very certain that infinity is a great value) because these empirical priors (often called empirical Bayesian priors) just amplifies the influence of the data and encodes zero domain expertise.

Prior elicitation is challenging and in practice you may have to iterate through updated prior models if you find that your initial prior model doesn’t capture enough of the relevant domain expertise to give you the inferences that you want. That said you have to be very careful when changing your prior model based on a posterior fit exactly because you need base the prior model on domain expertise and not the observed data – the observed data can tell you only to look at your domain expertise harder, not what your domain expertise is. For more see Towards A Principled Bayesian Workflow, especially the end of Section 4.2 and the beginning of Section 4.3.

5 Likes

Michael writes:

I agree. My only comment is that I would change “prior” to “model” because I’m concerned about model assumps in general. Indeed, the model for data and measurement is typically much more important than the prior distribution for model parameters. Further discussion on this general point here: https://statmodeling.stat.columbia.edu/2020/08/22/david-spiegelhalter-wants-a-checklist-for-quality-control-of-statistical-models/ We should not strain at the gnat of the prior distribution while swallowing the camel of the data model.

3 Likes

By “not the observed data,” do we really mean not the data we plan to perform inference on, since all other data can be part of, indeed is, prior information.

2 Likes

Sometimes I like to use the term “external information” rather than “prior information” to emphasize that the data used in the so-called prior do not need to come before the data used in the so-called data model.

3 Likes

@Tiny
What parameter values do I give these priors if I want to make the priors weakly informative (for noninformative priors it is easily solved but I learned that I’d better not do that) ? That depends on the scales of the measurements. Was time measured in seconds, days, months? I can only find out by looking at the data before setting the priors.

It sounds like there are a couple questions mixed in here in your original question; one is about using frequentist models to get a sense of the scale of your data. I think that your intuition already tells you there’s something not right about this and others here are confirming that.

There is a second question in here I think @betanalpha just provided a lot of advice on, but it may be worth stating explicitly that you don’t need to know measurement scale to create meaningful prior distributions. You can begin by thinking about them on whatever scale is easiest to think about, then either re-scale the prior (e.g. convert your more intuitive understanding about minutes into months) or else change the scale of the data to match the measurement scale of your prior. The measurement scale has no meaning itself in terms of your domain expertise, except as an arbitrary reference point.

More generally also I think the principle to shoot for is not to personally remain blind to the data, though that could help. To add to what @maxbiostat mentioned about the likelihood principle, you can do that by staying principled about how you actually work. I think that just means that the prior faithfully encodes whatever state of knowledge you say it represents (obviously not including the new data). I recently heard Frank Harrell in a talk respond to the concern that clinical trials shouldn’t be studied with Bayesian methods because drug companies can cheat with priors, and his response was that this is really the easiest cheating to catch. Of course not everyone has the FDA looking over their work.

3 Likes

Thank you all for the feedback, that is highly appreciated! The remark about the measurement scale is an eye-opener for me. I also fully appreciate the remarks about the use of expertise knowledge to set the prior, and of course I know indeed already a lot before seeing the data. I also played with prior predictive modeling. But there I also see some ambiguity. For instance, in brms one can simulate prior predictive distributions by setting “sample_prior=only” (a nice blog example of that is to be found in : https://magesblog.com/post/2018-08-02-use-domain-knowledge-to-review-prior-predictive-distributions/ ). But what confuses me then is that the prior is compared to how well it covers the data, so it suggests that if the prior simulations do not cover the data at all, one should adjust the prior (which effectively means after seeing the data). Also the remark by Max _Mantei about going back and forth to the model to refine suggests that one can adjust the prior if it is not doing a good job. Of course this is what one should do in model criticism: if something is not right go back and improve, but also then the data have been seen. I understand very well the message that the prior should reflect expert/domain knowledge, and I am also not working towards a desired solution. One of the strengths of the Bayesian approach I find is that one has to be explicit about the prior, so cheating is not really an issue, even without the FDA looking at you. But as a non-statistician I want to do my modeling in the best possible way, hence my question.

1 Like

Technically just the observed data on which inference will be performed, but I wouldn’t recommend trying to mind that “domain expertise” from other data. The problem with data is that it can be hard to interpret in not outright misleading without a model – you don’t actually learn from the data you learn from the likelihood function with is the data manifesting through the observational model. If you have other data then consider analyzing it sequentially – using the posteriors to motivate priors for the new analysis – or analyzing everything jointly which is always ideal.

1 Like

Prior predictive checks should not compare the prior predictive distribution to the observed data! Rather they should just be different ways of observed the consequences of the prior model from different perspectives which one can then compare to implicit domain expertise. See for example Towards A Principled Bayesian Workflow, especially the third-to-last paragraph.

Ignore the data being plotted there and focus on Marcus’s discussion of seeing negative prior predictive simulations which clashes with domain expertise that the observations should be positive. That’s a much better example of a principled prior predictive check.

Unfortunately the data-overly visualization of prior predictive distributions is common (arviz also does it) which understandably adds to the confusion.

2 Likes

Yeah I agree, and this has also been a problem in the bayesplot package. The bayesplot package doesn’t recommend that people overlay the data for prior predictive checks, but we haven’t made it super easy to exclude the data without a bit of hacking. I’ve been wanting to change that for a while and add functions for plotting predictive distributions without data. I finally got around to it a few months ago:

1 Like