P-value estimation by montecarlo sampling

Many fields want researcher to publish p-values as criterion for result significance.

A good way to compute p-value of a generative model is to:

  1. compute likelihood of observation
  2. sample likelihood values with the generative process
  3. Use a simple estimator comparing the sampled likelihood to the observations

I’m fairly new to Stan, but it seems to me that target could be used for 1. But I have no clue how to do step 2. I saw a useful link for step 3: 26.3 Posterior ``p-values’’ | Stan User’s Guide

Is there a code snippet I could use?

Also, shouldn’t there be a simple interface for this p-value computation, given how frequent it is needed?

Morning and welcome!

The short of it (me speaking as an individual, not for the whole community here): P-values don’t have a place in Bayesian workflows. That said you will see some folks that have code external to Stan that can wrangle something like those out.

If you are looking for strategies to help bring around reviewers, co-workers, colleagues, and such to a better way (Bayesian models in Stan) of quantifying uncertainty there are folks who can help out here.


As @Ara_Winter says, you’d be doing something Stan wasn’t built for. It’ll probably be more effort-efficient to code your likelihood in R or Python and simulate from that. Using Stan for this would involve passing in the point estimates representing your null model’s parameters , then using generated quantities to simulate data from it, collapse to a summary stat, and computing the same summary stat on the real data, then a binary 1/0 whether the real summary stat is bigger than the simulated summary stat, then cmdstan’s generate quantities method will produce a sequence of 1/0s that you collapse to a mean to define your p-value. All way easier outside of Stan frankly.


Just to interject: the p-values discussed at the link in the OP as well as in the OP’s outline of how to compute the p-value are so-called “Bayesian p-values” which arise in the context of posterior predictive checking and reference the quantile of the posterior predictive distribution for some appropriate discrepancy measure into which the observed discrepancy measure falls. These “p-values” are not significance tests of effects in the model, they are a strategy for summarizing the evidence for lack-of-fit as assessed by a particular discrepancy measure (i.e. summary quantity used as a test statistic in a posterior predictive check).

The fact that these quantities get called “p-values” is definitely confusing.

@louisabraham In my field (ecology) when authors are writing for potentially frequentist audiences, I often see language like “we assessed effect significance using 95% credible intervals” or similar. I’m not saying I agree with this language, but if you have an audience that is fixated on null hypothesis significance testing you might get some mileage out of statements like that one.


Thank you everyone for your warm welcome! It’s unusual to see such a nice community!

First of all, I misunderstood the definition of p-value from Wikipedia, by wanting to understand if from a Bayesian point of view:

the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct

I interpreted “test results at least as extreme as the results actually observed” as “observations with a likelihood less than the results actually observed”. Now I get that “results” is not a synonym for “observation”, as people compute p-values for coefficients. My variant of p-values can be framed in the framework of “Bayesian p-values” described in the documentation as pointed out by @jsocolar as it is just the probability of getting a likelihood less than the observation. Note that one needs a generative model. Do you think there is a way to compute them? My question is basically how to adapt the code from 26.3 Posterior ``p-values’’ | Stan User’s Guide to analyze the target quantity. Also, do not hesitate to tell me if you think it is not a quantity of interest.

Second, I read more about classical p-values, for instance in the case of linear regression. I use these formulas as reference. My interpretation – tell me if this is not the appropriate place to discuss this – is the following:

  1. We first consider a MLE estimator on the parameters given an experiment.
  2. We generate a distribution on the observations accross multiple experiments.
  3. We deduce a distribution on the MLE estimators given those observations.
  4. The p-values of the parameters follow naturally from the cdf of the estimators’ distributions.

There are a few things that surprise me, assumptions that I think are not clearly stated but necessary.

A. I don’t think the model actually tells how to generate new observations. In the case of linear regression, we add \epsilon to y (or to \hat y since y - \hat y is in the kernel of the projection on \beta). However, I fail to see the meaning of it and just view it as a symbolic manipulation that abuses the = and ~signs. Basically the equations give us a likelihood on the observations but not a means to generate new data. An illustration of this is that there are other meaningful ways to generate new data, eg bootstraping. Hence, I think the means to generate new samples should be added as an assumption of the model.

B. It is clear that the chosen generating process treats the parameters \beta and \sigma differently: \sigma is estimated only once (with the MSE of the MLE estimator) whereas \beta is estimated from the generated y. I feel like there is a hidden assumption here. To be perfectly “fair”, we should find a distribution for \sigma that is a fixed point of the generation-estimation process, but I guess it is complicated. So my conclusion is that saying \sigma is fixed when repeating the experiment is another assumption of the model.

From the previous points, there is actually an experiment that makes the distribution on \beta appear:

  1. treat X, y and \sigma as fixed inputs
  2. generate y' = y + \epsilon
  3. compute \beta with MLE (this can be done in Stan using L-BFGS if I’m not mistaken)
  4. sample from that distribution of \beta by repeating 2 and 3
  • Do you agree that this method would produce the correct distribution?
  • Do you think this approach can be generalized to other models? I think other models make some assumptions in the form of A and B.
  • Is the “composition” of Monte-Carlo sampling and MLE doable in Stan?
  • Are there more efficients ways to achieve that with Stan? I’m not too familiar of all possibilities offered by Stan and I don’t see any obvious way to apply MCMC and variational inference.
  • EDIT: I noticed that the distribution of Beta used for p-values coincides with the limit of the Bayesian posterior when the posterior factors out the distribution of X and we make the prior more diffuse (which is related to Bernstein-Von Mises’ theorem). So actually we could probably simply do a posterior estimation with a prior that is very diffuse. Is this always the case with classical p-values? Is there a way to achieve that in Stan?

Sorry for the long message!

This forum isn’t really the place to discuss frequentist p-values, but briefly you’re on the right track except that you need to compare an observed test statistic (not parameter estimate) to the distributions observed test statistics (not the MLE estimators) expected under the null hypothesis (not at the MLE).

Using the likelihood itself in the computation of a Bayesian p-value would be nonstandard. Usually we pick statistics (which are sometimes called discrepancy measures) that can be computed as functions of the data. But the likelihood is not just a function of the data; it is a function of the data and the parameters. Now it’s easy enough to implement the procedure where, for each draw from the posterior, we simulate a replicate dataset, compute the likelihood of those data, and thereby get a posterior predictive distribution for the likelihood. The problem, however, is that we don’t have a unique fixed value for the observed likelihood; instead, we have an entire posterior distribution. This is because, though the observed data are fixed, the likelihood depends on both the data and the parameters, and the parameters are estimated with posterior uncertainty. So there’s no unique value of the likelihood that we can compare to the posterior predictive distribution in order to obtain our Bayesian p-value.

For completeness, note that although the likelihood is not a deterministic function of the data in Bayesian analysis, the likelihood at the MLE estimate is a deterministic function of the data. Therefore, it’s possible (though I’m not sure it would be particularly useful) to conduct a posterior predictive check and obtain a Bayesian P-value for a discrepancy measure defined as the likelihood at the maximum-likelihood estimate. I say that this wouldn’t necessarily be useful, because in general when doing posterior-predictive checking it’s useful to select discrepancy measures such that, when checks fail, you get a clear indication of how to refine the model in order to improve it. Just to give a simple example, if you choose a measure of spatial autocorrelation as a discrepancy measure, and then find that your posterior predictive distribution consistently displays weaker autocorrelation than the observed data, you’ll know that you need to think about a better handling of space in your model.

1 Like

Thanks for your answer! I apologize for discussing frequentist p-values, I think I understand them better now that I try to frame them in the Bayesian view.

Actually I think I realized something: the frequentist view is equivalent to setting a constant improper prior. So I’m simply interested in the posterior over \beta given improper priors, right?
This would just be considered confidence intervals, not p-values. The documentation makes the same hypothesis: 1.1 Linear regression | Stan User’s Guide

Oh, that’s maybe where I’m mistaken. I know what you are saying is the right definition but I thought one can also compute the p-value given the posterior on the parameters, at least in the linear regression case. The posterior follows a Gaussian law whereas the t-statistic follows a Student law, but I think that’s just because of the division with the sample variance in the formula of the t-statistic.

Is there a quantile that can be computed from the posterior probability of \beta that would be equal to the quantile of the t-statistic of the null hypothesis? I was thiking about looking at the quantile of 0 (or doubling it). Basically obtaining a t-statistic as extreme as the observations with the Student law centered at 0 should be the same as obtaining 0 with the posterior, right?

If that’s wrong, I’m curious to know whether there is a name for the quantity I described because it appears to be interesting and have an obvious meaning.

If that’s correct, are frequentist p-values just cdf of parameter posteriors given a uniform / improper prior? In general, should I use a constant prior or Jeffreys prior to match the frequentist view?

It is not true in general that Bayesian credible intervals and frequentist confidence intervals (and thus p-values) coincide under improper uniform priors. To see this, note that a flat prior over one parameter induces a not-flat prior over some other parameter in a reparameterized version of the model. For example, a flat prior on a variance term induces a not-flat prior on the standard deviation. So if Bayesian inference under flat priors is to correspond with frequentist inference, this can only be true for one particular parameterization of a model (technically, for one particular family of parameterizations that are linear rescalings of one another). Now it’s possible that for simple linear regression there is some known choice of parameterization that makes Bayesian and exact frequentist inference numerically equivalent, but I don’t happen to know what that choice is.

Note that even if there exists such a choice, the Bayesian approach might give nontrivially different answers to standard frequentist approaches because the asymptotic approximations underlying the frequentist calculations are just that–approximations.

Finally, note that if you really want to use Bayesian machinery like Stan to perform a frequentist analysis, you can achieve this via the data cloning approach outlined by Lele et al here

1 Like

I don’t really know that this will add much as it’s more a practical suggestion than adding to the theoretical conversation happening here; however, the R package bayestestR includes a variety of helper functions for posterior inference. The reason I recommend it for this particular comment is that they include a test called the probability of direction. The authors of the package have a paper in Frontiers describing a stepwise approach to posterior inference (available here), and in it they claim that the probability of direction is directly proportional to the frequentist p-value (and the package includes a function for converting between p-values and the probability of direction). I can’t attest to the mathematical soundness/robustness of their claim, but I suspect that their discussion and literature cited might be useful for you, particularly as you think about the relationship between the posterior and frequentist p-values

1 Like

It looks like the probability-of-direction test is precisely “the thing that should behave like a p-value as long as informative priors don’t cause the Bayesian and frequentist calculations to diverge from one another”. The challenge is that all priors are informative under some parameterizations of the model and “uninformative” (in the sense of being flat) under other parameterizations. Again, it’s quite possible that in simple linear regressions there is a well-known choice of prior such that the probability-of-direction numerically recapitulates the frequentist p-value; I just don’t personally know what it is (for example, would it involve a flat prior on the residual variance, on the residual standard deviation, or on something else?).