Thank you everyone for your warm welcome! It’s unusual to see such a nice community!
First of all, I misunderstood the definition of p-value from Wikipedia, by wanting to understand if from a Bayesian point of view:
the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct
I interpreted “test results at least as extreme as the results actually observed” as “observations with a likelihood less than the results actually observed”. Now I get that “results” is not a synonym for “observation”, as people compute p-values for coefficients. My variant of p-values can be framed in the framework of “Bayesian p-values” described in the documentation as pointed out by @jsocolar as it is just the probability of getting a likelihood less than the observation. Note that one needs a generative model. Do you think there is a way to compute them? My question is basically how to adapt the code from 26.3 Posterior ``p-values’’ | Stan User’s Guide to analyze the target
quantity. Also, do not hesitate to tell me if you think it is not a quantity of interest.
Second, I read more about classical p-values, for instance in the case of linear regression. I use these formulas as reference. My interpretation – tell me if this is not the appropriate place to discuss this – is the following:
- We first consider a MLE estimator on the parameters given an experiment.
- We generate a distribution on the observations accross multiple experiments.
- We deduce a distribution on the MLE estimators given those observations.
- The p-values of the parameters follow naturally from the cdf of the estimators’ distributions.
There are a few things that surprise me, assumptions that I think are not clearly stated but necessary.
A. I don’t think the model actually tells how to generate new observations. In the case of linear regression, we add \epsilon to y (or to \hat y since y - \hat y is in the kernel of the projection on \beta). However, I fail to see the meaning of it and just view it as a symbolic manipulation that abuses the =
and ~
signs. Basically the equations give us a likelihood on the observations but not a means to generate new data. An illustration of this is that there are other meaningful ways to generate new data, eg bootstraping. Hence, I think the means to generate new samples should be added as an assumption of the model.
B. It is clear that the chosen generating process treats the parameters \beta and \sigma differently: \sigma is estimated only once (with the MSE of the MLE estimator) whereas \beta is estimated from the generated y. I feel like there is a hidden assumption here. To be perfectly “fair”, we should find a distribution for \sigma that is a fixed point of the generation-estimation process, but I guess it is complicated. So my conclusion is that saying \sigma is fixed when repeating the experiment is another assumption of the model.
From the previous points, there is actually an experiment that makes the distribution on \beta appear:
- treat X, y and \sigma as fixed inputs
- generate y' = y + \epsilon
- compute \beta with MLE (this can be done in Stan using L-BFGS if I’m not mistaken)
- sample from that distribution of \beta by repeating 2 and 3
- Do you agree that this method would produce the correct distribution?
- Do you think this approach can be generalized to other models? I think other models make some assumptions in the form of A and B.
- Is the “composition” of Monte-Carlo sampling and MLE doable in Stan?
- Are there more efficients ways to achieve that with Stan? I’m not too familiar of all possibilities offered by Stan and I don’t see any obvious way to apply MCMC and variational inference.
- EDIT: I noticed that the distribution of Beta used for p-values coincides with the limit of the Bayesian posterior when the posterior factors out the distribution of X and we make the prior more diffuse (which is related to Bernstein-Von Mises’ theorem). So actually we could probably simply do a posterior estimation with a prior that is very diffuse. Is this always the case with classical p-values? Is there a way to achieve that in Stan?
Sorry for the long message!