So I’ve seen classic examples of say, if we have a binomoal likelihood, a coin flip, we want something like a N(.5,1) as a prior. (if we’re going for conjugacy, we want a beta binomal, but that’s not what we’re discussing).

So - I’m very fortunate to have come across Gabry et al’s “A visualization in …”.

I’m ok with posterior predictive checks. I’ve made blatantly obvious mistakes in the past, i.e. my “applied GPs in Stan” post. But I want to formalize the idea of a prior predictive check, especially when the likelihood is unknown. In this paper, I’m given visualizations with no code and it’s hard for me to formalize the idea.

For the coin flip example. I have a binomial RV. What’s P? I don’t know. Is my best guess still N(.5,1), or should I estimate from the data what my prior should be? What if the coin is bullshit, and it’s like a binomial with p=.0000000001, and I’ve guessed .5? I can do a simulation to show what my honest guesses do, but I want a more general answer with different likelihood functions. How much weight is the prior carrying?

Any papers/case studies/plain obvious examples I should look at?

FYI: posterior predictive checks - all about it! reject models regularly for unrealistic posterior predictive checks. The classical analogy is obvious extrapolation… same deal in machine learning and applied math…

We only specify groups and number of observations, the priors and the likelihood with no data.

With packages like rstanarm is there a way that I can easily simulate from the prior predictive, or need I dump the code out and recreate what’s done in Gabry’s Bayes-Vis paper?

I’m looking at posterior_vs_prior in rstanarm, but it’s not looking like I’m generating observations from the prior predictive.

Am I missing something? Did I not dig enough into the code?

Prior reflects your knowledge of uncertainty before seeing the data. N(.5,1) seems hard to justify as theta should at least live within (0,1) interval. Beta(a, b) is more natural in terms of conjugality. After all, it is a trivial exponential family.

In decision theory you can ask for non-informative prior, as they are also connected to minimax. In this case the non-informative prior gives you 1/(p(1-p)), which can be approximated by beta(epsilon, epsilon). It is a flat prior on logit scale [log (p / 1−p)].

Again this reminds me of the merit of boundary-avoiding-prior-- an otherwise perfect beta(0, 0) just provides the opposite effect. A boundary-avoiding-prior in the logit space can be a boundary-embracing-prior in the p space.

On the other hand, if I only put a flat prior on p, then it is converted into a boundary-avoiding in the logit scale. This seems even more correct, as Stan only samples from the unconstrained space.

It is like optimization, I can add epsilon*Identity to remedy degeneration. Now a boundary-avoiding prior adds log-convexity.

Finally, beyond the benefit of smoother sampling, boundary-avoiding-prior is dangerous. If there is actual prior-data conflict, I will completely miss that. I believe this is why a prior-predictive checking will be emphasized.