How can software such as JAGS or STAN estimate the density a posteriori by MCMC from a DAGS diagram?

Good morning.

I guess my question exceeds the specificities of STAN, and is more related to the core of the principle of using causal diagrams to model a specific phenomenon. How can software such as JAGS or STAN estimate the density a posteriori by MCMC from a DAGS diagram ?

By this I mean that in order to make the Metropolis-Hastings algorithm work, it is necessary to know analytically the form of the density (a posteriori in the Bayesian framework, which is of interest to us) up to a constant. Indeed, if I want to simulate a target distribution p whose shape is known to within a constant cst, i.e. p = cst*p_tilde, with Metropolis-Hastings, I can do so with an instrumental density q which I accept with an acceptance rate that depends on p_tilde and q. But you need to know p_tilde to be able to simulate p.

Now, when I use JAGS or STAN, I specify a DAGS with laws between my parameters, I press enter, and MAGIC I get an a posteriori distribution. This means that the software has calculated p_tilde at some point. But how ?

For software that adjusts the regression coefficients using MCMC, such as brms, we actually have an analytical expression for the a posteriori density of the GLM parameters, so in that case I think I understand how it works. For the usual laws, the expressions of the standard laws are coded (depending on whether you’re doing Poisson, logit or normal regression), and once they’ve been multiplied together (GLM independence assumption) and with the laws of the priors, which are also usual, you get the joint law and p_tilde. As it is demonstrated in the Molenberghs and Verbeke book, we have a know expression :

ok

Then i guess brms automatically transcribes into STAN in back-end, and then Metropolis-Hastings produces chains whose distributions are those of the a posteriori density we wanted to estimate.

But if we draw a diagram whose vertices are variables with arrows between them to which we attribute deterministic or stochastic relationships, and we put priors on the hyperparameters, how can that work since the software has no analytical form of the a posteriori density that it could use to obtain p_tilde ?

Thank you.

You can read this directly off a directed graphical model specification. Each variable gets a density defined conditional on its ancestors and the entire posterior just follows from the chain rule. For example, when I write \mu \sim \textrm{normal}(0, 3), I include a factor of \textrm{normal}(\mu \mid 0, 3) in the posterior density (though we work on the log scale where \log \textrm{normal}(\mu \mid 0, 3) is a term in the log density). That gives you the joint density, which Bayes’s rule says is proportional to the posterior up to a constant. Stan does this translation literally line-by-line. [Aside: Stan, unlike BUGS (or JAGS), is not restricted to directed graphical model specifications.]. BUGS on the other hand, just computes the conditional densities of each parameter, which it can do by extracting the Markov blanket of each variable and conditioning (not enough room here to explain, but that’s enough to find it in a web search).

For HMC, we calculate \log p(\theta \mid y) up to an additive constant that doesn’t depend on \theta, where \theta are parameters and y data. Then we differentiate it and use it to simulate fictional Hamiltonian Dynamics for a proposal. [Aside: Stan doesn’t use Metropolis-Hastings. It uses the revised no-U-turn sampler which replaces the MH accept/reject with a multinomial draw over the trajectory (it’s literally the Barker proposal on two states if you want to look up its relation to Metropolis).

It’s direct. Every arrow circle representing a random variable gets translated into a factor in the density (term in the log density) and together they give you what you need.

2 Likes

Thank you very much for this answer. I will look for further details on HMC, the no-U-turn and the Baker proposal. About knowing the posterior density up to a constant from a directed graphical model specification, I agree that it is pretty direct with the Bayes rule for simple graphs, but aren’t there any limitations to it ? What is the guarantee that any DAGS, even as complex as this,

will give a clear posterior density up to a constant ?

Also, I which extent am I allow to use some determinist relationships between some vertices, such as GRE = arctan(0.5\log(\text{Needs camp}))^{3}+2*\text{other variable}. Does it break the Bayes rule for the following vertices ?

Section 18.2 of this intro to DAGS gives the general form of the translation.

The short answer is that it’s just an imperative programming language for defining a log density with implicit change-of-variables corrections for constrained parameters. You can read the Stan system paper or the Stan Reference Manual for a spec.

Anywhere you want as long as the result is a DAG. You have to be careful if you transform a parameter and then give the transformed variable a distribution—it won’t be right unless you account for the change of variables.

Thank you for your answers. I’ll work on understanding it !