Iterating between sampling a hierarchical parameter and model parameter

I would like to sample from the following hierarchical model:
p(\beta, m | y) \propto k(\beta | m, y) \pi(m)
where p is the posterior density, k is a kernel (unnormalized denisty), and \pi is the prior for m (e.g., a beta prior).

The problem is that the normalizing constant c(m, y) = \int k(\beta | m, y) d\beta, is unknown, so I cannot write the log joint posterior density. I can, however, sample from \pi(m) and, for each m, sample from p(\beta | m, y) via MCMC.

Is there an efficient way to do this in Stan? I have already written some code to sample from the unnormalized kernel, I only need a way to iterate between drawing a sample from \pi(m) and then drawing one MCMC sample from p(\beta | m, y), until a desired number of samples is obtained.

As far as I know, the constant term is redundant to draw samples from posteriors.

See the section Up to a proportion calculations, page 28 of following paper :
in which the authors suggest that, for example, if the model is declared by y ~ normal(0,sigma) then constant term is dropped. On the other hand, if += normal_log(y,0,sigma), the constant term is not dropped.

To me, it is not clear what the equation p(β,m|y)∝k(β|m,y)π(m) means and so also, I am not sure that dropping the term c(m,y)=∫k(β|m,y)dβ is appropriate for the context. :’-D

The only thing I can say for certain is that the term c(m,y)=∫k(β|m,y)dβ is required for the precise log likelihood values. :)

Note that for each m,y, if you can draw samples \beta_1, \cdots \beta_I from the density k(\beta|m,y), then the term c(m,y)=∫k(β|m,y)dβ can be approximated by the sum \frac{1}{I} \Sigma _{i=1, \cdots,I} \beta_i.

1 Like

Thanks for the response!

Regarding the example you supplied, the normalizing constant does not matter for normal models where the mean parameter is hierarchical because the normalizing constant only depends on the variance.

You are correct regarding dropping the term for the normalizing constant–I should not have done that for clarity. What I mean is basically:
p(\beta | m, y) \propto k(\beta | m, y) and c(m, y) = \int k(\beta | m, y) d\beta and
p(m | y) = \pi(m)

Thus, the joint posterior is
p(\beta, m | y) = \frac{1}{c(m, y)} k(\beta | m, y) \pi(m)

Since the normalizing constant c(m, y) is unavailable, the only way I can think to sample from the joint posterior is to:

  1. Sample m from \pi(m) directly
  2. Sample \beta from p(\beta | m, y) via MCMC.

Thus, I would need to iterate between sampling m and getting a post burn-in sample of size 1 from the unnormalized conditional log posterior k(\beta | m, y). At least, this is the only way I can think of to sample from the joint posterior when the normalizing constant is unknown.

I was wanting to essentially write a for loop like follows:

  for ( i in 1:M ) {
      draw one sample from pi(m) directly;
      draw one sample from p(beta | m, y) via Stan

but I was hoping to write the for loop either within Stan or C++ for speed purposes.

1 Like
  1. Using Stan, for any m,y, we can sample \beta_1, \cdots, \beta_I from k(\beta|m,y) via MCMC. Using these samples \beta_i, we can calculate the approximation of c(m,y) \approx \Sigma \beta_i /I
  2. Writes a programming code of the function k(\beta|m,y) \Sigma \beta_i /I instead of unknown analytic representation of k(\beta|m,y)/c(m,y) .
  3. Again, using Stan with the code of k(\beta|m,y)/c(m,y) , we can draw the desired samples from p(\beta|m,y) via MCMC. (For each sampling from p(\beta|m,y), Stan also runs to calculate c(m,y).)

It sounds tough. Is there any good way??

In R, for example, there is a function integrate(). I guess, using such a function, we can calculate c(m,y).

1 Like

As @Jean_Billie noted the dynamic Hamiltonian Monte Carlo sampler in Stan needs to evaluate only an unnormalized version of the target posterior density function. In fact Stan programs don’t even specify the posterior density function but rather the joint density function

\pi(y, \theta) = \pi(y \mid \theta) \, \pi(\theta)

which then becomes proportional to the posterior density upon evaluating the observed data,

\pi(\tilde{y}, \theta) \propto \pi(\theta \mid \tilde{y}).

For more see An Introduction to Stan.

In your case all you need to do is write a Stan program that implements

\pi(y, \beta, m) = \pi(y \mid \beta) \, \pi(\beta \mid m) \, \pi(m)

in the model block and specifies values for y in the data or transformed data blocks.

For examples of hierarchal modeling in Stan see also Hierarchical Modeling.

1 Like

Thanks for the response. Let me clarify my issue.

I only know the kernel of \pi(\beta | m) = \frac{1}{c(m)} k(\beta | m), where c(m) = \int k(\beta | m) d\beta

That is, the function (normalizing constant) that makes the density integrate to 1, c(m), is unknown. The joint posterior density is

p(\beta , m | y) \propto L(\beta | y) \pi(\beta | m) \pi(m) = L(\beta | y) \frac{k(\beta | m)}{c(m)} \pi(m)

I am not sure how I can write the log joint posterior density when the function c(m) is unknown.

1 Like

It would be useful to see what exactly your model is – is m one dimensional or many dimensional? What is k(\beta \mid m)? etc. If the unnormalised posterior depends on an unknown normalising constant then you have a doubly intractable target, and you need to do something else. If m is 1-D and k(\beta \mid m) is smooth/well behaved then it might be possible to use integrate_1d to obtain c(m).


Thanks a lot for that reference! m and \beta are both multi-dimensional. It sounds like I will have to go with one of the options in that review paper.

While it is a bit hard to say without knowing more about the model, I think, this might not always be the case. I am however far from an expert on this, so please check my reasoning.

In some way, Stan is a multidimensional integrator and will - in principle, if not always in practice - happily integrate additional nuisance dimensions for you. To take a simple example: Assume the following model:

p(\mu,\phi | y) \propto \int_0^\infty \frac{1}{y!} \, \lambda^y \, \exp(-\lambda) \frac{\frac{\phi}{\mu}^{\phi}} {\Gamma(\phi)} \, y^{\phi - 1} \exp(-\frac{\phi}{\mu} \, y) \mathrm{d}\lambda

This happens to be a model obtained by integrating out \lambda from:

y \sim \rm{Poisson(\lambda)} \\ \lambda \sim \rm{Gamma(\phi, \frac{\phi}{\mu})}

In this case, it happens that we can solve the integral analytically and the model is equivalent to

y \sim \rm{NegBinom_2}(\mu, \phi)

however, if we weren’t able to solve the integral, we could write a Stan model with \lambda as latent varible:

parameters {
   real<lower=0> lambda;
   real<lower=0> mu;
   real<lower=0> phi;

model {
   y ~ poisson(lambda);
   lambda ~ gamma(phi, phi / mu);

And this can be a well behaved model, although computationally more demanding.

Note that in the process we are:

  • Implicitly solving an integral
  • Avoiding computation of some constants (as now both Poisson and Gamma distribution are needed without constants).

In this particular case this doesn’t help us get rid of anything interesting, especially because Gamma has no interesting constant terms. But I think that if (and this is a big “if”) you can reformulate your model with suitable nuisance variables, Stan may compute c(m) implicitly on your behalf.

So just a thought - not sure if that will help in your particular case.

If c(m) is unknown then you can’t write down a posterior density function. That said the cases in which the normalization depends on parameters but can’t be worked out are somewhat rare, and as others have said it would help to see what your application is more precisely.

Note that all Stan needs is something proportional to the joint density \pi(\beta, m), which is sometimes easier to work out than the conditional \pi(\beta \mid m) and marginal \pi(m). In other words don’t condition if you don’t need to.

If you only have the conditional and the marginal then you have to consider the “doubly intractable” methods that @hhau mentioned. These methods in some sense try to find some joint \pi(\beta, m) compatible with the given unnormalized conditional and marginal density functions, but those joints often have awkward geometries that can be challenging to fit.

1 Like