Iterating between sampling a hierarchical parameter and model parameter

I would like to sample from the following hierarchical model:
p(\beta, m | y) \propto k(\beta | m, y) \pi(m)
where p is the posterior density, k is a kernel (unnormalized denisty), and \pi is the prior for m (e.g., a beta prior).

The problem is that the normalizing constant c(m, y) = \int k(\beta | m, y) d\beta, is unknown, so I cannot write the log joint posterior density. I can, however, sample from \pi(m) and, for each m, sample from p(\beta | m, y) via MCMC.

Is there an efficient way to do this in Stan? I have already written some code to sample from the unnormalized kernel, I only need a way to iterate between drawing a sample from \pi(m) and then drawing one MCMC sample from p(\beta | m, y), until a desired number of samples is obtained.

As far as I know, the constant term is redundant to draw samples from posteriors.

See the section Up to a proportion calculations, page 28 of following paper :http://www.stat.columbia.edu/~gelman/research/unpublished/stan-resubmit-JSS1293.pdf
in which the authors suggest that, for example, if the model is declared by y ~ normal(0,sigma) then constant term is dropped. On the other hand, if += normal_log(y,0,sigma), the constant term is not dropped.

To me, it is not clear what the equation p(Ī²,m|y)āˆk(Ī²|m,y)Ļ€(m) means and so also, I am not sure that dropping the term c(m,y)=āˆ«k(Ī²|m,y)dĪ² is appropriate for the context. :ā€™-D

The only thing I can say for certain is that the term c(m,y)=āˆ«k(Ī²|m,y)dĪ² is required for the precise log likelihood values. :)

Note that for each m,y, if you can draw samples \beta_1, \cdots \beta_I from the density k(\beta|m,y), then the term c(m,y)=āˆ«k(Ī²|m,y)dĪ² can be approximated by the sum \frac{1}{I} \Sigma _{i=1, \cdots,I} \beta_i.

1 Like

Thanks for the response!

Regarding the example you supplied, the normalizing constant does not matter for normal models where the mean parameter is hierarchical because the normalizing constant only depends on the variance.

You are correct regarding dropping the term for the normalizing constantā€“I should not have done that for clarity. What I mean is basically:
p(\beta | m, y) \propto k(\beta | m, y) and c(m, y) = \int k(\beta | m, y) d\beta and
p(m | y) = \pi(m)

Thus, the joint posterior is
p(\beta, m | y) = \frac{1}{c(m, y)} k(\beta | m, y) \pi(m)

Since the normalizing constant c(m, y) is unavailable, the only way I can think to sample from the joint posterior is to:

  1. Sample m from \pi(m) directly
  2. Sample \beta from p(\beta | m, y) via MCMC.

Thus, I would need to iterate between sampling m and getting a post burn-in sample of size 1 from the unnormalized conditional log posterior k(\beta | m, y). At least, this is the only way I can think of to sample from the joint posterior when the normalizing constant is unknown.

I was wanting to essentially write a for loop like follows:

  for ( i in 1:M ) {
      draw one sample from pi(m) directly;
      draw one sample from p(beta | m, y) via Stan
}

but I was hoping to write the for loop either within Stan or C++ for speed purposes.

1 Like
  1. Using Stan, for any m,y, we can sample \beta_1, \cdots, \beta_I from k(\beta|m,y) via MCMC. Using these samples \beta_i, we can calculate the approximation of c(m,y) \approx \Sigma \beta_i /I
  2. Writes a programming code of the function k(\beta|m,y) \Sigma \beta_i /I instead of unknown analytic representation of k(\beta|m,y)/c(m,y) .
  3. Again, using Stan with the code of k(\beta|m,y)/c(m,y) , we can draw the desired samples from p(\beta|m,y) via MCMC. (For each sampling from p(\beta|m,y), Stan also runs to calculate c(m,y).)

It sounds tough. Is there any good way??

In R, for example, there is a function integrate(). I guess, using such a function, we can calculate c(m,y).

1 Like

As @Jean_Billie noted the dynamic Hamiltonian Monte Carlo sampler in Stan needs to evaluate only an unnormalized version of the target posterior density function. In fact Stan programs donā€™t even specify the posterior density function but rather the joint density function

\pi(y, \theta) = \pi(y \mid \theta) \, \pi(\theta)

which then becomes proportional to the posterior density upon evaluating the observed data,

\pi(\tilde{y}, \theta) \propto \pi(\theta \mid \tilde{y}).

For more see An Introduction to Stan.

In your case all you need to do is write a Stan program that implements

\pi(y, \beta, m) = \pi(y \mid \beta) \, \pi(\beta \mid m) \, \pi(m)

in the model block and specifies values for y in the data or transformed data blocks.

For examples of hierarchal modeling in Stan see also Hierarchical Modeling.

1 Like

Thanks for the response. Let me clarify my issue.

I only know the kernel of \pi(\beta | m) = \frac{1}{c(m)} k(\beta | m), where c(m) = \int k(\beta | m) d\beta

That is, the function (normalizing constant) that makes the density integrate to 1, c(m), is unknown. The joint posterior density is

p(\beta , m | y) \propto L(\beta | y) \pi(\beta | m) \pi(m) = L(\beta | y) \frac{k(\beta | m)}{c(m)} \pi(m)

I am not sure how I can write the log joint posterior density when the function c(m) is unknown.

1 Like

It would be useful to see what exactly your model is ā€“ is m one dimensional or many dimensional? What is k(\beta \mid m)? etc. If the unnormalised posterior depends on an unknown normalising constant then you have a doubly intractable target, and you need to do something else. If m is 1-D and k(\beta \mid m) is smooth/well behaved then it might be possible to use integrate_1d to obtain c(m).

2 Likes

Thanks a lot for that reference! m and \beta are both multi-dimensional. It sounds like I will have to go with one of the options in that review paper.

While it is a bit hard to say without knowing more about the model, I think, this might not always be the case. I am however far from an expert on this, so please check my reasoning.

In some way, Stan is a multidimensional integrator and will - in principle, if not always in practice - happily integrate additional nuisance dimensions for you. To take a simple example: Assume the following model:

p(\mu,\phi | y) \propto \int_0^\infty \frac{1}{y!} \, \lambda^y \, \exp(-\lambda) \frac{\frac{\phi}{\mu}^{\phi}} {\Gamma(\phi)} \, y^{\phi - 1} \exp(-\frac{\phi}{\mu} \, y) \mathrm{d}\lambda

This happens to be a model obtained by integrating out \lambda from:

y \sim \rm{Poisson(\lambda)} \\ \lambda \sim \rm{Gamma(\phi, \frac{\phi}{\mu})}

In this case, it happens that we can solve the integral analytically and the model is equivalent to

y \sim \rm{NegBinom_2}(\mu, \phi)

however, if we werenā€™t able to solve the integral, we could write a Stan model with \lambda as latent varible:

parameters {
   real<lower=0> lambda;
   real<lower=0> mu;
   real<lower=0> phi;
}

model {
   y ~ poisson(lambda);
   lambda ~ gamma(phi, phi / mu);
}

And this can be a well behaved model, although computationally more demanding.

Note that in the process we are:

  • Implicitly solving an integral
  • Avoiding computation of some constants (as now both Poisson and Gamma distribution are needed without constants).

In this particular case this doesnā€™t help us get rid of anything interesting, especially because Gamma has no interesting constant terms. But I think that if (and this is a big ā€œifā€) you can reformulate your model with suitable nuisance variables, Stan may compute c(m) implicitly on your behalf.

So just a thought - not sure if that will help in your particular case.

If c(m) is unknown then you canā€™t write down a posterior density function. That said the cases in which the normalization depends on parameters but canā€™t be worked out are somewhat rare, and as others have said it would help to see what your application is more precisely.

Note that all Stan needs is something proportional to the joint density \pi(\beta, m), which is sometimes easier to work out than the conditional \pi(\beta \mid m) and marginal \pi(m). In other words donā€™t condition if you donā€™t need to.

If you only have the conditional and the marginal then you have to consider the ā€œdoubly intractableā€ methods that @hhau mentioned. These methods in some sense try to find some joint \pi(\beta, m) compatible with the given unnormalized conditional and marginal density functions, but those joints often have awkward geometries that can be challenging to fit.

1 Like