Expressing data-averaged posterior

data-averaged posterior

Hi, I was wondering how can the two following procedures with different orders of applying g and gathering samples be reflected in the mathematical expression of data-averaged posterior.

1. 1 time computation with y samples gathered from S datasets 

2. S times computation and gathering theta samples

S is the number of samples from the simulation prior, g is posterior sampling method, f is the likelihood, \pi is the prior.

What procedure does

\int_{\tilde{\theta} \in \Theta} \int_{\tilde{y} \in \mathcal{Y}} g(\theta \mid \tilde{y}, f, \pi) f(\tilde{y} \mid \tilde{\theta}) \pi(\tilde{\theta}) d \tilde{y}d \tilde{\theta}

represent and what would be the other procedure’s expression? Thanks!

I think I’m going to need some detail. Right now I don’t think the integral as written makes sense, since g is not a measurable function with respect to the measure implied by f and \pi. If you could back up a bit and explain what you are trying to accomplish, I think that piece of background could be helpful.

g is the generalization of true posterior from the equation 1 in Talts paper below,
\pi(\theta)=\int \mathrm{d} \tilde{y} \mathrm{~d} \tilde{\theta} \pi(\theta \mid \tilde{y}) \pi(\tilde{y} \mid \tilde{\theta}) \pi(\tilde{\theta})
where true posterior is \frac{f(y|\theta) \pi(\theta)}{\int f(y|\theta) \pi(\theta) d\theta}. When you are using different approximate computation we could think of a operator g (or it could be g_t converging toward some operator because I am attempting iterating SBC to converging).

Right. I still don’t understand what these mean

Again, I do think it would be helpful for you to give a bit more context to the question.

I think 1 is “draw a dataset S times, combine these datasets, and then draw y posterior samples”, and 2 is "draw a dataset S times, then draw some samples from the posterior that arises from fitting the model to just the s^{th} dataset, and then combine these posterior samples.

If this is right, note that in the limit of large S, (1) is simply “draw a very large sample from the prior predictive distribution and then fit the model”. Unless I’m overlooking something, the true posterior from such a method should converge in the large S limit to a point mass.

Hierarchical models provide a useful way to think about the difference between (2) and (1). Think of the parameters theta as random effects, which differ randomly from realization to realization. (1) says that we’re going to estimate a single set of parameters from all of the data. (2) says that we’re going to estimate full model, with \theta_i sampled from a random effect of “SBC iteration” whose hyperparameters are the prior, but then we’ll concatenate the posteriors for \theta_i over all of the iterations i. I think that the true posterior distribution for this thing should recapitulate the prior; certainly it’s very similar to the SBC procedure itself. But somebody more advanced should verify.

2 Likes

Thanks, this is correct. 2 is similar to the original SBC procedure where one rank is returned for each dataset. Another explanation is data-averaged posterior where average happens in outcome level (gather y from each dataset then computation once) vs parameter level (computation in each dataset then gather parameter).

On top of my original two questions below, the operator corresponding to computation should be noted as g, not g_j, right? The map is deterministic as prior is its input.

When studying ensembles of samples one has to consider independent and identical product distributions. Formally the distribution of N exact samples from the distribution specified by the density function \pi(x) is specified by the product density function

\pi(x_1, \ldots, x_N) = \pi(x_1) \cdot \ldots \cdot \pi(x_N) = \prod_{n = 1}^{N} \pi(x_{n}),

where each of the component density functions are the same.

The SBC™ method proposed in the paper look at L posterior samples for each observation simulated from the prior predictive distribution. This corresponds to the joint distribution

\pi(\theta_1, \ldots, \theta_{L}, y, \theta') = \left[ \prod_{l = 1}^{L} \pi(\theta_{l} \mid y) \right] \pi(y \mid \theta) \, \pi(\theta).

From this joint distribution the SBC™ method marginalizes out y and then pushes forward the resulting distribution \pi(\theta_1, \ldots, \theta_{L}, \theta') along a one-dimensional rank function, which turns out to be uniform.

Simulating K observations from the same prior draw and then L posterior draws for posterior distribution corresponds to

\pi(\theta_{1,1}, \ldots, \theta_{l,k}, \ldots \theta_{L,K}, y_1, \ldots, y_{K}, \theta') = \left[ \prod_{k = 1}^{K} \left[ \prod_{l = 1}^{L} \pi(\theta_{l, k} \mid y_{k}) \right] \pi(y_{k} \mid \theta) \right] \pi(\theta),

and so on.

2 Likes

I assume the self-consistency equation below might not hold with ensembles of samples which is what happened with code example here.
\pi(\theta)=\int \mathrm{d} \tilde{y} \mathrm{~d} \tilde{\theta} \pi(\theta \mid \tilde{y}) \pi(\tilde{y} \mid \tilde{\theta}) \pi(\tilde{\theta})

If so, may I ask whether this would affect the result of eight school example where K = 8 (J in the code here) and linear regression example where K = 25 (N in the code here) from Talts SBC paper?

If only K = L = 1 is allowed, this would limit the use of SBC greatly (especially the number of observations, K) unless I am misunderstanding. Would there be any walk around to this i.e. could the ensemble of samples be rewritten cleverly to recover the LHS of the above equation \pi(\theta)?

\pi\left(\theta_{1,1}, \ldots, \theta_{l, k}, \ldots \theta_{L, K}, y_{1}, \ldots, y_{K}, \theta^{\prime}\right)=\left[\prod_{k=1}^{K}\left[\prod_{l=1}^{L} \pi\left(\theta_{l, k} \mid y_{k}\right)\right] \pi\left(y_{k} \mid \theta\right)\right] \pi(\theta)
= \pi(\bar{\bar{\theta}}, \bar{y}, \theta^{\prime})?

Would the following be any help?

log(\left[\prod_{k=1}^{K}\left[\prod_{l=1}^{L} \pi\left(\theta_{l, k} \mid y_{k}\right)\right] \pi\left(y_{k} \mid \theta\right)\right] \pi(\theta))
=Klog(\left[\prod_{l=1}^{L} \pi\left(\theta_{l} \mid y\right)\right] \pi\left(y\mid \theta\right)) +log( \pi(\theta))
=KLlog(\pi\left(\theta\mid y\right)) + Klog(\pi\left(y\mid \theta\right)) +log( \pi(\theta))
= log(\pi(\bar{\bar{\theta}}\mid y)) + log(\pi\left(\bar{y}\mid \theta\right)) +log( \pi(\theta))

In my previous notation \theta and y are variables that represent the entire model configuration and observational spaces. Duplications of \theta correspond to replicating all of the parameters together, while duplications of y correspond to replicating the entire observation at once.

If the observation is compromised of multiple components, i.e. if mathematically Y = Y_{1} \times \ldots \times Y_{N} and y = (y_{1}, \ldots, y_{N}), then all of those components have to be replicated for each replication of y. This means that there will be multiple indices – one to label each replication and one to label to label the components within each replication – and those indices aren’t usually denoted consistently. I used k and l to denote replication indices above but the SBC paper uses j (for the hierarchical model) and k (for the linear regression model) to denote component indices.

One critical requirement for SBC™ is that the observational space has to match the observational space of interest. Not seeing diagnostic problems for posterior computation with N = 10 component observations doesn’t mean that the same computational method will also work for N = 100 component observations or vice versa.

2 Likes