# Estimating the Correlation Coefficient Distribution from Separate Posterior Distributions

I have a general analysis question. In the analysis of our experiment we are interested in correlating the same parameter in our subjects across two different conditions. The output of our model is a posterior over the parameter for each subject in each of the two conditions, and what we have been doing so far is correlating the point estimates of these posteriors to learn one correlation coefficient. But what we want to do is find a distribution of correlation coefficients that give us a sense of the uncertainty of that correlation given that we are estimating it from a pair of posterior distributions that have estimation uncertainty themselves. Without building the correlation into our Bayesian model itself, is there an established and principled way to do this analysis?

Do you believe that your estimates for the value in each condition are fully independent? If not, you need a model for this dependency. If so, then the union of the posteriors samples each condition is a valid set of samples from the joint posterior. Given samples from a joint posterior, it is possible to obtain samples from any function for the posterior margins by computing that function iteration-wise over the samples. In this case, the function would be the function that yields the sample correlation coefficient. So you would compute the sample correlation coefficient iteration-wise over the posterior to get a posterior distribution for the sample correlation coefficient.

If you want a posterior distribution for the population correlation coefficient, then additional assumptions will be necessary, e.g. about the distribution of covariate values found in the population.

In this case, the correlation we are trying to investigate is between two instances of the same model parameter in different experimental domains. So these posteriors were found through two separate fittings with slightly different model assumptions.

If I’m understanding your suggestion correctly, we want to follow an algorithm like:

1. from the first distribution, sample a parameter value at random from the posterior for each subject
2. repeat this process for the second distribution
3. correlate the two subject sets to find a single correlation coefficient
4. do this, say, 10,000 times to get a correlaton coefficient distribution

If I have understood correctly, we actually tried this, and what we found was that the entire correlation coefficient distribution was substantially underestimated relative to the correlation coefficient found by running a correlation between the point estimates. We are pretty sure the correlation coefficient obtained from running the analysis on the point estimates is not inflated because we see this happen with parameters we know to be highly correlated.

We tried this alternative algorithm as well:

1. from the first distribution, sample ten parameter values at random from the posterior for each subject
2. take the mean of the ten samples
3. repeat steps 1 and 2 for the second distribution
4. correlate the two subject sets to find a single correlation coefficient
5. do this, say, 1,000 times to get a correlation coefficient distribution

And what we found here, as we took the mean of a larger and larger numbers of samples (1, 10, 100, 1000), was that, whether we correspondingly decreased the number of iterations or not, the distribution came closer and closer to the correlation coefficient over the point estimates.

We were puzzled by this, but speculated that if you sample from the posteriors to run iterative correlation analyses in this way, you are dealing with sampling noise around the parameters that will make any one of these correlations weaker than if you use the mean or geomean to collapse the samples to point estimates before running your correlation analysis.