TLDR: For some distributions, it can be possible to easily calculate the uncertainty in the expected value from the posterior distribution. I discuss a more general approach that can be used with the posterior predictive distribution, although I’m not 100% sure it would make sense in the general case.
This has some similarity to a thread I had previously started (Distribution of mean of posterior predictive distribution), but I didn’t want to revive it (although comments there were very helpful in my thinking).
Putting some notation to this, suppose we have some posterior distribution p\left(\theta|I\right) with the posterior predictive given by
and the expected value is something like
Now if instead, I want to summarize the posterior distribution, I can calculate the expected value of the parameters
This is all pretty standard, I think.
What I was thinking recently was that if there is a closed form expression for the expected value of my distribution g(\theta) , then I could instead write
and by the law of the unconscious statistician (Law of the unconscious statistician - Wikipedia), that formula is equal to E(g(\theta)).
So why does this matter?
For one, it means that we can calculate the expected value of the distribution from the posterior distribution as an alternative to calculating the expected value of the posterior predictive. It’s kind of trivial for a normal distribution, since the expected value equals the location parameter. Thus, the expected value of the distribution equals the expected value of the location parameter. That’s not the case for other distributions so may be kind of interesting.
The other part that is interesting is that we haven’t marginalized out the uncertainty in the parameters. So for instance, while we could use the expected value formula above to calculate the mean of the standard deviation of the distribution, we also retain the ability to capture the uncertainty associated with the expected value. In the normal case from above, it should be equal to the standard deviation of the posterior distribution of the location parameter.
The R code below provides an example of this.
# Constants
N_sim_posterior <- 2000
mu_X <- 0.1
N_obs_X <- 240
N_sim_X <- 100000
sd_X <- 0.175
a_X <- 12
b_X <- sd_X * 2
# Generate posterior mean and standard deviation
X_posterior_mean <- rnorm(N_sim_posterior, mu_X, sd_X / sqrt(N_obs_X))
X_posterior_sd <- sqrt(1.0 / rgamma(N_sim_posterior, a_X, b_X))
# Simulation
X_sim_mean <- rep(NA, N_sim_posterior)
X_sim_sd <- rep(NA, N_sim_posterior)
for (i in 1:N_sim_posterior) {
X_sim <- rnorm(N_sim_X, X_posterior_mean[i], X_posterior_sd[i])
X_sim_mean[i] <- mean(X_sim)
X_sim_sd[i] <- sd(X_sim)
}
# Calculate some summary statistics and do some plots
mean(X_sim_mean)
mean(X_posterior_mean)
sd(X_sim_mean)
sd(X_posterior_mean)
plot(X_posterior_mean, X_sim_mean)
While the above approach is interesting, it suffers because it isn’t operating at the level of the posterior predictive. For instance, if we subsequently calculate
mean(X_sim_sd)
mean(X_posterior_sd)
sd(rnorm(N_sim_posterior, X_posterior_mean, X_posterior_sd))
Where that last term is basically the standard deviation of the posterior predictive, then you will find that it tends to be larger than either of the values I could calculate from this approach. The problem is that it isn’t taking into account the posterior distribution of the location parameter.
This is kind of where it ties in with that original thread.
The best solution I’ve found so far is is in the R code below. The bulk of the code is the same, but the difference is in the loop. In each iteration, I sample with replacement from the posterior distribution N_sim_X
times and then use those to simulate from the normal distribution (if you sample N_sim_posterior
times, effectively using all the posterior samples, then sd(X_sim_mean)
gets very small).
It’s kind of a frequentist approach in that I have some distribution, sample from it a bunch of times, and then calculate means and variances.
# Constants
N_sim_posterior <- 10000
mu_X <- 0.1
N_obs_X <- 240
N_sim_X <- N_obs_X
sd_X <- 0.175
a_X <- 12
b_X <- sd_X * 2
# Generate posterior mean and standard deviation
X_posterior_mean <- rnorm(N_sim_posterior, mu_X, sd_X / sqrt(N_obs_X))
X_posterior_sd <- sqrt(1.0 / rgamma(N_sim_posterior, a_X, b_X))
# Simulation
X_sim_mean <- rep(NA, N_sim_posterior)
X_sim_sd <- rep(NA, N_sim_posterior)
for (i in 1:N_sim_posterior) {
sample_ids <- sample(1:N_sim_posterior, N_sim_X) #needed to ensure sampling from both at same point
X_sim <- rnorm(N_sim_X, X_posterior_mean[sample_ids], X_posterior_sd[sample_ids])
X_sim_mean[i] <- mean(X_sim)
X_sim_sd[i] <- sd(X_sim)
}
# Calculate some summary statistics and do some plots
mean(X_sim_mean)
mean(X_posterior_mean)
sd(X_sim_mean)
sd(X_posterior_mean)
plot(X_posterior_mean, X_sim_mean)
# More summary stats
mean(X_sim_sd)
mean(X_posterior_sd)
sd(rnorm(N_sim_posterior, X_posterior_mean, X_posterior_sd))
Things line up more closely with what I would expect.
One interesting result is when comparing the histograms of X_sim_sd
and X_posterior_sd
. The histogram of X_sim_sd
is narrower and more normal-looking than the posterior. My best explanation is that when you simulate from the posterior predictive you only get some samples that are consistent with the more extreme standard deviations. And anyway, they tend to get more reflected in fat tails.
So I guess I’m wondering:
- Am I making sense? Are my good results something that wouldn’t hold for other distributions?
- Is any part of this new and interesting?