Expected Value and Uncertainty

TLDR: For some distributions, it can be possible to easily calculate the uncertainty in the expected value from the posterior distribution. I discuss a more general approach that can be used with the posterior predictive distribution, although I’m not 100% sure it would make sense in the general case.

This has some similarity to a thread I had previously started (Distribution of mean of posterior predictive distribution), but I didn’t want to revive it (although comments there were very helpful in my thinking).

Putting some notation to this, suppose we have some posterior distribution p\left(\theta|I\right) with the posterior predictive given by

p\left(\widetilde{y}|I\right) = \int_{\theta}f\left(y|\theta\right)p\left(\theta|I\right)d\theta

and the expected value is something like

E(\widetilde{y}) = \int \widetilde{y} p\left(\widetilde{y}|I\right)d\widetilde{y}

Now if instead, I want to summarize the posterior distribution, I can calculate the expected value of the parameters

E(\theta) = \int_{\theta} \theta p\left(\theta|I\right)d\theta

This is all pretty standard, I think.

What I was thinking recently was that if there is a closed form expression for the expected value of my distribution g(\theta) , then I could instead write

\int_{\theta} g(\theta) p\left(\theta|I\right)d\theta

and by the law of the unconscious statistician (Law of the unconscious statistician - Wikipedia), that formula is equal to E(g(\theta)).

So why does this matter?

For one, it means that we can calculate the expected value of the distribution from the posterior distribution as an alternative to calculating the expected value of the posterior predictive. It’s kind of trivial for a normal distribution, since the expected value equals the location parameter. Thus, the expected value of the distribution equals the expected value of the location parameter. That’s not the case for other distributions so may be kind of interesting.

The other part that is interesting is that we haven’t marginalized out the uncertainty in the parameters. So for instance, while we could use the expected value formula above to calculate the mean of the standard deviation of the distribution, we also retain the ability to capture the uncertainty associated with the expected value. In the normal case from above, it should be equal to the standard deviation of the posterior distribution of the location parameter.

The R code below provides an example of this.

# Constants
N_sim_posterior <- 2000
mu_X <- 0.1
N_obs_X <- 240
N_sim_X <- 100000
sd_X <- 0.175
a_X <- 12
b_X <- sd_X * 2

# Generate posterior mean and standard deviation
X_posterior_mean <- rnorm(N_sim_posterior, mu_X, sd_X / sqrt(N_obs_X))
X_posterior_sd <- sqrt(1.0 / rgamma(N_sim_posterior, a_X, b_X))

# Simulation
X_sim_mean <- rep(NA, N_sim_posterior)
X_sim_sd <- rep(NA, N_sim_posterior)
for (i in 1:N_sim_posterior) {
    X_sim <- rnorm(N_sim_X, X_posterior_mean[i], X_posterior_sd[i])
    X_sim_mean[i] <- mean(X_sim)
    X_sim_sd[i] <- sd(X_sim)
}

# Calculate some summary statistics and do some plots
mean(X_sim_mean)
mean(X_posterior_mean)
sd(X_sim_mean)
sd(X_posterior_mean)
plot(X_posterior_mean, X_sim_mean)

While the above approach is interesting, it suffers because it isn’t operating at the level of the posterior predictive. For instance, if we subsequently calculate

mean(X_sim_sd)
mean(X_posterior_sd)
sd(rnorm(N_sim_posterior, X_posterior_mean, X_posterior_sd))

Where that last term is basically the standard deviation of the posterior predictive, then you will find that it tends to be larger than either of the values I could calculate from this approach. The problem is that it isn’t taking into account the posterior distribution of the location parameter.

This is kind of where it ties in with that original thread.

The best solution I’ve found so far is is in the R code below. The bulk of the code is the same, but the difference is in the loop. In each iteration, I sample with replacement from the posterior distribution N_sim_X times and then use those to simulate from the normal distribution (if you sample N_sim_posterior times, effectively using all the posterior samples, then sd(X_sim_mean) gets very small).

It’s kind of a frequentist approach in that I have some distribution, sample from it a bunch of times, and then calculate means and variances.

# Constants
N_sim_posterior <- 10000
mu_X <- 0.1
N_obs_X <- 240
N_sim_X <- N_obs_X
sd_X <- 0.175
a_X <- 12
b_X <- sd_X * 2

# Generate posterior mean and standard deviation
X_posterior_mean <- rnorm(N_sim_posterior, mu_X, sd_X / sqrt(N_obs_X))
X_posterior_sd <- sqrt(1.0 / rgamma(N_sim_posterior, a_X, b_X))

# Simulation
X_sim_mean <- rep(NA, N_sim_posterior)
X_sim_sd <- rep(NA, N_sim_posterior)
for (i in 1:N_sim_posterior) {
    sample_ids <- sample(1:N_sim_posterior, N_sim_X) #needed to ensure sampling from both at same point
    X_sim <- rnorm(N_sim_X, X_posterior_mean[sample_ids], X_posterior_sd[sample_ids])
    X_sim_mean[i] <- mean(X_sim)
    X_sim_sd[i] <- sd(X_sim)
}

# Calculate some summary statistics and do some plots
mean(X_sim_mean)
mean(X_posterior_mean)
sd(X_sim_mean)
sd(X_posterior_mean)
plot(X_posterior_mean, X_sim_mean)

# More summary stats
mean(X_sim_sd)
mean(X_posterior_sd)
sd(rnorm(N_sim_posterior, X_posterior_mean, X_posterior_sd))

Things line up more closely with what I would expect.

One interesting result is when comparing the histograms of X_sim_sd and X_posterior_sd. The histogram of X_sim_sd is narrower and more normal-looking than the posterior. My best explanation is that when you simulate from the posterior predictive you only get some samples that are consistent with the more extreme standard deviations. And anyway, they tend to get more reflected in fat tails.

So I guess I’m wondering:

  • Am I making sense? Are my good results something that wouldn’t hold for other distributions?
  • Is any part of this new and interesting?

Don’t you want the condition expectation, i.e.,

\displaystyle \mathbb{E}[g(\theta) | I] = \int_{\Theta} g(\theta) \, p(\theta \mid I) \, \textrm{d}\theta.

If you compute this in Stan, you’ll get the uncertainty for the conditional expectation.

Most people just take this to be the definition of expectation in MCMC, not knowing that you actually need to derive it (though the “unconscious statistician” thing is a goofy name).

A posterior predictive value is going to be

\displaystyle p(\widetilde{y} \mid y) = \int_{\Theta} p(\widetilde{y} \mid \theta) \, p(\theta \mid y) \, \textrm{d}\theta.

You can just code this directly in Stan. There’s a chapter in the last part of the User’s Guide.

I don’t get this part, because g(\theta) here is a function, not a distribution. When we take g(\theta) = p(\tilde{y} \mid \theta), then we’re assuming you have an analytic function to compute it. Once you have that, you can just code it in Stan following the instructions in the last section of the User’s Guide.

Or maybe I’m misunderstanding something in the question?

Hi Bob. I appreciate the reply, but I’m not really asking how to do something in Stan. Actually, the thread I referenced above has a Stan program I wrote 4 years ago that simulates from the posterior predictive.

I’m not sure I would say that I’m setting g(\theta) = p(\tilde{y} \mid \theta) in this part.

If we let \theta_{i} be a single draw of theta, then we can choose function g be such that g(\theta_{i})=E(f(\theta_{i})) where y is assumed to be distributed by f. So for the normal distribution, this would be simple since E(y)=\mu. So we want g(\theta_{i})=\theta_{i,\mu} where \theta_{i,\mu} is just the \mu parameter for each draw. If we were dealing with the beta distribution, then we might set g(\theta_{i})=\theta_{i,\alpha} / (\theta_{i,\alpha} + \theta_{i,\beta}). Hopefully that makes more sense.

The above works for the normals, but I haven’t really checked if it works for other distributions (e.g. maybe it only works because the sum of normals is normal). I think of it as potentially a shortcut for calculating moments of the distribution.

Regardless, the above section of my original post is really the thinking that led into the last section of that post. It’s kind of tangential looking back. That last bit of R code is the part I’m really interested in.

Basically what I’m doing in that last section is make a number of draws equal to the number of observations and do that a large number of times. The typical approach is to do a large number of draws once.