Summing up predictions -- unexpected result

I’m seeing a strange issue when summing posterior predictive distributions. Looks like I don’t know what I’m doing, because the result is not what I expect.

Imagine I’m in the transport business and I would like to know the sum of the weight of my passengers. All I know are some attributes of my passengers, which I’m using to predict their weight. Here’s a simplified version of what I’m trying to do (using height as predictor):

# data set
h = rnorm(500, 170, 10) # height (cm)
w = h*0.6 - 40 + rnorm(500, 0, 10) # weight (kg)
# split into train and test sets, and center predictor
train = data.frame(w=w[1:300], h=h[1:300] - mean(h))
test = data.frame(w=w[301:500], h=h[301:500] - mean(h))

# The model
fit = stan_lm(w ~ h, data=train, prior=R2(location=0.4, what='median'), seed=4567)

# predict the test set
pred = posterior_predict(fit, test, seed=7890)

pred now contains the prediction of the model for the test data set. As I’m interested in the total weight of all 200 passengers in the test set, I just sum their weights:

sum_w = rowSums(pred)
c(mean(sum_w), sd(sum_w))
# result: mean=12231, sd=173

The sd of the result is strange: 173kg is too high. Assuming that all 200 predictions are more or less normal, adding 200 normal distributions should yield: \mu = \sum_{i=1}^{200} \mu_i and \sigma = \sqrt {\sum_{i=1}^{200} \sigma_i^2} :

  sqrt(sum(apply(pred, 2, sd)^2))
# mean=12231, sd=134

Expected sd=134kg instead of 174kg!

Now here is where I’m puzzled: if I shuffle the columns of the prediction, I get what I would expect:

shuffled = pred
for (i in 1:200) {
  shuffled[,i] = shuffled[,4000), i]

sum_sh = rowSums(shuffled)
c(mean(sum_sh), sd(sum_sh))
# mean=12231, sd=132

Why? It seems as if the rows of the predictions are correlated somehow. Even if I predict all 200 test cases in 200 separate posterior_predict() calls, with different seeds, this phenomenon still holds.

I’m perplexed. Could someone tell me, what I’m doing wrong or where my assumptions go astray?

With my real data, the difference between shuffled and unshuffled rowsums is even more pronounced as the data is skewed.


It looks like you’re expecting the marginal posterior distributions for each passenger’s weight to be independent normals. And this is indeed what you enforce by shuffling the elements within columns. But the marginal posterior distributions are not independent. The most straightforward dependence is that they all share the same intercept term. So on posterior iterations when the intercept is larger, the posterior predictions for each passenger all tend to be larger. This induces a positive covariance between the posterior distributions of the passengers, which breaks the formula that you’re using for the variance of the sum of independent normal variates.

Dear jsocolar, indeed, this was my intuition as well – without being able to formulate it as clearly as you did. Thank you for confirming my hunch :-)

So what is the proper way to find the total predicted weight of the passengers? Just shuffle (in order to de-correlate the normal variates) and then sum? Or would this be the wrong way to get a predictive distribution for the total weight of the passengers?

The posterior predictive distribution for the total weight is the one based on the true joint posterior (i.e. the one that includes the correlations), not the one that you recover from the marginal posteriors by assuming that they are independent.

Makes sense, yes. Thanks for helping me think this through :-)