I’m seeing a strange issue when summing posterior predictive distributions. Looks like I don’t know what I’m doing, because the result is not what I expect.
Imagine I’m in the transport business and I would like to know the sum of the weight of my passengers. All I know are some attributes of my passengers, which I’m using to predict their weight. Here’s a simplified version of what I’m trying to do (using height as predictor):
# data set
set.seed(1234)
h = rnorm(500, 170, 10) # height (cm)
w = h*0.6 - 40 + rnorm(500, 0, 10) # weight (kg)
# split into train and test sets, and center predictor
train = data.frame(w=w[1:300], h=h[1:300] - mean(h))
test = data.frame(w=w[301:500], h=h[301:500] - mean(h))
# The model
fit = stan_lm(w ~ h, data=train, prior=R2(location=0.4, what='median'), seed=4567)
# predict the test set
pred = posterior_predict(fit, test, seed=7890)
pred
now contains the prediction of the model for the test data set. As I’m interested in the total weight of all 200 passengers in the test set, I just sum their weights:
sum_w = rowSums(pred)
c(mean(sum_w), sd(sum_w))
# result: mean=12231, sd=173
The sd of the result is strange: 173kg is too high. Assuming that all 200 predictions are more or less normal, adding 200 normal distributions should yield: \mu = \sum_{i=1}^{200} \mu_i and \sigma = \sqrt {\sum_{i=1}^{200} \sigma_i^2} :
c(
sum(colMeans(pred)),
sqrt(sum(apply(pred, 2, sd)^2))
)
# mean=12231, sd=134
Expected sd=134kg instead of 174kg!
Now here is where I’m puzzled: if I shuffle the columns of the prediction, I get what I would expect:
set.seed(6543)
shuffled = pred
for (i in 1:200) {
shuffled[,i] = shuffled[sample.int(4000,4000), i]
}
sum_sh = rowSums(shuffled)
c(mean(sum_sh), sd(sum_sh))
# mean=12231, sd=132
Why? It seems as if the rows of the predictions are correlated somehow. Even if I predict all 200 test cases in 200 separate posterior_predict()
calls, with different seeds, this phenomenon still holds.
I’m perplexed. Could someone tell me, what I’m doing wrong or where my assumptions go astray?
With my real data, the difference between shuffled and unshuffled rowsums is even more pronounced as the data is skewed.
Thanks!