Treatment effects via posterior prediction

These questions came out of conversations I had today with @mariel and @ignacio.

In the stan_lm vignette, @jonah and @bgoodri suggest investigating the average treatment effect using posterior_predict.

clouds_cf <- clouds
clouds_cf$seeding[] <- "yes"
y1_rep <- posterior_predict(post, newdata = clouds_cf)
clouds_cf$seeding[] <- "no"
y0_rep <- posterior_predict(post, newdata = clouds_cf)
qplot(x = c(y1_rep - y0_rep), geom = "histogram", xlab = "Estimated ATE")

From what I understand (please correct me if this is wrong), posterior_predict is not just calculating \hat y_i = X_i\hat\beta (the expected value of y); it is drawing a new \tilde y_i from the posterior for each MCMC iteration. This incorporates the residual variability into the prediction.

First off, is there a problem with the above calculation, in that it assumes the draws y1_rep and y0_rep are independent? I would think you want to make this calculation in a way that takes into account the fact that these are the same people (really the same observation) in each case. Another way to look at it, is that I would expect y_1 - y_0 to be smaller if you are comparing those observations on the same person, as compared to two different people that happen to have the same covariates. I don’t think your suggested procedure accounts for this.

Second, I just want to point out that this is not the “standard” way of calculating an ATE, which is done by most software packages that I’ve come across. Most of the time, \hat y_i = X_i\hat\beta is used as the prediction, and the residual variability is ignored. When this is done, the issue I raise above is moot. I’m not saying that your approach is wrong, just pointing out that it is not the standard.

I’m curious about your thoughts to these issues. Thanks!


A third issue with the code in the vignette that was just pointed out by @mariel:

You calculate the posterior of y1-y0 for all individuals observations in the dataset. But then you just plot all the individual treatment effects (by vectorizing the resulting matrix). For the AVERAGE treatment effect, you should be averaging across observations (by taking rowMeans(y1_rep - y0_rep)). That totally changes the interpretation of what is being plotted. Do you agree?

1 Like

Hmmm, but this is Bayes so we get to incorporate our uncertainty about the residual error into the predictions. rstanarm does have posterior_linpred() for this conventional kind of prediction but the help pages advise against it. Maybe the phrase “average” treatment effect is a misnomer? Maybe “posterior-predictive treatment effect” is more apt.

It seems to me that you’re talking about two very different things in that paragraph, which correspond to points 2 and 3 that I raised in my first two posts.

  1. You’re right, we are able to incorporate the uncertainty about the residual into our predictions. However, this isn’t limited to Bayesian analysis. Frequentists can do this as well. See, for example, the difference between “confidence intervals” and “prediction intervals”. Keep in mind that I don’t necessarily disagree with including the residual in the prediction (I still have to think about this more), but I do disagree with predicting y_1 and y_2 independently (unless @jonah or @bgoodri can convince me otherwise).

  2. What that code displays is the distribution of individual-specific treatment effects. What we are typically interested in (for causal inference) is the average treatment effect. So yeah, if you call what is plotted in that plot an “average treatment effect”, that’s a misnomer. But the average is the thing we’re interested in, so I’d say it’s a “miscalculation” of what we want.

After thinking about this a couple days, there are two gripes I have with the way treatment effects are calculated in the vignette above.

  1. My first gripe is pretty straightforward. You are not calculating and plotting the average treatment effect. You are plotting the distribution of individual treatment effects. That inflates the variance by a factor of \sqrt n.
  2. The second gripe is more conceptual. The central issue is how correlated an observation is with its counterfactual. Until we get a time machine, that’s not something that we can really measure or know for sure. However, the standard in causal inference is to assume that they have a correlation of one. In other words, we assume the exact same residual for \tilde y_{i1} and \tilde y_{i0}. Thus, when you calculate \tilde y_{i1} - \tilde y_{i0}, the residuals cancel out, so you get the same thing as you would if you just did \hat y_{i1} - \hat y_{i0} (using the expected values). Whether that is appropriate or not could be debated, but it’s probably better than assuming the correlation is 0, which is what your code is doing. And this really isn’t a small difference - this is a big deal for estimating causal effects. In many cases the residual variance will be much larger than the variance of the impact of interest.

@shira, I’m wondering if you have any thoughts?

1 Like

Just bumping this back up the list. I would very much like to hear a response from @jonah or @bgoodri, because I am now convinced that their approach for estimating treatment effects is flawed.

RE: @tjmahr’s point that “average treatment effect” may be a misnomer… the average treatment effect is a causal inference term - this is what we are trying to estimate. ATE = E[Y(1) - Y(0)]. You can’t blame the estimand if the estimator isn’t correct.

What we have in that vignette is wrong if you assume that the observed outcomes have measurement error that makes them differ from the potential outcome functions and that the measurement error is the same irrespective of which potential outcome function is realized. Under those assumptions, which I think a lot of people are comfortable making, then subtracting one from the other would cancel out the measurement error, leaving the difference in the potential outcome functions. The posterior_predict function is basically assuming independent measurement errors on top of the two potential outcome functions, in which case the subtraction does not cancel them.

@jgellar and @bgoodri , I have posted a related question here: Help replicating example from Imbens and Rubin's Causal Inference book. I would love to hear your comments.

This approach to causal inference was first proposed by Donald Rubin (1978) and it’s explained quite well in the book Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction (Chapter 8) where they call it model-based imputation.

They do talk about the correlation between Y_i(0) and Y_i(1) and suggest that a conservative approach would have them perfectly correlated (p.175)

\begin{pmatrix} Y_i(0) \\ Y_i(1) \end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix} \mu_0 \\ \mu_1\end{pmatrix}, \begin{pmatrix} \sigma & \sigma\\ \sigma & \sigma\end{pmatrix}\right).

As you point out that correlation is not identified without a time-machine since we never observe both potential outcomes for the same individual, but that’s not a big deal for a Bayesian and it ends up reflected in a wider posterior distribution of the average treatment effects, as you also point out.

Also, about using rowMeans, I think you’re correct.

I totally missed this conversation, but I will contribute late. I struggled with this a lot while trying to decide whether I agreed with the claim in this paper that posterior predictive draws could sometimes be replaced with their expectations. (I ultimately chose to make the assumption of independent errors in my own research.)

Gripe #1 is a plot labelling issue, and I agree with @jgellar that you need to take the average at some point to call it an average treatment effect. I also agree that any definition of ATE (on the difference scale) that is not mathematically equivalent to \mathbb{E}\left[Y(1) - Y(0) \right] is plain wrong.

But isn’t gripe #2 a general gripe with causal inference? The relationship between residuals for Y_i(0) and Y_i(1) is unknown and unknowable. I’m not sure we should naturally assume the residuals have a correlation of 1; it’s just a conceptually simple way to explain how people have used parametric models for causal inference in the past.

Basically, I don’t see how you can get around making some kind of assumption here. At one end of the spectrum, we might make the strong and untestable assumption of completely independent errors. (Pearl’s non-parametric structural equation models with independent errors makes the same independence assumption as the posterior_predict approach here, so I wouldn’t say there is a uniformly accepted “standard in causal inference.”) Or you can make a different strong and untestable assumption of perfectly correlated errors, in which case I think you should advertise that you are almost certainly being too conservative.

1 Like