 # Help replicating example from Imbens and Rubin's Causal Inference book

Hello,

I’m trying to replicate the examples in the textbook “Causal Inference for Statistics, Social and Biomedical Sciences” by Guido Imbens and Donald Rubin using Stan (obvs!). One of their examples is giving me trouble :( I provide the code to replicate what I did at the end of the post, after a short description of the problem for those not familiar with the methodology in the book.

### For those not familiar with the book:

Not surprisingly, the book advocates the use of the Rubin Causal Model (RCM) that uses the potential outcomes framework. For those who don’t know what this is, the RCM for the case of a binary treatment W_i \in \{0,1\} associates two outcomes to each individual i. One for the case the individual was treated, Y_i(1), the other for the case she wasn’t, Y_i(0). We of course only observe one of these two potential outcomes for each individual depending on her treatment assignment

Y_i = \begin{cases} Y_i(1) \quad \text{if } W_i = 1, \\ Y_i(0) \quad \text{if } W_i = 0. \end{cases}

The treatment effect for individual i is \tau_i :=Y_i(1) - Y_i(0), and the average treatment effect is \mathbb{E}[\tau_i]. The crux of this framework is that it clearly turns the problem of causal inference into a missing data problem. In fact, most causal inference methods can be mapped into different ways to impute the missing outcomes.

### Simple example succesfully replicated

Chapter 8 of the book is about how to impute the missing potential outcomes by modeling the joint distribution of the missing and observed data and then impute the missing outcomes from the posterior predictive distribution of the missing outcomes. Their examples use the famous Lalonde data set which consists of the outcomes of a labor training experiment. The outcome of interest is the earnings of those involved in the experimnent a year after it concluded.

Here’s the code for the simple case–which I can replicate–in which both potential outcomes are modelled as normal

\begin{pmatrix} Y_i(0) \\ Y_i(1) \end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix} \mu_0 \\ \mu_1\end{pmatrix}, \begin{pmatrix} \sigma_0 & 0\\ 0 & \sigma_1\end{pmatrix}\right).

With normal priors on the mean and inverse gamma on the standard deviations, this case can be solved analytically and they do that in the book. Below is the Stan program for this model.

     data {
int<lower=1> N;             // number of observations
real<lower=0> Y_obs[N];     // observed outcome
int<lower=0, upper=1> W[N]; // treatment assignment indicator
}
parameters {
real mu_0;      // mean vector of potential ouctome of untreated
real mu_1;      // mean vector of potential ouctome of treated
real<lower=0> sigma_0;
real<lower=0> sigma_1;
}
model {
// priors
mu_0 ~ normal(0, 100);
mu_1 ~ normal(0, 100);
sigma_0 ~ cauchy(0, 10);
sigma_1 ~ cauchy(0, 10);
// likelihood
for (i in 1:N) {
if (W[i] == 0) {
Y_obs[i] ~ normal(mu_0, sigma_0);
}
else if (W[i] == 1) {
Y_obs[i] ~ normal(mu_1, sigma_1);
}
}
}
generated quantities {
vector[N] Y_0;
vector[N] Y_1;
vector[N] tau_individual;
real tau;
for (i in 1:N) {
if (W[i] == 1) {
Y_1[i] = Y_obs[i];
Y_0[i] = normal_rng(mu_0, sigma_0);  // impute midding outcome
}
if (W[i] == 0) {
Y_0[i] = Y_obs[i];
Y_1[i] = normal_rng(mu_1, sigma_1);  // impute missing outcome
}
// compute treatment effect for individual i
tau_individual[i] = Y_1[i] - Y_0[i];
}
tau = mean(tau_individual);
}


Here’s how I run it in R

# install.packages("devtools")
# devtools::install_github("jjchern/lalonde") # contains the dataset
library(lalonde)
library(data.table)

library(rstan)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)

## the experiment data
nswDT = data.table(lalonde::nsw_dw)

## prepare list od data for Stan
data_stan = list(
N = nrow(nswDT),
Y_obs = nswDT$re78, ## earnings after experiment W = nswDT$treat)

## compile stan model
fit.stan = stan(file = "normal-model.stan", data = data_stan)
tau = extract(fit.stan, "tau")[]
mean(tau)
sd(tau)
qplot(tau)


Here’s the histogram of the treatment effects

Beautiful.

They also extend this model by parametrizing the mean using individual’s observables, \mu_0 = X\beta_0 and \mu_1 = X\beta_1. I can also replicate that. Let’s move now to the problematic example.

### The problem: Two stage model of potential outcomes

The earnings data contains a lot of zeroes as can be seen here

library(ggplot2)
ggplot(nswDT, aes(x=re74)) + geom_histogram() + facet_grid(~treat)


They propose to model the potential outcomes, Y_i(0) and Y_i(1), with a two stage model: a logit that generates the zeroes and a lognormal distribution for the outcomes if they are not zero.

First stage:

\Pr(Y_i(0) > 0 \mid X_i, W_i, \theta) = \frac{\exp(X_i \gamma_0)}{1 + \exp(X_i \gamma_0)}

Second stage:

\ln (Y_i(0) \mid Y_i(0) > 0, X_i, W_i, \theta) \sim \mathcal{N}(X_i \beta_0, \sigma_0)

and similarly for Y_i(1) with corresponding parameters (\gamma_1, \beta_1, \sigma_1).

Here’s how I wrote this in Stan,

   data {
int<lower=1> N;               /* number of observations */
int K;                        /* number of covariates */
real<lower=0> Y_obs[N];
int<lower=0, upper=1> W[N];   /* treatment assignment indicator*/
matrix[N, K] X;               /* matrix of covariates */
}
parameters {
vector[K] gamma_0;            /* params of Y_0 > 0 */
vector[K] gamma_1;            /* params of Y_1 > 0 */
vector[K] beta_0;             /* params of Y_0 mean */
vector[K] beta_1;             /* params of Y_1 mean */
real<lower=0> sigma_0;
real<lower=0> sigma_1;
}
model {
// priors
beta_0 ~ normal(0, 100);
beta_1 ~ normal(0, 100);
gamma_0 ~ normal(0, 100);
gamma_1 ~ normal(0, 100);
sigma_0 ~ cauchy(0, 10);
sigma_1 ~ cauchy(0, 10);
// outcomes model
for (i in 1:N) {
if (W[i] == 0) {            /* observe Y_0 */
// likelihood of the observed Y_0[i]
if (Y_obs[i] == 0) {
0 ~ bernoulli_logit(X[i] * gamma_0);
}
else if (Y_obs[i] > 0){
1 ~ bernoulli_logit(X[i] * gamma_0);
Y_obs[i] ~ lognormal(X[i] * beta_0, sigma_0);
}
}
else if (W[i] == 1){        /* observe Y_1 */
// likelihood of the observed Y_1[i]
if (Y_obs[i] == 0) {
0 ~ bernoulli_logit(X[i] * gamma_1);
}
else if (Y_obs[i] > 0) {
1 ~ bernoulli_logit(X[i] * gamma_1);
Y_obs[i] ~ lognormal(X[i] * beta_1, sigma_1);
}
}
}
}
generated quantities {
vector[N] Y_0;
vector[N] Y_1;
real Y0is0;
real Y1is0;
real tau;
vector[N] tau_individual;
// generate the "Science" matrix by drawing from
// the missing data distribution
for (i in 1:N) {
if (W[i] == 1) {
Y_1[i] = Y_obs[i];
// draw bernoulli_logit to see if Y_0 is zero
Y0is0 = bernoulli_logit_rng(X[i] * gamma_0);
if (Y0is0 == 0)
Y_0[i] = lognormal_rng(X[i] * beta_0, sigma_0);
else
Y_0[i] = 0;
}
if (W[i] == 0) {
Y_0[i] = Y_obs[i];
// draw bernoulli to see if Y_1 is zero
Y1is0 = bernoulli_logit_rng(X[i] * gamma_1);
if (Y1is0 == 0)
Y_1[i] = lognormal_rng(X[i] * beta_1, sigma_1);
else
Y_1[i] = 0;
}
// compute treatment effect for individual i
tau_individual[i] = Y_1[i] - Y_0[i];
}
tau = mean(tau_individual);
}


And here is how I run it in R

## prepare covariates
nswDT[, :=(re74is0 = as.integer(re74==0),
re75is0 = as.integer(re75==0))]

names.covariates = c("age", "education", "married", "nodegree",
"black", "re74", "re74is0", "re75", "re75is0")

data_stan.covariates = list(
N = nrow(nswDT),
K = 10,
Y_obs = nswDT$re78, W = nswDT$treat,
X = as.matrix(nswDT[, c(intercept = 1, .SD), .SDcols=names.covariates]))

## run Stan
fit.ts = stan('two-stage-model.stan', data=data_stan.covariates)

tau


I get that the average treatment effect is \tau = 0.114 (0.48), while in the book they report 1.57 (0.75), much more in line with the simpler model.

Is there something I’m doing wrong? Has anyone used Stan to work on causal inference problems using the potential outcomes framework and can share an example?

Thank you!

Edit:
The coefficients I get are almost identical to the ones reported in the book. Below are two tables with posterior means and standard deviations for all the model’s coefficients. The first one is from the book, the second one from my model.

The coefficients are almost identical with two major exceptions. First, the coefficients corresponding to the variables indicating zero earnings in the past years have flipped signs. After checking, I’m almost certain the authors defined those variables in the opposite way they intended. That would not change anything substantive about the treatment effects, only flip the sign on the coefficients of those variables.

Second, the constants on the first stage logit for Y_i(0) and Y_i(1) reported in the book are \gamma_0 = 2.54 (1.49) and \gamma_1 = 2.54+0.68 (2.49) = 3.22, whereas I get \gamma_0=3.11 (1.62) and \gamma_1 = 2.71.

Book coefficients: My coefficients (in the same order as in the book’s table):

4 Likes

I think it’s the same issue as discussed in this thread.

The typical treatment effect calculation assumes that the residual will be the same for Y_0 and Y_1. The way you set up the generated quantities allows for different residuals. With the normal distribution this will only have an effect on the uncertainty in Tau but with the more complex distribution it can have an effect on the expectation of Tau as well. I think in pseudo code you want to calculate the Tau as follows.

if (W[i] == 1){
Y_1[i] = Y_obs[i];
// work in expectations for precision
Y1is0 = inv_logit(X[i]) * gamma_1);
Resid = Y_obs[i] - Y1is0 * exp(X[i] * beta_1)
Y0is0 = inv_logit(X[i]) * gamma_0);
Y_0[i] = Y0is0 * exp(X[i] * beta_0) + Resid
}


If I am correct, that means that you can actually calculated tau directly as

Y1is0 = inv_logit(X[i]) * gamma_1)
Y0is0 = inv_logit(X[i]) * gamma_0)
tau_individual[i] = Y1is0 * exp(X[i] * beta_1) - Y0is0 * exp(X[i] * beta_0)


@stijn I tried your suggestion but get the same result (with thinner looking tails on the posterior of the treatment effect).

The book suggests to compute the average treatment effects as

1. For each draw d from the parameter posterior, \theta^{(d)} \sim p(\theta \mid \textbf{Y}^{obs}, \textbf{X}, W)
a) Draw complete sample of missing outcomes \textbf{Y}^{miss} from p(Y_i^{miss} \mid \textbf{Y}^{obs}, \textbf{X}, W, \theta^{(d)})
b) Compute the avg. treatment effect for this draw \hat{\tau}^{(d)}=\frac{1}{N}\sum_iY_i(1)-Y_i(0) by filling in the missing value for each i using the corresponding value in \textbf{Y}^{miss}

Then, the posterior average treatment effect is just the average \bar{\tau}=\frac{1}{N^d}\sum_d \hat{\tau}^{(d)} over all the parameter samples we’ve taken.

I have made an edit in the original post adding the coefficients reported in the book and the ones I obtain.

Did you make any progress on this issue? I’m working through Rubin’s examples as well, and I don’t quite know how to implement the model in Stan.

Just a note for the comment above and in the linked thread: the flexibility of Rubin’s approach makes it possible to estimate unit varying treatment effects, as explained in this Stan case study: Model-based Inference for Causal Effects in Completely Randomized Experiments
So you are correct to make the assumption of no correlation between treatment effects given your model and covariates.

Short update on my end, the github linked in the case study includes a version of the model you were trying to run:

Running the model yields the same quantile effects, but not the same finite sample treatment effects: its 1.79(0.86) instead of 1.57(0.74), which makes the results much closer to the model without covariates.

But the covariates are estimated similarly as by Imbens & Rubin.

1 Like