Simplified instrumental variables model

I want to do a simple instrumental variables model.

In the first stage, each of n units is encouraged (more or less) to get treatment. In the second stage, each unit gets a certain amount of treatment, affected by the amount of encouragement and a confound. The outcome y is affected by the treatment and the confound. Then I want to estimate the effect of the treatment on the outcome.

I generated data using this model:

data {
  int<lower=0> N;
  real sigma;
  real alpha;
  real beta;
  real gamma;
}
parameters {
  vector[N] encourage;
  vector[N] treat;
  vector[N] confound;
  vector[N] y;
}
model {
  for (n in 1:N) {
    encourage[n] ~ normal(0, 5);
    treat[n] ~ normal(encourage[n] + confound[n], 1.0);
    y[n] ~ normal(alpha + beta * treat[n] + gamma * confound[n], sigma);
  }
}

and these assumed values:

{
    "N": 100,
    "sigma": 1,
    "alpha": 2,
    "beta": 3,
    "gamma": 4
}

Using the data generated by sampling from that model, I made another model moving some of the data values into the parameters block in order to recover the assumed parameter values:

data {
  int<lower=0> N;
  real sigma;
  real alpha;
  real beta;
  real gamma;
}
parameters {
  vector[N] encourage;
  vector[N] treat;
  vector[N] confound;
  vector[N] y;
}
model {
  for (n in 1:N) {
    encourage[n] ~ normal(0, 5);
    treat[n] ~ normal(encourage[n] + confound[n], 1.0);
    y[n] ~ normal(alpha + beta * treat[n] + gamma * confound[n], sigma);
  }
}

That was successful, yielding estimates close to the true parameter values:

 Row │ variable       mean             eltype   
     │ Symbol         Float64          DataType 
─────┼──────────────────────────────────────────
   1 │ lp__             -85.6632       Float64
   2 │ accept_stat__      0.864591     Float64
   3 │ stepsize__         0.000888479  Float64
   4 │ treedepth__        9.664        Int64
   5 │ n_leapfrog__     951.83         Int64
   6 │ divergent__        0.092        Int64
   7 │ energy__         137.435        Float64
   8 │ sigma              0.918336     Float64
   9 │ alpha              2.44468      Float64
  10 │ beta               3.04862      Float64
  11 │ gamma              3.95177      Float64
...

But that parametrization required me to describe the confound variable explicitly in the model, which I haven’t previously seen in an instrumental variables model. Is that the correct thing to do?

I tried removing the confound variable from the model:

data {
  int<lower=0> N;
}
parameters {
  real sigma;
  real alpha;
  real beta;
  vector[N] encourage;
  vector[N] treat;
  vector[N] y;
}
model {
  sigma ~ lognormal(0,1);
  alpha ~ normal(0,5);
  beta ~ normal(0,5);
  for (n in 1:N) {
    encourage[n] ~ normal(0, 5);
    treat[n] ~ normal(encourage[n], 1.0);
    y[n] ~ normal(alpha + beta * treat[n], sigma);
  }
}

but it yielded estimates very far from the true values:

 Row │ variable       mean           eltype   
     │ Symbol         Float64        DataType 
─────┼────────────────────────────────────────
   1 │ lp__           -236.601       Float64
   2 │ accept_stat__     0.7093      Float64
   3 │ stepsize__        0.0112572   Float64
   4 │ treedepth__       8.799       Int64
   5 │ n_leapfrog__    564.953       Int64
   6 │ divergent__       0.122       Int64
   7 │ energy__        388.643       Float64
   8 │ sigma             2.59038     Float64
   9 │ alpha             0.173213    Float64
  10 │ beta             -2.61633     Float64
...

I saw there is an example instrumental variables model blog post but it seemed more complicated and I didn’t really understand it.

Questions:

  • Is it ok to model the confounder explicitly like I did in the first estimation example? I think doing that is consistent with the “generative modeling” style that I’ve heard about, but I haven’t previously seen confounders modeled explicitly before, so (even though the estimates line up with the assumed values) I worry I’m doing something wrong.

  • Is there a simple way to model instrumental variables without explicitly modeling the confound parameter? Are there advantages to doing it this way?

If you observe the confounder you don’t need instrumental variables at all, IVs are for the case that you have a confounder that is unobserved. You generate the data with a confounder but then should remove it for the IV estimation, because observing the confounder removes the impetus for IV in the first place.

You don’t show any standard errors with your second set of estimates, and with N=100 I wouldn’t be shocked to see IV have significantly wider SEs. IV models have very suspect finite sample properties.

Hi @jack_monroe, thanks for replying. It looks like I made a mistake typing up my procedure above but it’s too late to edit. I’ve edited the info here, and increased the N to get more precision.

I generated the synthetic data with

{
    "N": 1000,
    "sigma": 1,
    "alpha": 2,
    "beta": 3,
    "gamma": 4
}

and

# confound_known.stan
data {
  int<lower=0> N;
  real sigma;
  real alpha;
  real beta;
  real gamma;
}
parameters {
  vector[N] encourage;
  vector[N] treat;
  vector[N] confound;
  vector[N] y;
}
model {
  for (n in 1:N) {
    encourage[n] ~ normal(0, 5);
    treat[n] ~ normal(encourage[n] + confound[n], 1.0);
    y[n] ~ normal(alpha + beta * treat[n] + gamma * confound[n], sigma);
  }
}

I estimated it with

# confound_unknown.stan
data {
  int<lower=0> N;
  vector[N] encourage;
  vector[N] treat;
  vector[N] y;
}
parameters {
  real sigma;
  real alpha;
  real beta;
  real gamma;
  vector[N] confound;
}
model {
  sigma ~ lognormal(0,1);
  alpha ~ normal(0,5);
  beta ~ normal(0,5);
  gamma ~ normal(0,5);
  for (n in 1:N) {
    encourage[n] ~ normal(0, 5);
    treat[n] ~ normal(encourage[n] + confound[n], 1.0);
    y[n] ~ normal(alpha + beta * treat[n] + gamma * confound[n], sigma);
  }
}

and

# noconfound.stan
data {
  int<lower=0> N;
}
parameters {
  real sigma;
  real alpha;
  real beta;
  vector[N] encourage;
  vector[N] treat;
  vector[N] y;
}
model {
  sigma ~ lognormal(0,1);
  alpha ~ normal(0,5);
  beta ~ normal(0,5);
  for (n in 1:N) {
    encourage[n] ~ normal(0, 5);
    treat[n] ~ normal(encourage[n], 1.0);
    y[n] ~ normal(alpha + beta * treat[n], sigma);
  }
}

The results are

> summarize_stan(indir("confound_known.csv"))
7×4 DataFrame
 Row │ variable       mean             std           eltype   
     │ Symbol         Float64          Float64       DataType 
─────┼────────────────────────────────────────────────────────
   1 │ lp__           -1491.65         35.803        Float64
   2 │ accept_stat__      0.870973      0.11108      Float64
   3 │ stepsize__         0.000240156   2.71186e-20  Float64
   4 │ treedepth__       10.0           0.0          Int64
   5 │ n_leapfrog__    1023.0           0.0          Int64
   6 │ divergent__        0.0           0.0          Int64
   7 │ energy__        3490.63         57.1504       Float64

> summarize_stan(indir("confound_unknown.csv"))
11×4 DataFrame
 Row │ variable       mean            std           eltype   
     │ Symbol         Float64         Float64       DataType 
─────┼───────────────────────────────────────────────────────
   1 │ lp__           -1337.63        84.8932       Float64
   2 │ accept_stat__      0.944365     0.0801436    Float64
   3 │ stepsize__         9.66971e-5   1.35593e-19  Float64
   4 │ treedepth__       10.0          0.0          Int64
   5 │ n_leapfrog__    1023.0          0.0          Int64
   6 │ divergent__        0.0          0.0          Int64
   7 │ energy__        1839.88        87.6988       Float64
   8 │ sigma              1.41037      0.128908     Float64
   9 │ alpha              2.01083      0.0832757    Float64
  10 │ beta               3.01167      0.0133289    Float64
  11 │ gamma              3.98868      0.0133275    Float64

> summarize_stan(indir("noconfound.csv"))
10×4 DataFrame
 Row │ variable       mean          std            eltype   
     │ Symbol         Float64       Float64        DataType 
─────┼──────────────────────────────────────────────────────
   1 │ lp__           -2037.15      321.771        Float64
   2 │ accept_stat__      0.556105    0.323591     Float64
   3 │ stepsize__         0.029138    4.85966e-17  Float64
   4 │ treedepth__        9.443       0.554128     Int64
   5 │ n_leapfrog__     749.08      264.582        Int64
   6 │ divergent__        0.0         0.0          Int64
   7 │ energy__        3539.61      323.302        Float64
   8 │ sigma              1.79112     0.563623     Float64
   9 │ alpha              0.716627    4.90802      Float64
  10 │ beta              -3.96536     2.64157      Float64

In confound_unknown.stan I am trying to simulate a situation where I don’t have measures of the confounder but I estimate it from the data along with the treatment effect. These estimates seem closely aligned to the known-true parameter values.

In noconfound.stan I’m trying to get an IV estimate of the treatment effect. Those estimates have large standard errors, consistent with your point about IV uncertainty, but I’m not sure if my implementation is actually correct since the parameters are so far off.

Do these make sense to you, or am I on the wrong track?