I want to do a simple instrumental variables model.
In the first stage, each of n units is encouraged (more or less) to get treatment. In the second stage, each unit gets a certain amount of treatment, affected by the amount of encouragement and a confound. The outcome y is affected by the treatment and the confound. Then I want to estimate the effect of the treatment on the outcome.
I generated data using this model:
data {
int<lower=0> N;
real sigma;
real alpha;
real beta;
real gamma;
}
parameters {
vector[N] encourage;
vector[N] treat;
vector[N] confound;
vector[N] y;
}
model {
for (n in 1:N) {
encourage[n] ~ normal(0, 5);
treat[n] ~ normal(encourage[n] + confound[n], 1.0);
y[n] ~ normal(alpha + beta * treat[n] + gamma * confound[n], sigma);
}
}
and these assumed values:
{
"N": 100,
"sigma": 1,
"alpha": 2,
"beta": 3,
"gamma": 4
}
Using the data generated by sampling from that model, I made another model moving some of the data
values into the parameters
block in order to recover the assumed parameter values:
data {
int<lower=0> N;
real sigma;
real alpha;
real beta;
real gamma;
}
parameters {
vector[N] encourage;
vector[N] treat;
vector[N] confound;
vector[N] y;
}
model {
for (n in 1:N) {
encourage[n] ~ normal(0, 5);
treat[n] ~ normal(encourage[n] + confound[n], 1.0);
y[n] ~ normal(alpha + beta * treat[n] + gamma * confound[n], sigma);
}
}
That was successful, yielding estimates close to the true parameter values:
Row │ variable mean eltype
│ Symbol Float64 DataType
─────┼──────────────────────────────────────────
1 │ lp__ -85.6632 Float64
2 │ accept_stat__ 0.864591 Float64
3 │ stepsize__ 0.000888479 Float64
4 │ treedepth__ 9.664 Int64
5 │ n_leapfrog__ 951.83 Int64
6 │ divergent__ 0.092 Int64
7 │ energy__ 137.435 Float64
8 │ sigma 0.918336 Float64
9 │ alpha 2.44468 Float64
10 │ beta 3.04862 Float64
11 │ gamma 3.95177 Float64
...
But that parametrization required me to describe the confound
variable explicitly in the model, which I haven’t previously seen in an instrumental variables model. Is that the correct thing to do?
I tried removing the confound
variable from the model:
data {
int<lower=0> N;
}
parameters {
real sigma;
real alpha;
real beta;
vector[N] encourage;
vector[N] treat;
vector[N] y;
}
model {
sigma ~ lognormal(0,1);
alpha ~ normal(0,5);
beta ~ normal(0,5);
for (n in 1:N) {
encourage[n] ~ normal(0, 5);
treat[n] ~ normal(encourage[n], 1.0);
y[n] ~ normal(alpha + beta * treat[n], sigma);
}
}
but it yielded estimates very far from the true values:
Row │ variable mean eltype
│ Symbol Float64 DataType
─────┼────────────────────────────────────────
1 │ lp__ -236.601 Float64
2 │ accept_stat__ 0.7093 Float64
3 │ stepsize__ 0.0112572 Float64
4 │ treedepth__ 8.799 Int64
5 │ n_leapfrog__ 564.953 Int64
6 │ divergent__ 0.122 Int64
7 │ energy__ 388.643 Float64
8 │ sigma 2.59038 Float64
9 │ alpha 0.173213 Float64
10 │ beta -2.61633 Float64
...
I saw there is an example instrumental variables model blog post but it seemed more complicated and I didn’t really understand it.
Questions:
-
Is it ok to model the confounder explicitly like I did in the first estimation example? I think doing that is consistent with the “generative modeling” style that I’ve heard about, but I haven’t previously seen confounders modeled explicitly before, so (even though the estimates line up with the assumed values) I worry I’m doing something wrong.
-
Is there a simple way to model instrumental variables without explicitly modeling the confound parameter? Are there advantages to doing it this way?