FIML Estimation for Heckman-Style Selection Models

Hello everybody,

I have a fairly easy model, but do not seem to recover all parameters correctly. The setting is a two-period diff-in-diff experiment, where we have a group of treated people that self-select into the treatment (e.g., adoption of an app) and a control group. I model the selection endogeneity using a Tobit-5 Heckman-type approach and I am trying to fit the Full Information Maximum Likelihood directly using the likelihood derived in Maddala (1983) and implemented in the etregress function of STATA: etregress.

All parameters except the treatment variable itself are recovered nicely. The treatment (delta1) is over-estimated by quite a bit and the 95% HDI does not contain the true value. I do not know which part is erroneous.

THE DATA (SIMULATED IN R)

N <- 1500 #cross-sections
T <- 2  #time points
NT <- N*T #observations

#correlated error terms
rho <- 0.8
sd1 <- 0.8
Sigma <- matrix(c(sd1, rho, rho, 1), nrow=2, ncol=2) #variance-covariance matrix
errors <- mvrnorm(n=N, mu=c(0, 0), Sigma=Sigma)

e1 <-  rep(errors[,1], T)
e2 <- errors[,2]

#variables for outcome and selection equations
ys1 <- rnorm(N,0,1)
ys2 <- rnorm(N,0,1)
x1 <- rnorm(N*T,0,1)
x2 <- rnorm(N*T,0,1)
id <- data.frame(id=1:N)
time <- data.frame(t=0:1)
id <- rep(id$id, T)
t_fix <- rep(time$t, each=N)

#parameters
b0 <- 2.5
b1 <- 0
b2 <- 1
delta <- 1.3 #this is the treatment effect
gamma <- 0.4 #this is the post effect
gammadelta <- 0.8 #this is the diff-in-diff effect

w0 <- 1.2
w1 <- -0.8
w2 <- 1.6

#selection eqn
w_star <- c()
w_star <- w0+w1*ys1+w2*ys2+e2 

#selection variable
w_sel = as.numeric(w_star>0) 
w_out <- rep(w_sel, T)
#outcome variable and eqn
tw_fix <- t_fix*w_out
y <- b0+b1*x1+b2*x2+ gamma*t_fix+  delta*w_out + gammadelta* tw_fix + e1

intercept_sel <- rep(1, N) 
intercept_obs <- rep(1, N*T)

#### independent variables that go into selection eqn
zy <- matrix(c(intercept_sel, ys1, ys2), nrow=N, ncol=3) #matrix for the selection equation
zy_columns = ncol(zy) #number of predictors for the treatment including the intercept

#### independent variables that go into outcome eqn
xb <- matrix(c(intercept_obs, x1, x2), nrow=N*T, ncol=3) #all covariates in the outcome equation without the three Diff-in-Diff variables treat, post, and treat x post
xb_columns = ncol(xb)

standata_heckman_diff <- list(
  y = y, #outcome vector of full sample (not observed outcomes are denoted 0)
  zy = zy, #predictors of selection equation (time-invariant)
  xb = xb, #covariates of outcome equation
  zy_columns = zy_columns, # number of predictors in selection
  xb_columns = xb_columns, # number of covariates in outcome equation
  N = N,    #number of cross sections
  T=T,      #number of time points
  NT = N*T, #number of observations overall
  id=id,    #IDs of the participants N
  treat = w_out, #vector of outcomes of the selection equation (1/0)
  post = t_fix,  #indicator for the time point (0 or 1)
  treatpost = tw_fix #interaction between time point and treatment indicators

AND HERE THE STAN CODE:

heckman_diff = "data{
  int<lower=1> N; //cross-sections
  int<lower=1> T; //time points
  int<lower=1> NT; //all observations
  int<lower=1> xb_columns; //number of covariates in outome equation
  int<lower=1> zy_columns; //number of predictors in selection equation
  matrix[NT,xb_columns] xb; //covariate regressor matrix of outcome equation
  matrix[N,zy_columns] zy; //regressor matrix of selection equation
  vector[NT] y; //vector of outcome variable
  int<lower=1> id[NT]; //vector of proband ids
  vector[NT] post; //vector of time variable
  vector<lower=0, upper=1> [NT] treat; //self-selection vector of outcome equation (observed 1 and 0)
  vector<lower=0, upper=1> [NT] treatpost; //interaction effect
}

parameters {
  real<lower=-1,upper=1> rho;         //covariance between errors
  vector[xb_columns] beta_outcome;
  vector[zy_columns] beta_selection;
  real delta1;                        //coefficient of selection dummy variable
  real gamma1;                        //time effect
  real gammadelta1;                   //interaction effect
  real<lower=0> sd1;                  //standard deviation of the outcome model
}

transformed parameters{
real<lower=0> lambda;
lambda = sd1 * rho;
}

model{
  vector[NT] mu_y;
  vector[N] mu_z;

//Priors
  beta_outcome ~ normal(0, 10);   //substantive betas
  beta_selection ~ normal(0, 10); //selection equation betas
  delta1 ~ normal(0,1);          //coefficient on the selection dummy w
  gamma1 ~ normal(0,1);          //coefficient on the selection dummy w
  gammadelta1 ~ normal(0,1);
  
  sd1 ~ normal(1,10);
  
  for (n in 1:NT) {
  mu_y[n] = xb[n,]*beta_outcome+post[n]*gamma1+treat[n]*delta1+treatpost[n]*gammadelta1;
  }
  for (j in 1:N) {
  mu_z[j] = zy[j]*beta_selection; //selection equation
  }
  
  // log likelihood for the overall model
  for (n in 1:NT) { 
    if(treat[id[n]]==1) {
        target += log(Phi((mu_z[id[n]]+(y[n]-mu_y[n])*rho/sd1)/sqrt(1-rho^2)))
                  -0.5*((y[n]-mu_y[n])/sd1)^2
                  -log(sqrt(2*pi())*sd1); //Likelihood for positive w-values;
    } else {
       target += log(Phi((-mu_z[id[n]]-(y[n]-mu_y[n])*rho/sd1)/sqrt(1-rho^2)))
                 -0.5*((y[n]-mu_y[n])/sd1)^2
                 -log(sqrt(2*pi())*sd1); //Likelihood for negative w-values
    }
  }
  
  //Likelihood for outcome model
  y ~ student_t(4,mu_y, sd1); 
}"

fit_heckman_diff = stan(model_code = heckman_diff,
                       data = standata_heckman_diff,
                       chains = 6, 
                       iter = 1000, 
                       warmup =200,
                       control = list(max_treedepth = 15))
} 

The two-step approach using inverse Mill’s Ratios works well, but the FIML is off in the treatment parameter delta1. Does anybody have a clue what I am doing wrong?

Thank you!

What’s going on here?:

for (n in 1:NT) { 
    if(treat[id[n]]==1) {
        target += log(Phi((mu_z[id[n]]+(y[n]-mu_y[n])*rho/sd1)/sqrt(1-rho^2)))
                  -0.5*((y[n]-mu_y[n])/sd1)^2
                  -log(sqrt(2*pi())*sd1); //Likelihood for positive w-values;
    } else {
       target += log(Phi((-mu_z[id[n]]-(y[n]-mu_y[n])*rho/sd1)/sqrt(1-rho^2)))
                 -0.5*((y[n]-mu_y[n])/sd1)^2
                 -log(sqrt(2*pi())*sd1); //Likelihood for negative w-values
    }
  }

Is there a mathy way to write this? You can surround things by dollar signs to get Latex equations in Discourse (https://meta.discourse.org/t/discourse-math-plugin/65770).

These are the likelihood functions in both cases (treatment vs control) as derived in Maddala (1983) and the implemented by STATA:

I thought I had them translated to Stan correctly, but something does not work out. The only edit I made was to also include the interaction term between the treatment and the time periods for the diff-in-diff estimation. However, even leaving this out and implementing everything exactly as stated overestimates the treatment effect delta.

Do you know what the likelihood is for though? Like is it p(y_j | \beta, ...) or something?

Cause this term seems like a likelihood too: y ~ student_t(4,mu_y, sd1);

I used rtrangucci’s code from another thread as a starting point. I figured the likelihoods were the different likelihoods for the selection and the outcome model.

model {
 vector[N] mu_y2;
 vector[N_pos] mu_y1;

 b1 ~ normal(0, 1);
 b2 ~ normal(0, 1);
 sd1 ~ normal(0, 1);

 mu_y2 = X2 * b2;
 mu_y1 = X1 * b1;

 for (n in 1:N_neg) {
   target += (log(Phi(-mu_y2[n_neg[n]])));
 }
 for (n in 1:N_pos) {
   target += log(Phi(sqrt(1 - rho^2)^(-1)*(mu_y2[n_pos[n]] 
                              + rho / (sd1 * sqrt(1 - rho^2))
                              * (y1[n] - mu_y1[n]))));
 }
 y1 ~ normal(mu_y1, sd1);
}

But apparently this was wrong. In fact, deleting this line solves the problem. Thanks a lot! I did try deleting that line in the original Tobit-2 Heckman and rtrangucci’s code and it gave strange results - so I never tried it again. This was easier than suspected. Sorry about that!

1 Like

Oooh, it looks like in rtrangucci’s original term he was using y1 ~ normal(mu_y1, sd1); to handle the terms that look like: -0.5*((y[n]-mu_y[n])/sd1)^2 - log(sqrt(2*pi())*sd1).

Glad you got it working!

1 Like