FIML Estimation for Heckman-Style Selection Models

Letchkov · March 17, 2020, 3:41pm

Hello everybody,

I have a fairly easy model, but do not seem to recover all parameters correctly. The setting is a two-period diff-in-diff experiment, where we have a group of treated people that self-select into the treatment (e.g., adoption of an app) and a control group. I model the selection endogeneity using a Tobit-5 Heckman-type approach and I am trying to fit the Full Information Maximum Likelihood directly using the likelihood derived in Maddala (1983) and implemented in the etregress function of STATA: etregress.

All parameters except the treatment variable itself are recovered nicely. The treatment (delta1) is over-estimated by quite a bit and the 95% HDI does not contain the true value. I do not know which part is erroneous.

THE DATA (SIMULATED IN R)

N <- 1500 #cross-sections
T <- 2  #time points
NT <- N*T #observations

#correlated error terms
rho <- 0.8
sd1 <- 0.8
Sigma <- matrix(c(sd1, rho, rho, 1), nrow=2, ncol=2) #variance-covariance matrix
errors <- mvrnorm(n=N, mu=c(0, 0), Sigma=Sigma)

e1 <-  rep(errors[,1], T)
e2 <- errors[,2]

#variables for outcome and selection equations
ys1 <- rnorm(N,0,1)
ys2 <- rnorm(N,0,1)
x1 <- rnorm(N*T,0,1)
x2 <- rnorm(N*T,0,1)
id <- data.frame(id=1:N)
time <- data.frame(t=0:1)
id <- rep(id$id, T)
t_fix <- rep(time$t, each=N)

#parameters
b0 <- 2.5
b1 <- 0
b2 <- 1
delta <- 1.3 #this is the treatment effect
gamma <- 0.4 #this is the post effect
gammadelta <- 0.8 #this is the diff-in-diff effect

w0 <- 1.2
w1 <- -0.8
w2 <- 1.6

#selection eqn
w_star <- c()
w_star <- w0+w1*ys1+w2*ys2+e2 

#selection variable
w_sel = as.numeric(w_star>0) 
w_out <- rep(w_sel, T)
#outcome variable and eqn
tw_fix <- t_fix*w_out
y <- b0+b1*x1+b2*x2+ gamma*t_fix+  delta*w_out + gammadelta* tw_fix + e1

intercept_sel <- rep(1, N) 
intercept_obs <- rep(1, N*T)

#### independent variables that go into selection eqn
zy <- matrix(c(intercept_sel, ys1, ys2), nrow=N, ncol=3) #matrix for the selection equation
zy_columns = ncol(zy) #number of predictors for the treatment including the intercept

#### independent variables that go into outcome eqn
xb <- matrix(c(intercept_obs, x1, x2), nrow=N*T, ncol=3) #all covariates in the outcome equation without the three Diff-in-Diff variables treat, post, and treat x post
xb_columns = ncol(xb)

standata_heckman_diff <- list(
  y = y, #outcome vector of full sample (not observed outcomes are denoted 0)
  zy = zy, #predictors of selection equation (time-invariant)
  xb = xb, #covariates of outcome equation
  zy_columns = zy_columns, # number of predictors in selection
  xb_columns = xb_columns, # number of covariates in outcome equation
  N = N,    #number of cross sections
  T=T,      #number of time points
  NT = N*T, #number of observations overall
  id=id,    #IDs of the participants N
  treat = w_out, #vector of outcomes of the selection equation (1/0)
  post = t_fix,  #indicator for the time point (0 or 1)
  treatpost = tw_fix #interaction between time point and treatment indicators

AND HERE THE STAN CODE:

heckman_diff = "data{
  int<lower=1> N; //cross-sections
  int<lower=1> T; //time points
  int<lower=1> NT; //all observations
  int<lower=1> xb_columns; //number of covariates in outome equation
  int<lower=1> zy_columns; //number of predictors in selection equation
  matrix[NT,xb_columns] xb; //covariate regressor matrix of outcome equation
  matrix[N,zy_columns] zy; //regressor matrix of selection equation
  vector[NT] y; //vector of outcome variable
  int<lower=1> id[NT]; //vector of proband ids
  vector[NT] post; //vector of time variable
  vector<lower=0, upper=1> [NT] treat; //self-selection vector of outcome equation (observed 1 and 0)
  vector<lower=0, upper=1> [NT] treatpost; //interaction effect
}

parameters {
  real<lower=-1,upper=1> rho;         //covariance between errors
  vector[xb_columns] beta_outcome;
  vector[zy_columns] beta_selection;
  real delta1;                        //coefficient of selection dummy variable
  real gamma1;                        //time effect
  real gammadelta1;                   //interaction effect
  real<lower=0> sd1;                  //standard deviation of the outcome model
}

transformed parameters{
real<lower=0> lambda;
lambda = sd1 * rho;
}

model{
  vector[NT] mu_y;
  vector[N] mu_z;

//Priors
  beta_outcome ~ normal(0, 10);   //substantive betas
  beta_selection ~ normal(0, 10); //selection equation betas
  delta1 ~ normal(0,1);          //coefficient on the selection dummy w
  gamma1 ~ normal(0,1);          //coefficient on the selection dummy w
  gammadelta1 ~ normal(0,1);
  
  sd1 ~ normal(1,10);
  
  for (n in 1:NT) {
  mu_y[n] = xb[n,]*beta_outcome+post[n]*gamma1+treat[n]*delta1+treatpost[n]*gammadelta1;
  }
  for (j in 1:N) {
  mu_z[j] = zy[j]*beta_selection; //selection equation
  }
  
  // log likelihood for the overall model
  for (n in 1:NT) { 
    if(treat[id[n]]==1) {
        target += log(Phi((mu_z[id[n]]+(y[n]-mu_y[n])*rho/sd1)/sqrt(1-rho^2)))
                  -0.5*((y[n]-mu_y[n])/sd1)^2
                  -log(sqrt(2*pi())*sd1); //Likelihood for positive w-values;
    } else {
       target += log(Phi((-mu_z[id[n]]-(y[n]-mu_y[n])*rho/sd1)/sqrt(1-rho^2)))
                 -0.5*((y[n]-mu_y[n])/sd1)^2
                 -log(sqrt(2*pi())*sd1); //Likelihood for negative w-values
    }
  }
  
  //Likelihood for outcome model
  y ~ student_t(4,mu_y, sd1); 
}"

fit_heckman_diff = stan(model_code = heckman_diff,
                       data = standata_heckman_diff,
                       chains = 6, 
                       iter = 1000, 
                       warmup =200,
                       control = list(max_treedepth = 15))
}

The two-step approach using inverse Mill’s Ratios works well, but the FIML is off in the treatment parameter delta1. Does anybody have a clue what I am doing wrong?

Thank you!

bbbales2 · March 19, 2020, 2:35pm

What’s going on here?:

for (n in 1:NT) { 
    if(treat[id[n]]==1) {
        target += log(Phi((mu_z[id[n]]+(y[n]-mu_y[n])*rho/sd1)/sqrt(1-rho^2)))
                  -0.5*((y[n]-mu_y[n])/sd1)^2
                  -log(sqrt(2*pi())*sd1); //Likelihood for positive w-values;
    } else {
       target += log(Phi((-mu_z[id[n]]-(y[n]-mu_y[n])*rho/sd1)/sqrt(1-rho^2)))
                 -0.5*((y[n]-mu_y[n])/sd1)^2
                 -log(sqrt(2*pi())*sd1); //Likelihood for negative w-values
    }
  }

Is there a mathy way to write this? You can surround things by dollar signs to get Latex equations in Discourse (https://meta.discourse.org/t/discourse-math-plugin/65770).

Letchkov · March 19, 2020, 3:06pm

These are the likelihood functions in both cases (treatment vs control) as derived in Maddala (1983) and the implemented by STATA:

I thought I had them translated to Stan correctly, but something does not work out. The only edit I made was to also include the interaction term between the treatment and the time periods for the diff-in-diff estimation. However, even leaving this out and implementing everything exactly as stated overestimates the treatment effect delta.

bbbales2 · March 19, 2020, 3:13pm

Do you know what the likelihood is for though? Like is it p(y_j | \beta, ...) or something?

Cause this term seems like a likelihood too: y ~ student_t(4,mu_y, sd1);

Letchkov · March 19, 2020, 3:32pm

I used rtrangucci’s code from another thread as a starting point. I figured the likelihoods were the different likelihoods for the selection and the outcome model.

model {
 vector[N] mu_y2;
 vector[N_pos] mu_y1;

 b1 ~ normal(0, 1);
 b2 ~ normal(0, 1);
 sd1 ~ normal(0, 1);

 mu_y2 = X2 * b2;
 mu_y1 = X1 * b1;

 for (n in 1:N_neg) {
   target += (log(Phi(-mu_y2[n_neg[n]])));
 }
 for (n in 1:N_pos) {
   target += log(Phi(sqrt(1 - rho^2)^(-1)*(mu_y2[n_pos[n]] 
                              + rho / (sd1 * sqrt(1 - rho^2))
                              * (y1[n] - mu_y1[n]))));
 }
 y1 ~ normal(mu_y1, sd1);
}

But apparently this was wrong. In fact, deleting this line solves the problem. Thanks a lot! I did try deleting that line in the original Tobit-2 Heckman and rtrangucci’s code and it gave strange results - so I never tried it again. This was easier than suspected. Sorry about that!

bbbales2 · March 19, 2020, 8:35pm

Oooh, it looks like in rtrangucci’s original term he was using y1 ~ normal(mu_y1, sd1); to handle the terms that look like: -0.5*((y[n]-mu_y[n])/sd1)^2 - log(sqrt(2*pi())*sd1).

Glad you got it working!

Topic		Replies	Views
Heckman sample selection model (working but biased selection coefficients) Modeling	16	4097	October 22, 2017
Heckman selection model code + simulation Modeling	34	6959	March 9, 2023
Simple selection model for MNAR data Modeling missing-data	2	397	February 27, 2024
Identification issue in Heckman model with correlated individual-specific intercepts Modeling fitting-issues , specification	1	465	August 10, 2022
Heckman selection model Modeling	7	1358	August 11, 2017

FIML Estimation for Heckman-Style Selection Models

Related topics