Heckman selection model

donghyeon_ko · August 2, 2017, 5:17am

Hi, I am currently trying to build Heckman selection model using STAN.

This is my stan code for the model.

data {
  int<lower=0> N; //Number of total data
  int<lower=0,upper=1> R[N]; //Indicator for the observation
  int<lower=0> nx; //Number of coefficients for the outcome model
  int<lower=0> nz; //Nmber of coefficients for the selection model
  vector[N] y;
  matrix[N, nx] x; //Covariates for the outcome model
  matrix[N, nz] z; //Covariates for the selection model
  int<lower=0> n; //Number of total coefficients
  vector[n] pri_mean;
  matrix[n,n] pri_var;
}
parameters {
  vector[nx] beta;
  vector[nz] gamma;
  real<lower=0> sigma;
  real rho;
}
model {
  append_row(beta,gamma) ~ multi_normal(pri_mean,pri_var);
  for(j in 1:N){
    if (R[j] == 1) {
    real err = (y[j] - x[j,] * beta) / sigma;
    target += normal_lcdf((z[j,] * gamma + err * rho) / sqrt(1 - square(rho)) | 0, 1) - 0.5 * square(err) - log(sqrt(2 * pi()) * sigma);
  }
  else target += normal_lcdf(-z[j,] * gamma | 0, 1);
  }
}

I don’t see anything wrong in this model but the result is pretty much off for some reason.

Should I tune the prior distribution or something?

Can you please help me building this model?

Thank you.

[edit: added code blocks]

Bob_Carpenter · August 6, 2017, 9:20pm

It’s not clear from looking at this what model you’re trying to build. I don’t even see how this could compile with err being declared as real.

Also, it’s very hard to answer questions with only “off for some reason” as a hint as to what’s going wrong.

donghyeon_ko · August 8, 2017, 5:29am

What I’m trying to build can be found here: http://www.stata.com/manuals14/rheckman.pdf#rheckmanMethodsandformulas

I used same notations with the link.

Sorry that I couldn’t explain more about the question, but since I think the model is correct, I don’t know how I can state the question more clearly.

I talked about this model with bgoodri(developer) before so maybe he or she knows what I’m talking about.

bgoodri · August 8, 2017, 5:46am

I’m confused, but I would start by constraining the correlation like real<lower=-1,upper=1> rho;.

donghyeon_ko · August 8, 2017, 5:53am

I asked the question about missing data at this post:

which was the part of this post.

So, I tried to build the model in the link you sent and I think I did it correctly, but the result is off for some reason.

I tried real<lower=-1,upper=1> rho; but this doesn’t make any difference either.

Do you seen why it’s giving off result?

bgoodri · August 8, 2017, 5:59am

What leads you to believe the results are off?

donghyeon_ko · August 8, 2017, 6:06am

Generating simulated data

x_1 = rnorm(1000, 0, 2)
z_1 = rnorm(1000, 0, 2)
library(mvtnorm)
omega = matrix(c(1, 0.3, 0.3, 1), nrow = 2)
error = rmvnorm(1000, c(0,0), omega)
eps = error[,1]
eta = error[,2]
y_star = x_1+eps
y_star = y_star+0.5
u_star = 2+x_1+(1.5*z_1)+eta
count = 0
u = rep(1, 1000)
for(i in 1:1000){
if(u_star[i]<=0){
u[i] = 0
count = count + 1
}
}

y = rep(0,1000)
for(i in 1:1000){
if(u[i] == 1){
y[i] = y_star[i]
}
}

R <- rep(1,1000)
R[y==0]=0

This is the code to generate simulation data.

Using this as data, I should get beta[1]=0.5, beta[2]=1, gamma[1]=2, gamma[2]=1, gamma[3]=1.5, sigma=1, rho=0.3, but as a result of my stan model, I got this:

            mean

beta[1] 0.36
beta[2] 0.98
gamma[1] 0.77
gamma[2] 0.44
gamma[3] 0.68
sigma 1.03
rho 0.51

This result is too far from what I expected.

FYI, below is the code I used to make data suitable for the stan model.

#Assign values to the input parameters for heckman model
N <- 1000
nx <- 2
nz <- 3
x <- as.matrix(cbind(1,x_1))
z <- as.matrix(cbind(1,x_1,z_1))
n <- nx+nz
pri_mean <- rep(0,n)
pri_var <- as.matrix(diag(rep(.01,n)))
dat <- list(N,R,nx,nz,y,x,z,n,pri_mean,pri_var)

#Heckman model regression
f <- stan(file = ‘heckman_ko.stan’, data = dat, iter = 1000, chains = 4)

Do you see anything wrong in my code?

Bob_Carpenter · August 11, 2017, 10:57pm

I don’t see the beta being simulated anywhere.

When you look at output, you don’t want to just look at the posterior mean—look at the posterior intervals.

Topic		Replies	Views
Heckman selection model code + simulation Modeling	34	6971	March 9, 2023
Linear Mixed Model Modeling	2	445	January 21, 2020
Examples for modeling selection effect, non-response? Modeling	2	636	February 9, 2020
Missing data handling Modeling	10	4437	June 23, 2017
Initialization error Modeling	5	2465	June 11, 2017

Heckman selection model

Generating simulated data

Related topics