Heckman selection model


Hi, I am currently trying to build Heckman selection model using STAN.

This is my stan code for the model.

data {
  int<lower=0> N; //Number of total data
  int<lower=0,upper=1> R[N]; //Indicator for the observation
  int<lower=0> nx; //Number of coefficients for the outcome model
  int<lower=0> nz; //Nmber of coefficients for the selection model
  vector[N] y;
  matrix[N, nx] x; //Covariates for the outcome model
  matrix[N, nz] z; //Covariates for the selection model
  int<lower=0> n; //Number of total coefficients
  vector[n] pri_mean;
  matrix[n,n] pri_var;
parameters {
  vector[nx] beta;
  vector[nz] gamma;
  real<lower=0> sigma;
  real rho;
model {
  append_row(beta,gamma) ~ multi_normal(pri_mean,pri_var);
  for(j in 1:N){
    if (R[j] == 1) {
    real err = (y[j] - x[j,] * beta) / sigma;
    target += normal_lcdf((z[j,] * gamma + err * rho) / sqrt(1 - square(rho)) | 0, 1) - 0.5 * square(err) - log(sqrt(2 * pi()) * sigma);
  else target += normal_lcdf(-z[j,] * gamma | 0, 1);

I don’t see anything wrong in this model but the result is pretty much off for some reason.

Should I tune the prior distribution or something?

Can you please help me building this model?

Thank you.

[edit: added code blocks]


It’s not clear from looking at this what model you’re trying to build. I don’t even see how this could compile with err being declared as real.

Also, it’s very hard to answer questions with only “off for some reason” as a hint as to what’s going wrong.


What I’m trying to build can be found here: http://www.stata.com/manuals14/rheckman.pdf#rheckmanMethodsandformulas

I used same notations with the link.

Sorry that I couldn’t explain more about the question, but since I think the model is correct, I don’t know how I can state the question more clearly.

I talked about this model with bgoodri(developer) before so maybe he or she knows what I’m talking about.


I’m confused, but I would start by constraining the correlation like real<lower=-1,upper=1> rho;.


I asked the question about missing data at this post:

which was the part of this post.

So, I tried to build the model in the link you sent and I think I did it correctly, but the result is off for some reason.

I tried real<lower=-1,upper=1> rho; but this doesn’t make any difference either.

Do you seen why it’s giving off result?


What leads you to believe the results are off?


Generating simulated data

x_1 = rnorm(1000, 0, 2)
z_1 = rnorm(1000, 0, 2)
omega = matrix(c(1, 0.3, 0.3, 1), nrow = 2)
error = rmvnorm(1000, c(0,0), omega)
eps = error[,1]
eta = error[,2]
y_star = x_1+eps
y_star = y_star+0.5
u_star = 2+x_1+(1.5*z_1)+eta
count = 0
u = rep(1, 1000)
for(i in 1:1000){
u[i] = 0
count = count + 1

y = rep(0,1000)
for(i in 1:1000){
if(u[i] == 1){
y[i] = y_star[i]

R <- rep(1,1000)

This is the code to generate simulation data.

Using this as data, I should get beta[1]=0.5, beta[2]=1, gamma[1]=2, gamma[2]=1, gamma[3]=1.5, sigma=1, rho=0.3, but as a result of my stan model, I got this:


beta[1] 0.36
beta[2] 0.98
gamma[1] 0.77
gamma[2] 0.44
gamma[3] 0.68
sigma 1.03
rho 0.51

This result is too far from what I expected.

FYI, below is the code I used to make data suitable for the stan model.

#Assign values to the input parameters for heckman model
N <- 1000
nx <- 2
nz <- 3
x <- as.matrix(cbind(1,x_1))
z <- as.matrix(cbind(1,x_1,z_1))
n <- nx+nz
pri_mean <- rep(0,n)
pri_var <- as.matrix(diag(rep(.01,n)))
dat <- list(N,R,nx,nz,y,x,z,n,pri_mean,pri_var)

#Heckman model regression
f <- stan(file = ‘heckman_ko.stan’, data = dat, iter = 1000, chains = 4)

Do you see anything wrong in my code?


I don’t see the beta being simulated anywhere.

When you look at output, you don’t want to just look at the posterior mean—look at the posterior intervals.