Missing data handling


#1

I’m developing Heckman selection model using Rstan. Data for Heckman selection model includes missing values which are NA’s in R and we cannot just delete those values since they play important role to the model. To deal with these missing data with Rstan, I tried to implement Ch.7 of Stan User Manual (Missing data & Partially known parameters). However, when I tried to do that, I got some error. Below is the code I wrote (Practice code using t-distribution to check missing data handling is working),

data {
vector[1000] x;
vector[800] y_obs;
}
parameters {
real beta;
real mu;
real sig;
real v;
vector[200] y_mis;
}
model {
(y_obs-(betax))~ student_t(v, mu, sig);
(y_mis-(beta
x))~ student_t(v, mu, sig);
}

Below is the code for the R file.

x <- rnorm(1000,0,2)
eps <- rt(1000,3)
y=1.5*x+eps
mis <- sample(1:1000, 200, replace = F)
mis
y[mis] = NA
y_mis = y[is.na(y)]
y_obs = y[!is.na(y)]
y_obs[800]
length(y_mis)
#x = x[!is.na(y)]
x
length(x)
schools_dat <- list(x,y_obs)

fit <- stan(file = ‘test.stan’, data = schools_dat,
iter = 1000, chains = 4)
print(fit)
plot(fit)

Below is the error message that I got.

DIAGNOSTIC(S) FROM PARSER:
Warning (non-fatal):
Left-hand side of sampling statement (~) may contain a non-linear transform of a parameter or local variable.
If it does, you need to include a target += statement with the log absolute determinant of the Jacobian of the transform.
Left-hand-side of sampling statement:
subtract(y_obs,multiply(beta,x)) ~ student_t(…)
Warning (non-fatal):
Left-hand side of sampling statement (~) may contain a non-linear transform of a parameter or local variable.
If it does, you need to include a target += statement with the log absolute determinant of the Jacobian of the transform.
Left-hand-side of sampling statement:
subtract(y_mis,multiply(beta,x)) ~ student_t(…)

hash mismatch so recompiling; make sure Stan code ends with a blank line
In file included from C:/Users/g1310/Documents/R/win-library/3.3/BH/include/boost/config.hpp:39:0,
from C:/Users/g1310/Documents/R/win-library/3.3/BH/include/boost/math/tools/config.hpp:13,
from C:/Users/g1310/Documents/R/win-library/3.3/StanHeaders/include/stan/math/rev/core/var.hpp:7,
from C:/Users/g1310/Documents/R/win-library/3.3/StanHeaders/include/stan/math/rev/core/gevv_vvv_vari.hpp:5,
from C:/Users/g1310/Documents/R/win-library/3.3/StanHeaders/include/stan/math/rev/core.hpp:12,
from C:/Users/g1310/Documents/R/win-library/3.3/StanHeaders/include/stan/math/rev/mat.hpp:4,
from C:/Users/g1310/Documents/R/win-library/3.3/StanHeaders/include/stan/math.hpp:4,
from C:/Users/g1310/Documents/R/win-library/3.3/StanHeaders/include/src/stan/model/model_header.hpp:4,
from file1fe47b81e23.cpp:8:
C:/Users/g1310/Documents/R/win-library/3.3/BH/include/boost/config/compiler/gcc.hpp:186:0: warning: “BOOST_NO_CXX11_RVALUE_REFERENCES” redefined

define BOOST_NO_CXX11_RVALUE_REFERENCES

^
:0:0: note: this is the location of the previous definition
cc1plus.exe: warning: unrecognized command line option “-Wno-ignored-attributes”

SAMPLING FOR MODEL ‘test’ NOW (CHAIN 1).

Rejecting initial value:
Error evaluating the log probability at the initial value.

Rejecting initial value:
Error evaluating the log probability at the initial value.

Initialization between (-2, 2) failed after 100 attempts.
Try specifying initial values, reducing ranges of constrained values, or reparameterizing the model.
[1] “Error : Initialization failed.”
[1] “error occurred during calling the sampler; sampling not done”

Is there something wrong with my code?


#2

These both need <lower=0> after real. Also, the point of a Heckman model is to estimate the correlation between the error terms, which you are not doing. I would suggest starting with the Heckman likelihood function (which does not require y_mis)

and get that working before you start making modifications.


#3

The code I wrote above was just the sample code to check whether STAN can handle missing data. Therefore I used simple t-distribution instead of Heckman model.

Yes, the point of a Heckman model is to estimate the correlation between the error terms, but I also would like to estimate other parameters like coefficients of selection equations that will be used to determine whether the data is observed or not. What I eventually want to do is to develop Heckman model using bivariate t-distribution for the error terms instead of the bivariate normal distribution.

Below is the code to create actual simulated data that I’ll use to test Heckman model.

x = rnorm(1000, 0, 2) //independent variable 1
w = rnorm(1000, 0, 2) //independent variable 2
library(mvtnorm)
omega = matrix(c(1, 0.3, 0.3, 1), nrow = 2) //covariance matrix for error terms
error = rmvnorm(1000, c(0,0), omega)
eps = error[,1]
eta = error[,2]
y_star = x+eps //regression equation
y_star = y_star+0.5
u_star = 2+x+(1.5*w)+eta //selection equation
count = 0
u = rep(1, 1000)
for(i in 1:1000){
if(u_star[i]<=0){
u[i] = 0
count = count + 1
}
}

y = array(dim=1000)
for(i in 1:1000){
if(u[i] == 1){
y[i] = y_star[i]
}
}

This includes many NA’s.

Do I still not need y_mis in this case?

If I need to use y_mis, can you please give me some examples using it(including code for the data)?

(Also, is the link you gave describing function in R?)


#4

You do not need a parameter called y_mis to estimate a Heckman model. However, in the data block you need to pass something that indicates which observations were selected out so that you could implement the likelihood function that I linked to. If you want to see the same math in an R vignette go to


#5

I am also in the process of developing a Heckman model. It is running, but the coefficients of the selection equation are still off.
Maybe it helps you as a starting point for your model:


#6

Johannes_Auer,

Thank you for providing this. I checked the code on your post and I saw you changed all of the NA’s to 0’s. What I want to do is to handle this NA’s without changing those to 0’s, since changing to 0’s will affect the coefficients. This might be the reason why you are getting off values too.


#7

You either have to use a magic number for the missing values such as negative infinity, positive infinity, etc. or you have to pass in a separate variable that is 1 if the observation is observed and 0 if it is missing.


#8

I’m trying to pass in a separate variable, but having a hard time.
Is there an example code that I can refer from?


#9

In R,

R <- !is.na(y)

In Stan,

data {
  int<lower=0> N;
  int<lower=0,upper=1> R[N];
  ...
}

#10

How can I use that R[N] was my question.
Can you show me the basic regression code using that R[N]?


#11
for(j in 1:N) { // see http://www.stata.com/manuals14/rheckman.pdf#rheckmanMethodsandformulas
  if (R[j] == 1) {
    real err = (y[j] - x[j] * beta) / sigma;
    target += normal_lcdf((z[j] * gamma + err * rho) / sqrt(1 - square(rho)) | 0, 1) 
            - 0.5 * square(err) - log(sqrt(2 * pi()) * sigma);
  }
  else target += -normal_lcdf(-z[j] * gamma | 0, 1);
}

Heckman selection model