Heckman selection model code + simulation

Hi all –

I’ve put together a little Heckman selection type model. It seems to run OK, but in simulation I’m finding the correlation parameter rho is biased towards 1. @bgoodri or @jonah (or others) might have some ideas on what to do.

The Stan code:

// saved as heckman_correction.stan
functions {
  vector inverse_mills(vector z) {
    vector[rows(z)] out;
    for(i in 1:rows(z)) {
       out[i] = exp(normal_lpdf(z[i] | 0, 1)) / (Phi(z[i])); // yes I know but I pass it positive z rather than negative...

data {
  int N; // number of observations
  int P; // number of covariates
  int P2; // number of instruments
  vector[N] Y; // income or log of income
  vector<lower = 0, upper = 1>[N] participation; // participation in the workforce
  matrix[N, P] X; // covariates
  matrix[N, P2] Z_raw; // instruments
  int estimate_model;
transformed data {
  matrix[N, P + P2] Z = append_col(X, Z_raw);
parameters {
  real alpha;
  real alpha_1;
  vector[P] beta;
  vector[P + P2] gamma;
  real<lower = 0> sigma_u;
  real<lower = -1, upper = 1> rho;
transformed parameters {
  vector[N] p = Phi(alpha_1 + Z * gamma);
model {
  // priors
  alpha ~ student_t(3, 0, 2);
  alpha_1 ~ student_t(3, 0, 2);
  beta ~ normal(0, 1);
  gamma ~ normal(0, 1);
  sigma_u ~ inv_gamma(1.5, 2);
  rho ~ normal(0, .5);
  // log likelihood for selection model
  target += participation' * log(p) + (1 - participation)' * log(1 - p);
  // log likelihood for outcome model
  for(n in 1:N) {
    if(participation[n] == 1.0) {
      Y[n] ~ normal(alpha + X[n] * beta + sigma_u * rho * inverse_mills(rep_vector(p[n],1) ), sigma_u);

And some R code to simulate and run it

library(rstan); library(tidyverse)

# Simulate some fake data

N <- 500
P <- 5
P2 <- 5
X = matrix(rnorm(N*P, 0, .5), N, P)
Z_raw = matrix(rnorm(N*P2, 0, .5), N, P2)

gamma <- rnorm(P + P2)
Z <- cbind(X, Z_raw)
alpha_1 <- rnorm(1)
alpha <- rnorm(1)
beta <- rnorm(P)
sigma_u <- abs(rnorm(1))

# Draw the structural shocks
structural_shocks <- MASS::mvrnorm(N, c(0, 0), matrix(c(1, sigma_u * .5, sigma_u * .5, sigma_u^2), 2, 2))

# Simulate reporting a salary (participation) and the wage
participation <- as.numeric((alpha_1 + as.matrix(Z) %*% gamma + structural_shocks[,1])> 0)
wage <- participation * (alpha + X %*% beta + structural_shocks[,2])

# Record the actual probabilities
actual_p <- pnorm(alpha_1 + as.matrix(Z) %*% gamma)

# Create the data list
data_list <- list(N = N, 
                  P = P, 
                  P2 = P2, 
                  X = X, Z_raw = Z_raw, 
                  Y = as.numeric(wage), 
                  participation = participation, 
                  estimate_model = 1)

# Compile the model
compiled_model <- stan_model("heckman_correction.stan")

# Sample
fit_sim <- sampling(compiled_model, data = data_list, chains = 4, iter = 1000, cores = 4)

confints <- broom::tidy(fit_sim, pars = c("alpha", "alpha_1", "beta", "gamma", "rho", "sigma_u"), conf.int = T)

confints_p <- broom::tidy(fit_sim, pars = "p", conf.int = T)

data_frame(parameter = c("alpha", "alpha_1", "beta", "gamma", "rho", "sigma_u"), 
           values = c(alpha, alpha_1, list(beta), list(gamma), 0.5, sigma_u)) %>% 
  unnest() %>% 
  mutate(estimates = confints$estimate,
         lower = confints$conf.low,
         upper = confints$conf.high) %>% 
  ggplot(aes(x = values, y = estimates, colour = parameter)) +
  geom_point() +
  geom_linerange(aes(ymin = lower, ymax = upper)) +
  geom_abline(aes(intercept= 0, slope = 1))

data_frame(parameter = "p", 
           values = list(actual_p)) %>% 
  unnest() %>% 
  mutate(estimates = confints_p$estimate,
         lower = confints_p$conf.low,
         upper = confints_p$conf.high,
         misestimated = values < lower | values > upper) %>% 
  ggplot(aes(x = values, y = estimates, colour = misestimated, alpha = (misestimated+ 1)/2)) +
  geom_point() +
  scale_color_manual(values = c("black", "red")) +
  guides(alpha = F) +
  geom_linerange(aes(ymin = lower, ymax = upper)) +
  geom_abline(aes(intercept= 0, slope = 1))

Which gives plots like theseimage and image


In the example you plotted, the estimate was lower than the true value.

If you do a bunch of simulations from the prior, the simulated value should be uniformly distributed in rank in the posteriors (that’s just the Cook-Gelman-Rubin test, or “simulation based calibration” as I believe Andrew now prefers).

I don’t understand why positive z is important in the function, but then I don’t know this model.

Does there exist a source that discusses putting the inverse Mills ratio into the linear predictor in a MCMC context? I think I got a Heckman model to work a few years ago using the likelihood function that supplanted Heckman’s two-step approach in a frequentist context.

Here’s a Stan program that implements the Heckman likelihood and fits the program to fake data. It recovers the parameters well. I haven’t yet fitted it to repeated simulated datasets though.
EDIT: The code below has an error in the likelihood, which @RachaelMeager caught in her post below (link). Follow the link for the explanation, the correct code, and an extension!

The likelihood contribution for observed units below should be:

log(Phi(sqrt(1 - rho^2)^(-1)*(mu_y2[n_pos[n]] 
                               + rho / sd1
                               * (y1[n] - mu_y1[n]))));

but is erroneously coded as:

log(Phi(sqrt(1 - rho^2)^(-1)*(mu_y2[n_pos[n]] 
                               + rho / (sd1 * sqrt(1 - rho^2))
                               * (y1[n] - mu_y1[n]))));
data {
  int<lower=1> N;
  int<lower=1> N_neg;
  int<lower=1> N_pos;
  int<lower=1> D;
  vector[N_pos] y1;
  int<lower=0,upper=1> y2[N];
  matrix[N_pos,D] X1;
  matrix[N,D] X2;
transformed data {
  int<lower=1,upper=N> n_pos[N_pos];
  int<lower=0,upper=N> n_neg[N_neg];
    int i;
    int j;
    i = 1;
    j = 1;
    for (n in 1:N) {
      if (y2[n] == 1) {
        n_pos[i] = n;
        i = i + 1;
      } else {
        n_neg[j] = n;
        j = j + 1;
parameters {
  real<lower=-1,upper=1> rho;
  vector[D] b1;
  vector[D] b2;
  real<lower=0> sd1;
model {
  vector[N] mu_y2;
  vector[N_pos] mu_y1;

  b1 ~ normal(0, 1);
  b2 ~ normal(0, 1);
  sd1 ~ normal(0, 1);

  mu_y2 = X2 * b2;
  mu_y1 = X1 * b1;

  for (n in 1:N_neg) {
    target += (log(Phi(-mu_y2[n_neg[n]])));
  for (n in 1:N_pos) {
    target += log(Phi(sqrt(1 - rho^2)^(-1)*(mu_y2[n_pos[n]] 
                               + rho / (sd1 * sqrt(1 - rho^2))
                               * (y1[n] - mu_y1[n]))));
  y1 ~ normal(mu_y1, sd1);
n <- 1000
d <- 5

sig1 <- abs(rnorm(1))
sig2 <- 1
rho <- 2*runif(1)-1
Sigma <- diag(c(sig1,sig2)) %*% matrix(c(1,rho,rho,1),2,2) %*% diag(c(sig1, sig2))

X <- matrix(rnorm(n * 3 * d),n,3 * d)
X1 <- X[,1:d]
X2 <- X[,(d+1):(2 * d)]
X3 <- X[,(2 * d+1):(3 * d)]

b1 <- rnorm(d)
b2 <- rnorm(d)

noise <- mvrnorm(n = n, mu = c(0, 0), Sigma = Sigma)

y2 <- (X2 %*% b2 + noise[,2]) >= 0
y1 <- (X1 %*% b1 + noise[,1])
y1_sel <- y1[y2]
X1_sel <- X1[y2,] 

stan_data <- list(y2 = y2[,1], y1 = y1_sel, X1 = X1_sel, 
                  X2 = X2, N = n, D = d, N_pos = sum(y2[,1]), 
                  N_neg = as.integer(length(y2[,1]) - sum(y2[,1])))
heck <- stan_model(file='stan_code/heck.stan')
fit <- sampling(heck, data = stan_data, iter = 2000,chains=4,cores = 4)
draws <- as.matrix(fit, pars = c('rho','b1','b2','sd1')) 
true <- c(rho, b1, b2, sig1)

bayesplot::mcmc_recover_intervals(draws, true)

fake_data_heck.R (967 Bytes)


Sounds like a good example for a simple case study. This is a model I’ve heard of and Jim asked about and Rob had already coded, so presumably it’s of general interest.

What is being assumed in these examples about the joint distribution of the errors in the outcome and selection equations? Bivariate normality? If this rather strong assumption is violated you can have severe bias (non-Gaussian copulas can be used to relax it). Another important assumption is that the X variables are independent of the joint distribution of the errors, conditional on the selection probability… This assumption can and should be tested.

Angrist, J.D. (1997). Conditional independence in sample selection models. Economics Letters, 54(2), 103-112. https://doi.org/10.1016/S0165-1765(97)00022-0

Huber, M., & Melly, B. (2015). A Test of the Conditional Independence Assumption in Sample Selection Models. Journal of Applied Econometrics, 30(7), 1144-1168. https://doi.org/10.1002/jae.2431

Pigini, C. (2014). Bivariate Non-Normality in the Sample Selection Model. Journal of Econometric Methods, 4(1), 123-144. doi:10.1515/jem-2013-0008

Smith, M.D. (2003). Modelling sample selection using Archimedean copulas. Econometrics, 6(1), 99–123.

Hasebe, T. (2013). Copula-based maximum-likelihood estimation of sample-selection models. The Stata Journal, 13(3), 547–573.

Giampiero, M., & Wyszynski, K. (2016). Semi-parametric copula sample selection models for count responses. Computational Statistics & Data Analysis, 104, 110-129.

McGovern, M.E., Bärnighausen, T., Marra, G., Radice, R. (2015). On the assumption of bivariate normality in selection models: a Copula approach applied to estimating HIV prevalence. Epidemiology, 26(2), 229-237.

Marra, G., Radice, R., Bärnighausen, T., Wood, S.N., & McGovern, M.E. (2017). A Simultaneous Equation Approach to Estimating HIV Prevalence With Nonignorable Missing Responses. Journal of the American Statistical Association, 112(518), 484-496.

Andrew’s spent a lot of time discussing error terms, including in today’s blog post.

Rather than the kind of tests @Cam is suggesting, we tend to recommend calibration tests of predictive uncertainty quantification (things like simulation-based calibration or more general posterior predicitve checks to probe specific aspects of models).

Pretty much any of our parametric approximations to something are going to be wrong in some sense. We don’t really think most of the things we write logistic regressions for truly involve linear functions of predictors on the log odds scale. Yet it can be surprisingly effective and well calibrated.

What you have to be more careful about than basic shape are things like multimodality and wide tails (if you’re trying to estimate wider posterior intervals, say).

I had a look at Andrew’s post. Yes, I would concede the point that normality of error distributions may not be something to lose too much sleep over. But I don’t know that I would agree with: “except for prediction, who cares about the error term, it’s the least important part of a regression model…” If the conditional moment restriction E(epsilon|X) = 0 is violated, no parametric approximation (or non-parametric model) will help. Many an econometrician’s career has been built around that problem.

Assuming we’re talking about the usual situation where epsilon = Y - X * beta in a regression with outcomes Y, predictor matrix X and coefficients beta, then isn’t that a result of the basic regression being biased? It doesn’t seem related to the error distribution not being normal.

P.S. I’m also not sure why Andrew says “except for prediction” since that seems like one of the main applications of Bayesian models.

I guess that’s my way of saying that I “care” about the (conditional) distribution of the error term given the predictor matrix. If the X_i correlate with the errors (here I mean the unobservable error representing the omitted causal influences on Y, not the fitted residuals which are uncorrelated with X by construction), then you will have biased estimates and tests. Possible reasons why corr(X,e) ≠ 0 include relevant omitted predictors and measurement error in X. True, it doesn’t have to do with the “normality” issue specifically – I just think that generally characterizing the error as the “least important part of a regression model” is perhaps a little bit cavalier, given that assumptions about its relationship with other variables in the model are essential for causal inference.

I think Andrew may have hurt his case by being too flippant, here. He can’t possibly mean don’t look at errors. He spends lots of time plotting errors, etc. I think what he means is that modeling the distribution of error terms isn’t as important as fixing the expectations (e.g., linear predictors).

When you find patterns in the errors, such as correlated errors, I thought the usual route to fixing them was to improve the expectation part of the model, for example by adding predictors as you suggest, or by adding priors to control overfitting, or by moving to something like a time-series structure.

The only case where I’m familiar with people changing error models is in “robust regression” where they use something like a Student-t error distribution. Or by adding correlations in seemingly-unrelated regressions (SUR), but that’s a multivariate case.

I’m not claiming to be an expert in this. I can’t make heads or tails of the language in most econometrics papers.

Agreed, and I’m sure he is also familiar with the issue I raise (“endogeneity” or corr(X,e) ≠ 0).

Systematic patterns among the errors themselves can be addressed using the techniques you mention. But if you have the endogeneity problem specifically, then you need to bring in “instrumental” variables, or add copula functions for estimating corr(X,e) as part of the model. I also agree that most of the econometrics papers on these and other issues are decidedly unkind to the reader.

This might be up your alley:

Or this heterogeneous treatment effects IV model with a Gaussian process prior on the effects.

1 Like

Indeed it is, thanks Jim! :-) What would be your take on the situation where the unobservables and the observables (X, and possibly Z) in the equations are also non-independent? This seems a likely scenario in real data. Could we potentially include a (Gaussian) copula in the model to capture/soak up the resulting regressor-error correlations? I have never seen that tried in the selection/treatment model context specifically, but in more general (endogenous) regression contexts it has been gaining some traction.

Park, S., & Gupta, S. (2012). Handling Endogenous Regressors by Joint Estimation Using Copulas. Marketing Science, 31(4), 567-586.

Tran, K.C., & Tsionas, E.G. (2015). Endogeneity in Stochastic Frontier Models: Copula Approach without External Instruments. Economics Letters, 133, 85-88.

Blauw, S.L, & Franses, Ph.H.B.F. (2016). Off the Hook: Measuring the Impact of Mobile Telephone Use on Economic Development of Households in Uganda using Copulas. Journal of Development Studies, 52(3), 315–330.

Christopoulos, D. McAdam, P., & Tzavalis, E. (March 1, 2018). Dealing with Endogeneity in Threshold Models Using Copulas: An Illustration to the Foreign Trade Multiplier. ECB Working Paper No. 2136, ISBN: 978-92-899-3241-7. Available at SSRN: https://ssrn.com/abstract=3141725.

I’m not sure I understand exactly what you mean by “unobservables and the observables (X, and possibly Z) in the equations are also non-independent”. When \mbox{cov}(X, e) \neq 0 then you instrument for X. When \mbox{cov}(Z, e) \neq 0 then Z isn’t a valid instrument.

I’ve not played with the Park-Gupta stuff, but my prior is that like other instrument-free methods it’d be fairly straightforward to break it.

Sorry, yes that is what I meant – your preferred solution in the case of cov(X,e) ≠ 0 and a possibly invalid Z. Valid external instruments are a rare animal. Swamy claims that it is simply impossible to find one (I’m not sure if I buy that extreme view… I would say “extremely difficult”):

Swamy, P.A.V.B., Tavlas, G.S., & Hall, S.G. (2015). On the Interpretation of Instrumental Variables in the Presence of Specification Errors. Econometrics, 3, 55–64.

Raunig, B. (2017). On The Interpretation of Instrumental Variables in the Presence of Specification Errors: A Causal Comment. Econometrics, 5(3), 31. doi:10.3390/econometrics5030031

Swamy, P.A.V.B., Hall, S.G., Tavlas, G.S., & von zur Muehlen, P. (2017). On the Interpretation of Instrumental Variables in the Presence of Specification Errors: A Reply. Econometrics 2017, 5(3), 32.
Econometrics 2015, 3(1), 55-64; doi:10.3390/econometrics5030032

The Park-Gupta simulations show that the Gaussian copula is fairly robust to other forms. I haven’t seen anyone take a Bayesian angle on it. Heteroskedasticity-based IVs are also a neat idea (again, a use for errors! :-) but they make some strong parametric assumptions.

Klein, R., & Vella, F. (2009). Estimating the Return to Endogenous Schooling Decisions via Conditional Second Moments. Journal of Human Resources, 44(4), 1047-1065. http://jhr.uwpress.org/cgi/reprint/44/4/1047

Klein, R., & Vella, F. (2010). Estimating a class of triangular simultaneous equations models without exclusion restrictions. Journal of Econometrics, 154(2), 154-164. https://doi.org/10.1016/j.jeconom.2009.05.005

Milunovich, G., & Yang, M. (2018). Simultaneous Equation Systems With Heteroscedasticity: Identification, Estimation, and Stock Price Elasticities. Journal of Business & Economic Statistics, 36(2), 288-308. https://doi.org/10.1080/07350015.2016.1149072

Lewbel, A. (1997). Constructing Instruments for Regressions With Measurement Error When No Additional Data are Available, With an Application to Patents and R&D. Econometrica, 65(5), 1201-1213.

Lewbel, A. (2012). Using Heteroscedasticity to Identify and Estimate Mismeasured and Endogenous Regressor Models. Journal of Business & Economic Statistics, 30(1), 67-80.

Lewbel, A. (2018). Identification and Estimation Using Heteroscedasticity Without Instruments: The Binary Endogenous Regressor Case. Economics Letters, 165, 10-12. https://doi.org/10.1016/j.econlet.2018.01.003

Fernihough, A. (February 20, 2015). Package ivlewbel: Uses heteroscedasticity to estimate mismeasured and endogenous regressor models. R Package Version 1.1. https://cran.r-project.org/web/packages/ivlewbel/index.html

Gui, R., Meierer, M., & Algesheimer, R. (Dec. 1, 2015). REndo: Fitting Linear Models with Endogenous Regressors when No External Instruments are Available. R Package Version 1.0. https://cran.r-project.org/web/packages/REndo/index.html

Chau, T.W. (2017). Identification through Heteroscedasticity: What If We Have the Wrong Form? Economics Bulletin, 37(4), 2413-2421. http://www.accessecon.com/Pubs/EB/2017/Volume37/EB-17-V37-I4-P215.pdf

Some non-parametric approaches swap the usual exclusion restriction for a flimsier additivity assumption:

Ben-Moshe, D., D’Haultfœuille, X., & Lewbel, A. (2017). Identification of additive and polynomial models of mismeasured regressors without instruments. Journal of Econometrics, 200(2), 207-222. https://doi.org/10.1016/j.jeconom.2017.06.006

Anyway, econometrics = identification soup.

Sorry for bumping this topic, but this type of model has been discussed a few times over the past years and I liked this implementation by @James_Savage. As far as I can see this initial issue was not really addressed here (“bias” in rho), so if someone digs this up in the future this might be helpful:

I think there is a small mistake in the log likelihood of the outcome model: Since this is a conditional normal, the conditional variance should be \text{Var}(Y|\text{participation})=(1-\rho^2)\sigma_u^2, right? So in Stan, one would have to put something like \sqrt{\sigma_u^2-\rho^2 \sigma_u^2} into the (conditional) outcome likelihood.

So instead of this line:
Y[n] ~ normal(alpha + X[n] * beta + sigma_u * rho * inverse_mills(rep_vector(p[n],1) ), sigma_u);

…it should be something like this:
Y[n] ~ normal(alpha + X[n] * beta + sigma_u * rho * inverse_mills(rep_vector(p[n],1) ), sqrt(square(sigma_u) - square(rho)*square(sigma_u)));
(Of course, one can do this more efficiently…)

Or am I completely off here?

1 Like

Hey @Max_Mantei – have you tried replicating the simulation above with your change? (Sorry, don’t have time to dive back into this right now).

Hey, I didn’t systematically test this. However, it seems that the second parametrization generally recovers lower values of rho. This comes at the cost that it doesn’t sample that well (also, I think that \alpha_1 +Z\gamma should go into the inverse Mills ration, and not p, right? This also leads to lower rho). But, sample selection and IV models in general tend to be a bit brittle, so maybe that’s that… In the first parametrization, the \sigma_u really is the conditional sd, which is lower than the marginal variance sd, and thus leads to higher values of \rho.

I guess the main problem is, that \rho and \sigma_u are not so easily disentangled, since it’s a conditional normal. Practically, I found that something like Y[n] ~ normal(alpha + X[n] * beta + theta * inverse_mills(rep_vector(z[n],1) ), sigma); worked well with putting theta and sigma into the parameters block.

Max, rtrangucci’s model above addresses the bias problem. In simulations I have run, it recovers all the parameters very well with great coverage properties and I didn’t notice bias in rho. If you spot any issues with rtrangucci’s model though please update me/us as I am about to use it for a paper so would like to know!