# Heckman sample selection model (working but biased selection coefficients)

#1

Dear Stan-users,

I am in the process of developing a simple heckman selection model (aka Tobit Type 2 model) using the Wooldridge data set “mroz” (Wooldridge, 2002, page 565, example 17.6). The data set contains the wages for married females and a set of covariates. However, we only observe wages for 428 women who choose to work. We do not observe the wages for the 325 women who are not working.

The heckman model works with a probit selection equation to capture whether or not a woman is working (whether or not wage is observed) and an outcome equation to analyze the influence of covariates on wage.

My code works with a latent series that links selection and outcome (wage) equation. Furthermore, it splits the estimation of the outcome equation for observed and unobserved wages (missing values).

For the outcome equation, I am able to estimate coefficients that are similar to the estimates from a frequentist model based on the mroz data set. Also, my code can recover the parameters in the outcome equation from simulated data quite well.

For the selection equation, all of my estimates are off.

I would greatly appreciate input on the following questions:

1. Do you agree with the general approach that I chose?
2. Where’s the mistake that causes the estimates in the selection (probit equation) to be biased?

Below is the code for the model and a R script that automatically loads and analyzes the data set. The R script also includes a simulation for a tobit type 2 selection model to show that my results are not an artifact of the mroz data set.

Any help would be greatly appreciated!

Wooldridge, J. (2002): Econonometric analysis of cross section and panel data, MIT Press.

######### heckman selection model with mroz dataset###########################
library(MASS)
library(rstan)
library(shinystan)
library(foreign)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
set.seed(20151204)

rm(list=ls())

###########simulated data###########
####number of observations
N <- 700

####correlated error terms
Sigma <- matrix(c(1, .5, .5, 1), nrow=2, ncol=2)
errors <- mvrnorm(n=N, mu=c(0, 0), Sigma=Sigma)

e1 <- errors[,1]
e2 <- errors[,2]

####variables for outcome eqn
x0a <- rnorm(N,0,1)
x0b <- rnorm(N,0,1)
x1 <- rnorm(N,0,1)
x2 <- rnorm(N,0,1)

####selection eqn
z_star <- 1*x0a+1*x0b+1*x1+1*x2+e2
####selection variable
z = as.numeric(pnorm(z_star)>.25)

####outcome variable and eqn
y <-  ifelse(z==1,1+1*x1+1*x2+e2,0)

####frequentist results
ols <- lm(y ~  x1 + x2, subset=z==1)
summary(ols)

library(sampleSelection)
heckit.ml <- heckit(selection = z ~ x0a + x0b +x1 + x2, outcome = y ~ x1 + x2, method = "ml")
summary(heckit.ml)

#### create intercepts
intercept <- rep(1, N)

#### independent variables that go into selection eqn
xa <- matrix(c(intercept, x0a,x0b, x1, x2), nrow=N, ncol=5)
nxa = ncol(xa)

#### independent variables that go into outcome eqn
xb <- matrix(c(intercept, x1, x2), nrow=N, ncol=3)
nxb = ncol(xb)

################# prep data for Stan######################
dat <- list(
lwage = y,
xa = xa,
xb = xb,
nxa = nxa,
nxb = nxb,
N = N,
P = 1,
observed = z
)

#######prepare data########
####dependent variable for outcome eqn
#lwage <- mydata\$lwage
#replace unobserved values with 0
#lwage[is.na(lwage)]<-0

####independent variables for outcome eqn
#educ <- mydata\$educ
#exper <- mydata\$exper
#expersq <- mydata\$expersq

####dependent variable for selection eqn
#observed <- mydata\$inlf

####additional independent variables for selection eqn
#nwifeinc <- mydata\$nwifeinc
#age <- mydata\$age
#kidslt6 <- mydata\$kidslt6
#kidsge6 <- mydata\$kidsge6

####total number of obs
#N <- length(lwage)

#### create intercepts
#intercept <- rep(1, N)

#### independent variables that go into selection eqn
#xa <- matrix(c(intercept, nwifeinc, educ, exper, expersq, age, kidslt6, kidsge6), nrow=N, ncol=8)
#nxa = ncol(xa)

#### independent variables that go into outcome eqn
#xb <- matrix(c(intercept, educ, exper, expersq), nrow=N, ncol=4)
#nxb = ncol(xb)

################# prep data for Stan######################
#dat <- list(
#  lwage = lwage,
#  xa = xa,
#  xb = xb,
#  nxa = nxa,
#  nxb = nxb,
#  N = N,
#  P = 1,
#  observed = observed
#)

mroz_heckman <- '
data {
int N;             // number of obs
int P;             // number of continuous outcomes
int observed[N];   // dependent variable for selection eqn
int<lower=1> nxa;  // number of variables in selection eqn
int<lower=1> nxb;  // number of variables in outcome eqn
vector[N] lwage;   // dependent variable for outcome eqn
vector[nxa] xa[N]; // variables for selection eqn
vector[nxb] xb[N]; // variables for outcome eqn
}

parameters{
vector[N] latent_series;  // latent series to link selection and outcome eqn
vector[N] wage_mis;       // variable capturing missing lwages

row_vector[nxa] alpha;    // coefficients for selection eqn
row_vector[nxb] beta;     // coefficients for outcome eqn

vector<lower=0>[P] scale;
corr_matrix[P+1] Omega;

}
transformed parameters {
matrix[N, P+1] Y2;
matrix[N, P+1] Y2_m;
vector[P+1] scales;

Y2 = append_col(lwage, latent_series);
Y2_m = append_col(wage_mis, latent_series);

scales[1]=scale[1];
scales[2] = 1.0; // set to one because of probit

}
model{
vector[P+1] mu[N];
scale ~ student_t(4,0,1);
Omega ~ lkj_corr(4);

for(n in 1:N) {
mu[n,1] = beta*xb[n];  // for outcome eqn
mu[n,2] = alpha*xa[n]; // for latent series that later goes into selection eqn
}

for(n in 1:N) {
if( observed[n] == 1 ){
Y2[n] ~ multi_normal(mu[n], quad_form_diag(Omega, scales));   // linear model for outcome eqn and latent series for observed lwages
}
else{
Y2_m[n] ~ multi_normal(mu[n], quad_form_diag(Omega, scales)); // linear model for outcome eqn and latent series for missing lwages
}

observed[n] ~ bernoulli(Phi_approx(latent_series[n])); // probit for selection
}
}
'

##########fit model#######
fit <-  stan(model_code = mroz_heckman, data=dat, iter = 1000 , warmup = 500 , init="random" , chains = 2 )

######inspect coefficients######
print(fit, pars = c("alpha", "beta"))

shiny <- launch_shinystan(fit)

########compare to frequentist results#######

ols <- lm(lwage ~ educ + exper + expersq, subset=observed==1)
summary(ols)

library(sampleSelection)
heckit.ml <- heckit(selection = observed ~ nwifeinc +educ + exper + expersq + age + kidslt6 + kidsge6, outcome = lwage ~ educ + exper + expersq, method = "ml")
summary(heckit.ml)

Missing data handling
#2

I had a look at this, but I’m not really familiar enough with the problem to help.

But! You’ll make your life easier vectorizing this model.

The big ticket is writing your draws for the multinormals in one big go:

vector[P + 1] Y2[N];

The computational problem with how you had things written is that the loop written like:

for(n in 1:N) {
if( observed[n] == 1 ){
Y2[n] ~ multi_normal(mu[n], quad_form_diag(Omega, scales));   // linear model for outcome eqn and latent series for observed lwages
} else{
Y2_m[n] ~ multi_normal(mu[n], quad_form_diag(Omega, scales)); // linear model for outcome eqn and latent series for missing lwages
}

requires a bunch of work to handle to covariance matrix N times. If we do things in one line, the overhead in handling the covariance matrix can be shared between all N draws. This requires a little rearrangement of variables throughout the model.

I made a slight modification to the R that calls the code:

idxs = order(z)
dat <- list(
lwage = y[idxs],
xa = xa[idxs,],
xb = xb[idxs,],
nxa = nxa,
nxb = nxb,
N = N,
P = 1,
L = N - sum(z),
observed = z[idxs]
)

With this we’re sorting the data so that all the observed == 0 folks come first (there’s an extra variable, L, which tells us how many observed == 0 folks there are). With this (and the goal of vectorizing the model), we can rewrite the actual Stan model like:

data {
int N;             // number of obs
int P;             // number of continuous outcomes
int L;             // number of observed waves
int observed[N];   // dependent variable for selection eqn
int<lower=1> nxa;  // number of variables in selection eqn
int<lower=1> nxb;  // number of variables in outcome eqn
vector[N] lwage;   // dependent variable for outcome eqn
vector[nxa] xa[N]; // variables for selection eqn
vector[nxb] xb[N]; // variables for outcome eqn
}

parameters{
vector[N] latent_series;  // latent series to link selection and outcome eqn
vector[L] wage_mis;       // variable capturing missing lwages

row_vector[nxa] alpha;    // coefficients for selection eqn
row_vector[nxb] beta;     // coefficients for outcome eqn

vector<lower=0>[P] scale;
corr_matrix[P+1] Omega;
}

transformed parameters {
vector[P + 1] Y2[N];
vector[P + 1] scales;

for(n in 1:N) {
if(observed[n] == 0)
Y2[n, 1] = wage_mis[n];
else
Y2[n, 1] = lwage[n];

Y2[n, 2] = latent_series[n];
}

scales[1] = scale[1];
scales[2] = 1.0; // set to one because of probit
}

model{
vector[P+1] mu[N];
scale ~ student_t(4, 0, 1);
Omega ~ lkj_corr(4);

for(n in 1:N) {
mu[n, 1] = beta * xb[n];  // for outcome eqn
mu[n, 2] = alpha * xa[n]; // for latent series that later goes into selection eqn
}

observed ~ bernoulli(Phi_approx(latent_series)); // probit for selection
}

This model takes a few minutes to run. This gave me similar results to the original model which took like an hour. I apologize for the rushed message and lack of actual insight on your problem, but hopefully this helps you figure out what’s happening!

#3

Thank you Ben! That makes life way easier.
I added your code to the original R script for the simulated data and the mroz data set:

######### heckman selection model with mroz dataset###########################
library(MASS)
library(rstan)
library(shinystan)
library(foreign)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
set.seed(20151204)

rm(list=ls())

###########simulated data###########
#number of observations
N <- 700

#correlated error terms
Sigma <- matrix(c(1, .5, .5, 1), nrow=2, ncol=2)
errors <- mvrnorm(n=N, mu=c(0, 0), Sigma=Sigma)

e1 <- errors[,1]
e2 <- errors[,2]

#variables for outcome eqn
x0a <- rnorm(N,0,1)
x0b <- rnorm(N,0,1)
x1 <- rnorm(N,0,1)
x2 <- rnorm(N,0,1)

#selection eqn
z_star <- 1*x0a+1*x0b+1*x1+1*x2+e2
#selection variable
z = as.numeric(pnorm(z_star)>.25)

#outcome variable and eqn
y <-  ifelse(z==1,1+1*x1+1*x2+e2,0)

#frequentist results
ols <- lm(y ~  x1 + x2, subset=z==1)
summary(ols)

library(sampleSelection)
heckit.ml <- heckit(selection = z ~ x0a + x0b +x1 + x2, outcome = y ~ x1 + x2, method = "ml")
summary(heckit.ml)

# create intercepts
intercept <- rep(1, N)

# independent variables that go into selection eqn
xa <- matrix(c(intercept, x0a,x0b, x1, x2), nrow=N, ncol=5)
nxa = ncol(xa)

# independent variables that go into outcome eqn
xb <- matrix(c(intercept, x1, x2), nrow=N, ncol=3)
nxb = ncol(xb)

idxs = order(z)
################# prep data for Stan######################
dat <- list(
lwage = y[idxs],
xa = xa[idxs,],
xb = xb[idxs,],
nxa = nxa,
nxb = nxb,
N = N,
P = 1,
L = N - sum(z),
observed = z[idxs]
)

#######prepare data########
#dependent variable for outcome eqn
#lwage <- mydata\$lwage
#replace unobserved values with 0
#lwage[is.na(lwage)]<-0

#independent variables for outcome eqn
#educ <- mydata\$educ
#exper <- mydata\$exper
#expersq <- mydata\$expersq

#dependent variable for selection eqn
#observed <- mydata\$inlf

#additional independent variables for selection eqn
#nwifeinc <- mydata\$nwifeinc
#age <- mydata\$age
#kidslt6 <- mydata\$kidslt6
#kidsge6 <- mydata\$kidsge6

#total number of obs
#N <- length(lwage)

# create intercepts
#intercept <- rep(1, N)

# independent variables that go into selection eqn
#xa <- matrix(c(intercept, nwifeinc, educ, exper, expersq, age, kidslt6, kidsge6), nrow=N, ncol=8)
#nxa = ncol(xa)

# independent variables that go into outcome eqn
#xb <- matrix(c(intercept, educ, exper, expersq), nrow=N, ncol=4)
#nxb = ncol(xb)

#idxs = order(observed)
################# prep data for Stan######################
#dat <- list(
#  lwage = lwage[idxs],
#  xa = xa[idxs,],
#  xb = xb[idxs,],
#  nxa = nxa,
#  nxb = nxb,
#  N = N,
#  P = 1,
#  L = N - sum(observed),
#  observed = observed[idxs]
#)

mroz_heckman <- '
data {
int N;             // number of obs
int P;             // number of continuous outcomes
int L;             // number of non-observed wages
int observed[N];   // dependent variable for selection eqn
int<lower=1> nxa;  // number of variables in selection eqn
int<lower=1> nxb;  // number of variables in outcome eqn
vector[N] lwage;   // dependent variable for outcome eqn
vector[nxa] xa[N]; // variables for selection eqn
vector[nxb] xb[N]; // variables for outcome eqn
}

parameters{
vector[N] latent_series;  // latent series to link selection and outcome eqn
vector[L] wage_mis;       // variable capturing missing lwages

row_vector[nxa] alpha;    // coefficients for selection eqn
row_vector[nxb] beta;     // coefficients for outcome eqn

vector<lower=0>[P] scale;
corr_matrix[P+1] Omega;
}

transformed parameters {
vector[P + 1] Y2[N];
vector[P + 1] scales;

for(n in 1:N) {
if(observed[n] == 0)
Y2[n, 1] = wage_mis[n];
else
Y2[n, 1] = lwage[n];

Y2[n, 2] = latent_series[n];
}

scales[1] = scale[1];
scales[2] = 1.0; // set to one because of probit
}

model{
vector[P+1] mu[N];
scale ~ student_t(4, 0, 1);
Omega ~ lkj_corr(4);

for(n in 1:N) {
mu[n, 1] = beta * xb[n];  // for outcome eqn
mu[n, 2] = alpha * xa[n]; // for latent series that later goes into selection eqn
}

observed ~ bernoulli(Phi_approx(latent_series)); // probit for selection
}
'

##########fit model#######
fit <-  stan(model_code = mroz_heckman, data=dat, iter = 1000 , warmup = 500 , init="random" , chains = 2 )

######inspect coefficients######
print(fit, pars = c("alpha", "beta"))

shiny <- launch_shinystan(fit)

########compare to frequentist results#######

ols <- lm(lwage ~ educ + exper + expersq, subset=observed==1)
summary(ols)

library(sampleSelection)
heckit.ml <- heckit(selection = observed ~ nwifeinc +educ + exper + expersq + age + kidslt6 + kidsge6, outcome = lwage ~ educ + exper + expersq, method = "ml")
summary(heckit.ml)

#4

I don’t know your model, but can give you two general pieces of advice. First, start small and build up your model a bit at a time so if something goes wrong, you know where. Second, test on data simulated from the priors in the model—that should fit and lets you debug your program independently of fitting it to real data.

I’d recommend not including the program as a string—you can’t use matrix transpose with a ' as a delimter and can’t use reject/print with a " and you lose line numbers.

You’ll get much better computational performance using Cholesky factors of correlation matrices rather than correlation matrices as parameters. There’s an lkj_corr_cholesky that works directly, the a multi_normal that takes a Cholesky, so it’s easier to scale than quad_form_diag. This should make things much faster.

In general, ifyou’re going to compute something like quad_form_diag(Omega, scales) in a loop, do it once on the outside as a local variable and save the result and reuse. It’s much faster. But in general, it’d be beter to separate the observed/unobserved thing.

What happens to the Y2_m[n] where observed[n] == 1? If they don’t get distributions, everythnig will blow up, and even if they do, I think the values are meaningless. You might want to check out the missing data chapter of the manual.

#5

I was also going to build a selection model and thought I start with Tobit-2.

I ran your model (with tiny changes) on the mroz data (3 chains, 2000 iterations each) and got this:

(vertical bars are ML estimates using sampleSelect::heckit)

The model took some time to fit (~20 minutes), but other than fairly low n_eff on the outcome stage (100-200 per parameter…) the fit seems fine.

Is this similar to the bias you were referring to? I’m personally more inclined to trust in the MCMC samples, especially considering the implications of a point estimate for something like expersq.

#6

sampleSelect::heckit does two-step estimation rather than ML. But yes, you should trust MCMC at least as much as any other estimation technique for selection models, particularly if you use informative priors.

#7

Well, I did both ML and 2-stage via the method = "ml" option (and "2stage" or whatever respectively). The dotted vertical line in the plot ist the 2-stage estimate. It’s essentially equal to the ML.

This was one of the tiny changes I made: Adding (stupidly wide) normal priors for the coefficients. I was afraid it would not fit using uniform priors.

I’ll try the simulated data next week and will then move on to a Tobit type 5 on other data obviously. BTW: @bgoodri, I really appreciate your posts from the 2015 mailing list regarding that model. :)

#8

EDIT: WHOOPS! Obviously these results are very off on the outcome stage now… I will update as soon as I’ve figuered out what’s wrong…

@Johannes_Auer, I think I found the problem. Please ignore my earlier post – there apparently was something wrong with the samples.

I rewrote the model and the key part to get better results was to reparameterize the multivariate normal in terms of the Cholesky decomposed correlation matrix (pages 344–345 in the 2.16 user manual).

Using 2,000 iterations on 3 chains with default options (alpha_adapt = 0.8 and max_treedepth = 10) I got very good results: No divergent transitions, all Rhats equal to 1.00, lowest n_eff slightly above 1,000 (majority of parameters actually at 3,000). And these are the estimates, again with MLE and 2-stage estimates from the sampleSelect package as vertical bars:

… makes much more sense!

Here is the model code. I’ll put this here so people can easily see how the reparameterization stuff looks. like I rewrote some things, so you maybe have to declare the data a bit differently in R.

Stan code:

int N;                       // number of obs
int N_tr1;                   // number treated
vector[N_tr1] y_tr1;         // observed y
int<lower=0,upper=1> tr[N];  // treatment assigned
int<lower=1> k;              // number covariates
int<lower=1> k_select;       // number covariates (selection)
vector[k] X[N];              // covariates
vector[k_select] X_select[N];// covariates (selection)
}
parameters{
row_vector[k] beta;
row_vector[k_select] beta_select;

real<lower=0> sigma;
cholesky_factor_corr[2] L_Omega;

vector[2] alpha;
}
transformed parameters{
vector[2] Y[N];
vector[2] mu[N];
vector<lower=0>[2] sdvector;

// Put in observed values of y (where tr == 1)
for(n1 in 1:N_tr1)
Y[n1, 2] = y_tr1[n1];

for(n in 1:N){
mu[n, 1] = beta_select*X_select[n];
mu[n, 2] = beta*X[n];
// Implies bivariate normal
Y[n] = mu[n] + sdvector .* (L_Omega * alpha);
}

// Fix first sd to 1, because of Probit
sdvector[1] = 1.0;
sdvector[2] = sigma;
}
model{
// Priors on coefficients (pretty vague...)
beta ~ normal(0, 5);
beta_select ~ normal(0, 5);

// Priors on sd's
L_Omega ~ lkj_corr_cholesky(4);
sigma ~ student_t(4, 0, 1);

// Probit for the selection stage (treatment)
tr ~ bernoulli(Phi_approx(Y[,1]));

// Aux variable
alpha ~ normal(0, 1);
}

#9

@Max_Mantei , thank you very much for your help!
I tried to declare my data so that it fits to your code, however I am not able to reproduce your results, yet.
Is it possible that you provide your data declaration? When I run your code I get the error message: “bernoulli_log: Probability parameter[1] is -1.#IND, but must be finite!”

#10

Note that @Max_Mantei wrote that the results are off for the outcome stage.

One possible reason I can see is that in the transformed parameters block some elements of Y are first filled with oberved data:

// Put in observed values of y (where tr == 1)
for(n1 in 1:N_tr1)
Y[n1, 2] = y_tr1[n1];

but then overwritten with modeles values :

for(n in 1:N){
mu[n, 1] = beta_select*X_select[n];
mu[n, 2] = beta*X[n];
// Implies bivariate normal
Y[n] = mu[n] + sdvector .* (L_Omega * alpha);
}

as Y[n] includes Y[n,2].

I also don’t see how this model conditons on outcomes for the treated.

#11

@Guido_Biele You are right. I was stupid about that part of the model. Had I taken @Bob_Carpenter’s suggestion about reading the Missing Data chapter in the manual, I’d have spotted that earlier.

Still, the key was using the non-centered parametrization. But only on the latent variable for the selection stage. Here is the model code that should give reasonable results. (I don’t claim the code is optimal.)

data{
int N;                       // number of obs
int N_tr1;                   // number treated
int N_tr0;                   // number not treated
vector[N_tr1] y_tr1;         // observed y
int<lower=0,upper=1> tr[N];  // treatment assigned
int<lower=1> k;              // number covariates
int<lower=1> k_select;       // number covariates (selection)
vector[k] X[N];              // covariates
vector[k_select] X_select[N];// covariates (selection)
}
parameters{
row_vector[k] beta;
row_vector[k_select] beta_select;

real<lower=0> sigma;
corr_matrix[2] Omega;

vector[N_tr0] y_tr0;

// aux variable for non-centered parameterization
real alpha;
}
transformed parameters{
real z[N];
cov_matrix[2] Sigma;
vector<lower=0>[2] sdvector;

// Fix first sd to 1, because of Probit
sdvector[1] = 1.0;
sdvector[2] = sigma;

// this is a bit stupid, since sqrt(Sigma[1, 1]) == 1.0;
// I left it in to make clear what's going on
for(n in 1:N)
z[n] = beta_select*X_select[n] + sqrt(Sigma[1, 1]) * alpha;

}
model{
// Priors on coefficients
beta ~ normal(0, 2.5);
beta_select ~ normal(0, 2.5);

// Priors on sd's
Omega ~ lkj_corr(4.0);
sigma ~ normal(0, 1);

for(n1 in 1:N_tr1)
y_tr1[n1] ~ normal(beta*X[n1], sqrt(Sigma[2, 2]));
for(n0 in 1:N_tr0)
y_tr0[n0] ~ normal(beta*X[N_tr0 + n0], sqrt(Sigma[2, 2]));

tr ~ bernoulli(Phi_approx(z));

// aux variable
alpha ~ normal(0, 1);
}

Note that the model is stupid about the order in which the data comes in. Since the mroz data is ordered already (first all treated, then all untreated) this is okay here. But you’d need to pay attention to that when the data looks differently (see @bbbales2’s post).

The estimates look much better now (using 2,000 iteration on 3 chains and specifying init = "0").

R script (data prep and plot) and Stan model:
mroz-tobit-2.R (2.6 KB)
mroz-tobit-2.stan (1.5 KB)

#12

@Max_Mantei , thank you for the codes.
Have you tried running your code on simulated data (sorted)?
When I run your code on simulated data, I am not able to estimate values similar to the “true” simulated value.

#13

You want to test posterior interval coverage, not similarity of posterior mean to true parameter value.

#14

This is my last try with this, and I think I’ve come full circle on this one. Sorry for the confusion I caused in the meanwhile.

First I thought I understood the simulated data that you posted, but I realized that I didn’t. You used the error term e2 twice: on the selection and the outcome stage. That’s a typo, right? Could also be that I don’t really understand whats going on.

Furthermore, the latent variable z_star should have Var(z_star) = 1.0 (therefore sd(z_star) = 1.0). This is also the (correct) restriction that is used in your original Stan model. However, the simulated data has sd(z_star) > 1. That could be a problem right? Maybe sampleSelect::heckit give “unbiased” estimates of the selection stage, because in their implementation they never explicitly make this restriction? (This is a wild and uninformed guess.)

Moreover, you’ve set the “cutoff” for the latent variable at pnorm(z_star)>.25, but for z_star ~ Normal(0, 1) it should be pnorm(z_star)>.5 to have the conventional if z_star > 0, then z = 1, else z = 0 interpretation of the latent variable z (conditional on some mu obviously). Is that correct?

Given all that, I don’t really see what the unbiased selection stage coefficients would be. Strictly speaking sampleSelect::heckit does not give unbiased estimates either, since it estimates a non-zero intercept. (?)

Anyways, please see the code below. The outcome stage coefficients are being estimated correctly – even better compared to heckit (not the intercept though).

The problem that remains with the model below however is, that Bayesian Fraction of Missing Information is low for all chains. I’ve to research a bit on that. Perhaps one of the smart people here can see what’s going on?

sim_data-tobit-2.R (3.0 KB)
tobit-2_centered.stan (1.1 KB)

#15

Not sure what the problem is with converging, but I’d suggest backing off the covariance estimation and seeing if you can get it to fit with independent errors.

Are there any warnings you’re getting like divergences?

What’s the n_eff and R-hat for the parameters in the posterior?

You can also simplify a bunch of the code. The transformed parameters block can just be

cholesky_factor_cov[2] L_Sigma
= diag_pre_multiply([1, sigma]', L_Omega);

You can eliminate loops with

Y[ , 1] = z1;

Your z1 and z0 parameters don’t seem to have any priors, which can cause problems with sampling.

You’re giving sigma a very thick tail, so you might want a tighter prior.

#16

Thanks, Bob!

I didn’t know [1, sigma]' was possible. Nice!

Eliminating the loops is a bit difficult, since Y[ , 1] is a real. Also, z0 and z1 would have to be concatenated/appended to something like one one long z vector. Same for y1 and y0.

However, speed is not really and issue. On my work laptop I get a little under 1,000 iterations per minute per chain (3). I had to set max_treedepth = 15, though. No warnings about divergent transitions. The BMFI low warning pops up for all chains. (I still have to read more on that.)

With 6,000 post-warmup iterations (2,000 per chain) there are four parameters with n_eff < 200 and 1.01 < R-hat < 1.02: L_Omega[2,2], L_Sigma[2,2], L_Omega[1,2], and L_Sigma[1,2].

• While the [2, 2] entries for L_Omega and L_Sigma show low n_eff, the corresponding sd sigma has R-hat = 1 and n_eff = 2868.
• The [1,2] entries are and should be 0 anyways, right? I don’t know what R-hat and n_eff means in this context…

The regression coefficients look “correct”, but also some have R-hats of 1.007 and almost all of them n_eff of 300-500.

Estimating independent errors is not an option, since accounting for the covariance between observed and unobserved outcomes, y1 and y0 respectively, and the latent variable that indicating a “tendency/propensity” for being observed z ~ Normal(bs*Xs, 1), which I split into z0 and z1, is exactly what this model tries accomplish. (Sorry for this very long, very German sentence.) This is also why z0 and z1 don’t have priors, if that makes sense.

Thanks again for giving so much useful tips and advice! I already learned a lot in this forum.

#17

That should be fine as long as your n_eff is high enough (that also takes R-hat and mixing into account in Stan’s version).

That still usually a win. If you have a matrix of data y, it’s better to do to_vector(y) ~ than to try to loop over the columns (which is better than looping over rows due to memory locality concerns).

That’s not surprising. Because of the non-linearities, some parameters or derived quantities mix better than others.

Perfectly grammatical and comprehensible English :-) The German that would throw us for a loop would be auxiliary verb second, verb final, and subject/object out of (English) order. Not to mention 50-character long noun compounds. (Sorry, recovering linguist.)