Hierarchical Heckman-style Selection Models

Continuing the discussion from Heckman selection model code + simulation, I am building a hierarchical version of these models for a project.

I’m running into an unusual situation: the HMC seems to work fine, but the calibration/interval coverage is bad. Usually, the HMC fails spectacularly and signals the problem. But not here.

In simulations I did (code below) the 50% intervals cover only about 30% of the truth, 95% intervals cover only 75% of the truth. I’ve done the usual recentering already. No DTs at all, allegedly, pairs and traceplots both look fine as far as I can tell – though the traceplots do display a little bit more serial correlation than one would like, to my eye. This is very unusual to me! Usually It’s DTs all over the place at the first sign of a problem. I’ve put all of this and working example with the calibration test below. Let me walk through the model first.

I start with a classic heckman selection model from the other thread which works great and is calibrated perfectly. Then I add a “lower level” to the model where instead of observing the data from the selected units (the y1s) I observe an unbiased but noisy signal of them, y1_hat, with known SD, y1_hat_sd.

This is basically 8 schools stacked onto a heckman model. And yet! It doesn’t cover the space properly. I wondered if lack of global concavity of the heckman likelihood might be interacting with the hierarchy in some bad way, but i don’t think it should, because the hierarchy only activates conditional on the unit being selected at all (whereas the unconditional inference is what poses the nonconcavity issue).

Here’s the model, which is just the same as the one that works from the previous thread, but with the y1_hat, y1_hat_sd added and some parameter transformation to do the re-centering trick:

data {
  int<lower=1> N;
  int<lower=1> N_neg;
  int<lower=1> N_pos;
  int<lower=1> D;
  vector[N_pos] y1_hat;
  vector[N_pos] y1_hat_se;
  int<lower=0,upper=1> y2[N];
  matrix[N_pos,D] X1;
  matrix[N,D] X2;
}
transformed data {
  int<lower=1,upper=N> n_pos[N_pos];
  int<lower=0,upper=N> n_neg[N_neg];
  {
    int i;
    int j;
    i = 1;
    j = 1;
    for (n in 1:N) {
      if (y2[n] == 1) {
        n_pos[i] = n;
        i = i + 1;
      } else {
        n_neg[j] = n;
        j = j + 1;
      }
    }
  }
}
parameters {
  real<lower=-1,upper=1> rho;
  vector[D] b1;
  vector[D] b2;
  real<lower=0> sd1;
  vector[N_pos] eta_y1;
}
transformed parameters {
  vector[N_pos] y1;
  vector[N_pos] mu_y1;
  mu_y1 = X1 * b1;
  y1 = mu_y1 + sd1*eta_y1;
  
}
model {
  vector[N] mu_y2;

  rho ~ normal(0,0.5);
  b1 ~ normal(0, 1);
  b2 ~ normal(0, 1);
  sd1 ~ normal(0, 1);

  
  mu_y2 = X2 * b2;
  
  for (n in 1:N_neg) {
    target += (log(Phi(-mu_y2[n_neg[n]])));
  }
  
  for (n in 1:N_pos) {
    target += log(Phi(sqrt(1 - rho^2)^(-1)*(mu_y2[n_pos[n]] 
                               + (rho / sd1)
                               * (y1[n] - mu_y1[n])))); 
  }
  eta_y1 ~ normal(0,1);
  y1_hat ~ normal(y1, y1_hat_se);
}

Here’s the accuracy for some run for the parameters we care about (the heckman parameters). Not very good (it’s a coincidence that in this run all 3 b1s are bad, across runs I haven’t seen a pattern except that sd1 is doing better than rho most of the time):


Here are the traceplots and pairs plots:

Here’s the stan file and the simulation file. Just run it with S = 1 if you want to do 1 run of the model.
fake_data_heck_hierarchical_montecarlo_calibration_check.R (3.2 KB)
heck_hierarchical.stan (1.3 KB)

Does anyone see any clues in the trace or pairs plots that I have missed? Or any other errors I’ve missed here, or have any thoughts about what might be going on here? I am worried that I am losing my mind and have set up the hierarchy wrong and just can’t see it…

2 Likes

Is b1 being multiplied by anything? It looks to me like only b2 is being evaluated relative to data (line 54 of your heck_hierarchical script). When I run it on my machine, all those params are sitting on 0 because of the normal(0,1) prior.

Edit: Ignore this, my mistake, I see it now!

–Brice

1 Like

Thanks for joining and digging in Brice!! Much appreciated.

Dan Simpson on twitter has suggested this looks like an identificatio problem. I have now run the thing for 10,000 iters, once, and i see mega max_treedepth violations. I will check if this is a general phenomenon and report back. Any other hunches very welcome, no matter how vague!

1 Like

Haven’t read the whole original thread so sorry if I’m not understanding part of the model but here you have

Then you have this in the target +=

(y1[n] - mu_y1[n])

On that line are you intending to do the equivalent of mu_y1 + sd1*eta_y1 - mu_y1?

3 Likes

Treating y1 as a parameter drastically speeds up the model and reduces all the treedepth issues but I still have bias issues.

I replaced Phi with lcdf because of Make `normal_lcdf(x | 0, 1)` equivalent to `Phi()` · Issue #2470 · stan-dev/math · GitHub.

data {
  int<lower=1> N;
  int<lower=1> N_neg;
  int<lower=1> N_pos;
  int<lower=1> D;
  vector[N_pos] y1_hat;
  vector[N_pos] y1_hat_se;
  int<lower=0,upper=1> y2[N];
  matrix[N_pos, D] X1;
  matrix[N,D] X2;
}
transformed data {
  int<lower=1,upper=N> n_pos[N_pos];
  int<lower=0,upper=N> n_neg[N_neg];
  
  {
    int i;
    int j;
    i = 1;
    j = 1;
    for (n in 1:N) {
      if (y2[n] == 1) {
        n_pos[i] = n;
        i = i + 1;
      } else {
        n_neg[j] = n;
        j = j + 1;
      }
    }
  }
}
parameters {
  real<lower=-1,upper=1> rho;
  vector[D] b1;
  vector[D] b2;
  real<lower=0> sd1;
  vector[N_pos] y1;
}
transformed parameters {
}
model {
  vector[N_pos] mu_y1 = X1 * b1;
  vector[N] mu_y2 = X2 * b2;

  rho ~ normal(0, 0.5);
  b1 ~ normal(0, 5);
  b2 ~ normal(0, 5);
  sd1 ~ exponential(1);
  
  for (n in 1:N_neg) {
    target += normal_lcdf(-mu_y2[n_neg[n]] | 0, 1);
  //  target += (log(Phi(-mu_y2[n_neg[n]])));
  }
  
  for (n in 1:N_pos) {
     target += normal_lcdf(
        sqrt(1 - rho^2)^(-1) * (mu_y2[n_pos[n]] + (rho / sd1) * (y1[n] - mu_y1[n])) | 0, 1); 
    // target += log(Phi(sqrt(1 - rho^2)^(-1)*(mu_y2[n_pos[n]] 
    //                            + (rho / sd1)
    //                            * (y1[n] - mu_y1[n])))); 
  }

  y1_hat ~ normal(y1, y1_hat_se);
  y1 ~ normal(mu_y1, sd1);
}
5 Likes

Ah super interesting point @stevebronder. Initially I thought, “No, this is not important and it’s just the reparameterisation makes it look funny”. But you’re right to point to it because i think that by introducing uncertainty about the actual value of y1, in addition to the foundational uncertainty we have about parameters in the heckman model, we are intuitively going to make it quite a bit more difficult to learn about mu_y1 (even when the two uncertainties are orthogonal, as they are)…hmm… I’ll have a think. Thank you!!

Thank you so much @spinkney for the speed-up!! This should make it so much easier for me to ferret out the problem, I really appreciate it !

4 Likes

It seems like we should be able to integrate out the latent y1 and just use y1_hat, because y1, y1_hat, and y2 are jointly multivariate normal. I don’t see why what you wrote above doesn’t work though.

Something like:

// w/ spinkney's changes to likelihood below
parameters {
  real<lower=-1,upper=1> rho;
  vector[D] b1;
  vector[D] b2;
  real<lower=0> sd1;
}
model {
  vector[N] mu_y2 = X2 * b2;
  vector[N_pos] mu_y1 = X1 * b1;
  vector[N_pos] y1_hat_obs_sd;
  for (n in 1:N_pos)
    y1_hat_obs_sd[n] = sqrt(y1_hat_se[n]^2 + sd1^2);

  b1 ~ normal(0, 1);
  b2 ~ normal(0, 1);
  sd1 ~ normal(0, 1);
  
  for (n in 1:N_neg) {
    target += normal_lcdf(-mu_y2[n_neg[n]] | 0, 1);
  }
  
  for (n in 1:N_pos) {
    target += normal_lcdf(
                          sqrt(1 - rho^2)^(-1) * (mu_y2[n_pos[n]]
                                                  + (rho / y1_hat_obs_sd[n])
                                                  * (y1_hat[n] - mu_y1[n])) | 0, 1);
  }
  y1_hat ~ normal(mu_y1, y1_hat_obs_sd);
}

I just checked and it doesn’t solve the issue with the inference, but it eliminates the latent variable which might speed things up a bit.

1 Like

I think I might see what the issue is. In the simulation code line 38 reads:

  y1_sel_hat <- rnorm(y1_sel, y1_hat_se)

but I think the code should read

  y1_sel_hat <- rnorm(length(y1_sel), y1_sel, y1_hat_se)

That sort of seems to fix things:

but the calibration still looks weird for the rho parameter (the last column below)

> colMeans(calibration_50)
[1] 0.40 0.52 0.52 0.52 0.54 0.62 0.48 0.16
> colMeans(calibration_95)
[1] 0.94 0.94 0.92 1.00 0.94 1.00 0.96 0.60

and I get divergent transitions in some fits unless I bound the y1_hat_se below by 0.01 in the simulation code. I wanted to avoid a situation where some y1_hat_se entries are tiny and sig1 is also tiny. That seems to eliminate any datasets that generate divergent transitions, but I can’t be certain that this is the reason why.

I also changed the prior on rho to match the DGP code, so I added a parameter rho_raw that is uniform [0, 1].

fake_data_heck_hierarchical_montecarlo_calibration_check.R (3.7 KB)
heck_hierarchical.stan (1.3 KB)

EDIT: A bit more on the undercoverage for rho:

It looks like the intervals for rho undercover when abs(rho) is large, or when sig1 is small. Here’s a glm for the 95% interval coverage (I reduced the number of observations to 250 and increased the number of simulations to 100 in order to generate the data for the intervals):

> summary(glm(calibration_95[,8] ~ sds + abs(rhos), family = binomial))

Call:
glm(formula = calibration_95[, 8] ~ sds + abs(rhos), family = binomial)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.9148   0.1909   0.3862   0.5980   1.3332  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   2.8169     0.7909   3.562 0.000369 ***
sds           1.3378     0.6316   2.118 0.034164 *  
abs(rhos)    -4.3304     1.1458  -3.779 0.000157 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 102.791  on 99  degrees of freedom
Residual deviance:  79.004  on 97  degrees of freedom
AIC: 85.004

Number of Fisher Scoring iterations: 5

This maybe isn’t so surprising, but it’s fixable if you’re willing to change your priors for rho and sd1. Putting a Beta(3, 3) prior on rho_raw, and a Gamma(2, 2) on sd1 seems to nearly fix the under coverage in the simulation study:

> colMeans(calibration_50)
[1] 0.50 0.56 0.47 0.49 0.40 0.53 0.55 0.36
> colMeans(calibration_95)
[1] 0.98 0.96 0.95 0.94 0.92 0.93 0.92 0.85

heck_hierarchical.stan (1.3 KB)
fake_data_heck_hierarchical_montecarlo_calibration_check_alt_priors.R (4.1 KB)

11 Likes

THIS IS AMAZING!!! Thank you so much!!

I have done a check on my machine too and this looks pretty good overall even without the prior changes, but yes the prior changes eliminate the DTs. And it makes sense that the new priors work better – the model approaches non-identification when abs(rho) → 1 or any of the SEs go to 0, so we should never expect good performance then and we would not be fitting the model if we thought those cases were salient. Therefore, the beta(3,3) prior on rho is a much better choice!!

Thank you so much for catching the DGP bug and also this very thoughtful priors work!!

3 Likes

You’re welcome! I’m glad it was helpful!

1 Like