Implementing two-step estimation

I am trying to implement a two-step estimator to account for my DV being discretized. In the first step, I need to estimate the mean av_y and st. dev. \sigma_y of the underlying normal distribution using the midpoints of the bins y_{mp} for N observations. I do that with something like this:

model {
  ymp ~ normal( avy, sigmay); 
  avy ~ normal(muavy, sigmaavy); 
  sigmay ~ cauchy(musigmay, sigmasigmay);  
  // then some hierarichal priors for muavy and sigmaavy
}

Then I need to combine these estimates with the bin edges X, X_{+1} to create a new dependent variable
y_{EC} = av_y + \sigma_y \frac{\phi( (X_n - av_y) / \sigma_y ) - \phi( (X_n^{+1} - av_y) / \sigma_y ) } { \Phi( (X_n^{+1} - av_y) / \sigma_y ) - \Phi( (X_n - av_y) / \sigma_y ) }
and estimate the second-step parameters (say \alpha and \sigma_\zeta) that account for the uncertainty in the estimates of av_y and \sigma_y. Let’s say for simplicity that the second-stage model is y_{EC} \sim N(\alpha, \sigma_\zeta). What I have tried last is

transformed parameters{
  real yEC[N];
  for (n in 1:N) {
     yEC[n] = avy + sigmay *
                ( exp(normal_lpdf((X[y[n]  ] - avy)/sigmay | 0, 1)) -
                  exp(normal_lpdf((X[y[n]+1] - avy)/sigmay | 0, 1)) ) / 
                ( exp(normal_lcdf((X[y[n]+1] - avy)/sigmay | 0, 1)) -
                  exp(normal_lcdf((X[y[n]  ] - avy)/sigmay] | 0, 1)) )
}
model {
    target += normal_lpdf( avy | muavy, sigmaavy);
    target += normal_lpdf( sigmay | musigmay, sigmasigmay);
    target += normal_lpdf( yEC  | alpha , sigmazeta);        
      
  // priors for parameters alpha and sigmazeta

The true model is quite more complicated but I believe this code gets to the core of my question.

My problem is that the av_y estimated in the second stage have a mean way lower than the muavy and this problem does not happen if I use the original data to estimate the second step directly (pretending the bin midpoints are the actual data). I originally used one single program to estimate both steps, but got the bias. Then ran the first step, and processed the chains to compute muavy, sigmaavy, musigmay, sigmasigmay and passed as data to the second step. The bias survives.

I suspect that the likelihood of y_{EC} \sim N(\alpha, \sigma_\zeta) is pulling the mean of av_y towards zero. That is, I cannot fully separate the two models.

Thanks in advance.

Apologies if I’m misreading, but this sounds like a measurement error model? The docs below may be useful if so

2 Likes

@stevebronder, thanks for the prompt reply and for the recommendation. This indeed can be regarded as a measurement error model. I tried before to use data augmentation to estimate the true underlying value of my DV, but I could not get significant estimates. I suspect that the bins are narrow enough that the distribution within each bin is quite flat and therefore the uncertainty around the imputed values is just too high. Basically, I do not have enough information and cannot use the distribution shape to identify the underlying values.

So, instead, I thought of estimating y_{EC}, the expected value of y conditional on the bin, and then using that as a data point (accounting for the spread of its distribution). In the frequentist world, that generates consistent estimates. See:

Stewart, Mark B. “On least squares estimation when the dependent variable is grouped.” The Review of Economic Studies 50, no. 4 (1983): 737-753.

I realize I am mixing consistency with Bayesian methods, but I have not found a better option yet.

Oh! If I understand this right, then you might save yourself some time by using Stan’s ordered distributions. If you know the cutpoints that define the bins then you submit these as data along with the bin label for each observation, leaving you to do inference on the eta parameter, which I think you can then hack to achieve inference on both the location and scale of a latent normal. Something like:

data{
    int n_bins_plus_2 ; //plus2 bc first and last bins go to -Inf/+Inf, respectively 
    vector[n_bins_plus_2-1] cutpoints ; 
    int n_obs ;
    array[n_obs] int<lower=1,upper= n_bins_plus_2 > bin_for_obs ; // remember that no observation should have the label “1”
}
parameters{
    real mu ;
    real<lower=0> sigma ;
}
model{
    bin_for_obs ~ ordered_logistic( mu/sigma , cutpoints ) ;
}

And then, should you have some other observed data X that you want to include as a predictor to assess the relationship between X and your binned outcome, you can do:

data{
    int n_bins_plus_2 ;
    vector[n_bins_plus_2-1] cutpoints ;
    int n_obs ;
    array[n_obs] int<lower=1,upper= n_bins_plus_2 > bin_for_obs ;
    vector[n_obs] X ;
}
parameters{
    vector[2] mu ; // intercept and effect-of-Y on mu
    vector[2] log_sigma ; // intercept and effect-of-Y on sigma
}
model{
    //priors would go here
    bin_for_obs ~ ordered_logistic( 
        (mu[1]+mu[2]*X) 
        ./  
        exp(sigma[1]+sigma[2]*X) 
        , cutpoints 
    ) ;
}

Similarly, if you have the above plus some other observed data variable Y and you want to let the count data and X jointly inform on Y, you can use a structural-equation-style latent variable model:

data{
    int n_bins_plus_2 ;
    vector[n_bins_plus_2-1] cutpoints ;
    int n_obs ;
    array[n_obs] int<lower=1,upper= n_bins_plus_2 > bin_for_obs ;
    vector[n_obs] X ;
    vector[n_obs] Y ;
}
parameters{
    vector[3] mu ;
    vector[3] <lower=0> sigma ;
    vector[n_obs] Z ; 
    real<lower=0> Z_X ; // must be positive for identifiability
    real<lower=-1, upper=1> Z_bin; 
    real Z_Y;
    vector[n_obs] bin_unique ; 
}
model{
    // other priors would go here
    bin_unique ~ std_normal() ;  // must be std_normal for identifiability
    Z ~ std_normal() ;  // must be std_normal for identifiability
    X ~ normal( mu[1] + Z*Z_X , sigma[1] ) ;
    bin_for_obs ~ ordered_logistic( 
        mu[2] + (
          Z*Z_bin + bin_unique*(1-Z_bin)
        )*sigma[2]
        , cutpoints 
    ) ;
    Y ~ normal( mu[3] + Z*Z_Y , sigma[3] ) ;
}

What do you think @stevebronder ?

1 Like

This is really helpful for a number of applications!

If I can ask one noob question; should the ‘model’ block be the ‘parameters’ block and vice versa? I

Ha, good catch, mixed them up somehow 🤦‍♂️

Edited my post on o fix it and avoid confusing others.

1 Like

I am afraid I sacrificed important parts of my problem description for the sake of conciseness and clarity. The model I tried to describe is the first equation of a system, which in its simplest, original form looks like
y_{it}=\pi_{it}+\zeta_{it} + \epsilon_{it}
x_{it} = \phi \zeta_{it} + \psi \epsilon_{it} + \xi_{it},
where \pi and \zeta are latent variables, i indexes observation units and t indexes time.

This model is identified when y is a continuous variable. But in my data y is discretized. I actually did try to code the first equation as an interval regression, which is I believe what you are proposing. I had to desist because I need to recover some estimate of \epsilon to estimate the second equation and I was not able to do that. I suspect discretization removes the information associated with \epsilon. Please correct me if I am wrong and it is possible to extend your code to draw an estimate of \epsilon.

In the meantime, I will use the interval regression you propose to estimate the mean and spread of the distribution of y.

Have you confirmed this with simulations? I feel like even in the case of no-binning of y, \zeta and \epsilon are under-identified thanks to their additive role in both eqns. The fact that they have equal weight for y and possibly-unequal weights in x helps, but under circumstances where x also supports equal weights, they’ll certainly be under-identified.

I did not do the simulation, but I estimated the model using the midpoints of the bins for y. The chains converge and are stable if I impose a sign restriction on \phi (theory implies it should be positive). The estimates are also consistent with estimates published in a couple of papers for the same model with different data and a continuous y. One of those papers demonstrated identification theoretically for the case in which y is continuous.