Estimate an indicator variable within a linear regression


#1

I am estimating a linear model, where I want to estimate the threshold value for an indicator variable together with the normal selection coefficients.

I have the following model in Stan:

data {
  int N; // no observastions
  
  // Covariates
  real x1[N];
  real x2[N];
  
  // Count outcome
  real y[N];
}

parameters {
  real betas[3];
  real <lower=0> gamma;
  real <lower=0> sigma;
}

model {
  real mu[N];
  
  // Likelihood 
  for (i in 1:N) {
    mu[i] = betas[1] + betas[2] * x1[i] + betas[3] * (x2[i] < gamma);
  }
  y ~ normal(mu, sigma);
}

Where gamma is the the threshold to classify x2 into 0 and 1. I have two questions:

  1. Is this the right way to specify my problem in Stan? If I run the model on simulated data, I can recover the correct parameter values.
  2. Are there any tricks I could use to speed up computation?

#2

No, because it is not differentiable with respect to betas[3] when x2[i] > gamma

Doesn’t matter until you write down a differentiable model so that Stan can sample from without warnings.

It is hard to think of data-generating processes where x2 has no influence on y until it reaches gamma and then is linear. But if you were going to estimate a model such as this, you will need an informative prior on gamma and presumably the other parameters as well. Then you can write your model as a two component mixture where the mixing weight is the probability that x2[i] < gamma under some CDF for x2. Something like

model {
  vector[N] eta = betas[1] * betas[2] * x1; // need to make x1 a vector first
  for (i in 1:N) {
    real theta = foo_cdf(gamma | ...);
    target += log_mix(theta, normal_lpdf(y[i] | eta[i] + betas[3] * x2[i], sigma),
                             normal_lpdf(y[i] | eta[i], sigma));
  }
  // priors
}

#3

Many thanks for your suggestions, I am still having a hard time understanding and adjusting the line

real theta = foo_cdf(gamma | ...);

Are you ware of any papers/tutorials/examples I could refer to?


#4

It is just whatever CDF you choose from the distribution of x2.