Estimate an indicator variable within a linear regression

I am estimating a linear model, where I want to estimate the threshold value for an indicator variable together with the normal selection coefficients.

I have the following model in Stan:

data {
  int N; // no observastions
  
  // Covariates
  real x1[N];
  real x2[N];
  
  // Count outcome
  real y[N];
}

parameters {
  real betas[3];
  real <lower=0> gamma;
  real <lower=0> sigma;
}

model {
  real mu[N];
  
  // Likelihood 
  for (i in 1:N) {
    mu[i] = betas[1] + betas[2] * x1[i] + betas[3] * (x2[i] < gamma);
  }
  y ~ normal(mu, sigma);
}

Where gamma is the the threshold to classify x2 into 0 and 1. I have two questions:

  1. Is this the right way to specify my problem in Stan? If I run the model on simulated data, I can recover the correct parameter values.
  2. Are there any tricks I could use to speed up computation?

No, because it is not differentiable with respect to betas[3] when x2[i] > gamma

Doesn’t matter until you write down a differentiable model so that Stan can sample from without warnings.

It is hard to think of data-generating processes where x2 has no influence on y until it reaches gamma and then is linear. But if you were going to estimate a model such as this, you will need an informative prior on gamma and presumably the other parameters as well. Then you can write your model as a two component mixture where the mixing weight is the probability that x2[i] < gamma under some CDF for x2. Something like

model {
  vector[N] eta = betas[1] * betas[2] * x1; // need to make x1 a vector first
  for (i in 1:N) {
    real theta = foo_cdf(gamma | ...);
    target += log_mix(theta, normal_lpdf(y[i] | eta[i] + betas[3] * x2[i], sigma),
                             normal_lpdf(y[i] | eta[i], sigma));
  }
  // priors
}

Many thanks for your suggestions, I am still having a hard time understanding and adjusting the line

real theta = foo_cdf(gamma | ...);

Are you ware of any papers/tutorials/examples I could refer to?

It is just whatever CDF you choose from the distribution of x2.

I have been trying to a lot and also extended to model to the following:

y = b[1] * x1 + b[2] * (x1 * (x2 < gamma))

I want to estimate b[1],b[2] and gamma. Is it also possible to reformulate this model within a mixture model?

I implemented the model as it is in Stan and got unbiased parameter estimates for some data and convergence problems (i.e., Rhat values > 1.2) for other data sets. As I understood your previous replies to my questions, that model is unidentifiable and I am having a hard time to reformulate it.

Any hints would be greatly appreciated.

You have to start with a generative model that can simulate the data and from there, it should be clearer how to formulate a mixture model to estimate the parameters given wild data. It is going to depend a lot on how gamma is generated.

My idea is to use gamma to differentiate between two hidden states that can be derived from x2. A simple model to generate data in R could look like this:

n <- 100
x1 <- runif(n)
x2 <- runif(n)
b <- c(1, -2)
gamma <- 0.5
sig <- 0.2

mu <- b[1] * x1 + b[2] * (x1 * (x2 < gamma))
y <- rnorm(100, mu, sig)

That is not a generative model or at least not the generative model that your Stan code estimates the parameters of. What distributions are the parameters drawn from?

It’s a problem for Stan that d/d.gamma (x2[i] < gamma) isn’t continuous. It effectively cuts information flow back to gamma because small changes in gamma don’t affect the posterior until you get near each x2[i], at which point you get a step function.

In the best scenario, Stan will produce a random walk for gamma.

As Ben points out, it’s probably better to think through a continous generative model.