# Estimate an indicator variable within a linear regression

I am estimating a linear model, where I want to estimate the threshold value for an indicator variable together with the normal selection coefficients.

I have the following model in Stan:

``````data {
int N; // no observastions

// Covariates
real x1[N];
real x2[N];

// Count outcome
real y[N];
}

parameters {
real betas;
real <lower=0> gamma;
real <lower=0> sigma;
}

model {
real mu[N];

// Likelihood
for (i in 1:N) {
mu[i] = betas + betas * x1[i] + betas * (x2[i] < gamma);
}
y ~ normal(mu, sigma);
}
``````

Where `gamma` is the the threshold to classify `x2` into 0 and 1. I have two questions:

1. Is this the right way to specify my problem in Stan? If I run the model on simulated data, I can recover the correct parameter values.
2. Are there any tricks I could use to speed up computation?

No, because it is not differentiable with respect to `betas` when `x2[i] > gamma`

Doesn’t matter until you write down a differentiable model so that Stan can sample from without warnings.

It is hard to think of data-generating processes where `x2` has no influence on `y` until it reaches `gamma` and then is linear. But if you were going to estimate a model such as this, you will need an informative prior on `gamma` and presumably the other parameters as well. Then you can write your model as a two component mixture where the mixing weight is the probability that `x2[i] < gamma` under some CDF for `x2`. Something like

``````model {
vector[N] eta = betas * betas * x1; // need to make x1 a vector first
for (i in 1:N) {
real theta = foo_cdf(gamma | ...);
target += log_mix(theta, normal_lpdf(y[i] | eta[i] + betas * x2[i], sigma),
normal_lpdf(y[i] | eta[i], sigma));
}
// priors
}
``````

Many thanks for your suggestions, I am still having a hard time understanding and adjusting the line

``````real theta = foo_cdf(gamma | ...);
``````

Are you ware of any papers/tutorials/examples I could refer to?

It is just whatever CDF you choose from the distribution of `x2`.

I have been trying to a lot and also extended to model to the following:

`y = b * x1 + b * (x1 * (x2 < gamma))`

I want to estimate `b`,`b` and `gamma`. Is it also possible to reformulate this model within a mixture model?

I implemented the model as it is in Stan and got unbiased parameter estimates for some data and convergence problems (i.e., Rhat values > 1.2) for other data sets. As I understood your previous replies to my questions, that model is unidentifiable and I am having a hard time to reformulate it.

Any hints would be greatly appreciated.

You have to start with a generative model that can simulate the data and from there, it should be clearer how to formulate a mixture model to estimate the parameters given wild data. It is going to depend a lot on how `gamma` is generated.

My idea is to use `gamma` to differentiate between two hidden states that can be derived from `x2`. A simple model to generate data in R could look like this:

``````n <- 100
x1 <- runif(n)
x2 <- runif(n)
b <- c(1, -2)
gamma <- 0.5
sig <- 0.2

mu <- b * x1 + b * (x1 * (x2 < gamma))
y <- rnorm(100, mu, sig)
``````

That is not a generative model or at least not the generative model that your Stan code estimates the parameters of. What distributions are the parameters drawn from?

It’s a problem for Stan that `d/d.gamma (x2[i] < gamma)` isn’t continuous. It effectively cuts information flow back to `gamma` because small changes in `gamma` don’t affect the posterior until you get near each `x2[i]`, at which point you get a step function.

In the best scenario, Stan will produce a random walk for `gamma`.

As Ben points out, it’s probably better to think through a continous generative model.