I am estimating a linear model, where I want to estimate the threshold value for an indicator variable together with the normal selection coefficients.

I have the following model in Stan:

```
data {
int N; // no observastions
// Covariates
real x1[N];
real x2[N];
// Count outcome
real y[N];
}
parameters {
real betas[3];
real <lower=0> gamma;
real <lower=0> sigma;
}
model {
real mu[N];
// Likelihood
for (i in 1:N) {
mu[i] = betas[1] + betas[2] * x1[i] + betas[3] * (x2[i] < gamma);
}
y ~ normal(mu, sigma);
}
```

Where `gamma`

is the the threshold to classify `x2`

into 0 and 1. I have two questions:

- Is this the right way to specify my problem in Stan? If I run the model on simulated data, I can recover the correct parameter values.
- Are there any tricks I could use to speed up computation?

No, because it is not differentiable with respect to `betas[3]`

when `x2[i] > gamma`

Doesn’t matter until you write down a differentiable model so that Stan can sample from without warnings.

It is hard to think of data-generating processes where `x2`

has no influence on `y`

until it reaches `gamma`

and then is linear. But if you were going to estimate a model such as this, you will need an informative prior on `gamma`

and presumably the other parameters as well. Then you can write your model as a two component mixture where the mixing weight is the probability that `x2[i] < gamma`

under some CDF for `x2`

. Something like

```
model {
vector[N] eta = betas[1] * betas[2] * x1; // need to make x1 a vector first
for (i in 1:N) {
real theta = foo_cdf(gamma | ...);
target += log_mix(theta, normal_lpdf(y[i] | eta[i] + betas[3] * x2[i], sigma),
normal_lpdf(y[i] | eta[i], sigma));
}
// priors
}
```

Many thanks for your suggestions, I am still having a hard time understanding and adjusting the line

```
real theta = foo_cdf(gamma | ...);
```

Are you ware of any papers/tutorials/examples I could refer to?

It is just whatever CDF you choose from the distribution of `x2`

.

I have been trying to a lot and also extended to model to the following:

`y = b[1] * x1 + b[2] * (x1 * (x2 < gamma))`

I want to estimate `b[1]`

,`b[2]`

and `gamma`

. Is it also possible to reformulate this model within a mixture model?

I implemented the model as it is in Stan and got unbiased parameter estimates for some data and convergence problems (i.e., Rhat values > 1.2) for other data sets. As I understood your previous replies to my questions, that model is unidentifiable and I am having a hard time to reformulate it.

Any hints would be greatly appreciated.

You have to start with a generative model that can simulate the data and from there, it should be clearer how to formulate a mixture model to estimate the parameters given wild data. It is going to depend a lot on how `gamma`

is generated.

My idea is to use `gamma`

to differentiate between two hidden states that can be derived from `x2`

. A simple model to generate data in R could look like this:

```
n <- 100
x1 <- runif(n)
x2 <- runif(n)
b <- c(1, -2)
gamma <- 0.5
sig <- 0.2
mu <- b[1] * x1 + b[2] * (x1 * (x2 < gamma))
y <- rnorm(100, mu, sig)
```

That is not a generative model or at least not the generative model that your Stan code estimates the parameters of. What distributions are the parameters drawn from?

It’s a problem for Stan that `d/d.gamma (x2[i] < gamma)`

isn’t continuous. It effectively cuts information flow back to `gamma`

because small changes in `gamma`

don’t affect the posterior until you get near each `x2[i]`

, at which point you get a step function.

In the best scenario, Stan will produce a random walk for `gamma`

.

As Ben points out, it’s probably better to think through a continous generative model.