I am estimating a linear model, where I want to estimate the threshold value for an indicator variable together with the normal selection coefficients.
I have the following model in Stan:
data {
int N; // no observastions
// Covariates
real x1[N];
real x2[N];
// Count outcome
real y[N];
}
parameters {
real betas[3];
real <lower=0> gamma;
real <lower=0> sigma;
}
model {
real mu[N];
// Likelihood
for (i in 1:N) {
mu[i] = betas[1] + betas[2] * x1[i] + betas[3] * (x2[i] < gamma);
}
y ~ normal(mu, sigma);
}
Where gamma
is the the threshold to classify x2
into 0 and 1. I have two questions:
- Is this the right way to specify my problem in Stan? If I run the model on simulated data, I can recover the correct parameter values.
- Are there any tricks I could use to speed up computation?
No, because it is not differentiable with respect to betas[3]
when x2[i] > gamma
Doesn’t matter until you write down a differentiable model so that Stan can sample from without warnings.
It is hard to think of data-generating processes where x2
has no influence on y
until it reaches gamma
and then is linear. But if you were going to estimate a model such as this, you will need an informative prior on gamma
and presumably the other parameters as well. Then you can write your model as a two component mixture where the mixing weight is the probability that x2[i] < gamma
under some CDF for x2
. Something like
model {
vector[N] eta = betas[1] * betas[2] * x1; // need to make x1 a vector first
for (i in 1:N) {
real theta = foo_cdf(gamma | ...);
target += log_mix(theta, normal_lpdf(y[i] | eta[i] + betas[3] * x2[i], sigma),
normal_lpdf(y[i] | eta[i], sigma));
}
// priors
}
Many thanks for your suggestions, I am still having a hard time understanding and adjusting the line
real theta = foo_cdf(gamma | ...);
Are you ware of any papers/tutorials/examples I could refer to?
It is just whatever CDF you choose from the distribution of x2
.
I have been trying to a lot and also extended to model to the following:
y = b[1] * x1 + b[2] * (x1 * (x2 < gamma))
I want to estimate b[1]
,b[2]
and gamma
. Is it also possible to reformulate this model within a mixture model?
I implemented the model as it is in Stan and got unbiased parameter estimates for some data and convergence problems (i.e., Rhat values > 1.2) for other data sets. As I understood your previous replies to my questions, that model is unidentifiable and I am having a hard time to reformulate it.
Any hints would be greatly appreciated.
You have to start with a generative model that can simulate the data and from there, it should be clearer how to formulate a mixture model to estimate the parameters given wild data. It is going to depend a lot on how gamma
is generated.
My idea is to use gamma
to differentiate between two hidden states that can be derived from x2
. A simple model to generate data in R could look like this:
n <- 100
x1 <- runif(n)
x2 <- runif(n)
b <- c(1, -2)
gamma <- 0.5
sig <- 0.2
mu <- b[1] * x1 + b[2] * (x1 * (x2 < gamma))
y <- rnorm(100, mu, sig)
That is not a generative model or at least not the generative model that your Stan code estimates the parameters of. What distributions are the parameters drawn from?
It’s a problem for Stan that d/d.gamma (x2[i] < gamma)
isn’t continuous. It effectively cuts information flow back to gamma
because small changes in gamma
don’t affect the posterior until you get near each x2[i]
, at which point you get a step function.
In the best scenario, Stan will produce a random walk for gamma
.
As Ben points out, it’s probably better to think through a continous generative model.