I am trying to implement a simple threshold logistic model where the data have been generated using this data generation process:
logit(y_i) = -1.3 + 2.6 \times I(x_i > -0.6), where x \sim N(0, 1)
In the model \tau is an unknown parameter reflecting the threshold (whose true value is -0.6) and \beta_0 and \beta_1 are the intercept and effect of passing the threshold, respectively. The model doesn’t fit so well, and I’m trying to figure out why. I’ve tried increasing tree depth and the number of chains, but has no effect. The chains are generally in the right area, but are not mixing well. Is there anything obvious way I can modify the model to get better results?
Here is the stan code:
data {
int<lower=1> N; // number of observations
real x[N];
int y[N];
}
parameters {
real tau; // cutoffs for low/not low
real beta0; // overall intercept for outcome model;
real beta1; // effect of not low
}
transformed parameters{
vector[N] yhat;
for (i in 1:N)
yhat[i] = beta0 + beta1*(x[i] > tau);
}
model {
// priors
beta0 ~ normal(0, 5);
beta1 ~ normal(0, 5);
tau ~ normal(0, 3);
// outcome model
for (i in 1:N)
y[i] ~ bernoulli_logit(yhat[i]);
}
Here is the R code in which I generate the simulated data (using package simstudy
) and fit the model first with the chngpt
package, and then fit the stan
model:
library(simstudy)
library(chngpt)
#--- Generate simulated data
d1 <- defData(varname = "V1", formula = 0, variance = 1, dist = "normal")
d1 <- defData(d1, varname = "L1", formula = "-1.3 + 1.6 * (V1 > -.6)",
link= "logit", dist = "binary")
dd <- genData(1000, d1)
#--- Fit using chngpt package (MLE)
fit <- chngptm(formula.1 = L1 ~ 1, formula.2 = ~ V1, data = dd, type="step", family="binomial")
summary(fit)
#--- Fit with stan
rt <- stanc("KSG/changepoint.stan");
sm <- stan_model(stanc_ret = rt, verbose=FALSE)
N <- nrow(dd)
y <- dd[, L1]
x <- dd[, V1]
studydata <- list(N=N, x=x, y=y)
fit <- sampling(sm, data = studydata, iter = 4000, warmup = 1000,
cores = 4L, chains = 4, control = list(max_treedepth = 10))