I am fitting a probit model (below) using the standard approach described in the Stan documentation.
I am fitting the model to time-series data, rather than cross-sectional. There are long periods where the dependent variable (y) is 0 and then short periods where it is 1. As a result, I noticed that when I calculate residuals (not shown), then there is a strong presence of autocorrelation (ignoring issues on calculating residuals on probit models).
I tried to figure some ways to mitigate the issue, but honestly it’s a bit tricky. For instance, the most common way to account for autocorrelation is to include the lag of the dependent variable. However, in this case, I might not be able to observe the value of the dependent variable after just one lag. I might not observe it for 6 lags, which doesn’t help so much. Even then, I would need to transform the variable to include it in the Phi() function. I was thinking that using a latent variable approach MIGHT work (i.e. what is in the Conceptual Framework here: https://en.wikipedia.org/wiki/Probit_model), but I wasn’t sure how to do that in Stan. If I can’t get that to work, then I was also debating using hidden Markov Model.
Any suggestions?
data {
int<lower=0> T;
int<lower=0> N_X;
int<lower=0,upper=1> y[T];
matrix[T, N_X] X;
}
parameters {
real A;
vector[N_X] B;
}
model {
A ~ normal(0, 1);
B ~ normal(0, 10);
y ~ bernoulli(Phi(A + X * B));
}
generated quantities {
vector[T] y_sim;
for (i in 1:T) {
y_sim[i] = bernoulli_rng(Phi(A + X[i] * B));
}
}