I have l = 1 \dots L, arbitarily long time-series, each one with Nt_l normally distributed measures D_l .
Each serie contains a change-point located at an arbitrary time \tau_l , so that:
D_l(t) \sim \begin{cases} \mathcal{N}(\mu1_{p,l}, \sigma_p) & \mbox{if } t \lt \tau_l \\ \mathcal{N}(\mu2_{p,l}, \sigma_p), & \mbox{if } t \geq \tau_l \end{cases}
Moreover, it is known that \mu1_{p,l} \sim \mathcal{N}(mu1, \sigma) and \mu2_{p,l} \sim \mathcal{N}(mu2, \sigma).
Since I would write a Stan model to do inference on my measures, firstly I generated some synthetic data to check if the parameters (especially the \tau) are correctly calculated:
L = 3
Nt = 50
tau = sample(x = seq(from = round(Nt /3), to = round(2 * Nt / 3)), size = L, replace = TRUE)
toEnd = Nt - tau
mu1 = -0.7
mu2 = 1.3
sigma = 0.4
mu1_p = rnorm(L, mean = mu1, sd = sigma)
mu2_p = rnorm(L, mean = mu2, sd = sigma)
sigma_p = 0.2
D = matrix(data = NA_real_, nrow = Nt, ncol = L)
for (t in 1: Nt) {
for (l in 1: L) {
D[t, l] = ifelse(t < tau[l], rnorm(1, mean = mu1_p[l], sd = sigma_p), rnorm(1, mean = mu2_p[l], sd = sigma_p))
}
}
In this case, all the L=3 time-series have the same size Nt=50 which it is not the real case I will apply the model to. This comes to the fact that I started coding the Stan model extending the nowave.it’ script adapting it to the model above. The consequence is that D id a 50x3 matrix. The three \tau are generated to lay somewhat in the middle of each time-series, and this is consistent with the physical mechanism I want to describe.
So, here’s the translation of the math model to Stan:
data {
int<lower=1> Nt;
int<lower=1> L;
vector[L] D[Nt];
}
parameters {
real mu1;
real mu2;
real<lower = 0> sigma;
vector[L] mu1_p_raw;
vector[L] mu2_p_raw;
real<lower = 0> sigma_p;
}
transformed parameters {
vector[L] mu1_p = mu1 + sigma * mu1_p_raw;
vector[L] mu2_p = mu2 + sigma * mu2_p_raw;
vector[Nt] log_p[L];
for (l in 1:L) {
vector[Nt + 1] log_p_e;
vector[Nt + 1] log_p_l;
log_p_e[1] = 0;
log_p_l[1] = 0;
for (i in 1:Nt) {
log_p_e[i + 1] = log_p_e[i] + normal_lpdf(D[i, l] | mu1_p[l], sigma_p);
log_p_l[i + 1] = log_p_l[i] + normal_lpdf(D[i, l] | mu2_p[l], sigma_p);
}
log_p[l] = rep_vector(-log(Nt) + log_p_l[Nt + 1], Nt) + head(log_p_e, Nt) - head(log_p_l, Nt);
}
}
model {
mu1 ~ std_normal();
mu2 ~ std_normal();
sigma ~ std_normal();
mu1_p_raw ~ std_normal();
mu2_p_raw ~ std_normal();
sigma_p ~ std_normal();
for (l in 1:L)
target += log_sum_exp(log_p[l]);
}
generated quantities {
int<lower = 1, upper = Nt> tau[L];
for (l in 1:L) {
tau[l] = categorical_rng(softmax(log_p[l]));
}
}
I am quite happy with the results, although the reparametrization of \mu1_p and \mu2_p and adapt_delta=0.99
do not have eliminated some sporadic divergent transitions after warmup. Here are the results:
mean se_mean sd 50% n_eff Rhat
mu1_p[1] -1.3634953 0.0005508518 0.03529051 -1.3634321 4104.3675 0.9999053
mu1_p[2] -0.2586235 0.0005489039 0.03590872 -0.2591301 4279.6400 0.9993275
mu1_p[3] -0.7160250 0.0007111709 0.04712015 -0.7162227 4390.0093 1.0004151
mu2_p[1] 1.2859460 0.0006491305 0.04319718 1.2852867 4428.3997 0.9994939
mu2_p[2] 1.3433570 0.0007406048 0.04210413 1.3435833 3232.0374 1.0004466
mu2_p[3] 1.8209715 0.0005387842 0.03373185 1.8207860 3919.6797 0.9991053
sigma_p 0.1962541 0.0002211053 0.01155979 0.1961298 2733.3879 1.0001940
mu1 -0.7018092 0.0092852402 0.33567857 -0.7159256 1306.9559 1.0012539
mu2 1.3134600 0.0112513183 0.37241814 1.3493183 1095.6071 1.0013171
sigma 0.5961289 0.0080242368 0.25166286 0.5370761 983.6277 1.0025183
tau[1] 31.0000000 NaN 0.00000000 31.0000000 NaN NaN
tau[2] 30.0000000 NaN 0.00000000 30.0000000 NaN NaN
tau[3] 19.0000000 NaN 0.00000000 19.0000000 NaN NaN
wich are consistent with the original parameters:
> mu1_p
[1] -1.3758223 -0.2042016 -0.7435864
> mu2_p
[1] 1.253103 1.373233 1.812222
> tau
[1] 31 30 19
So far, so good, but this is not the structure of my measures because I should have arbitrarily long time series. The input structure should be similar to:
L = 3
Nt = sample(x= seq(from = 50, to = 100), size = L, replace = TRUE)
tau = c()
for (l in 1:L) {
tau = c(tau, sample(x = seq(from = round(Nt[l] /3), to = round(2 * Nt[l] / 3)), size = 1))
}
N = sum(Nt)
D = data.frame()
for (l in 1: L) {
for (n in 1: Nt[l]) {
D = rbind(D, data.frame(time= n, IDx = l, D = ifelse(n < tau[l], rnorm(1, mean = mu1_p[l], sd = sigma_p), rnorm(1, mean = mu2_p[l], sd = sigma_p))))
}
}
D
now becomes a data frame
and each series is identified by the value of IDx
.
And here comes the question:
With this structure I am not able (in Stan) to correctly indexing the vector[N] log_p;
in the model
block (log_p
is no more a matrix, since the series are not equally long) and indexing when marginalizing out the \tau in the generated quantities
block.
Any help is appreciated.