# Identify change points in multiple time-series

I have l = 1 \dots L, arbitarily long time-series, each one with Nt_l normally distributed measures D_l .

Each serie contains a change-point located at an arbitrary time \tau_l , so that:

D_l(t) \sim \begin{cases} \mathcal{N}(\mu1_{p,l}, \sigma_p) & \mbox{if } t \lt \tau_l \\ \mathcal{N}(\mu2_{p,l}, \sigma_p), & \mbox{if } t \geq \tau_l \end{cases}

Moreover, it is known that \mu1_{p,l} \sim \mathcal{N}(mu1, \sigma) and \mu2_{p,l} \sim \mathcal{N}(mu2, \sigma).

Since I would write a Stan model to do inference on my measures, firstly I generated some synthetic data to check if the parameters (especially the \tau) are correctly calculated:

L = 3
Nt = 50

tau = sample(x = seq(from = round(Nt /3), to = round(2 * Nt / 3)), size = L, replace = TRUE)
toEnd = Nt - tau

mu1 = -0.7
mu2 = 1.3
sigma = 0.4

mu1_p = rnorm(L, mean = mu1, sd = sigma)
mu2_p = rnorm(L, mean = mu2, sd = sigma)

sigma_p = 0.2

D = matrix(data = NA_real_, nrow = Nt, ncol = L)

for (t in 1: Nt) {
for (l in 1: L) {
D[t, l] = ifelse(t < tau[l], rnorm(1, mean = mu1_p[l], sd = sigma_p), rnorm(1, mean = mu2_p[l], sd = sigma_p))
}
}


In this case, all the L=3 time-series have the same size Nt=50 which it is not the real case I will apply the model to. This comes to the fact that I started coding the Stan model extending the nowave.it’ script adapting it to the model above. The consequence is that D id a 50x3 matrix. The three \tau are generated to lay somewhat in the middle of each time-series, and this is consistent with the physical mechanism I want to describe.
So, here’s the translation of the math model to Stan:

data {
int<lower=1> Nt;
int<lower=1> L;
vector[L] D[Nt];
}

parameters {
real mu1;
real mu2;
real<lower = 0> sigma;
vector[L] mu1_p_raw;
vector[L] mu2_p_raw;
real<lower = 0> sigma_p;
}

transformed parameters {
vector[L] mu1_p = mu1 + sigma * mu1_p_raw;
vector[L] mu2_p = mu2 + sigma * mu2_p_raw;
vector[Nt] log_p[L];

for (l in 1:L) {
vector[Nt + 1] log_p_e;
vector[Nt + 1] log_p_l;
log_p_e = 0;
log_p_l = 0;
for (i in 1:Nt) {
log_p_e[i + 1] = log_p_e[i] + normal_lpdf(D[i, l] | mu1_p[l], sigma_p);
log_p_l[i + 1] = log_p_l[i] + normal_lpdf(D[i, l] | mu2_p[l], sigma_p);
}
log_p[l] = rep_vector(-log(Nt) + log_p_l[Nt + 1], Nt) + head(log_p_e, Nt) - head(log_p_l, Nt);
}
}

model {
mu1 ~ std_normal();
mu2 ~ std_normal();

sigma ~ std_normal();

mu1_p_raw ~ std_normal();
mu2_p_raw ~ std_normal();

sigma_p ~ std_normal();

for (l in 1:L)
target += log_sum_exp(log_p[l]);
}

generated quantities {
int<lower = 1, upper = Nt> tau[L];
for (l in 1:L) {
tau[l] = categorical_rng(softmax(log_p[l]));
}

}



I am quite happy with the results, although the reparametrization of \mu1_p and \mu2_p and adapt_delta=0.99 do not have eliminated some sporadic divergent transitions after warmup. Here are the results:

               mean      se_mean         sd        50%     n_eff      Rhat
mu1_p -1.3634953 0.0005508518 0.03529051 -1.3634321 4104.3675 0.9999053
mu1_p -0.2586235 0.0005489039 0.03590872 -0.2591301 4279.6400 0.9993275
mu1_p -0.7160250 0.0007111709 0.04712015 -0.7162227 4390.0093 1.0004151
mu2_p  1.2859460 0.0006491305 0.04319718  1.2852867 4428.3997 0.9994939
mu2_p  1.3433570 0.0007406048 0.04210413  1.3435833 3232.0374 1.0004466
mu2_p  1.8209715 0.0005387842 0.03373185  1.8207860 3919.6797 0.9991053
sigma_p   0.1962541 0.0002211053 0.01155979  0.1961298 2733.3879 1.0001940
mu1      -0.7018092 0.0092852402 0.33567857 -0.7159256 1306.9559 1.0012539
mu2       1.3134600 0.0112513183 0.37241814  1.3493183 1095.6071 1.0013171
sigma     0.5961289 0.0080242368 0.25166286  0.5370761  983.6277 1.0025183
tau   31.0000000          NaN 0.00000000 31.0000000       NaN       NaN
tau   30.0000000          NaN 0.00000000 30.0000000       NaN       NaN
tau   19.0000000          NaN 0.00000000 19.0000000       NaN       NaN


wich are consistent with the original parameters:

> mu1_p
 -1.3758223 -0.2042016 -0.7435864
> mu2_p
 1.253103 1.373233 1.812222
> tau
 31 30 19


So far, so good, but this is not the structure of my measures because I should have arbitrarily long time series. The input structure should be similar to:

L = 3
Nt = sample(x= seq(from = 50, to = 100), size = L, replace = TRUE)

tau = c()
for (l in 1:L) {
tau = c(tau, sample(x = seq(from = round(Nt[l] /3), to = round(2 * Nt[l] / 3)), size = 1))
}

N = sum(Nt)

D = data.frame()
for (l in 1: L) {
for (n in 1: Nt[l]) {
D = rbind(D, data.frame(time= n, IDx = l, D = ifelse(n < tau[l], rnorm(1, mean = mu1_p[l], sd = sigma_p), rnorm(1, mean = mu2_p[l], sd = sigma_p))))
}
}


D now becomes a data frame and each series is identified by the value of IDx.

And here comes the question:
With this structure I am not able (in Stan) to correctly indexing the vector[N] log_p; in the model block (log_p is no more a matrix, since the series are not equally long) and indexing when marginalizing out the \tau in the generated quantities block.

Any help is appreciated.

Since nobody replied, I will give it a try.

But first:

It usually does not pay off to ignore those as they often bite you later. I don’t deeply understand changepoint models, so can’t help you directly, but I’ve written about a few tricks to handle divergences at https://www.martinmodrak.cz/2018/02/19/taming-divergences-in-stan-models/

But on to the implementation question:
What I would do would be to concatenate all the time series into a single vector and manually unserialize them as needed. It could look something like (not tested, just a sketch, but I made it work for a different task):

data {
int<lower=1> Nt;
int<lower=1> lengths;
vector[sum(lengths)] D;
}

transformed data {
int<lower=0> series_start[Nt];

series_start = 1;
for(i in 2:Nt) {
series_start[i] = series_start[i - 1] + lengths[i - 1];
}

}

transformed parameters {
vector[sum(lengths)] log_p;
for(i in 1:Nt) {
vector[lengths[i]] log_p_for_series;
//compute log_p for the given series
//...
log_p[series_start[i] : (seris_start[i] + lengths[i])] = log_p_for_series;
}
}


Hope that helps and best of luck!