Identify change points in multiple time-series

I have l = 1 \dots L, arbitarily long time-series, each one with Nt_l normally distributed measures D_l .

Each serie contains a change-point located at an arbitrary time \tau_l , so that:

D_l(t) \sim \begin{cases} \mathcal{N}(\mu1_{p,l}, \sigma_p) & \mbox{if } t \lt \tau_l \\ \mathcal{N}(\mu2_{p,l}, \sigma_p), & \mbox{if } t \geq \tau_l \end{cases}

Moreover, it is known that \mu1_{p,l} \sim \mathcal{N}(mu1, \sigma) and \mu2_{p,l} \sim \mathcal{N}(mu2, \sigma).

Since I would write a Stan model to do inference on my measures, firstly I generated some synthetic data to check if the parameters (especially the \tau) are correctly calculated:

L = 3
Nt = 50

tau = sample(x = seq(from = round(Nt /3), to = round(2 * Nt / 3)), size = L, replace = TRUE)
toEnd = Nt - tau

mu1 = -0.7
mu2 = 1.3
sigma = 0.4

mu1_p = rnorm(L, mean = mu1, sd = sigma)
mu2_p = rnorm(L, mean = mu2, sd = sigma)

sigma_p = 0.2

D = matrix(data = NA_real_, nrow = Nt, ncol = L)

for (t in 1: Nt) {
    for (l in 1: L) {
        D[t, l] = ifelse(t < tau[l], rnorm(1, mean = mu1_p[l], sd = sigma_p), rnorm(1, mean = mu2_p[l], sd = sigma_p))
    }
}

In this case, all the L=3 time-series have the same size Nt=50 which it is not the real case I will apply the model to. This comes to the fact that I started coding the Stan model extending the nowave.it’ script adapting it to the model above. The consequence is that D id a 50x3 matrix. The three \tau are generated to lay somewhat in the middle of each time-series, and this is consistent with the physical mechanism I want to describe.
So, here’s the translation of the math model to Stan:

data {
    int<lower=1> Nt;
    int<lower=1> L;
    vector[L] D[Nt]; 
}

parameters {
    real mu1;
    real mu2;
    real<lower = 0> sigma;
    vector[L] mu1_p_raw;
    vector[L] mu2_p_raw;
    real<lower = 0> sigma_p;
}

transformed parameters {
    vector[L] mu1_p = mu1 + sigma * mu1_p_raw;
    vector[L] mu2_p = mu2 + sigma * mu2_p_raw;
    vector[Nt] log_p[L];
    
    for (l in 1:L) {
        vector[Nt + 1] log_p_e;
        vector[Nt + 1] log_p_l;
        log_p_e[1] = 0;
        log_p_l[1] = 0;
        for (i in 1:Nt) {
            log_p_e[i + 1] = log_p_e[i] + normal_lpdf(D[i, l] | mu1_p[l], sigma_p);
            log_p_l[i + 1] = log_p_l[i] + normal_lpdf(D[i, l] | mu2_p[l], sigma_p);
        }
        log_p[l] = rep_vector(-log(Nt) + log_p_l[Nt + 1], Nt) + head(log_p_e, Nt) - head(log_p_l, Nt);
    }
}

model {
    mu1 ~ std_normal();
    mu2 ~ std_normal();
    
    sigma ~ std_normal();
    
    mu1_p_raw ~ std_normal();
    mu2_p_raw ~ std_normal();
    
    sigma_p ~ std_normal();
    
    for (l in 1:L)
        target += log_sum_exp(log_p[l]);
}

generated quantities {
    int<lower = 1, upper = Nt> tau[L];
    for (l in 1:L) {
        tau[l] = categorical_rng(softmax(log_p[l]));
    }
    
}

I am quite happy with the results, although the reparametrization of \mu1_p and \mu2_p and adapt_delta=0.99 do not have eliminated some sporadic divergent transitions after warmup. Here are the results:

               mean      se_mean         sd        50%     n_eff      Rhat
mu1_p[1] -1.3634953 0.0005508518 0.03529051 -1.3634321 4104.3675 0.9999053
mu1_p[2] -0.2586235 0.0005489039 0.03590872 -0.2591301 4279.6400 0.9993275
mu1_p[3] -0.7160250 0.0007111709 0.04712015 -0.7162227 4390.0093 1.0004151
mu2_p[1]  1.2859460 0.0006491305 0.04319718  1.2852867 4428.3997 0.9994939
mu2_p[2]  1.3433570 0.0007406048 0.04210413  1.3435833 3232.0374 1.0004466
mu2_p[3]  1.8209715 0.0005387842 0.03373185  1.8207860 3919.6797 0.9991053
sigma_p   0.1962541 0.0002211053 0.01155979  0.1961298 2733.3879 1.0001940
mu1      -0.7018092 0.0092852402 0.33567857 -0.7159256 1306.9559 1.0012539
mu2       1.3134600 0.0112513183 0.37241814  1.3493183 1095.6071 1.0013171
sigma     0.5961289 0.0080242368 0.25166286  0.5370761  983.6277 1.0025183
tau[1]   31.0000000          NaN 0.00000000 31.0000000       NaN       NaN
tau[2]   30.0000000          NaN 0.00000000 30.0000000       NaN       NaN
tau[3]   19.0000000          NaN 0.00000000 19.0000000       NaN       NaN

wich are consistent with the original parameters:

> mu1_p
[1] -1.3758223 -0.2042016 -0.7435864
> mu2_p
[1] 1.253103 1.373233 1.812222
> tau
[1] 31 30 19

So far, so good, but this is not the structure of my measures because I should have arbitrarily long time series. The input structure should be similar to:

L = 3
Nt = sample(x= seq(from = 50, to = 100), size = L, replace = TRUE)

tau = c()
for (l in 1:L) {
    tau = c(tau, sample(x = seq(from = round(Nt[l] /3), to = round(2 * Nt[l] / 3)), size = 1))
}

N = sum(Nt)

D = data.frame()
for (l in 1: L) {
    for (n in 1: Nt[l]) {
        D = rbind(D, data.frame(time= n, IDx = l, D = ifelse(n < tau[l], rnorm(1, mean = mu1_p[l], sd = sigma_p), rnorm(1, mean = mu2_p[l], sd = sigma_p))))
    }
}

D now becomes a data frame and each series is identified by the value of IDx.

And here comes the question:
With this structure I am not able (in Stan) to correctly indexing the vector[N] log_p; in the model block (log_p is no more a matrix, since the series are not equally long) and indexing when marginalizing out the \tau in the generated quantities block.

Any help is appreciated.

Since nobody replied, I will give it a try.

But first:

It usually does not pay off to ignore those as they often bite you later. I don’t deeply understand changepoint models, so can’t help you directly, but I’ve written about a few tricks to handle divergences at Taming Divergences in Stan Models

But on to the implementation question:
What I would do would be to concatenate all the time series into a single vector and manually unserialize them as needed. It could look something like (not tested, just a sketch, but I made it work for a different task):

data {
  int<lower=1> Nt;
  int<lower=1> lengths;
  vector[sum(lengths)] D;
}

transformed data {
  int<lower=0> series_start[Nt];

    series_start[1] = 1;
    for(i in 2:Nt) {
      series_start[i] = series_start[i - 1] + lengths[i - 1];
    }

}

transformed parameters {
  vector[sum(lengths)] log_p;
  for(i in 1:Nt) {
     vector[lengths[i]] log_p_for_series;
     //compute log_p for the given series
     //...
     log_p[series_start[i] : (seris_start[i] + lengths[i])] = log_p_for_series;
  }
}

Hope that helps and best of luck!