Simulation and inference for a stochastic process with a random time of event in rstan

SuPie · June 30, 2020, 4:29pm

I would like to model a continuous-time (partly) stochastic process X(t) that has a jump at a random time point \tau and that is not directly observed.

The process is deterministic up to \tau

X(t) \equiv 0 \qquad \text{ for } t < \tau,

at \tau, it jumps to a random value m_0

X(\tau) = m_0,

and after \tau, it evolves as a Markov process with a transition probability that depends on a parameter \theta:

\left\lbrace X(t)\, {\large\vert}\, X(s) = x_s\right\rbrace \sim \mathcal{N}_{>0} \left(x_s - \theta \cdot x_s\cdot (t-s), \,\sqrt{\theta \cdot x_s\cdot (t-s)}\right)\qquad \text{ for } \tau\leq s < t.

I assume multiplicative measurement errors for the observations y_1,\dots,y_M at time points t_1,\dots, t_M:

\log(y_k)\sim\mathcal{N}(\log(X(t_k) + 1),\, \sigma)\qquad \text{ for } k=1,\dots M.

Moreover, I assume the following priors

\tau \sim \mathcal{Unif}(0,10), \quad m_0 \sim \mathcal{N}_{>0}(300,300), \quad \theta \sim \mathcal{N}_{>0}(0,5), \quad \sigma\sim\mathcal{Unif}(0.001,2),

where \mathcal{N}_{>0} denotes a truncated normal distribution. Note that the random time point \tau does not need to be one of the observed time point.

I am able to simulate the model in rstan with simulate_stoch_exp_decay_with_random_release_time_fix_param.stan (2.1 KB) :

library(rstan)
stansim <- stan(file = "simulate_stoch_exp_decay_with_random_release_time_fix_param.stan", 
                data = list(M=181, time_points=seq(from=0, to=30, length.out=181), 
                            tau=1.1, m0=200, theta=.11, sigma=.01), 
                algorithm = "Fixed_param", seed = 4242, iter = 1, chains = 1)
y_obs <- extract(stansim, pars = "y_obs")[[1]][1,]
write.table(y_obs, file = "y_obs.txt", col.names=FALSE, row.names=FALSE)

Now, I would like to infer the four parameters \tau, m_0, \theta, \sigma from the simulated dataset y_obs.txt (3.0 KB).

Among others, I’ve tried the following Stan Code (stanmodel_stoch_exp_decay_with_random_release_time_v1.stan (2.1 KB) ):

data {
  int<lower=0> M;
  vector<lower=0>[M] y_obs; // observations
  vector<lower=0>[M] time_points; // time points of the observations
}

// we assume log-normally distributed noise and therefore log-transform the data
transformed data{
  vector[M] log_y_obs;
  log_y_obs = log(y_obs);
}

parameters {
  real<lower=0, upper=10> tau; // time point at which initial amount m0 is released
  real<lower=0> m0; // initial amount released at time tau
  real<lower=0> theta; // decay rate
  real<lower=0.001, upper=2> sigma; // measurement noise parameter
  real<lower=0> x_k[M];// random process states
}

transformed parameters{
  real<lower=0> x[M]; // process states at observed time points
  for (t in 1:M) {
    if (time_points[t] < tau ){
      x[t] = 0; // process states before tau are deterministic
    }else{
      if(time_points[t] == tau ){// tau is an observed time point
        x[t] = m0;
      }else{ // time_points[t] > tau
      x[t] = x_k[t];
      }
    }
  }
}

model {
  // log-likelihood part 1: prod. of transition probabilities of the diffusion process
  int first; // variable to check whether the first transition after tau is yet to come
  first = 1;
  for (t in 1:M) {
    if(time_points[t] >= tau ){
      if(time_points[t] == tau ){ // tau is an observed time point
        first = 0;
      }else{// time_points[t] > tau
        if(first){// transtition prob. between tau and the first observed time point after tau
          x[t] ~ normal(m0 - theta * m0 * (time_points[t] - tau),
                      sqrt(theta * m0 * (time_points[t] - tau)));
          first = 0;
        }else{// transitions prob. between observed time points after tau
          x[t] ~ normal(x[t-1] - theta * x[t-1] * (time_points[t] - time_points[t-1]),
                        sqrt(theta * x[t-1] * (time_points[t] - time_points[t-1])));
        }
      }
    }
  }
  // log-likelihood part 2: multiplicative normal noise
  for (i in 1:M){
    log_y_obs[i] ~ normal(log(x[i] + 1), sigma);
  }
  // priors
  theta ~ normal(0, 5);
  m0 ~ normal(300,300);
}

and in R:

y_obs <- read.table("y_obs.txt")[[1]]
stanfit1 <- stan(file = "stanmodel_stoch_exp_decay_with_random_release_time_v1.stan", 
                data = list(M=181, time_points=seq(from=0, to=30, length.out=181), 
                            y_obs=y_obs), 
                seed = 42424)

This results in mostly divergent transitions (increasing adapt_delta did not help) and the chains just get stuck:

I’m suspecting that sampling with Stan is simply not possible for this kind of model due to the fact that the components of X would have to switch between a deterministic and a stochastic state depending on the value of \tau. But I am not completely sure. Maybe there is some technique that I have not come across yet.

Does anyone have an idea on how to resolves this issue or can confirm my conjecture that sampling with Stan is not possible?

martinmodrak · July 8, 2020, 2:01pm

Hi, the main problem is IMHO that conditioning on a parameter in the way you do on tau makes the posterior discontinuous and this breaks Stan. When you need to handle a discrete parameter (and \tau is a discrete parameter here) in Stan, you need to marginalize it out, this amounts to (very roughly):

parameters {
  simplex[11] tau_p; //tau_p[i - 1] == P(tau = i)
}

model {
  vector[11] partial_lp;
  for(tau in 0:10) {
    partial_lp[tau + 1] = ... //Compute the likelihood assuming a single given value of tau
  }

  //Weight the likelihood by the probability
  target += log_sum_exp(log(tau_p) + partial_lp);
}

Nevertheless even the rest of the model seems weird, so I would start by treating tau as known and making sure the model works in this case and only then move to marginalizing tau out.

In particular, the hard bounds on sigma are likely to be problematic, and you will likely get better results with a soft bound (e.g. real<lower=0> sigma and sigma ~ normal(0,1) or sigma ~ inv_gamma(something suitable), which has low prior probability of both sigma < 0.01 and sigma > 2 but will not hinder sampling.

The way the process is described is also a bit weird - if your values are bounded by 0 from the bottom but have varying means, maybe the process can be better described by a log normal or gamma distribution?

Best of luck with your model!

SuPie · July 9, 2020, 9:24pm

Hi Martin,
Thank you for your reply!

Actually \tau is not supposed to be a discrete parameter. By \tau\sim\mathcal{U}inf(0,10), I meant to refer to the continuous uniform distribution. Sorry for not making that clear! (I did try to emphasise that \tau does not necessarily coincide with one of the time points t_k of the observations though) However, I do realize that the conditioning on \tau is weird, I just don’t know how to prevent it.
Do you think it would make sense to “split” \tau into two parameters by setting \tau = k \cdot\Delta t - s with discrete parameter k being the index of the first observed time point after \tau, continuous parameter s being the time interval between \tau and the observed time point t_k, and \Delta t being the known time step between observations; and then treating k in the way you have suggested?

I had already tried whether inference for the remaining parameters works when \tau is assumed to be known. In that case, things are fine:
E.g. for the simulated data from my first post (where the parameters used for simulating were m_0=200, \theta=0.11, and \sigma=0.01), the Stan code stanmodel_stoch_exp_decay_with_known_release_time.stan (1.9 KB) and in R:

y_obs <- read.table("y_obs.txt")[[1]]
stanfit <- stan(file = "stanmodel_stoch_exp_decay_with_known_release_time.stan", 
                data = list(M=181, time_points=seq(from=0, to=30, length.out=181), 
                            y_obs=y_obs, tau=1.1, index_before_tau=7), 
                seed = 42424)

I obtained the following output:

##          mean se_mean   sd   2.5%    50%  97.5% n_eff  Rhat
## m0    201.439    0.03 1.93 197.73 201.45 205.29  3708 1.001
## theta   0.111    0.00 0.01   0.10   0.11   0.13  3638 1.000
## sigma   0.009    0.00 0.00   0.01   0.01   0.01   485 1.003

Describing the transition probabilities of the process by the normal distribution comes from the Euler-Maruyama approximation for the solution of a stochastic differential equation which is the kind of process that I am trying to model, therefore, I would definitely like to keep that part the way it is.

martinmodrak · July 10, 2020, 8:45am

Oh, I didn’t notice that the actual value of \tau enters the likelihood for the first observation. But it enters the likelihood primarily via comparison to time_points, so I think it’s “main” influence is discrete, so splitting the discrete part and continuous part + marginalizing over the discrete part would probably help.

Note however, that you would need to make 11 copies of all parameters describing the series (m0, theta, x_k), separately for each possible choice of k because you IMHO cannot rule out that the data can be for example explained equally well by low (early) tau + small m0 or high (late) tau + big m0… (which might be the reason for divergences in your original model as that would imply multimodality in your original parametrization)

Oh, so x[t] are not bounded, but only y_obs are bounded? (your initial description seems to assume x[t] are bounded). If so, then I would still think whether the observation model (log_y_obs[i] ~ normal(log(x[i] + 1), sigma);) really represents your understanding on how the data came about…

Does that make sense?

SuPie · July 28, 2020, 12:06pm

Thanks again for your comments! I’ve tried to follow your advice and came up with the following Stan model that spilts the parameter \tau in a discrete and a continuous part and includes several copies of all parameters (one copy for each of the possible values for the discrete part):

// This Stan program defines a stochastic model of exponential decay where the
// initial time point tau is random.
// in this model, the index of the observation after tau is included as a parameter
// tau = index_after_tau * time_step - s
// and we have one set of parameters per possible value of index_after_tau

data {
  int<lower=0> M; // number of observed time points
  vector<lower=0>[M] y_obs; // observations
  vector<lower=0>[M] time_points; // time points of the observations
  real<lower=0> time_step; // time step between observations
  int<lower=2> max_index; // upper bound for index of the observation after tau
  int<lower=2, upper=max_index> min_index; // lower bound for index of the observation after tau
  // the index after tau cannot be one (i.e. that of the first observation)
}


transformed data{
  vector[M] log_y_obs;
  real log_unif; 
  int n_poss_ind;
  n_poss_ind = max_index - min_index + 1; // number of possible values that the index after tau can take
  log_unif = -log(n_poss_ind); // we assume a uniform prior for the index after tau;
  log_y_obs = log(y_obs); // we assume log-normally distributed noise and therefore log-transform the data
}

parameters {
  vector<lower=0, upper=time_step>[n_poss_ind] s; // time step from tau (at which 
  // initial amount m0 is released) to the next observed time point  
  vector<lower=0>[n_poss_ind] m0; // initial amount released at time tau
  vector<lower=0>[n_poss_ind] theta; // decay rate
  vector<lower=0.001, upper=2>[n_poss_ind] sigma; // measurement noise parameter
  vector<lower=0>[n_poss_ind] x[M];// random process states
  simplex[n_poss_ind] index_after_tau_p; // index_after_tau_p[i] == P(index_after_tau = i + min_index - 1)?
}

transformed parameters {
  vector[n_poss_ind] partial_lp;
  partial_lp = rep_vector(log_unif, n_poss_ind); 
  for (index_after_tau in min_index:max_index) {
    //Compute the likelihood assuming a single given value of the index after tau
     for (t in 1:(index_after_tau - 1)) {
       // observation model; before tau x[t] = 0
       partial_lp[index_after_tau - min_index + 1] += 
        normal_lpdf(log_y_obs[t] | 0, sigma[index_after_tau - min_index + 1]);
       partial_lp[index_after_tau - min_index + 1] += 
        normal_lpdf(x[t, index_after_tau - min_index + 1] | 0, 0.00001);
     }
     
     // factor of the likelihood of the process for the time step from the state at 
     // time point tau to the state at the first observed time point
     partial_lp[index_after_tau - min_index + 1] += 
      normal_lpdf(x[index_after_tau] | 
                  m0[index_after_tau - min_index + 1] - 
                    theta[index_after_tau - min_index + 1] * 
                    m0[index_after_tau - min_index + 1] * 
                    s[index_after_tau - min_index + 1], 
                  sqrt(theta[index_after_tau - min_index + 1] * 
                    m0[index_after_tau - min_index + 1] * 
                    s[index_after_tau - min_index + 1]));
    // observation model for the first observation after tau
    partial_lp[index_after_tau - min_index + 1] += 
      normal_lpdf(log_y_obs[index_after_tau] | 
                  log(x[index_after_tau, index_after_tau - min_index + 1] + 1),
                  sigma[index_after_tau - min_index + 1]);
                                        
    for (t in (index_after_tau + 1):M) {
      // factor of the likelihood of the process
      partial_lp[index_after_tau - min_index + 1] += 
          normal_lpdf(x[t,index_after_tau - min_index + 1] | 
                      x[t-1,index_after_tau - min_index + 1] - 
                        theta[index_after_tau - min_index + 1] * 
                        x[t-1,index_after_tau - min_index + 1] * 
                        (time_points[t] - time_points[t-1]), 
                      sqrt(theta[index_after_tau - min_index + 1] * 
                        x[t-1,index_after_tau - min_index + 1] * 
                        (time_points[t] - time_points[t-1])));
      // observation model for the remaining observations after tau
      partial_lp[index_after_tau - min_index + 1] += 
        normal_lpdf(log_y_obs[t] | log(x[t, index_after_tau - min_index + 1] + 1),
                    sigma[index_after_tau - min_index + 1]);
     }
  }
}

model {
  //Weight the likelihood by the probability
  target += log_sum_exp(log(index_after_tau_p) + partial_lp);
  
  // priors
  theta ~ normal(0, 5);
  m0 ~ normal(300, 300);
}

If I provide the correct value for the index after \tau by setting the data min_index as well as max_index to this value, sampling works fine (though s cannot be identified, but that’s not surprising):

y_obs <- read.table("y_obs.txt")[[1]]
stanfit <- stan(file = "stanmodel_stoch_exp_decay_with_random_release_time_incl_index_with_sep_param.stan", 
                data = list(M=181, time_points=seq(from=0, to=30, length.out=181), 
                            y_obs=y_obs, time_step=1/6, max_index=8, min_index=8), 
                seed = 42424)

##             mean se_mean   sd   2.5%    50%  97.5% n_eff  Rhat
## m0[1]    201.837    0.05 2.30 197.71 201.70 206.77  2502 1.001
## theta[1]   0.111    0.00 0.01   0.10   0.11   0.13  2180 1.001
## sigma[1]   0.009    0.00 0.00   0.01   0.01   0.01   398 1.013
## s[1]       0.084    0.00 0.05   0.01   0.08   0.16  1877 1.001

However, if I now try to allow for several possible values for the index after \tau, e.g. max_index=9, min_index=7, then this again results in mostly divergent transitions and the chains getting stuck:
traceplot_tau_split

I’ve looked over the Stan code a thousand times but can’t find my mistake. Do you have an idea what is causing this behaviour?

log_y_obs[i] ~ normal(log(x[i] + 1), sigma); is not the final version of the observation model, it’s just a simplified version that I’m using for now until I have figured out how to deal with \tau which I would like to be the focus of this thread.

martinmodrak · July 31, 2020, 11:44am

That’s a good step in checking your model! I tried running this and I get a few dozen divergent transitions even with min_index == max_index, so I suspect something might be fishy already when \tau is fixed.

Examining the pairs plots it seems that a) some of the x values are quite correlated and b) there can be funnel-like shapes, somewhat similar to what is discussed at: Diagnosing Biased Inference with Divergences - e.g.

So maybe what you need is non-centered parametrization, i.e. make x_raw ~ normal(0, 1) and then, whenever you have x[t,i] ~ normal(mu, sd) compute x[t,i] = x_raw * sd + mu.

Does that make sense?

Only after mitigating the divergences in the easy case I would move to the more complex case of unknown switching point.

Two things that I find strange but are probably not a major trouble

You don’t actually need to treat x as parameters for times before \tau. To avoid this (and increase efficiency) you would need to implement a ragged array, which might be a bit of a hassle E.g. if you had min_index = 2, max_index = 3 and M = 6 you would have the parameters as vector[(6 - 2) + (6 - 3)] x_flat; and use transformed data to compute int x_start_index[2]; so that x_start_index[1] = 1 and x_start_index[2] = 5. (obviously there is a way to make this general). Then, for t >= min_index + i - 1 you could access what currently is x[t, i] as x_flat[x_start_index[i] + t - (min_index + i - 1) - 1 (hopefully I didn’t make an off-by-1 error). In any case, if you have parameters you don’t use, give them something like normal(0,1) distribution, normal( .., 0.00001) can also cause problems sometimes.
Hard bounds on sigma seem unnecessary and potentially problematic. I would just have vector<lower=0>[n_poss_ind] sigma; and have sigma ~ normal(0, 1); - this induces soft constraint so that sigma will mostly be < 2 but should be easier on the sampler.

SuPie · August 12, 2020, 8:41am

Hi! Thank you for your suggestions! I have implemented all three of them.
This is my new Stan model code:

// This Stan program defines a simple stochastic model of exponential decay where the
// initial time point tau is random.
// in this model, the index of the observation after tau is included as a parameter
// tau = index_after_tau_p * time_step - s
// and we have one set of parameters per possible value of index_after_tau_p

// non-centered parametrization

data {
  int<lower=0> M; // number of observed time points
  vector<lower=0>[M] y_obs; // observations
  vector<lower=0>[M] time_points; // time points of the observations
  real<lower=0> time_step; // time step between observations
  int<lower=2> max_index; // upper bound for index of the observation after tau
  int<lower=2, upper=max_index> min_index; // lower bound for index of the observation after tau
  // the index after tau cannot be one (i.e. that of the first observation)
}

transformed data{
  vector[M] log_y_obs;
  real log_unif; 
  int n_poss_ind;
  int n_random_states; 
  int x_start_index[max_index - min_index + 1];
  log_y_obs = log(y_obs); // we assume log-normally distributed noise and therefore log-transform the data
  n_poss_ind = max_index - min_index + 1; // number of possible values that the index after tau can take
  log_unif = - log(n_poss_ind); // we assume a uniform prior for the index after tau;
  n_random_states = (M - max_index + 1) * n_poss_ind + (n_poss_ind - 1) * 
                    n_poss_ind / 2; // the number of random process states = length of x_flat
  // calculate the index within x_flat where the values of the random process states 
  // start for the i-th possible value of index_after_tau
  // i = index_after_tau - min_index + 1
  for (i in 1:n_poss_ind) {
    x_start_index[i] = (i - 1) * (M - i - min_index + 3) + (i - 1) * (i - 2) / 2 + 1;
  }
}

parameters {
  vector<lower=0, upper=time_step>[n_poss_ind] s; // time step from tau (at which 
  // initial amount m0 is released) to the next observed time point  
  vector<lower=0>[n_poss_ind] m0; // initial amount released at time tau
  vector<lower=0>[n_poss_ind] theta; // decay rate
  vector<lower=0>[n_poss_ind] sigma; // measurement noise parameter
  vector[n_random_states] x_flat; // random process states
  simplex[n_poss_ind] index_after_tau_p; //index_after_tau_p[i] == P(index_after_tau = i + min_index - 1)
}

transformed parameters {
  vector[n_poss_ind] partial_lp;
  vector<lower=0>[n_poss_ind] x[M]; // all process states
  partial_lp = rep_vector(log_unif, n_poss_ind);
  for (i in 1:n_poss_ind) { // again: i = index_after_tau - min_index + 1
    //Compute the likelihood assuming a single given value of the index after tau
    for (t in 1:(i + min_index - 2)) { // before tau x[t] = 0
     x[t, i] = 0;
     // observation model
     partial_lp[i] += normal_lpdf(log_y_obs[t] | 0, sigma[i]);
    }
    
    // factor of the likelihood of the process for the time step from the state at 
    // time point tau to the state at the first observed time point
    partial_lp[i] +=  normal_lpdf(x_flat[x_start_index[i]] | 0, 1);
    x[i + min_index - 1, i] =  m0[i] - theta[i] * m0[i] * s[i] + 
                                sqrt(theta[i] * m0[i] * s[i]) * x_flat[x_start_index[i]];

    // observation model for the first observation after tau
    partial_lp[i] +=  normal_lpdf(log_y_obs[i + min_index - 1] | 
                                    log(x[i + min_index - 1, i] + 1), sigma[i]);
                                        
    for (t in (i + min_index):M) { // (index_after_tau + 1):M
      // factor of the likelihood of the process
      partial_lp[i] +=  normal_lpdf(x_flat[x_start_index[i] + t - i - min_index + 1] | 0, 1);
      x[t, i] = x[t-1, i] - theta[i] * x[t-1, i] * (time_points[t] - time_points[t-1]) +
                sqrt(theta[i] * x[t-1, i] * (time_points[t] - time_points[t-1])) *
                  x_flat[x_start_index[i] + t - i - min_index + 1];
      // observation model for the remaining observations after tau
      partial_lp[i] += normal_lpdf(log_y_obs[t] | log(x[t, i] + 1), sigma[i]);
    }
  }
}

model {
  //Weight the likelihood by the probability
  target += log_sum_exp(log(index_after_tau_p) + partial_lp);
  
  // priors
  theta ~ normal(0, 5);
  m0 ~ normal(300, 300);
  sigma ~ normal(0,1);
}

When I now run (still providing the correct value of the index after \tau via min_index and max_index)

y_obs <- read.table("y_obs.txt")[[1]]
init_values <- c(rep(list(list(x_flat=as.array(rep(1, times= 174)))), times = 4))
stanfit <- stan(file = "stanmodel_stoch_exp_decay_with_random_release_time_incl_index_with_sep_param_x_non_centered.stan", 
                data = list(M=181, time_points=seq(from=0, to=30, length.out=181), 
                            y_obs=y_obs, time_step=1/6, max_index=8, min_index=8), 
                seed = 42424,
                control = list(max_treedepth = 15),
                init = init_values)

there are no divergent transitions and the estimates look reasonable:

##             mean se_mean   sd   2.5%    50%  97.5% n_eff  Rhat
## m0[1]    201.798    0.03 2.31 197.71 201.68 206.61  4667 1.000
## theta[1]   0.111    0.00 0.01   0.10   0.11   0.13  1392 1.004
## sigma[1]   0.009    0.00 0.00   0.01   0.01   0.01   448 1.015
## s[1]       0.084    0.00 0.05   0.00   0.08   0.16  6319 1.000

(I’m initializing x_flat to ensure that the sampling starts in a region where the transformed parameters x are non-negative. I couldn’t find a better way to deal with the lower bound on x.)

Still, when I try to allow for several possible values for the index after \tau , e.g. max_index=9, min_index=7 , then this results in mostly divergent transitions and the chains looking messy:

y_obs <- read.table("y_obs.txt")[[1]]
M <- 181
max_index <- 9
min_index <- 7
length_x_flat <- (M - max_index + 1) * (max_index - min_index + 1) + 
              (max_index - min_index) * (max_index - min_index + 1) / 2
init_values <- c(rep(list(list(x_flat=as.array(rep(1, times = length_x_flat)))),
                     times = 4))
stanfit21 <- stan(file = "stanmodel_stoch_exp_decay_with_random_release_time_incl_index_with_sep_param_x_non_centered.stan",
                data = list(M=M, time_points=seq(from=0, to=30, length.out=181),
                            y_obs=y_obs, time_step=1/6, max_index=max_index, min_index=min_index),
                seed = 42424,
                control = list(max_treedepth = 15),
                init = init_values)

traceplot_tau_split_non_centered_param

I am pretty sure that there is no error in the indexing of x or x_flat, as I’ve looked at the values of x at different iterations after warm up and those look reasonable for those chains that sample from a region around the true value (chains 1,2, and 4 for index_after_tau=8, and chain 3 for index_after_tau=9):

But now I don’t know what else to look at. Do have another idea what could still be causing these problems?

martinmodrak · August 16, 2020, 7:44pm

First, I would like to hear a bit more on what the model you are working with is expected to model (i.e. what the data represent and what is the scientific question), after looking into it a bit more, I strongly suspect htere might be more direct ways to express the model and avoid some of the pains. If all you wanted to express is an exponential decay with some noise, I guess we could find better representations.

In particular, the model is currently a bit incoherent - for example you do not enforce that x[t,i] + 1 > 0, so the sampler hits this boundary repeatedly, hindering sampling (this is actually reported in Stan output as a warning - it usually pays off to not ignore those warnings). The main cuplrit is probably that the model allows large theta so that theta * (time_points[t] - time_points[t-1]) > 1,

The fact that you needed max_treedepth = 15 is usually a huge warning that something is fishy…

SuPie · August 21, 2020, 11:11am

My question is really rather about “can I use Stan to infer parameters (in particular the initial time point) for this kind of model?” than “what kind of model can I use to infer parameters from a particular data set with Stan?”.
The model that I have described in my first post is only a simplified version of the model that I actually want to use in the end. In the actual model, the process X(t) will have several components of which only a subset will be observed. But if I can’t get the sampling to work for this simplified model from simulated data, I won’t even try for the more complex one.

I cannot see this warning. Where would I have to look for it (in R)?
Also, how could I enforce x[t,i] + 1 > 0? In the declaration vector<lower=0>[n_poss_ind] x[M];, I had already included the lower bound that would also ensure x[t,i] + 1 > 0. I had also tried to include bounds for x_flat that depend on other parameters (by declaring vector<lower=0>[n_random_states] x_flat and having another transformed parameter x_flat2 that is shifted), but couldn’t get this to work.

Could you explain or point my to some resource explaining why this is problematic?

martinmodrak · August 21, 2020, 12:57pm

What I am trying to say is that as I try to debug the math, I find it hard to believe there is an actual physical process that is described by this mathematical formulation, because there is a bunch of implied constraints on the parameters and it is hard to make them all satisfied at once and not introduce geometry that would break the sampler. So I think that at this point, it would be useful to take a step back and think: does the math actually represent the problem well? Maybe there is a different mathematical formulation that would model the process as well - or even better - and that would be easier to handle in Stan. And for that I would need to understand at least a bit of the real world phenomena you are trying to model.

If you insist that you 100% need this exact mathematical formulation, I’ll be glad to try to help you make it work in Stan, but my experience is that quite frequently you can gain a lot by rethinking the link between the model, reality and the goals of your project.

If the model has issues with the default max_treedepth, it means it is difficult to sample. I would also expect the sampling to take really long. And there is the Folk theorem of statistical computing (which is not a theorem, just an observation) that if your model runs slowly/has divergences/… it often turns out that it is also a suboptimal model for the data you have.

SuPie · August 21, 2020, 1:48pm

Thank you for your advice! I understand your concerns. Nevertheless, as I have pointed out before, I am mostly interested in the (somewhat theoretical) question whether it is possible in Stan to infer a random initial time point specifically for such an (approximated) SDE model. That’s why I’m using only simulated data.

I’m using simulated data that was generated from the same model that I would like to use to infer the parameters, so in that case, does the need for a higher max_treedepth still indicate problems?

martinmodrak · August 21, 2020, 2:15pm

OK, I see. Still it is possible that there are multiple alternative formulations of the model. I admit I have no knowledge of SDE theory, but maybe you can separate the process into a deterministic and stochastic part, i.e. that after the initial timepoint, it would be a sum of exponential decay and say a Gaussian process? Are you sure \theta is plausibly somewhere between 0 and 10? Where does the weird \log(y_k)\sim\mathcal{N}(\log(X(t_k) + 1) come from?

EDIT: Additionally, truncated normal is a unwieldy distribution (especially if you allow negative mean as the model does). Couldn’t the model be reformulated to have linear decrease on the unobserved space (x) and lognormal observation model?

martinmodrak · August 24, 2020, 9:18am

So I spend a bit more time with your model and I am still puzzled by the markov process you describe. One thing I would expect is that the unobserved process is the same regardless of the time points I chose to observe, so if I use f_{t,s}(x_t, x_s) for the density of X(t) = x_t given X(s) = x_s and I take s < q < t it should hold that:

f_{t,s}(x_t, x_s) = \int_0^\infty f_{t,q} (x_t, x_q) f_{q,s} (x_q, x_s) d x_q

However, this doesn’t seem to hold a quick example simulating the process in R, using the same m_0, \theta over the same overall timeframe, but increasing the amount of time points I measure in between:

library(truncnorm)
library(tidyverse)
set.seed(52244)
m0 <- 200
theta <- 0.8
max_time <- 3
N <- 30


res <- list()
i <- 1
for(time_step in c(0.1, 1.5)){
  times <- seq(0, max_time, by = time_step)
  for(n in 1:N) {
    x <- numeric(length(time))
    x[1] <- m0
    for(t in 2:length(times)) {
      x[t] <- rtruncnorm(1, a = 0, mean = x[t -1] - theta * x[t - 1] * time_step , sd = sqrt(theta * x[t - 1] * time_step))
    }
    res[[i]] <- data.frame(time_step = time_step, run = n, time = times, x = x)
    i <- i + 1
  }
}

res_df <- do.call(rbind, res)

res_df %>% ggplot(aes(y = x, x = time, color = factor(time_step), group = interaction(time_step, run))) + geom_line(alpha = 0.5)

One can make the difference arbitrarily big by assuming even bigger diferences in time_step.

Obviously this isn’t against the law and you are free to define a your process in whichever way you like but I am once again concerned whether we are focusing on the correct process here. As I don’t understand the background, I don’t know which of the process assumptions you make are fundamental and which are auxiliary and this makes it hard for me to figure out a good way to (re)parametrize the process to make it play neatly with Stan.

I see two possible ways forward:
a) I can (try to) show how I would use Stan to model a simpler but similar model that I understand
b) We can dig into the math of your actual problem and try to understand better what is going on.

SuPie · August 27, 2020, 4:31pm

Thank you so much for your efforts!!

You are right! That’s because the transition density for an SDE is generally not known, but can only be approximated. Here, we use the Euler-Maruyama scheme that approximates the transition density by a normal density. The approximation is only valid for small time steps. Moreover, the true process described by the underlying SDE can only take non-negative values. That’s why the normal density is truncated.

what you are suggesting is similar to the so-called linear noise approximation; however, that is a different kind of approximation which I would not like to use because it’s less general.

I expect \theta to be pretty small, but don’t really know an upper bound. That’s why I chose a rather broad prior.

In the final version of my model, (one component of) the process X(t) will not be observed directly, but the observation is a linear transformation (aX(t)+b, where a,b>0 also need to be inferred) that is corrupted by multiplicative measurement error. For now, I just set a=b=1. However, I guess this part is not essential to my question of whether I can infer the initial time point. Therefore, we might as well assume y_k = X(t_k) + \epsilon with \epsilon \sim \mathcal{N}(0, \sigma^2) for now.

As I have tried to point out before: I am interested in the question whether it is possible in Stan to infer the random initial time point as described in my first post for a general SDE model (approximated by the Euler-Maruyama scheme). If the answer is ‘no’, that’s fine. In that case, I would just try to better understand the reasons.

Do you happen to know the answer to any of the open questions from my previous replies?:

martinmodrak · August 27, 2020, 6:32pm

I honestly don’t know. It seems borderline possible. The process given the initial timepoint looks a bit like random walk model and those are done quite a lot. I am around 90% confident it is possible to make it work for the SDE approximation you describe without extreme effort. For the other part, I am also around 90% confident that for a simple well behaved process the scheme we implemented to infer the time point would work. What I am uncertain about is whether the interaction of a complex process and unknown initial timepoint willl be tractable. The model is definitely a hard one - as a reference point, even quite innocous-looking ODEs can turn out to be a challenge and the stochastic part is unlikely to make it easier.

There are also other ways to encode the time point that may or may not behave better - most notably, you could use some smooth approximation to the step function (e.g. sigmoid) and take the process as a multiplication of sigmoid and the decay. This however is likely to have other issues… In both cases getting the model to work reliably and efficiently with a known time point would be a necessary first step.

Ok, I think I am starting to understand what this is about :-) If I understand it correctly, the target system is fixed, but details of the approximation are not, right? In my limited experience with ODE modelling, when a process is positive, you would either model the log of the process or apply a log1p_exp (aka Softplus). On the other hand, truncated normal can be badly behaved computationally. I would actually expect the approximation to work more naturally on the log of the process as it seems to assume unbounded support for the process. That would obviously change the behavior of the approximation, so you need to check if it makes sense for your case (and re-derive the conditional probability for the log).

Back to the other points:

I see them in the viewer pane in RStudio or directly in console when I run with chains = 1

That can be tricky - I would prefer directly modelling the log as discussed above.

Yes. If you get divergences with lower treedepth, it means the geometry of the model is somehow degenerate and the sampling, while likely accurate with larger treedepeth (and hence smaller step size) is inefficient. Since we need to add additional complexity on top of the “known timepoint” base case which almost certainly would make the geometry less well behaved, having the base case behave nicely is important. Or, more precisely, if the more complex model doesn’t work, it is unlikely it can be made to work without polishing the base case.

Am I making sense here?

SuPie · September 30, 2020, 10:01am

Again, thank you for suggestions and sorry for my late reply!

I had already considered using a sigmoid function instead of a step/jump, but came to the conclusion that it wouldn’t solve the problem. The values of the process between the observed time points do not enter into the calculation of the posterior anyway. Only the time point when the process reaches m_0 would be of interest. So I don’t think the jump is the problem, but rather the fact that the process switches from a deterministic to a stochastic evolution.

For a known time point, sampling worked perfectly fine (see my second post). Also after splitting the time point into the discrete index and the remaining continuous part and including your other suggestions, the model in this post worked when the index was known, i.e. there were no divergent transitions and the estimates looked reasonable. (Unfortunately, I was not able to reproduce the warnings that you have mentioned).

Since I did not get any divergent transitions with lower treedepth, but only the warning that the majority of the transitions after warmup exceeded the maximum treedepth of 10. Is that still problematic?

Thanks for this suggestion, I’ll try to further investigate this soon.

martinmodrak · October 5, 2020, 2:44pm

Maybe there was some discrepancy between the exact versions of code /data you and I’ve been running (also might be on my side, so sorry if I messed up).

Yes, that still means something is not completely right and the geometry is difficult for the sampler (though it is usually less severe than divergences). So if the model as a whole doesn’t work, I would definitely investigate if I can get rid of the threedepth warnings.

Best of luck with further steps!

Topic		Replies	Views
Change from global model fit to one-step-ahead prediction Modeling rstan , fitting-issues	7	1497	May 25, 2020
Extended Kalman filter, problem estimating process and observation error size Modeling rstan , specification	7	1014	July 14, 2022
Brownian motion with measurement noise Modeling	11	1592	April 26, 2021
How to consider a process in the model without it being a parameter or data? (Spatio-temporal process)) Modeling rstan , hierarchical-model	3	257	March 12, 2024
Problem computing integrals Modeling rstan , fitting-issues , integrator	5	939	September 18, 2020

Simulation and inference for a stochastic process with a random time of event in rstan

Related topics