Constrained latent non-centered time series


#1

Hello!

I’m currently trying to build a model with an AR(1) time series in the parameter space, as my simpler model (with simple hierarchical pooling on each of the regression parameters) requires additional regularisation (frequent across chain disagreement as to the location of the posterior).

I am getting into trouble for a number of reasons:

  • Based on the advice I have seen on the various stan discussion forums, I am parameterising this time series by the innovations. However, I have some constraints in place:
    • I have positivity constraints that are motivated from the physical nature of the process / data.
    • I have an ordering constraint on the midpoint of the tanh component for identifiability.
  • I cannot, ahead of time, constrain the innovations to respect these constraints (at least, I do not think I can?)

As such, it is easy to generate an initial set of innovations that lead to parameter values that violate the above constraints (leading to the model being difficult to initialize). Specifying my own initial values doesn’t help a great deal as (i suspect) appropriate values/proposals for the innovations are difficult to generate.

The combination of this parameterisation and set of constraints is proving difficult to deal with.
I am beginning to suspect that I am pushing the limits of what this parametric model can fit, and a major reparameterization / new model is on my horizon, but I’d really like to make this one work. Are there any major suggestions for ways to constrain / parameterise the innovations differently so that I can sample this model?

functions {
  real ar_next(real prev, int index, vector kappa, vector phi, vector innovation) {
    real result;
    result = kappa[index] + (phi[index] * prev) + innovation[index];
    return(result);
  }
}

data {
  // Number of depths
  int<lower = 0> n_depths;
  
  // Number of time points
  int<lower = 0> n_times;
  
  // Vector of depths measurements were taken at
  vector[n_depths] depths;
  
  // Matrix of measurements, each row is a set of measurements at one time point
  // These measurements must be on the ORIGINAL scale in order for the priors 
  // to make any kind of physical sense.
  matrix[n_times, n_depths] densities;
}

parameters {  
  real<lower = 0> beta_zero_t0;
  real<lower = 0> beta_one_t0;
  real<lower = 0> beta_three_t0;
  real<lower = 0> beta_six_t0;

  // beta_two and beta_five
  // need to be ordered for identifiability.
  positive_ordered[2] beta_midpoint_t0;

  // ar parameters
  // they are all positive (even though the _could_ be negative) as there is
  // zero empircal chance they are negative
  vector<lower = 0, upper = 1>[6] phi;
  
  // ar constant, might require stabilising.
  // these processes all have positive mean, so kappa must be positive
  vector<lower = 0>[6] kappa;

  // ar error distribution
  vector<lower = 0>[6] beta_sigma;

  // innovation container
  // I don't know how to make these innovations satisfy the positive_ordered
  // constraint that we require????
  matrix[n_times - 1, 6] innovations;


  real<lower = 0> sigma_curve;
}

transformed parameters {
  matrix[n_times, n_depths] fitted_values;
  
  real<lower = 0> beta_zero[n_times];
  real<lower = 0> beta_one[n_times];
  real<lower = 0> beta_three[n_times];
  real<lower = 0> beta_six[n_times];

  positive_ordered[2] beta_midpoint[n_times];

  beta_zero[1] = beta_zero_t0;
  beta_one[1] = beta_one_t0;
  beta_three[1] = beta_three_t0;
  beta_six[1] = beta_six_t0;  

  beta_midpoint[1, 1] = beta_midpoint_t0[1];
  beta_midpoint[1, 2] = beta_midpoint_t0[2];

  for (tt in 1:n_times) {
    if (tt >= 2) {
      beta_zero[tt] = ar_next(beta_zero[tt - 1], 1, kappa, phi, innovations[tt - 1, ]');
      beta_one[tt] = ar_next(beta_one[tt - 1], 2, kappa, phi, innovations[tt - 1, ]');
      beta_midpoint[tt, 1] = ar_next(beta_midpoint[tt - 1, 1], 3, kappa, phi, innovations[tt - 1, ]');
      beta_three[tt] = ar_next(beta_three[tt - 1], 4, kappa, phi, innovations[tt - 1, ]');
      beta_midpoint[tt, 2] = ar_next(beta_midpoint[tt - 1, 2], 5, kappa, phi, innovations[tt - 1, ]');
      beta_six[tt] = ar_next(beta_six[tt - 1], 6, kappa, phi, innovations[tt - 1, ]');
    }
   fitted_values[tt,] = (beta_zero[tt] - beta_one[tt] * (tanh((depths + beta_midpoint[tt, 1]) / beta_three[tt]) + tanh((depths + beta_midpoint[tt, 2]) / beta_six[tt])))'; 
  }
  
}

model {
  for (tt in 1:n_times) {
    // vectorwise over individuals.
    densities[tt,] ~ normal(fitted_values[tt,], sigma_curve);
    if (tt < n_times) { // otherwise we will out-of-bounds the innovations matrix
      innovations[tt, ] ~ normal(0, beta_sigma); 
    }
  }

  // ar priors
  phi ~ beta(3, 1);
  // I really have no idea what on earth this should be
  kappa ~ normal(100, 500);
  // this might be too tight? We shall see, again unsure as to the 
  // scale of these in this time series structure.
  beta_sigma ~ normal(0, 10);

  // initial value priors
  beta_zero_t0 ~ normal(1023, 2);
  beta_one_t0 ~ normal(1, 2);

  beta_three_t0 ~ normal(42, 3);
  beta_six_t0 ~ normal(45, 3);

  beta_midpoint_t0[1] ~ normal(50, 5);
  beta_midpoint_t0[2] ~ normal(150, 5);

  // It's tiny.
  sigma_curve ~ normal(0, 0.1);

}

Unfortunately I cannot share the original data, however I have attached an R script that generates data that is very similar. I have also attached an R script with some appropriate initial values / model fitting code.

Some other tidbits:

  • I can’t rescale the data, as the parameter estimates have physical interpretation / priors, and I’m not sure how to translate them onto a different scale, subject to the nonlinear regression function.
  • I have fitted simpler models & have the typical swath of diagnostics, if you’re interested in how I’ve ended up needing time dependency in the parameter space.

Unrelated: I’ve been using Stan for 3 years now, and 99.9% of the time it’s been great! So I’d just like to say thanks for all the work that’s gone into it!

datagenerator.R (2.0 KB)
fitter.R (1.8 KB)


#2

For posterity: I ended up imposing the time dependency on the log scale, and switching to an additive structure inside the \text{tanh} function to avoid the ordering constraint. Specifically, the model is now:

\pmb{y}_t = \beta_{0, t} - \beta_{1, t} \left(\text{tanh}\left(\frac{\pmb{z}_t + \beta_{2, t}}{\beta_{3, t}}\right) + \text{tanh}\left(\frac{\pmb{z}_t + \beta_{2, t} + \beta_{4, t}} {\beta_{5, t}} \right) \right) + \pmb{\varepsilon}_{t} \\ \pmb{\beta}_{*, t} = \text{exp}\left\{\pmb{\beta}^{'}_{*, t} \right\} \\ \pmb{\beta}^{'}_{*, t} = \pmb{\kappa}_{\beta^{'}_{*}} +\pmb{\phi}_{\beta^{'}_{*}} \pmb{\beta}^{'}_{*, t - 1} + \pmb{\varepsilon}_{\pmb{\beta^{'}_{*}}, t} \\ \pmb{\varepsilon}_{t} \sim \text{N} \left(0, \sigma^2 \right) \qquad \qquad \pmb{\varepsilon}_{\pmb{\beta^{'}_{*}}, t} \sim \text{N} \left(0, \sigma^2_{\pmb{\beta^{'}_{*}}} \right)

This also made the stan code a lot nicer, and I think that it should be fittable with some careful initialisation. Note that the variable names in the stan code don’t match the model above.

data {
  // Number of depths
  int<lower = 0> n_depths;
  
  // Number of time points
  int<lower = 0> n_times;
  
  // Vector of depths measurements were taken at
  vector[n_depths] depths;
  
  // Matrix of measurements, each row is a set of measurements at one time point
  // These measurements must be on the ORIGINAL scale in order for the priors 
  // to make any kind of physical sense.
  matrix[n_times, n_depths] densities;
}

parameters {  
  // initial values for time.
  vector[6] beta_prime_t0;

  // ar parameters
  // they are all positive (even though the _could_ be negative) as there is
  // zero empircal chance they are negative
  // now that innovations are on the log scale, this assumption might need
  // revisiting.
  vector<lower = 0, upper = 1>[6] phi_prime;
  
  // ar constant, might require stabilising.
  // these processes all have positive mean, so kappa must be positive
  vector[6] kappa_prime;

  // ar error distribution
  vector<lower = 0>[6] beta_prime_sigma;

  // innovation container
  // I don't know how to make these innovations satisfy the positive_ordered
  // constraint that we require????
  matrix[n_times - 1, 6] innovations_prime;

  // error variance (original scale)
  real<lower = 0> sigma_curve;
}

transformed parameters {
  matrix[n_times, n_depths] fitted_values;
  matrix[n_times, 6] beta_orig;
  vector[6] t0_prior_means = log([1023, 1.1, 50, 42, 100, 45])';
  
  beta_orig[1, ] = exp(beta_prime_t0)';
  
  for (tt in 1:n_times) {
    if (tt >= 2) {
      beta_orig[tt, ] = exp(kappa_prime + phi_prime .* log(beta_orig[tt - 1,]') + innovations_prime[tt - 1,]')';
    }
    
    fitted_values[tt, ] = (beta_orig[tt, 1] - beta_orig[tt, 2] * (tanh((depths + beta_orig[tt, 3]) / beta_orig[tt, 4]) + 
                                                                  tanh((depths + beta_orig[tt, 3] + beta_orig[tt, 5]) / beta_orig[tt, 6])))';
  }
}

model {
  for (tt in 1:n_times) {

    densities[tt,] ~ normal(fitted_values[tt,], sigma_curve);

    if (tt >= 2) {
      innovations_prime[tt - 1, ] ~ normal(0, beta_prime_sigma);
      // print("innovations_prime: ", innovations_prime[tt - 1,]);
    }
  }

  // ar priors
  // that these should be closer to 1 than to zero. 
  phi_prime ~ beta(3, 1);

  // I really have no idea what on earth this should be
  kappa_prime ~ normal(0, 3);

  // this might be too tight? We shall see, again unsure as to the 
  // scale of these in this time series structure.
  beta_prime_sigma ~ normal(0, 2);

  // initial value priors
  // log scale
  beta_prime_t0 ~ normal(t0_prior_means, log(3));

  // It's tiny.
  sigma_curve ~ normal(0, 0.1);

}

#3

More updates for posterity:
Unsurprisingly, this model takes a really really really really long time to sample (excluding the fact I lost a model run to the magic of computers: https://github.com/stan-dev/rstan/issues/520 ). 90 warmup and 10 samples takes a large, highly variable amount of time (the number of lp evaluations is pretty bonkers):

> rstan::get_elapsed_time(test_fit)
        warmup    sample
chain:1 191929 16346.000
chain:2 112053     0.235
chain:3 222754 30129.100

> res <- get_sampler_params(test_fit)
> lapply(res, function(X) {
+   t <- sum(X[, 4])
+   names(t) <- colnames(X)[4]
+   t
+ })
[[1]]
n_leapfrog__ 
    57488258 

[[2]]
n_leapfrog__ 
    28316721 

[[3]]
n_leapfrog__ 
    72956735 
  

I’ve tried rescaling the data to avoid overly specific prior means, but that (unsurprisingly) doesn’t make much of a difference. I think most of the sampling difficulties lie in the varying scales of \pmb{\varepsilon}_{\pmb{\beta^{'}_{*, t-1}}}:

The above plot is stratified by chain, and doesn’t really account for a lot of things, but does show that the bulk of the means of the innovations are very close to zero (on the log scale), whilst some of them really aren’t. Coupled with the following plot of posterior standard deviations of the innovations really shows the differences in scale:

My (primitive) understanding of how NUTS adapts to sample from the posterior leads me to believe that this “difficult parameter space geometry” is a contributing factor to the long run time, but I can’t see any way to regularise the scale of the innovations across each time point, so this might be the end of the road for this model.


#4

Sorry this went so long without an answer. We’re getting a bit overwhelmed with the volume.

Have you tried modeling the values directly rather than the innovations? It should work out to the same thing to do:

y[n] ~ normal(y[n - 1], sigma)

as to do:

(y[n] - y[n-1]) ~ normal(), 1);

How complicated are the constraints? Positivity is easy enough.

Glad to hear it.

One thing you can do is take things like this:

beta_six_t0 ~ normal(45, 3);

and use:

transformed parameters {
  real beta_six_t0 = 45 + 3 * beta_six_t0_std;
  ...
model {
  beta_six_t0_std ~ normal(0, 1);

You may be having problems with initialization otherwise, though it looks like you’re doing that by hand.


#5

Hmm. I missed your response here, because Discourse made it look like there was only one post in this thread. Thanks for reporting back.

Are you expecting that multimodality or might it be an artifact of poor mixing?

You could try putting a tighter prior on those \varepsilon to try and control the computations.

You can also try starting with a smaller step size and upping the acceptance target. This might make some iterations slower, but it should even them out some.


#6

Appreciate the response, things sure do seem busy around here!

I think I am expecting the multimodality, but can’t be sure. I have a simpler version of this model, with no time dependency and a hierarchical mean on each of the \beta_{*}. I took the posterior distributions of \beta_{*}, for each time point and plotted the boxplots, so that I get some kind of pseudo-time series that I can inspect (I’ve attached one as an example: 2018-04-09_beta-three-through-time-boxplot.pdf (46.0 KB)
). This plots gives me a few reasons for concern:

  • There is way more time dependency here than an AR(1) can provide, if I want to stick with an AR structure, I should probably include more terms. (Fitting an AR(k) model to the posterior medians suggests that k should be at least 5)

  • The corresponding AR assumption that \sigma^2_{\beta^{'}_{*}} is constant over time is very likely to be violated.

I would put my money on the violation of this assumption being the culprit for the multimodality in the posterior, but I’m no good at gambling.

Really I think I’m just bumping up against the disagreement between my model and my data, and I’ll probably have to think more about the generative process to come up with a better model.


#7

The Folk Theorem!