Semilocal Linear Trend Model, opinions/suggestions welcome

I’ve been playing around with a Stan version of the semilocal linear trend model, as found in the R package bsts. The model both automatically scales and matches bsts’s fit fairly well. Still, I’m hoping to get some feedback about:

  1. starting points mu[1] and nu[1],
  2. choice of priors (I offer my thoughts in line), and
  3. possible ways to eliminate the few and sporadic divergences, as the simple solution of increase adapt_delta is already used.
data {                                                                                   
  int <lower=1> T;                                                                       
  vector[T] y;                                                                           
}                                                                                        
transformed data {                                                                       
  real sd_y = sd(y);                                                                     
}                                                                                        
parameters {                                                                             
  real<lower=0> sigma_y;                                                                 
  vector[T] gamma;                                                                       
  real<lower=0> sigma_gamma;                                                             
  vector[T] zeta;                                                                        
  real<lower=0> sigma_zeta;                                                              
  real eta;                                                                              
  real<lower=-1, upper=1> phi;                                                           
}                                                                                        
transformed parameters {                                                                 
  vector[T] mu;                                                                          
  vector[T] nu;                                                                          
                                                                                         
  mu[1] = y[1] + sigma_gamma * gamma[1];                                                 
  nu[1] = zeta[1];                                                                       
  for (t in 2:T) {                                                                       
    mu[t] = mu[t-1] + nu[t-1] + sigma_gamma * gamma[t];                                  
    nu[t] = eta + phi * (nu[t-1] - eta) + sigma_zeta * zeta[t];                          
  }                                                                                      
}                                                                                        
model {                                                                                  
  // priors                                                                              
  //sigma_y ~ exponential(1 / sd_y);                                                     
  sigma_y ~ normal(0, 2.5 * sd_y);                                                       
  gamma ~ normal(0, 1);                                                                  
  // sigma_gamma ~ gamma(2, 1 / sd_y);                                                   
  sigma_gamma ~ normal(0, 2.5 * sd_y);                                                   
  zeta ~ normal(0, 1);                                                                   
  // sigma_zeta ~ gamma(2, 1 / sd_y);                                                    
  sigma_zeta ~ normal(0, 2.5 * sd_y);                                                    
  eta ~ normal(0, sd_y);                                                                 
  phi ~ normal(0, 0.5);                                                                  

  // likelihood                                                                          
  y ~ normal(mu, sigma_y);  
}                                                                                        
generated quantities {                                                                   
  vector[T] y_pred;                                                                      
  for (t in 1:T)                                                                         
    y_pred[t] = normal_rng(mu[t], sigma_y);                                              
} 

If you’re interested, here’s some R code to get you started. The code below compares some features from Stan and bsts fits across two data sets, one from the datasets package and one from bsts.

library(rstan)                                                                           
options(mc.cores = parallel::detectCores())                                              
rstan_options(auto_write = TRUE)                                                         
library(bsts)                                                                            
                                                                                         
sm <- stan_model('semilocal_linear_trend.stan')                                          
                                                                                         
## Nile                                                                                  
y <- as.vector(Nile)                                                                     
T <- length(y)                                                                           
x <- 1:T                                                                                 
                                                                                         
## rsxfs                                                                                 
data(rsxfs)                                                                              
                                                                                         
y <- as.vector(rsxfs)                                                                    
T <- length(y)                                                                           
x <- 1:T                                                                                 
                                                                                         
### Stan                                                                                 
sfit <- sampling(sm, data=list(y = y, T = T),                                            
                 control=list(adapt_delta=0.99,                                          
                              max_treedepth=15))                                         
                                                                                         
### bsts                                                                                 
m <- list()                                                                              
m <- AddSemilocalLinearTrend(m, y=y)                                                     
bfit <- bsts(y, m, niter=2000)                                                           
                                                                                         
### plot                                                                                 
plot(x, y, type='l')                                                                     
lines(x, colMeans(bfit$state.contributions), col='red')                                  
lines(x, colMeans(extract(sfit, "y_pred")$y_pred), col='blue')                           
                                                                                         
plot(density(bfit$trend.slope.mean), col='red')                                          
lines(density(extract(sfit, "eta")$eta), col='blue')                                     
                                                                                         
summary(bfit$trend.slope.mean)                                                           
summary(extract(sfit, "eta")$eta) 

Thanks in advance for any and all comments.

1 Like

This is another nice case study topic! Especially given there’s a popular package in R that already implements the functions.

I’m not familiar with these models, but @andrewgelman and @bgoodri and @vehtari should be more familiar (I’m calling them out so they don’t think this is a complete answer!).

I see that bsts allows spike-and-slab priors on the AR coefficients. Presumably they also allow other things which you matched?

It would be possible for a handful of AR components to marginalize out the mixture components in a proper spike-and-slab model, but probably inefficient. This might be something nice to compare to Piironen and Vehtari’s horseshoe approach.

I can recommend recoding the transformation for efficiency:

  1. Store that 2.5 * sd_y and reuse. This will be minor. Not sure where the scaling comes from.

  2. You can even non-center a single parameter to remove the prior dependency between sigma_gamma and sd_y with:

sigma_gamma_z ~ normal(0, 1);
sigma_gamma = 2.5 * sd_y * sigma_gamma_z;
1 Like

Hi, I don’t really have anything to add on this except to suggest simulating some fake data from the prior and see what these simulated time series look like.
Andrew

Haven’t matched much else yet. Thinking I should clean up the simpler model, in terms of getting rid of the sporadic divergences and what not, before adding complexity.

Great thoughts. Along with some other slight changes, the transformed parameters block is now

transformed parameters {                                                                 
  vector[T] mu;                                                                          
  vector[T] nu;                                                                          
                                                                                         
  real sigma_y = scale_error * sigma_y_z;                                           
  real sigma_gamma = scale_error * sigma_gamma_z;                                   
  real sigma_zeta = scale_error * sigma_zeta_z;                                     
  real eta = scale_slope * eta_z;                                                   
                                                                                         
  mu[1] = y[1] + sigma_gamma * gamma[1];                                                 
  nu[1] = sigma_zeta * zeta[1];                                                          
  for (t in 2:T) {                                                                       
    mu[t] = mu[t-1] + nu[t-1] + sigma_gamma * gamma[t];                                  
    nu[t] = eta + phi * (nu[t-1] - eta) + sigma_zeta * zeta[t];                          
  }                                                                                      
} 

with updated scaling scale_error = 10 * sd(y) and scale_slope = 2.5 * sd(y). I’m want the priors to automatically scale to the magnitude of the data y, and then spread the priors out.

For more details on the model above and the simulations below, see the GitHub page I set up: GitHub - roualdes/semilocal-linear-trend: Semilocal linear trend model in Stan

Short answer, here’s a plot of some simulated data. These data were simulated from this model with the slope parameter \eta \in \{-1, 0.1, 1\}, \phi \sim N(-0.5, 0.25), and all standard deviations set to 100, for some flavor. The solid lines below are the means of the simulated data at each time point.

I then fit this model to the simulated data. The dashed lines in the plot above are 95% credible intervals from this model fit to the simulated data. Below is a plot of the posterior densities of the slope parameters \eta, one for each value of \eta from the simulated data. The true values of \eta are drawn as short, vertical lines along the x-axis.

Some things that I learned through this recommendation to simulate some data.

  1. The generating process is much more interesting when the scale of the error terms is much larger than the scale of the slope \eta.

  2. Fitting this model, even to the simulated data, was not easy for HMC: double to triple digit divergences even with adapt_delta = .95, transitions beyond max_treedepth = 15, and ocassionaly low BFMI across some of the four chains.

  3. The true parameter values are well captured through simulated data by this model.

I’m happy with the fits of the model, but displeased with the poor diagnostics of the fit. It would be nice to find a way to reparameterize this model to improve the diagnostics.

Thanks for your input thus far.

It’s best to check the variances as well as the means of the output. You can use the new simulation-based calibration (SBC) techniques (based on the original Cook-Gelman-Rubin diagnostics).

If you want more feedback on efficiency, could you please post the whole model as it stands? Thanks.

I haven’t yet gotten to the simulation-based calibration paper yet, but it’s on my reading list.

Re variances: oof. The variances are not recovered.

My latest thoughts are to draw more heavily from Durbin and Koopman’s Time Series Analysis by State Space Methods and compute more of the model analytically. Though this will take me some time as this coming academic year will be time consuming.

Here’s the full model:

data {
  int <lower=1> T;
  vector[T] y;
}
transformed data {
  real sd_y = sd(y);
  real scale_error = 10 * sd_y;
  real scale_slope = 2.5 * sd_y;
}
parameters {
  real<lower=0> sigma_y_z;
  vector[T] gamma;
  real<lower=0> sigma_gamma_z;
  vector[T] zeta;
  real<lower=0> sigma_zeta_z;
  real eta_z;
  real<lower=-1, upper=1> phi;
}
transformed parameters {
  vector[T] mu;
  vector[T] nu;

  real sigma_y = scale_error * sigma_y_z;
  real sigma_gamma = scale_error * sigma_gamma_z;
  real sigma_zeta = scale_error * sigma_zeta_z;
  real eta = scale_slope * eta_z;

  mu[1] = y[1] + sigma_gamma * gamma[1];
  nu[1] = sigma_zeta * zeta[1];
  for (t in 2:T) {
    mu[t] = mu[t-1] + nu[t-1] + sigma_gamma * gamma[t];
    nu[t] = eta + phi * (nu[t-1] - eta) + sigma_zeta * zeta[t];
  }
}
model {
  // priors
  sigma_y_z ~ normal(0, 1);
  gamma ~ normal(0, 1);
  sigma_gamma_z ~ normal(0, 1);
  zeta ~ normal(0, 1);
  sigma_zeta_z ~ normal(0, 1);
  eta_z ~ normal(0, 1);
  phi ~ normal(0, 0.5);

  // likelihood
  y ~ normal(mu, sigma_y);
}
generated quantities {
  vector[T] y_pred;
  for (t in 1:T)
    y_pred[t] = normal_rng(mu[t], sigma_y);
}

That’s always a good idea.

The book by Simo Särkkä is also good on state-space models.

One thing you can do is pull out some of the operations that can be done with vectors, like sigma_gamma * gamma, e.g.,

vector[T] scaled_gamma = scale_error * sigma_gamma_z * gamma;
vector[T] scaled_zeta = scale_error * sigma_zeta_z * zeta;

Those two scalars get multiplied first, then they multiply the vector, so the whole thing is much more efficient. Then you can just use scaled_gamma[t] instead of sigma_gamma * gamma[t],

  nu = eta + scaled_zeta[1];
  for (t in 2:T)
    nu[t] += phi * (nu[t-1] - eta);
  mu = scaled_gamma;
  mu[1] += y[1];
  mu = cumulative_sum(mu);
  mu[2:T] += cumulative_sum(nu[1:T - 1]);

You’ll have to check my algebra, but I hope the point is clear—minimize the number of operations and use compound operations wherever possible (I don’t actually know if cumulative_sum has good analytic derivatives, but now with adjoint-Jacobian product forms, it’ll be a lot easier to rewrite them so that they do).