Speeding up and reducing divergences for multivariate hiearchical stochastic factor model

On my local machine, it’s

R version 3.6.3 (2020-02-29)
Platform: x86_64-apple-darwin19.3.0 (64-bit)
Running under: macOS Catalina 10.15.4

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /usr/local/Cellar/openblas/0.3.9/lib/libopenblasp-r0.3.9.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] rstan_2.19.3        StanHeaders_2.19.2  cmdstanr_0.0.0.9000 reticulate_1.15     forcats_0.5.0      
 [6] stringr_1.4.0       dplyr_0.8.5         purrr_0.3.4         readr_1.3.1         tidyr_1.0.2        
[11] tibble_3.0.1        ggplot2_3.3.0       tidyverse_1.3.0     posterior_0.0.2     Matrix_1.2-18      

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.8       lubridate_1.7.8    lattice_0.20-38    prettyunits_1.1.1  ps_1.3.2          
 [6] assertthat_0.2.1   R6_2.4.1           cellranger_1.1.0   backports_1.1.6    reprex_0.3.0      
[11] stats4_3.6.3       httr_1.4.1         pillar_1.4.3       rlang_0.4.5        readxl_1.3.1      
[16] rstudioapi_0.11    callr_3.4.3        checkmate_2.0.0    loo_2.2.0          munsell_0.5.0     
[21] broom_0.5.5        compiler_3.6.3     modelr_0.1.6       pkgconfig_2.0.3    pkgbuild_1.0.6    
[26] tidyselect_1.0.0   gridExtra_2.3      matrixStats_0.56.0 fansi_0.4.1        crayon_1.3.4      
[31] dbplyr_1.4.3       withr_2.1.2        grid_3.6.3         nlme_3.1-144       jsonlite_1.6.1    
[36] gtable_0.3.0       lifecycle_0.2.0    DBI_1.1.0          magrittr_1.5       scales_1.1.0      
[41] cli_2.0.2          stringi_1.4.6      fs_1.4.1           xml2_1.3.1         ellipsis_0.3.0    
[46] generics_0.0.2     vctrs_0.2.4        tools_3.6.3        glue_1.4.0         hms_0.5.3         
[51] parallel_3.6.3     processx_3.4.2     abind_1.4-5        inline_0.3.15      colorspace_1.4-1  
[56] rvest_0.3.5        haven_2.2.0   

and on the server it’s

R version 3.6.2 (2019-12-12)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux

Matrix products: default
BLAS/LAPACK: /usr/lib64/libopenblas-r0.3.3.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
 [1] rstan_2.18.2        StanHeaders_2.18.1  cmdstanr_0.0.0.9000
 [4] reticulate_1.15     forcats_0.4.0       stringr_1.4.0
 [7] dplyr_0.8.5         purrr_0.3.3         readr_1.3.1
[10] tidyr_1.0.0         tibble_3.0.1        ggplot2_3.3.0
[13] tidyverse_1.3.0     posterior_0.0.2     Matrix_1.2-18

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4         lubridate_1.7.4    lattice_0.20-38    prettyunits_1.1.1
 [5] ps_1.3.2           assertthat_0.2.1   R6_2.4.1           cellranger_1.1.0
 [9] backports_1.1.6    reprex_0.3.0       stats4_3.6.2       httr_1.4.1
[13] pillar_1.4.3       rlang_0.4.5        readxl_1.3.1       rstudioapi_0.11
[17] callr_3.4.3        checkmate_2.0.0    loo_2.2.0          munsell_0.5.0
[21] broom_0.5.3        compiler_3.6.2     modelr_0.1.5       pkgconfig_2.0.3
[25] pkgbuild_1.0.6     tidyselect_0.2.5   gridExtra_2.3      matrixStats_0.56.0
[29] fansi_0.4.1        crayon_1.3.4       dbplyr_1.4.2       withr_2.1.2
[33] grid_3.6.2         nlme_3.1-142       jsonlite_1.6.1     gtable_0.3.0
[37] lifecycle_0.2.0    DBI_1.1.0          magrittr_1.5       scales_1.1.0
[41] cli_2.0.2          stringi_1.4.5      fs_1.3.1           xml2_1.2.2
[45] ellipsis_0.3.0     generics_0.0.2     vctrs_0.2.4        tools_3.6.2
[49] glue_1.4.0         hms_0.5.3          parallel_3.6.2     processx_3.4.2
[53] abind_1.4-5        inline_0.3.15      colorspace_1.4-1   rvest_0.3.5
[57] haven_2.2.0

Do you have any additional insights on reducing the number of divergences and improving stability? I’m hoping to move on to the full model soon…

The only thing that sticks out to me is replace the Cauchy prior on tau with normals and see if that helps.

Ah I’ve done that already and it helps quite a bit and all the results I’m getting are with std_normal(). I’m getting 1% divergences, do you think the results are still trustworthy?

If that helped, tighten down the priors more!

If you aren’t sure how to do that, would it be possible to do that with prior predictive checks. Check here: https://www.youtube.com/watch?v=ZRpo41l02KQ&t=2694 .

I’ve just realised that log_f_sd_scale should be positive and have changed my DGP such that it samples log_f_sd_scale ~ abs(rnorm(F, mean=x, sd=y). Empirirically speaking, it seems that using a lognormal prior is more efficient than using Gaussian and a truncated Gaussian but I can’t seem to reduce the number of divergences to 0 out of 4000. Which prior would you recommend?

If it’s on the log scale I don’t see why it should be positive necessarily?

It’s probably the zero-avoiding aspects of that which are making sampling more efficient.

The goal of the non-centering parameterization is to make these funnely hierarchical things more efficient, but it sounds like you might still be having trouble there.

What does it mean to constrain log_f_sd_scale to be away from zero?

We’re trying to give additional structure to the problem so that we don’t see large changes in log_f_sd.

We have log_f_sd[t] = log_f_sd_loc + log_f_sd_scale * log_f_sd[t-1] + noise. So to ensure stationarity, we require log_f_sd_scale to be smaller than 1 but non-zero. I’ve hit the point with adapt_delta = 0.999 and treedepth = 15 so it’s pretty much pathological I feel.

Also, is there a bug in cmdstan for vector<lower=0>? I’ve defined
vector<lower=0>[F] log_f_sd_scale; but I’m getting negative log_f_sd_scale estimates. Somehow I don’t encouter this problem with log_f_sd_tau, the vector defines the covariance matrix for log_f_sd under the LKJ prior definition.

This definitely shouldn’t happen if you have lower = 0.0 defined. Can you upload the model that does this?

data {
  int <lower=1> P; // number of dimensions 
  int <lower=1> F; // number of latent factors
  int <lower=1> TS; // number of time steps
  real disc; // discount
  vector[P] x[TS];
}
transformed data {
  vector[F] L_diag = rep_vector(1.0, F);
  int beta_l = F * (P - F) + F * (F - 1) / 2; // number of lower-triangular, non-zero loadings
}
parameters {
  // reparameterisation parameters
  vector[F] z_log_f_sd[TS]; // reparameterised log latent factor standard deviations
  vector[F] z_fac[TS]; // reparameterised latent factors
  
  // parameters
  vector[beta_l] L_lower_init; // initial lower diagonal loadings
  
  vector[F] log_f_sd_loc; // latent factor standard deviation location
  vector<lower=0>[F] log_f_sd_scale; // latent factor standard deviation scale
  
  // priors
  cholesky_factor_corr[F] L_Omega;
  vector<lower=0>[F] tau;
  real<lower=0> x_sd; // x standard deviation
}
transformed parameters {
  vector[F] log_f_sd[TS];
  matrix[P, F] beta;
  cholesky_factor_cov[F] chol_log_f_sd_sd = diag_pre_multiply(tau, L_Omega);

  {
    int idx;
    idx = 1;
    for (j in 1:F) {
      beta[j, j] = L_diag[j]; // set positive diagonal loadings
      for (k in (j+1):F){
        beta[j, k] = 0; // set upper triangle values to 0
      }
      for (i in (j+1):P){
        beta[i, j] = L_lower_init[idx]; // set lower diagonal loadings
        idx = idx + 1;
      }
    }
  }
  
  for (t in 1:TS){
    log_f_sd[t] = log_f_sd_loc + chol_log_f_sd_sd * z_log_f_sd[t];
    if (t > 1){
        log_f_sd[t] += log_f_sd_scale .* log_f_sd[t-1];
    }
  }
}
model {
  // priors
  L_lower_init ~ std_normal();
  
  log_f_sd_loc ~ normal(0, 0.5);
  log_f_sd_scale ~ lognormal(-2, 1);

  L_Omega ~ lkj_corr_cholesky(1);
  tau ~ normal(0, 0.2);
  x_sd ~ gamma (2, 1./10.);
  
  // likelihood 
  for (t in 1:TS){
    z_log_f_sd[t] ~ std_normal();
    z_fac[t] ~ std_normal();
    x[t] ~ normal(beta * (exp(log_f_sd[t]) .* z_fac[t]), x_sd);
  }
}

It happens when I use std_normal and truncated normal priors as well…

Output of Stan summary

log_f_sd_scale[1]     -0.146649006 0.013875799 0.37592818 -0.841831150 -0.42677275 -0.1634185000  0.12027275
log_f_sd_scale[2]     -0.089736568 0.014582019 0.36360744 -0.786463700 -0.33482600 -0.0864614500  0.16266400
log_f_sd_scale[3]      0.066691309 0.027746054 0.34006852 -0.590690475 -0.16980650  0.0671905500  0.32088975
log_f_sd_scale[4]     -0.052568976 0.023578581 0.37958898 -0.748239900 -0.33513575 -0.0575047000  0.21341725
log_f_sd_scale[5]     -0.109591160 0.012928292 0.35942434 -0.764557975 -0.38136325 -0.1054650000  0.15912700
1 Like

I’ll check this later. Same data generator as before?

Almost

# library(Matrix)
library(posterior)
library(tidyverse)

# library(reticulate)
# use_python("/usr/bin/python")
# use_python("/apps/python/3.6/3.6.3/bin/python3")
# np <- import("numpy")

library(cmdstanr)
library(rstan)
# options(mc.cores = parallel::detectCores())
# rstan_options(auto_write = TRUE)

seed <- 1
set.seed(seed)

# Simulate data according to Lopes and Carvalho (2007) without switching
n_T <- 100
n_F <- 5
n_X <- 10
n_D <- 1000
n_W <- 1000
R <- 0.5
AD <- 0.999
TD <- 15
SS <- 0.01
discount <- 0.99

# Create data strucutures
X <- array(0, c(n_T, n_X))
B <- array(0, c(n_T, n_X, n_F))
log_F_sigma <- array(0, c(n_T, n_F))

# x hyperparameters 
# x_sigma
x_sigma <- exp(rep(rnorm(1), n_X))

# f hyperparameters
# Covariance
log_F_sigma_loc <- rnorm(n_F, mean=0, sd=0.25)
log_F_sigma_scale <- abs(rnorm(n_F, mean=0, sd=0.25))
# chol_F_sigma <- diag(abs(rnorm(n_F, mean=0, sd=0.1)))
chol_F_sigma <- matrix(rnorm(n_F * n_F, sd=0.1), c(n_F, n_F))
chol_F_sigma[upper.tri(chol_F_sigma)] <- 0
diag(chol_F_sigma) <- abs(diag(chol_F_sigma))

# B hyperparameters
# Want B to be 'almost sparse' where B contains mostly values close to 0.05
base_order <- 0.05
bias_order <- 0.5
p_connect <- 0.3
B_init <- matrix(0, n_X, n_F)
diag(B_init) <- 1
n_l_t <- sum(lower.tri(B_init))
B_init[lower.tri(B_init)] <- rnorm(n_l_t, 0, .05) +
  rbinom(n_l_t, 1, .3) * (2 * rbinom(n_l_t, 1, .5) - 1) * 
  rnorm(n_l_t, bias_order, 1)
# (bias_order + abs(rnorm(n_l_t)))

# Initialise
log_F_sigma[1, ] <- log_F_sigma_loc + chol_F_sigma %*% rnorm(n_F)
B[1, , ] <- B_init
# for (j in 1:n_N){
#   X[1, j, ] <- (B[1, , ] %*% (exp(log_F_sigma[1, ]) * rnorm(n_F))) + x_sigma * rnorm(n_X)
# }
X[1, ] <- (B[1, , ] %*% (exp(log_F_sigma[1, ]) * rnorm(n_F))) + x_sigma * rnorm(n_X)


# Generate data
for (i in 2:n_T){
  log_F_sigma[i, ] <- log_F_sigma_loc + (log_F_sigma_scale * log_F_sigma[i-1, ]) + chol_F_sigma %*% rnorm(n_F)
  B[i, , ] <- B[i-1, , ] # + discount * base_order * lower.tri(B[i, , ]) * matrix(rnorm(n_X * n_F), c(n_X, n_F))
  # for (j in 1:n_N){
  #   X[i, j,  ] <- (B[i, , ] %*% (exp(log_F_sigma[i, ]) * rnorm(n_F))) + x_sigma * rnorm(n_X)
  # }
  X[i,  ] <- (B[i, , ] %*% (exp(log_F_sigma[i, ]) * rnorm(n_F))) + x_sigma * rnorm(n_X)
}

# Standardise X
X_mu <- colMeans(X, dims = 1)
X_sd <- apply(X, 2, sd)
# X_s <- t((t(X) - X_mu) / X_sd)
X_s <- scale(X)

# matplot(X, type='l')

base <- cmdstan_model('infer_tv_base.stan', quiet=FALSE)
fit_b <- base$sample(data=list(P=n_X, F=n_F, TS=n_T, disc=discount, x=X), 
                     num_samples=n_D, 
                     num_warmup=n_W, 
                     refresh=100, 
                     init=R, 
                     seed=seed, 
                     adapt_delta=AD, 
                     max_depth=TD, 
                     stepsize=SS,
                     num_chains=1,
                     num_cores=4)

stanfit <- read_stan_csv(fit_b$output_files())
par_sum <- summary(stanfit, pars=c('x_sd', 'chol_log_f_sd_sd', 'log_f_sd_loc', 'log_f_sd_scale'))$summary
print(par_sum)

print(x_sigma)
print(chol_F_sigma) 
print(log_F_sigma_loc) 
print(log_F_sigma_scale)

params <- extract(stanfit)
beta_hat <- colMeans(params$beta, dims=1)
chol_F_sigma_hat <- colMeans(params$chol_log_f_sd_sd, dims=1)
log_F_sigma_loc_hat <- colMeans(params$log_f_sd_loc, dims=1)
log_F_sigma_scale_hat <- colMeans(params$log_f_sd_scale, dims=1)
pdf("comp.pdf")
plot(c(t(beta_hat)) ~ c(t(B_init)), cex=2, pch=10, col="dark gray", xlab="Simulated Loadings", ylab="Estimated Loadings")
fit_B <- lm(c(t(beta_hat)) ~ c(t(B_init))) # Fit a linear model
abline(fit_B)
abline(0,1, lty = 2) # 1:1 line
plot(c(t(chol_F_sigma_hat)) ~ c(t(chol_F_sigma)), cex=2, pch=10, col="dark gray", xlab="Simulated Log Factor Cholesky", ylab="Estimated Log Factor Cholesky")
fit_chol <- lm(c(t(chol_F_sigma_hat)) ~ c(t(chol_F_sigma))) # Fit a linear model
abline(fit_chol)
abline(0,1, lty = 2) # 1:1 line
plot(c(log_F_sigma_loc_hat) ~ c(log_F_sigma_loc), cex=2, pch=10, col="dark gray", xlab="Simulated Log Factor Location", ylab="Estimated Log Factor Location")
fit_loc <- lm(c(log_F_sigma_loc_hat) ~ c(log_F_sigma_loc)) # Fit a linear model
abline(fit_loc)
abline(0,1, lty = 2) # 1:1 line
plot(c(log_F_sigma_scale_hat) ~ c(log_F_sigma_scale), cex=2, pch=10, col="dark gray", xlab="Simulated Log Factor Scale", ylab="Estimated Log Factor Scale")
fit_scale <- lm(c(log_F_sigma_scale_hat) ~ c(log_F_sigma_scale)) # Fit a linear model
abline(fit_scale)
abline(0,1, lty = 2) # 1:1 line
dev.off()

I’ve also realised that log_f_sd_scale is the probably the object that is causing the divergences. Even without the restrictions for it to be positive, I can’t seem to estimate it well. But I’m not sure how I can even reparameterise it.

If it helps, using fabs works but then the estimates I get are the same for almost all dimensions see 4th plot in attachment (9.0 KB) .

Hmm, any chance you could get rid of it? You’re already estimating a covariance matrix there with the LKJ?

Or at least simplify it? Maybe only one value?

With regards to the negative numbers, when I ran the model I got this:

> fit_b$draws() %>%
+   subset_draws(variable = c("log_f_sd_scale"), regex = TRUE) %>%   # need regex=TRUE
+   summarise_draws()
# A tibble: 5 x 10
  variable           mean median    sd    mad     q5   q95  rhat ess_bulk ess_tail
  <chr>             <dbl>  <dbl> <dbl>  <dbl>  <dbl> <dbl> <dbl>    <dbl>    <dbl>
1 log_f_sd_scale[1] 0.153  0.110 0.134 0.0908 0.0238 0.428  1.00    3383.    2534.
2 log_f_sd_scale[2] 0.150  0.108 0.130 0.0886 0.0239 0.429  1.00    2941.    2595.
3 log_f_sd_scale[3] 0.158  0.120 0.125 0.0998 0.0260 0.432  1.00    1262.     692.
4 log_f_sd_scale[4] 0.152  0.113 0.130 0.0954 0.0230 0.425  1.00    2279.    2136.
5 log_f_sd_scale[5] 0.153  0.112 0.131 0.0919 0.0241 0.426  1.00    3383.    2861.

Maybe force a recompile so you definitely have an up to date model:

base <- cmdstan_model('m5.stan', quiet = FALSE, force_recompile = TRUE)

Forcing a recompile worked, thanks!

Wihtout log_f_sd_scale, log_f_sd would accumulate across time which yields an increasing sd for log_f. The values of log_f_sd_scale being all very similar could mean that my DGP just isn’t doing enough to differentiate between dimensions, what do you think?

I think the relevant code is this:

for (t in 1:TS){
    log_f_sd[t] = log_f_sd_loc + chol_log_f_sd_sd * z_log_f_sd[t];
    if (t > 1){
        log_f_sd[t] += log_f_sd_scale .* log_f_sd[t-1];
    }
  }

It seems like the chol_log_f_sd_sd would already take care of a lot of correlations across the P dimensions.

Adding another vector of noise on top of this… I dunno. Seems like too many degrees of freedom to be identified.

It’s like:

x \sim N(0, \Sigma) \\ y \sim N(\mu, \Sigma_\text{diagonal})\\ z = x + y

Estimate \Sigma, \Sigma_\text{diagonal}, and \mu. Seems like too much. Seems like it should be one covariance term.

I think those lines of code mean

x \sim N(0, \Sigma)\\ y = \mu + \sigma(z[t-1])\\ z[t] = y + x

It’s a common construction in stochastic volatility models.

Also, the choice of prior for log_f_sd_scale has a pretty strong effect on the estimates I get. Is there any way to reduce this effect?

Also by defining log_f_sd_scale as vector<lower=0, upper=1>[F] log_f_sd_scale, I eliminated the divergences and got a 2-3x speedup. However, it’s still not identifiable.

Unexpected but cool.

I dunno I’m having trouble parsing this.

Like if I’m pattern matching terms then I’d do:

if (t > 1){
  log_f_sd[t] += log_f_sd_scale .* (log_f_sd[t-1] - log_f_sd_loc);
}

I guess the way to build up from a simpler model would be:

  1. Do a one component output (P = 1)
  2. Do a multiple component output, but keep the innovation terms the same for every dimension (I’m calling \delta_t \sigma from the manual the innovations)
  3. Do a multiple component output, but keep tau the same for every dimension (innovations different)
  4. Allow the scale of the innovations in each dimension to change
  5. Do the full multivariate thing

Maybe that would give some clues about the identifiability?