Speeding up and reducing divergences for multivariate hiearchical stochastic factor model

atechnicolorskye · April 21, 2020, 6:13pm

On my local machine, it’s

R version 3.6.3 (2020-02-29)
Platform: x86_64-apple-darwin19.3.0 (64-bit)
Running under: macOS Catalina 10.15.4

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /usr/local/Cellar/openblas/0.3.9/lib/libopenblasp-r0.3.9.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] rstan_2.19.3        StanHeaders_2.19.2  cmdstanr_0.0.0.9000 reticulate_1.15     forcats_0.5.0      
 [6] stringr_1.4.0       dplyr_0.8.5         purrr_0.3.4         readr_1.3.1         tidyr_1.0.2        
[11] tibble_3.0.1        ggplot2_3.3.0       tidyverse_1.3.0     posterior_0.0.2     Matrix_1.2-18      

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.8       lubridate_1.7.8    lattice_0.20-38    prettyunits_1.1.1  ps_1.3.2          
 [6] assertthat_0.2.1   R6_2.4.1           cellranger_1.1.0   backports_1.1.6    reprex_0.3.0      
[11] stats4_3.6.3       httr_1.4.1         pillar_1.4.3       rlang_0.4.5        readxl_1.3.1      
[16] rstudioapi_0.11    callr_3.4.3        checkmate_2.0.0    loo_2.2.0          munsell_0.5.0     
[21] broom_0.5.5        compiler_3.6.3     modelr_0.1.6       pkgconfig_2.0.3    pkgbuild_1.0.6    
[26] tidyselect_1.0.0   gridExtra_2.3      matrixStats_0.56.0 fansi_0.4.1        crayon_1.3.4      
[31] dbplyr_1.4.3       withr_2.1.2        grid_3.6.3         nlme_3.1-144       jsonlite_1.6.1    
[36] gtable_0.3.0       lifecycle_0.2.0    DBI_1.1.0          magrittr_1.5       scales_1.1.0      
[41] cli_2.0.2          stringi_1.4.6      fs_1.4.1           xml2_1.3.1         ellipsis_0.3.0    
[46] generics_0.0.2     vctrs_0.2.4        tools_3.6.3        glue_1.4.0         hms_0.5.3         
[51] parallel_3.6.3     processx_3.4.2     abind_1.4-5        inline_0.3.15      colorspace_1.4-1  
[56] rvest_0.3.5        haven_2.2.0

and on the server it’s

R version 3.6.2 (2019-12-12)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux

Matrix products: default
BLAS/LAPACK: /usr/lib64/libopenblas-r0.3.3.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
 [1] rstan_2.18.2        StanHeaders_2.18.1  cmdstanr_0.0.0.9000
 [4] reticulate_1.15     forcats_0.4.0       stringr_1.4.0
 [7] dplyr_0.8.5         purrr_0.3.3         readr_1.3.1
[10] tidyr_1.0.0         tibble_3.0.1        ggplot2_3.3.0
[13] tidyverse_1.3.0     posterior_0.0.2     Matrix_1.2-18

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4         lubridate_1.7.4    lattice_0.20-38    prettyunits_1.1.1
 [5] ps_1.3.2           assertthat_0.2.1   R6_2.4.1           cellranger_1.1.0
 [9] backports_1.1.6    reprex_0.3.0       stats4_3.6.2       httr_1.4.1
[13] pillar_1.4.3       rlang_0.4.5        readxl_1.3.1       rstudioapi_0.11
[17] callr_3.4.3        checkmate_2.0.0    loo_2.2.0          munsell_0.5.0
[21] broom_0.5.3        compiler_3.6.2     modelr_0.1.5       pkgconfig_2.0.3
[25] pkgbuild_1.0.6     tidyselect_0.2.5   gridExtra_2.3      matrixStats_0.56.0
[29] fansi_0.4.1        crayon_1.3.4       dbplyr_1.4.2       withr_2.1.2
[33] grid_3.6.2         nlme_3.1-142       jsonlite_1.6.1     gtable_0.3.0
[37] lifecycle_0.2.0    DBI_1.1.0          magrittr_1.5       scales_1.1.0
[41] cli_2.0.2          stringi_1.4.5      fs_1.3.1           xml2_1.2.2
[45] ellipsis_0.3.0     generics_0.0.2     vctrs_0.2.4        tools_3.6.2
[49] glue_1.4.0         hms_0.5.3          parallel_3.6.2     processx_3.4.2
[53] abind_1.4-5        inline_0.3.15      colorspace_1.4-1   rvest_0.3.5
[57] haven_2.2.0

atechnicolorskye · April 21, 2020, 6:16pm

Do you have any additional insights on reducing the number of divergences and improving stability? I’m hoping to move on to the full model soon…

bbbales2 · April 21, 2020, 6:21pm

The only thing that sticks out to me is replace the Cauchy prior on tau with normals and see if that helps.

atechnicolorskye · April 21, 2020, 6:24pm

Ah I’ve done that already and it helps quite a bit and all the results I’m getting are with std_normal(). I’m getting 1% divergences, do you think the results are still trustworthy?

bbbales2 · April 21, 2020, 6:27pm

If that helped, tighten down the priors more!

If you aren’t sure how to do that, would it be possible to do that with prior predictive checks. Check here: https://www.youtube.com/watch?v=ZRpo41l02KQ&t=2694 .

atechnicolorskye · April 23, 2020, 4:20am

I’ve just realised that log_f_sd_scale should be positive and have changed my DGP such that it samples log_f_sd_scale ~ abs(rnorm(F, mean=x, sd=y). Empirirically speaking, it seems that using a lognormal prior is more efficient than using Gaussian and a truncated Gaussian but I can’t seem to reduce the number of divergences to 0 out of 4000. Which prior would you recommend?

bbbales2 · April 23, 2020, 10:44am

If it’s on the log scale I don’t see why it should be positive necessarily?

It’s probably the zero-avoiding aspects of that which are making sampling more efficient.

The goal of the non-centering parameterization is to make these funnely hierarchical things more efficient, but it sounds like you might still be having trouble there.

What does it mean to constrain log_f_sd_scale to be away from zero?

atechnicolorskye · April 23, 2020, 4:39pm

We’re trying to give additional structure to the problem so that we don’t see large changes in log_f_sd.

We have log_f_sd[t] = log_f_sd_loc + log_f_sd_scale * log_f_sd[t-1] + noise. So to ensure stationarity, we require log_f_sd_scale to be smaller than 1 but non-zero. I’ve hit the point with adapt_delta = 0.999 and treedepth = 15 so it’s pretty much pathological I feel.

atechnicolorskye · April 23, 2020, 5:00pm

Also, is there a bug in cmdstan for vector<lower=0>? I’ve defined
vector<lower=0>[F] log_f_sd_scale; but I’m getting negative log_f_sd_scale estimates. Somehow I don’t encouter this problem with log_f_sd_tau, the vector defines the covariance matrix for log_f_sd under the LKJ prior definition.

bbbales2 · April 23, 2020, 5:29pm

This definitely shouldn’t happen if you have lower = 0.0 defined. Can you upload the model that does this?

atechnicolorskye · April 23, 2020, 5:47pm

data {
  int <lower=1> P; // number of dimensions 
  int <lower=1> F; // number of latent factors
  int <lower=1> TS; // number of time steps
  real disc; // discount
  vector[P] x[TS];
}
transformed data {
  vector[F] L_diag = rep_vector(1.0, F);
  int beta_l = F * (P - F) + F * (F - 1) / 2; // number of lower-triangular, non-zero loadings
}
parameters {
  // reparameterisation parameters
  vector[F] z_log_f_sd[TS]; // reparameterised log latent factor standard deviations
  vector[F] z_fac[TS]; // reparameterised latent factors
  
  // parameters
  vector[beta_l] L_lower_init; // initial lower diagonal loadings
  
  vector[F] log_f_sd_loc; // latent factor standard deviation location
  vector<lower=0>[F] log_f_sd_scale; // latent factor standard deviation scale
  
  // priors
  cholesky_factor_corr[F] L_Omega;
  vector<lower=0>[F] tau;
  real<lower=0> x_sd; // x standard deviation
}
transformed parameters {
  vector[F] log_f_sd[TS];
  matrix[P, F] beta;
  cholesky_factor_cov[F] chol_log_f_sd_sd = diag_pre_multiply(tau, L_Omega);

  {
    int idx;
    idx = 1;
    for (j in 1:F) {
      beta[j, j] = L_diag[j]; // set positive diagonal loadings
      for (k in (j+1):F){
        beta[j, k] = 0; // set upper triangle values to 0
      }
      for (i in (j+1):P){
        beta[i, j] = L_lower_init[idx]; // set lower diagonal loadings
        idx = idx + 1;
      }
    }
  }
  
  for (t in 1:TS){
    log_f_sd[t] = log_f_sd_loc + chol_log_f_sd_sd * z_log_f_sd[t];
    if (t > 1){
        log_f_sd[t] += log_f_sd_scale .* log_f_sd[t-1];
    }
  }
}
model {
  // priors
  L_lower_init ~ std_normal();
  
  log_f_sd_loc ~ normal(0, 0.5);
  log_f_sd_scale ~ lognormal(-2, 1);

  L_Omega ~ lkj_corr_cholesky(1);
  tau ~ normal(0, 0.2);
  x_sd ~ gamma (2, 1./10.);
  
  // likelihood 
  for (t in 1:TS){
    z_log_f_sd[t] ~ std_normal();
    z_fac[t] ~ std_normal();
    x[t] ~ normal(beta * (exp(log_f_sd[t]) .* z_fac[t]), x_sd);
  }
}

It happens when I use std_normal and truncated normal priors as well…

Output of Stan summary

log_f_sd_scale[1]     -0.146649006 0.013875799 0.37592818 -0.841831150 -0.42677275 -0.1634185000  0.12027275
log_f_sd_scale[2]     -0.089736568 0.014582019 0.36360744 -0.786463700 -0.33482600 -0.0864614500  0.16266400
log_f_sd_scale[3]      0.066691309 0.027746054 0.34006852 -0.590690475 -0.16980650  0.0671905500  0.32088975
log_f_sd_scale[4]     -0.052568976 0.023578581 0.37958898 -0.748239900 -0.33513575 -0.0575047000  0.21341725
log_f_sd_scale[5]     -0.109591160 0.012928292 0.35942434 -0.764557975 -0.38136325 -0.1054650000  0.15912700

bbbales2 · April 23, 2020, 5:53pm

I’ll check this later. Same data generator as before?

atechnicolorskye · April 23, 2020, 6:12pm

Almost

# library(Matrix)
library(posterior)
library(tidyverse)

# library(reticulate)
# use_python("/usr/bin/python")
# use_python("/apps/python/3.6/3.6.3/bin/python3")
# np <- import("numpy")

library(cmdstanr)
library(rstan)
# options(mc.cores = parallel::detectCores())
# rstan_options(auto_write = TRUE)

seed <- 1
set.seed(seed)

# Simulate data according to Lopes and Carvalho (2007) without switching
n_T <- 100
n_F <- 5
n_X <- 10
n_D <- 1000
n_W <- 1000
R <- 0.5
AD <- 0.999
TD <- 15
SS <- 0.01
discount <- 0.99

# Create data strucutures
X <- array(0, c(n_T, n_X))
B <- array(0, c(n_T, n_X, n_F))
log_F_sigma <- array(0, c(n_T, n_F))

# x hyperparameters 
# x_sigma
x_sigma <- exp(rep(rnorm(1), n_X))

# f hyperparameters
# Covariance
log_F_sigma_loc <- rnorm(n_F, mean=0, sd=0.25)
log_F_sigma_scale <- abs(rnorm(n_F, mean=0, sd=0.25))
# chol_F_sigma <- diag(abs(rnorm(n_F, mean=0, sd=0.1)))
chol_F_sigma <- matrix(rnorm(n_F * n_F, sd=0.1), c(n_F, n_F))
chol_F_sigma[upper.tri(chol_F_sigma)] <- 0
diag(chol_F_sigma) <- abs(diag(chol_F_sigma))

# B hyperparameters
# Want B to be 'almost sparse' where B contains mostly values close to 0.05
base_order <- 0.05
bias_order <- 0.5
p_connect <- 0.3
B_init <- matrix(0, n_X, n_F)
diag(B_init) <- 1
n_l_t <- sum(lower.tri(B_init))
B_init[lower.tri(B_init)] <- rnorm(n_l_t, 0, .05) +
  rbinom(n_l_t, 1, .3) * (2 * rbinom(n_l_t, 1, .5) - 1) * 
  rnorm(n_l_t, bias_order, 1)
# (bias_order + abs(rnorm(n_l_t)))

# Initialise
log_F_sigma[1, ] <- log_F_sigma_loc + chol_F_sigma %*% rnorm(n_F)
B[1, , ] <- B_init
# for (j in 1:n_N){
#   X[1, j, ] <- (B[1, , ] %*% (exp(log_F_sigma[1, ]) * rnorm(n_F))) + x_sigma * rnorm(n_X)
# }
X[1, ] <- (B[1, , ] %*% (exp(log_F_sigma[1, ]) * rnorm(n_F))) + x_sigma * rnorm(n_X)


# Generate data
for (i in 2:n_T){
  log_F_sigma[i, ] <- log_F_sigma_loc + (log_F_sigma_scale * log_F_sigma[i-1, ]) + chol_F_sigma %*% rnorm(n_F)
  B[i, , ] <- B[i-1, , ] # + discount * base_order * lower.tri(B[i, , ]) * matrix(rnorm(n_X * n_F), c(n_X, n_F))
  # for (j in 1:n_N){
  #   X[i, j,  ] <- (B[i, , ] %*% (exp(log_F_sigma[i, ]) * rnorm(n_F))) + x_sigma * rnorm(n_X)
  # }
  X[i,  ] <- (B[i, , ] %*% (exp(log_F_sigma[i, ]) * rnorm(n_F))) + x_sigma * rnorm(n_X)
}

# Standardise X
X_mu <- colMeans(X, dims = 1)
X_sd <- apply(X, 2, sd)
# X_s <- t((t(X) - X_mu) / X_sd)
X_s <- scale(X)

# matplot(X, type='l')

base <- cmdstan_model('infer_tv_base.stan', quiet=FALSE)
fit_b <- base$sample(data=list(P=n_X, F=n_F, TS=n_T, disc=discount, x=X), 
                     num_samples=n_D, 
                     num_warmup=n_W, 
                     refresh=100, 
                     init=R, 
                     seed=seed, 
                     adapt_delta=AD, 
                     max_depth=TD, 
                     stepsize=SS,
                     num_chains=1,
                     num_cores=4)

stanfit <- read_stan_csv(fit_b$output_files())
par_sum <- summary(stanfit, pars=c('x_sd', 'chol_log_f_sd_sd', 'log_f_sd_loc', 'log_f_sd_scale'))$summary
print(par_sum)

print(x_sigma)
print(chol_F_sigma) 
print(log_F_sigma_loc) 
print(log_F_sigma_scale)

params <- extract(stanfit)
beta_hat <- colMeans(params$beta, dims=1)
chol_F_sigma_hat <- colMeans(params$chol_log_f_sd_sd, dims=1)
log_F_sigma_loc_hat <- colMeans(params$log_f_sd_loc, dims=1)
log_F_sigma_scale_hat <- colMeans(params$log_f_sd_scale, dims=1)
pdf("comp.pdf")
plot(c(t(beta_hat)) ~ c(t(B_init)), cex=2, pch=10, col="dark gray", xlab="Simulated Loadings", ylab="Estimated Loadings")
fit_B <- lm(c(t(beta_hat)) ~ c(t(B_init))) # Fit a linear model
abline(fit_B)
abline(0,1, lty = 2) # 1:1 line
plot(c(t(chol_F_sigma_hat)) ~ c(t(chol_F_sigma)), cex=2, pch=10, col="dark gray", xlab="Simulated Log Factor Cholesky", ylab="Estimated Log Factor Cholesky")
fit_chol <- lm(c(t(chol_F_sigma_hat)) ~ c(t(chol_F_sigma))) # Fit a linear model
abline(fit_chol)
abline(0,1, lty = 2) # 1:1 line
plot(c(log_F_sigma_loc_hat) ~ c(log_F_sigma_loc), cex=2, pch=10, col="dark gray", xlab="Simulated Log Factor Location", ylab="Estimated Log Factor Location")
fit_loc <- lm(c(log_F_sigma_loc_hat) ~ c(log_F_sigma_loc)) # Fit a linear model
abline(fit_loc)
abline(0,1, lty = 2) # 1:1 line
plot(c(log_F_sigma_scale_hat) ~ c(log_F_sigma_scale), cex=2, pch=10, col="dark gray", xlab="Simulated Log Factor Scale", ylab="Estimated Log Factor Scale")
fit_scale <- lm(c(log_F_sigma_scale_hat) ~ c(log_F_sigma_scale)) # Fit a linear model
abline(fit_scale)
abline(0,1, lty = 2) # 1:1 line
dev.off()

atechnicolorskye · April 23, 2020, 9:27pm

I’ve also realised that log_f_sd_scale is the probably the object that is causing the divergences. Even without the restrictions for it to be positive, I can’t seem to estimate it well. But I’m not sure how I can even reparameterise it.

If it helps, using fabs works but then the estimates I get are the same for almost all dimensions see 4th plot in attachment (9.0 KB) .

bbbales2 · April 24, 2020, 12:33am

Hmm, any chance you could get rid of it? You’re already estimating a covariance matrix there with the LKJ?

Or at least simplify it? Maybe only one value?

With regards to the negative numbers, when I ran the model I got this:

> fit_b$draws() %>%
+   subset_draws(variable = c("log_f_sd_scale"), regex = TRUE) %>%   # need regex=TRUE
+   summarise_draws()
# A tibble: 5 x 10
  variable           mean median    sd    mad     q5   q95  rhat ess_bulk ess_tail
  <chr>             <dbl>  <dbl> <dbl>  <dbl>  <dbl> <dbl> <dbl>    <dbl>    <dbl>
1 log_f_sd_scale[1] 0.153  0.110 0.134 0.0908 0.0238 0.428  1.00    3383.    2534.
2 log_f_sd_scale[2] 0.150  0.108 0.130 0.0886 0.0239 0.429  1.00    2941.    2595.
3 log_f_sd_scale[3] 0.158  0.120 0.125 0.0998 0.0260 0.432  1.00    1262.     692.
4 log_f_sd_scale[4] 0.152  0.113 0.130 0.0954 0.0230 0.425  1.00    2279.    2136.
5 log_f_sd_scale[5] 0.153  0.112 0.131 0.0919 0.0241 0.426  1.00    3383.    2861.

Maybe force a recompile so you definitely have an up to date model:

base <- cmdstan_model('m5.stan', quiet = FALSE, force_recompile = TRUE)

atechnicolorskye · April 24, 2020, 2:00am

Forcing a recompile worked, thanks!

Wihtout log_f_sd_scale, log_f_sd would accumulate across time which yields an increasing sd for log_f. The values of log_f_sd_scale being all very similar could mean that my DGP just isn’t doing enough to differentiate between dimensions, what do you think?

bbbales2 · April 24, 2020, 12:20pm

I think the relevant code is this:

for (t in 1:TS){
    log_f_sd[t] = log_f_sd_loc + chol_log_f_sd_sd * z_log_f_sd[t];
    if (t > 1){
        log_f_sd[t] += log_f_sd_scale .* log_f_sd[t-1];
    }
  }

It seems like the chol_log_f_sd_sd would already take care of a lot of correlations across the P dimensions.

Adding another vector of noise on top of this… I dunno. Seems like too many degrees of freedom to be identified.

It’s like:

x \sim N(0, \Sigma) \\ y \sim N(\mu, \Sigma_\text{diagonal})\\ z = x + y

Estimate \Sigma, \Sigma_\text{diagonal}, and \mu. Seems like too much. Seems like it should be one covariance term.

atechnicolorskye · April 24, 2020, 3:40pm

I think those lines of code mean

x \sim N(0, \Sigma)\\ y = \mu + \sigma(z[t-1])\\ z[t] = y + x

It’s a common construction in stochastic volatility models.

Also, the choice of prior for log_f_sd_scale has a pretty strong effect on the estimates I get. Is there any way to reduce this effect?

atechnicolorskye · April 24, 2020, 11:12pm

Also by defining log_f_sd_scale as vector<lower=0, upper=1>[F] log_f_sd_scale, I eliminated the divergences and got a 2-3x speedup. However, it’s still not identifiable.

bbbales2 · April 26, 2020, 12:44am

Unexpected but cool.

I dunno I’m having trouble parsing this.

Like if I’m pattern matching terms then I’d do:

if (t > 1){
  log_f_sd[t] += log_f_sd_scale .* (log_f_sd[t-1] - log_f_sd_loc);
}

I guess the way to build up from a simpler model would be:

Do a one component output (P = 1)
Do a multiple component output, but keep the innovation terms the same for every dimension (I’m calling \delta_t \sigma from the manual the innovations)
Do a multiple component output, but keep tau the same for every dimension (innovations different)
Allow the scale of the innovations in each dimension to change
Do the full multivariate thing

Maybe that would give some clues about the identifiability?

Topic		Replies	Views
Time Varying Factor Analysis Modeling fitting-issues , performance	2	801	March 23, 2020
Multivariate factor stochastic volatility model - Problem with covariance matrix Modeling	5	1115	October 17, 2017
Divergences and low EBFMI in multi-normal model with measurement noise Modeling	2	1148	September 12, 2019
Convergence problems with factor loadings in multivariante autoregressive model Modeling rstan , fitting-issues , specification , ecology	11	1592	July 31, 2020
Divergent transitions in TVP-VAR with stochastic volatility Modeling fitting-issues	2	46	November 22, 2024

Speeding up and reducing divergences for multivariate hiearchical stochastic factor model

Related topics