Speeding up and reducing divergences for multivariate hiearchical stochastic factor model

I’m fitting a multivariate hierchical stochastic factor model with time-varying factor standard deviations and am planning to extend it to the case where the factor loadings are stochastic too. I’m hoping to get better estimates of the covariance matrix of the factor standard deviations too.

To avoid divergences and speed this up, I’ve set init_r to be 0.5, adapt_delta to 0.99, tree-depth to 15 and step_size to be 0.25. However, I still get divergences and have no idea how to fix them.

data {
  int <lower=1> P; // number of dimensions 
  int <lower=1> F; // number of latent factors
  int <lower=1> TS; // number of time steps
  real disc; // discount
  vector[P] x[TS];
}
transformed data {
  vector[F] L_diag = rep_vector(1.0, F);
  int<lower=1> beta_l = F * (P - F) + F * (F - 1) / 2; // number of lower-triangular, non-zero loadings
}
parameters {
  // reparameterisation parameters
  vector[F] z_log_f_sd[TS]; // reparameterised log latent factor standard deviations
  vector[F] z_fac[TS]; // reparameterised latent factors
  
  // parameters
  vector[beta_l] L_lower_init; // initial lower diagonal loadings
  
  vector[F] log_f_sd_loc; // latent factor standard deviation transform scale
  vector[F] log_f_sd_scale; // latent factor standard deviation transform scale
  // vector[P] x_loc;
  // vector[P] x_scale;
  
  // priors
  cholesky_factor_corr[F] L_Omega;
  vector<lower=0,upper=pi()/2>[F] tau_unif;
  // vector<lower=0>[F] chol_log_f_sd_sd; // diagonal standard deviation
  real<lower=0> x_sd; // x standard deviation
  // vector[F] log_f_sd_sd; // diagonal standard deviation
  // real log_x_sd; // x standard deviation
}
transformed parameters {
  vector[F] log_f_sd[TS];
  matrix[P, F] beta;
  vector[F] mu[TS];
  vector[F] tau = tan(tau_unif);
  cholesky_factor_cov[F] chol_log_f_sd_sd = diag_pre_multiply(tau, L_Omega);
  // vector[F] chol_log_f_sd_sd = exp(log_f_sd_sd);
  // real x_sd = exp(log_x_sd);
  
  {
    int idx;
    idx = 1;
    for (j in 1:F) {
      beta[j, j] = L_diag[j]; // set positive diagonal loadings
      for (k in (j+1):F){
        beta[j, k] = 0; // set upper triangle values to 0
      }
      for (i in (j+1):P){
        beta[i, j] = L_lower_init[idx]; // set lower diagonal loadings
        idx = idx + 1;
      }
    }
  }
  
  for (t in 1:TS){
    // log_f_sd[t] = log_f_sd_loc + chol_log_f_sd_sd .* z_log_f_sd[t];
    log_f_sd[t] = log_f_sd_loc + chol_log_f_sd_sd * z_log_f_sd[t];
    if (t > 1){
      log_f_sd[t] += log_f_sd_scale .* log_f_sd[t-1];
    }
    // mu[t] =  x_loc + x_scale .* (beta * (exp(log_f_sd[t]) .* z_fac[t]));
    mu[t] = beta * (exp(log_f_sd[t]) .* z_fac[t]);
  }
}
model {
  // priors
  L_lower_init ~ std_normal();
  
  log_f_sd_loc ~ normal(0, 0.5);
  log_f_sd_scale ~ normal(0, 0.5);
  // x_loc ~ std_normal();
  // x_scale ~ std_normal();
  
  L_Omega ~ lkj_corr_cholesky(4);
  // chol_log_f_sd_sd ~ lognormal(-1, 1);
  x_sd ~ lognormal(2, 1./10);
  // log_f_sd_sd ~ normal(-2, 0.1);
  // log_x_sd ~ normal(-2, 0.1);
  
  // likelihood 
  for (t in 1:TS){
    z_log_f_sd[t] ~ std_normal();
    z_fac[t] ~ std_normal();
    // if (is_nan(mu[t][1]) || is_nan(mu[t][2]) || is_nan(mu[t][3]) || is_nan(mu[t][4]) || is_nan(mu[t][5]) || 
    //     is_nan(mu[t][6]) || is_nan(mu[t][7]) || is_nan(mu[t][8]) || is_nan(mu[t][9]) || is_nan(mu[t][10])){
    //   print("chol_log_f_sd_sd", chol_log_f_sd_sd);
    //   print("log_f_sd_loc", log_f_sd_loc);
    //   print("log_f_sd_scale", log_f_sd_scale);
    //   print("z_log_f_sd", z_log_f_sd[t]);
    //   print("log_f_sd", log_f_sd[t-1]);
    //   print("log_f_sd", log_f_sd[t]);
    //   print("f_sd", exp(log_f_sd[t]));
    // }
    x[t] ~ normal(mu[t], x_sd);
  }
}

I’m thinking of potentially concatentating all the auxiliary reparameterisation parameters into one long vector but I’m not sure how much help that would be. Currently, I don’t think I’m doing any redundant computations. Also I’m unsure if I could whiten my parameterisations to make them more stable.

Any help and performance tips would be much appreciated.

Bump

If there’s a way to debug this one piece at a time, go for that.

Like if you remove the multivariate part does it work? Or if you remove the hierarchical part?

Not easy things to do but it’s hard to know what’s going from the outside.

vector<lower=0,upper=pi()/2>[F] tau_unif;

Also anything heavy tailed I’m wary of, though I see you’ve done the fancy reparameterizations.

2 Likes

It’s generally working at the moment but I’m hoping to get it to run quicker and more accurately. I’ve discovered that removing mu[t] from transformed parameters and computing it directly in the model section yields better estimates but is much slower. Why is this so?

I took this from the hiearchical models guide. Is there a way to not make it heavy tailed? Also, it seems like using the LKJ prior is much quicker than using a vectorised prior and it’s all very confusing for me as intuitively this shouldn’t be happening.

If anything is changing here, I think this indicates a bug somewhere. Having things in transformed parameters can make models run slow just cause there’s more IO, but I doubt that’s happening here.

If you need a heavy tailed prior, what you did the is the thing to do. But then the question is do you need a heavy tailed prior?

Thanks for the tips, I’ll have a look as this happened with an extended version of this to account for multiple observations at the same timepoint.

How much do the changing the order of magnitudes of parameters help? Multiple people have mentioned mutilplying and dividing by 100/1000 on the forum.

It seems that doing this yields different quite different neffs, is much faster and causes less divergences. The values I get are slightly different for most parameters. Do you have any intuition on what’s potentially causing this as the diffs are only two lines different?

Keep investigating. There’s a difference. Post the two models here and I can look at them when I get a chance.

Without mu in transformed

data {
  int <lower=1> P; // number of dimensions 
  int <lower=1> F; // number of latent factors
  int <lower=1> TS; // number of time steps
  real disc; // discount
  vector[P] x[TS];
}
transformed data {
  vector[F] L_diag = rep_vector(1.0, F);
  int<lower=1> beta_l = F * (P - F) + F * (F - 1) / 2; // number of lower-triangular, non-zero loadings
}
parameters {
  // reparameterisation parameters
  vector[F] z_log_f_sd[TS]; // reparameterised log latent factor standard deviations
  vector[F] z_fac[TS]; // reparameterised latent factors
  
  // parameters
  vector[beta_l] L_lower_init; // initial lower diagonal loadings
  
  vector[F] log_f_sd_loc; // latent factor standard deviation transform scale
  vector[F] log_f_sd_scale; // latent factor standard deviation transform scale
  
  // priors
  cholesky_factor_corr[F] L_Omega;
  vector<lower=0,upper=pi()/2>[F] tau_unif;
  real<lower=0> x_sd; // x standard deviation
}
transformed parameters {
  vector[F] log_f_sd[TS];
  matrix[P, F] beta;
  vector[F] tau = tan(tau_unif);
  cholesky_factor_cov[F] chol_log_f_sd_sd = diag_pre_multiply(tau, L_Omega);
  
  {
    int idx;
    idx = 1;
    for (j in 1:F) {
      beta[j, j] = L_diag[j]; // set positive diagonal loadings
      for (k in (j+1):F){
        beta[j, k] = 0; // set upper triangle values to 0
      }
      for (i in (j+1):P){
        beta[i, j] = L_lower_init[idx]; // set lower diagonal loadings
        idx = idx + 1;
      }
    }
  }
  
  for (t in 1:TS){
    log_f_sd[t] = log_f_sd_loc + chol_log_f_sd_sd * z_log_f_sd[t];
    if (t > 1){
      log_f_sd[t] += log_f_sd_scale .* log_f_sd[t-1];
    }
  }
}
model {
  // priors
  L_lower_init ~ std_normal();
  
  log_f_sd_loc ~ std_normal();
  log_f_sd_scale ~ std_normal();

  L_Omega ~ lkj_corr_cholesky(4);
  x_sd ~ gamma (2, 1./10.);
  
  // likelihood 
  for (t in 1:TS){
    z_log_f_sd[t] ~ std_normal();
    z_fac[t] ~ std_normal();
    x[t] ~ normal(beta * (exp(log_f_sd[t]) .* z_fac[t]), x_sd);
  }
}

Mu in transformed

data {
  int <lower=1> P; // number of dimensions 
  int <lower=1> F; // number of latent factors
  int <lower=1> TS; // number of time steps
  real disc; // discount
  vector[P] x[TS];
}
transformed data {
  vector[F] L_diag = rep_vector(1.0, F);
  int<lower=1> beta_l = F * (P - F) + F * (F - 1) / 2; // number of lower-triangular, non-zero loadings
}
parameters {
  // reparameterisation parameters
  vector[F] z_log_f_sd[TS]; // reparameterised log latent factor standard deviations
  vector[F] z_fac[TS]; // reparameterised latent factors
  
  // parameters
  vector[beta_l] L_lower_init; // initial lower diagonal loadings
  
  vector[F] log_f_sd_loc; // latent factor standard deviation transform scale
  vector[F] log_f_sd_scale; // latent factor standard deviation transform scale
  
  // priors
  cholesky_factor_corr[F] L_Omega;
  vector<lower=0,upper=pi()/2>[F] tau_unif;
  real<lower=0> x_sd; // x standard deviation
}
transformed parameters {
  vector[F] log_f_sd[TS];
  matrix[P, F] beta;
  vector[F] mu[TS];
  vector[F] tau = tan(tau_unif);
  cholesky_factor_cov[F] chol_log_f_sd_sd = diag_pre_multiply(tau, L_Omega);
  
  {
    int idx;
    idx = 1;
    for (j in 1:F) {
      beta[j, j] = L_diag[j]; // set positive diagonal loadings
      for (k in (j+1):F){
        beta[j, k] = 0; // set upper triangle values to 0
      }
      for (i in (j+1):P){
        beta[i, j] = L_lower_init[idx]; // set lower diagonal loadings
        idx = idx + 1;
      }
    }
  }
  
  for (t in 1:TS){
    log_f_sd[t] = log_f_sd_loc + chol_log_f_sd_sd * z_log_f_sd[t];
    if (t > 1){
      log_f_sd[t] += log_f_sd_scale .* log_f_sd[t-1];
    }
    mu[t] = beta * (exp(log_f_sd[t]) .* z_fac[t]);
  }
}
model {
  // priors
  L_lower_init ~ std_normal();
  
  log_f_sd_loc ~ std_normal();
  log_f_sd_scale ~ std_normal();
  
  L_Omega ~ lkj_corr_cholesky(4);
  x_sd ~ gamma (2, 1./10.);
  
  // likelihood 
  for (t in 1:TS){
    z_log_f_sd[t] ~ std_normal();
    z_fac[t] ~ std_normal();
    x[t] ~ normal(mu[t], x_sd);
  }
}

Oh jeez, these do look like they should be the same. Now I’m scared of a Stan bug.

What Stan interface/version are you running? Do you mind uploading data to run this model?

Here’s the DGP, I’m on Stan 2.18.2 and I get the same number of divergences somehow this time. I’m running on a server that doesn’t allow me to run 2.19.x. I tested it on my local machine with Stan 2.19.3 and am getting subtle differences too but less divergences; it might just be precision issues and code updates.

library(Matrix)
library(mvtnorm)

library(reticulate)
use_python("/usr/bin/python")
# use_python("/apps/python/3.6/3.6.3/bin/python3")
np <- import("numpy")

library(rstan)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)

seed <- 1
set.seed(seed)

# Simulate data according to Lopes and Carvalho (2007) without switching
n_T <- 100
n_F <- 5
n_X <- 10
n_D <- 4000
n_W <- 1000
R <- 0.5
AD <- 0.99
TD <- 15
SS <- 0.1
discount <- 0.99

# Create data strucutures
X <- array(0, c(n_T, n_X))
B <- array(0, c(n_T, n_X, n_F))
log_F_sigma <- array(0, c(n_T, n_F))

# x hyperparameters 
# x_sigma
x_sigma <- exp(rep(rnorm(1), n_X))
# y_sigma <- matrix(rep(1. / rgamma(1, 3, 1), n_Y), c(n_Y, 1))

# f hyperparameters
# Covariance
log_F_sigma_loc <- rnorm(n_F, mean=-0, sd=0.25)
log_F_sigma_scale <- rnorm(n_F, mean=0, sd=0.25)
chol_F_sigma <- diag(abs(rnorm(n_F, mean=0, sd=0.1)))
# chol_f_sigma <- matrix(rnorm(n_F * n_F, sd=0.1), c(n_F, n_F))
# diag(chol_f_sigma) <- abs(diag(chol_f_sigma))

# B hyperparameters
# Want B to be 'almost sparse' where B contains mostly values close to 0.05
base_order <- 0.05
bias_order <- 0.5
p_connect <- 0.3
B_init <- matrix(0, n_X, n_F)
diag(B_init) <- 1
n_l_t <- sum(lower.tri(B_init))
B_init[lower.tri(B_init)] <- rnorm(n_l_t, 0, .05) +
  rbinom(n_l_t, 1, .3) * (2 * rbinom(n_l_t, 1, .5) - 1) * 
  # rnorm(n_l_t, bias_order, 1)
  (bias_order + abs(rnorm(n_l_t)))

# Initialise
log_F_sigma[1, ] <- log_F_sigma_loc + chol_F_sigma %*% rnorm(n_F)
B[1, , ] <- B_init
# for (j in 1:n_N){
#   X[1, j, ] <- (B[1, , ] %*% (exp(log_F_sigma[1, ]) * rnorm(n_F))) + x_sigma * rnorm(n_X)
# }
X[1, ] <- (B[1, , ] %*% (exp(log_F_sigma[1, ]) * rnorm(n_F))) + x_sigma * rnorm(n_X)


# Generate data
for (i in 2:n_T){
  log_F_sigma[i, ] <- log_F_sigma_loc + (log_F_sigma_scale * log_F_sigma[i-1, ]) + chol_F_sigma %*% rnorm(n_F)
  B[i, , ] <- B[i-1, , ] # + discount * base_order * lower.tri(B[i, , ]) * matrix(rnorm(n_X * n_F), c(n_X, n_F))
  # for (j in 1:n_N){
  #   X[i, j,  ] <- (B[i, , ] %*% (exp(log_F_sigma[i, ]) * rnorm(n_F))) + x_sigma * rnorm(n_X)
  # }
  X[i,  ] <- (B[i, , ] %*% (exp(log_F_sigma[i, ]) * rnorm(n_F))) + x_sigma * rnorm(n_X)
}

# Standardise X
X_mu <- colMeans(X, dims = 1)
X_sd <- apply(X, 2, sd)
# X_s <- t((t(X) - X_mu) / X_sd)
X_s <- scale(X)

# matplot(X, type='l')

fit <- stan(file='infer_tv_base.stan', data=list(P=n_X, F=n_F, TS=n_T, disc=discount, x=X), iter=n_D, warmup=n_W, 
            refresh=n_W, init_r=R, seed=seed, control=list(adapt_delta=AD, max_treedepth=TD, stepsize=SS))

# fit <- stan(file='infer_tv.stan', data=list(P=n_X, F=n_F, TS=n_T, N=n_N, disc=discount, x=X), iter=n_D, warmup=n_W, 
# refresh=100, seed=seed, chains=1, control=list(adapt_delta=AD, max_treedepth=TD))

# summary(fit, pars=c('x_sd', 'chol_log_f_sd'))
par_sum <- summary(fit, pars=c('x_sd', 'chol_log_f_sd_sd', 'log_f_sd_loc', 'log_f_sd_scale'))$summary
print(par_sum)

print(x_sigma)
print(chol_F_sigma) 
print(log_F_sigma_loc) 
print(log_F_sigma_scale)

# monitor(extract(fit, pars=c('x_sd', 'chol_log_f_sd_sd', 'log_f_sd_loc', 'log_f_sd_scale'), 
                # permuted = FALSE, inc_warmup = FALSE), digits_summary=5)

params <- extract(fit)
beta_hat <- colMeans(params$beta, dims=1)
pdf("beta_comp_alt.pdf")
plot(c(t(beta_hat)) ~ c(t(B_init)), cex=2, pch=10, col="dark gray", xlab="Simulated Loadings", ylab="Estimated Loadings")
fit_B <- lm(c(t(beta_hat)) ~ c(t(B_init))) # Fit a linear model
abline(fit_B)
abline(0,1, lty = 2) # 1:1 line
dev.off()

# saveRDS(fit, "draws.rds")
# params <- extract(fit)
# print(colMeans(params$x_sd, dims=1))
# print(colMeans(params$log_f_sd_sd=1))

# Save draws as Python pickle
# py_save_object(r_to_py(params), 'draws.pkl')

2 Likes

Thanks for posting. I’ll try to take a look at this tomorrow.

1 Like

I had a look at this with the develop cmdstan (roughly version 2.23) using cmdstanr (https://mc-stan.org/cmdstanr/).

So something like this:

library(cmdstanr)
library(tidyverse)
library(posterior) # https://github.com/jgabry/posterior

m1 = cmdstan_model("m1.stan", quiet = FALSE)
m2 = cmdstan_model("m2.stan", quiet = FALSE)

fit1 = m1$sample(data = list(P=n_X, F=n_F, TS=n_T, disc=discount, x=X),
          num_samples = 1000,
          num_warmup = 1000,
          init=R,
          seed=seed,
          adapt_delta=AD,
          max_depth=10,
          refresh=100,
          num_chains = 4,
          num_cores = 4)

fit2 = m2$sample(data = list(P=n_X, F=n_F, TS=n_T, disc=discount, x=X),
                 num_samples = 1000,
                 num_warmup = 1000,
                 init=R,
                 seed=seed,
                 adapt_delta=AD,
                 max_depth=10,
                 refresh=100,
                 num_chains = 4,
                 num_cores = 4)

# See if the 4 chains from each separate fit mix
# I had to add mu as a transformed parameter in  the model where you don't use it
# in the model block to get this check to work. (I still don't use mu in the model block)
bind_draws(fit1$draws(), fit2$draws(), along = 'chain') %>%
  summarise_draws() %>%
  arrange(-rhat)

Using your generated data, things roughly mixed. Like, there were Rhats of 1.01.

Do you have a computer you could install a later version of either Rstan or cmdstanr and give this a try and see if you still have the problems?

Also try this variation:

    vector[F] tmp;
    log_f_sd[t] = log_f_sd_loc + chol_log_f_sd_sd * z_log_f_sd[t];
    if (t > 1){
      log_f_sd[t] += log_f_sd_scale .* log_f_sd[t-1];
    }
    tmp = exp(log_f_sd[t]) .* z_fac[t];
    mu[t] = beta * tmp;

In the model where you use mu[t] and see if you find a difference.

Thanks for all the help, I’ve verified that the Rhats are all around 1.02 for all three models and found a small error in model 2.

  1. They had all the same number of divergences which is reassuring. Do you have any tips for further reducing those apart from increasing adapt_delta to 0.999 (this reduces it to 1 per 1000 draws) as I can’t reparameterise any more?
  2. Interestingly, model 1 (the one without mu) doesn’t yield treedepth warnings but the rest do. Why is that so?

Also is there a way of selecting variables while doing a summary? I had to convert the CmdStanMCMC files to StanFit files which seem unnecessary…

I just feel like there’s a bug if these models don’t do the same thing.

What was the error?

I doubt that there’s an error somewhere as diffing the models just yields minor bracketing differences when I assign the temp variable: temp = exp(log_f_sd[t]) .* z_fac[t]; vs temp = (exp(log_f_sd[t]) .* z_fac[t]);

vector[F] mu[TS] instead of vector[P] mu[TS] as the observations are p-dimensional. I have no idea how that got past compilation.

Ooooo! A delicious clue!

There’s definitely a bug there. I filed an issue with a simplified model: https://github.com/stan-dev/stanc3/issues/523

If this is the problem affecting you, then I think the model where you don’t use mu and do:

x[t] ~ normal(beta * (exp(log_f_sd[t]) .* z_fac[t]), x_sd);

is the one to trust for now. I wasn’t able to reproduce that bug in rstan 2.19.3, but I was able to reproduce it in the 2.23 release candidate. If you’re using rstan 2.18.2 then it’s kinda like this bug went away for a few releases bug came back?

Could you do a sessionInfo() and print the results here? In particular I’m curious what version of StanHeaders you have.