Speeding up and reducing divergences for multivariate hiearchical stochastic factor model

atechnicolorskye · April 26, 2020, 8:12pm

I think one can do with or without recentering as this was what was specified in the model I’m replicating. The only thing I’m not doing is letting x_sd vary across time and dimensions?

Did you mean phi when you wrote tau? If I keep the scale constant, I get a near constant bias log_f_sd_loc as I think it makes it harder to differentiate between factors.

bbbales2 · April 26, 2020, 8:30pm

The tau here:

cholesky_factor_cov[F] chol_log_f_sd_sd = diag_pre_multiply(tau, L_Omega);

atechnicolorskye · April 29, 2020, 6:30am

After some experimenting, I have a few things to report.
1.

It seems that having different variants of the same mathematical statements yields different results i.e.

for (t in 1:TS){
    log_f_sd[t] = log_f_sd_loc + chol_log_f_sd_sd * z_log_f_sd[t];
    if (t > 1){
        log_f_sd[t] += log_f_sd_scale .* (log_f_sd[t-1] - log_f_sd_loc);
    }
  }

performs differently than

  log_f_sd[1] = log_f_sd_loc + chol_log_f_sd_sd * z_log_f_sd[1];
  for (t in 2:TS){
    log_f_sd[t] = log_f_sd_loc + (log_f_sd_scale .* (log_f_sd[t-1] - log_f_sd_loc)) + (chol_log_f_sd_sd * z_log_f_sd[t]);
    }

If I reparameterise log_f_sd = abs(raw_log_f_sd) where raw_log_f_sd is an unconstrained vector, the mean of log_f_sd != abs(raw_log_f_sd).
Choice of priors values make a big difference when using normals, truncated normals and lognormals but not with gammas.

Setting values to be the same in every dimension for log_f_sd_loc, log_f_sd_scale and a diagonal covariance innovation matrix does not lead to identifiability.

I’m worried I’m overlooking something big. The model I’m trying to replicate is Factor Stochastic Volatility with Time Varying Loadings and Markov Switching Regimes.pdf (408.4 KB) which is based on Aguilar and West.pdf (1004.3 KB). Currently, I can’t get the latter to work even if I removed time-varying innovations. Employing time-varying innovations just leads to poorly estimated scales for both log_f_sd and the innovations but well-estimated for everything else.

bbbales2 · April 29, 2020, 5:27pm

I feel like I am too.

Can you check that the two models have the same gradients given some unconstrained parameters?

Use grad_log_prob to check: log_prob and grad_log_prob functions — log_prob-methods • rstan

Do something like:

params = rnorm(get_num_upars(fit))
grad_log_prob(fit1, params)
grad_log_prob(fit2, params)

And compare the difference in the first and the second. They should be the same.

As far as the model goes, maybe start with a non-multivariate re-write?

atechnicolorskye · April 30, 2020, 3:58am

bbbales2:

Can you check that the two models have the same gradients given some unconstrained parameters?

Use grad_log_prob to check: https://mc-stan.org/rstan/reference/stanfit-method-logprob.html

Do something like:
params = rnorm(get_num_upars(fit))
grad_log_prob(fit1, params)
grad_log_prob(fit2, params)
And compare the difference in the first and the second. They should be the same.

They’re not the same when n_warmup = n_draw = 1000.

params = rnorm(get_num_upars(stanfit))
> sum(grad_log_prob(stanfit, params) - grad_log_prob(stanfit_alt, params))
[1] -1.019831

However, when n_warmup = n_draw = 100,

> sum(grad_log_prob(stanfit, params) - grad_log_prob(stanfit_alt, params))
[1] 6.087279e-10

Do you have an idea of what’s going on?

I’m going back to a non dynamic model to see if it works in the first place. Else will have to do a non-multivariate rewrite…

bbbales2 · April 30, 2020, 4:52pm

Yeah this is super suspect. I will do some more investigation on this later.

The n_draw/n_warmup differences makes no sense. Maybe it was more to do with the random parameters? Or were those the same random params in both cases?

Also throw an abs in there so errors aren’t canceling out:

sum(abs(grad_log_prob(stanfit, params) - grad_log_prob(stanfit_alt, params)))

bbbales2 · April 30, 2020, 4:53pm

Usually when we do gradient checks we just generate one warmup and one post-warmup draw – the gradients should be totally independent of that.

atechnicolorskye · April 30, 2020, 8:28pm

So n_warmup = n_draws = 1? I assume they should be different as I’m using the same random seed and I have drawn more when n_warmup = n_draws = 1000…

Doing that, I get

> sum(abs(grad_log_prob(stanfit, params) - grad_log_prob(stanfit_alt, params)))
[1] 1.337243

bbbales2 · April 30, 2020, 8:31pm

Well when you call gradients, the only degree of freedom are the parameters you’re passing in, and you control those.

Could you post the latest two models? I’m sure I can dig up one of the data scripts you posted already to pass to them, but just so I’m working with exactly what you have from the model standpoint.

atechnicolorskye · April 30, 2020, 9:08pm

Thanks for looking into this.

They are:

data {
  int <lower=1> P; // number of dimensions 
  int <lower=1> F; // number of latent factors
  int <lower=1> TS; // number of time steps
  real disc; // discount
  vector[P] x[TS];
}
transformed data {
  vector[F] beta_diag = rep_vector(1.0, F);
  int beta_l = F * (P - F) + F * (F - 1) / 2; // number of lower-triangular, non-zero loadings
}
parameters {
  // nuisance parameters
  vector[F] z_log_f_sd[TS]; // reparameterised log latent factor standard deviations
  vector[F] z_fac[TS]; // reparameterised latent factors
  
  // parameters
  vector[F] log_f_sd_loc; // latent factor standard deviation location
  // vector[F] log_f_sd_scale; // latent factor standard deviation scale
  vector<lower=0, upper=1.0>[F] log_f_sd_scale; // latent factor standard deviation scale
  // real<lower=0.0, upper=1.0> r_log_f_sd_scale; // latent factor standard deviation scale
  // vector[F] log_f_sd_init; // initial log_f_sd
  
  vector[beta_l] beta_lower_init; // initial lower diagonal loadings
  // vector<lower=0.0>[F] beta_diag; // positive diagonal loadings
  
  // vector[P] x_loc;
  real<lower=0.0> x_sd; // x standard deviation
  
  // priors
  cholesky_factor_corr[F] log_f_sd_Omega;
  vector<lower=0.0>[F] log_f_sd_tau;
}
transformed parameters { 
  vector[F] log_f_sd[TS];
  matrix[P, F] beta;
  
  // vector[F] abs_log_f_sd_scale;
  // vector[F] log_f_sd_scale = rep_vector(r_log_f_sd_scale, F);
  matrix[F, F] chol_log_f_sd_sd = diag_pre_multiply(log_f_sd_tau, log_f_sd_Omega);
  
  // for (f in 1:F){
    //   abs_log_f_sd_scale[f] = abs(log_f_sd_scale[f]);
    // }
  
  {
    int idx;
    idx = 1;
    for (j in 1:F) {
      beta[j, j] = beta_diag[j]; // set positive diagonal loadings
      for (k in (j+1):F){
        beta[j, k] = 0; // set upper triangle values to 0
      }
      for (i in (j+1):P){
        beta[i, j] = beta_lower_init[idx]; // set lower diagonal loadings
        idx = idx + 1;
      }
    }
  }
  
  log_f_sd[1] = log_f_sd_loc + (chol_log_f_sd_sd * z_log_f_sd[1]);
  for (t in 2:TS){
    // log_f_sd[t] = log_f_sd_loc + (log_f_sd_scale .* log_f_sd[t-1]) + (chol_log_f_sd_sd * z_log_f_sd[t]);
    log_f_sd[t] = log_f_sd_loc + (log_f_sd_scale .* (log_f_sd[t-1] - log_f_sd_loc)) + (chol_log_f_sd_sd * z_log_f_sd[t]);
    // log_f_sd[t] = log_f_sd_loc + (abs_log_f_sd_scale .* (log_f_sd[t-1] - log_f_sd_loc)) + (chol_log_f_sd_sd * z_log_f_sd[t]);
  }
  // for (t in 1:TS){
  //   log_f_sd[t] = log_f_sd_loc + (chol_log_f_sd_sd * z_log_f_sd[t]);
  //   if (t > 1){
  //     log_f_sd[t] += log_f_sd_scale .* (log_f_sd[t-1] - log_f_sd_loc);
  //   }
  // }
}
model {
  // priors
  beta_lower_init ~ std_normal();
  // beta_diag ~ normal(0.5, 1);
  
  log_f_sd_loc ~ normal(0, 0.5);
  log_f_sd_scale ~ normal(0, 0.5);
  // for (f in 1:F){
    //   log_f_sd_scale[f] ~ normal(0, 0.3) T[0, ];
    // }
  // log_f_sd_scale ~ gamma(2, 1./20.);
  // log_f_sd_scale ~ lognormal(-2, 0.5);
  // r_log_f_sd_scale ~ gamma(2, 1./10.);
  
  // x_loc ~ normal(0, 1);
  x_sd ~ gamma(2, 1./10.);
  
  log_f_sd_Omega ~ lkj_corr_cholesky(1);
  log_f_sd_tau ~ normal(0, 0.2);
  
  // likelihood 
  for (t in 1:TS){
    z_log_f_sd[t] ~ std_normal();
    z_fac[t] ~ std_normal();
    x[t] ~ normal(beta * (exp(log_f_sd[t]) .* z_fac[t]), x_sd);
    // x[t] ~ normal(x_loc + beta * (exp(log_f_sd[t]) .* z_fac[t]), x_sd);
  }
}

and

data {
  int <lower=1> P; // number of dimensions 
  int <lower=1> F; // number of latent factors
  int <lower=1> TS; // number of time steps
  real disc; // discount
  vector[P] x[TS];
}
transformed data {
  vector[F] beta_diag = rep_vector(1.0, F);
  int beta_l = F * (P - F) + F * (F - 1) / 2; // number of lower-triangular, non-zero loadings
}
parameters {
  // nuisance parameters
  vector[F] z_log_f_sd[TS]; // reparameterised log latent factor standard deviations
  vector[F] z_fac[TS]; // reparameterised latent factors
   
  // parameters
  vector[F] log_f_sd_loc; // latent factor standard deviation location
  // vector[F] log_f_sd_scale; // latent factor standard deviation scale
  vector<lower=0.0, upper=1.0>[F] log_f_sd_scale; // latent factor standard deviation scale
  // real<lower=0.0, upper=1.0> r_log_f_sd_scale; // latent factor standard deviation scale
  // vector[F] log_f_sd_init; // initial log_f_sd
  
  vector[beta_l] beta_lower_init; // initial lower diagonal loadings
  // vector<lower=0.0>[F] beta_diag; // positive diagonal loadings
  
  // vector[P] x_loc;
  real<lower=0.0> x_sd; // x standard deviation
  
  // priors
  cholesky_factor_corr[F] log_f_sd_Omega;
  vector<lower=0.0>[F] log_f_sd_tau;
}
transformed parameters { 
  vector[F] log_f_sd[TS];
  matrix[P, F] beta;
  
  // vector[F] abs_log_f_sd_scale;
  // vector[F] log_f_sd_scale = rep_vector(r_log_f_sd_scale, F);
  matrix[F, F] chol_log_f_sd_sd = diag_pre_multiply(log_f_sd_tau, log_f_sd_Omega);
  
  // for (f in 1:F){
  //   abs_log_f_sd_scale[f] = abs(log_f_sd_scale[f]);
  // }
  
  {
    int idx;
    idx = 1;
    for (j in 1:F) {
      beta[j, j] = beta_diag[j]; // set positive diagonal loadings
      for (k in (j+1):F){
        beta[j, k] = 0; // set upper triangle values to 0
      }
      for (i in (j+1):P){
        beta[i, j] = beta_lower_init[idx]; // set lower diagonal loadings
        idx = idx + 1;
      }
    }
  }
  
  // log_f_sd[1] = log_f_sd_loc + chol_log_f_sd_sd * z_log_f_sd[1];
  // log_f_sd[1] = log_f_sd_init;
  // for (t in 2:TS){
  //   // log_f_sd[t] = log_f_sd_loc + (log_f_sd_scale .* log_f_sd[t-1]) + (chol_log_f_sd_sd * z_log_f_sd[t]);
  //   log_f_sd[t] = log_f_sd_loc + (log_f_sd_scale .* (log_f_sd[t-1] - log_f_sd_loc)) + (chol_log_f_sd_sd * z_log_f_sd[t]);
  //   // log_f_sd[t] = log_f_sd_loc + (abs_log_f_sd_scale .* (log_f_sd[t-1] - log_f_sd_loc)) + (chol_log_f_sd_sd * z_log_f_sd[t]);
  // }
  for (t in 1:TS){
    log_f_sd[t] = log_f_sd_loc + (chol_log_f_sd_sd * z_log_f_sd[t]);
    if (t > 1){
        log_f_sd[t] += log_f_sd_scale .* (log_f_sd[t-1] - log_f_sd_loc);
    }
  }
}
model {
  // priors
  beta_lower_init ~ std_normal();
  // beta_diag ~ normal(0.5, 1);
  
  log_f_sd_loc ~ normal(0, 0.5);
  log_f_sd_scale ~ normal(0, 0.5);
  // for (f in 1:F){
  //   log_f_sd_scale[f] ~ normal(0, 0.3) T[0, ];
  // }
  // log_f_sd_scale ~ gamma(2, 1./20.);
  // log_f_sd_scale ~ lognormal(-2, 0.5);
  // r_log_f_sd_scale ~ gamma(2, 1./10.);
  
  // x_loc ~ normal(0, 1);
  x_sd ~ gamma(2, 1./10.);
  
  log_f_sd_Omega ~ lkj_corr_cholesky(1);
  log_f_sd_tau ~ normal(0, 0.2);

  // likelihood 
  for (t in 1:TS){
    z_log_f_sd[t] ~ std_normal();
    z_fac[t] ~ std_normal();
    x[t] ~ normal(beta * (exp(log_f_sd[t]) .* z_fac[t]), x_sd);
    // x[t] ~ normal(x_loc + beta * (exp(log_f_sd[t]) .* z_fac[t]), x_sd);
  }
}

and the R code is

library(posterior)
library(tidyverse)
options(tibble.print_max = Inf)

# library(reticulate)
# use_python("/usr/bin/python")
# np <- import("numpy")

library(cmdstanr)
library(rstan)
options(mc.cores = parallel::detectCores())

seed <- 1
set.seed(seed)

# Simulate data according to Lopes and Carvalho (2007) without switching
n_T <- 100
n_F <- 5
n_X <- 10
n_D <- 1000
n_W <- 1000
R <- 1
AD <- 0.99
TD <- 10
# SS <- 0.01
discount <- 0.99

# Create data strucutures
X <- array(0, c(n_T, n_X))
B <- array(0, c(n_T, n_X, n_F))
log_F_sigma <- array(0, c(n_T, n_F))
log_X_sigma <- array(0, c(n_T, n_X))

# x hyperparameters
x_sigma <- exp(rep(rnorm(1), n_X))
# x_sigma <- abs(rep(rnorm(1), n_X))
# log_X_sigma_loc <- rnorm(n_X, mean=0, sd=0.25)
# log_X_sigma_scale <- abs(rnorm(n_X, mean=0, sd=0.25))
# chol_X_sigma <- matrix(0, n_X, n_X)
# diag(chol_X_sigma) <- abs(rnorm(n_X, mean=0, sd=0.1))

# f hyperparameters
log_F_sigma_loc <- rnorm(n_F, mean=0, sd=0.25)
# log_F_sigma_scale <- rnorm(n_F, mean=0, sd=0.25)
log_F_sigma_scale <- abs(rnorm(n_F, mean=0, sd=0.25))
# chol_F_sigma <- diag(abs(rnorm(n_F, mean=0, sd=0.1)))
chol_F_sigma <- matrix(rnorm(n_F * n_F, sd=0.1), c(n_F, n_F))
chol_F_sigma[upper.tri(chol_F_sigma)] <- 0
diag(chol_F_sigma) <- abs(diag(chol_F_sigma))

# B hyperparameters
# Want B to be 'almost sparse' where B contains mostly values close to 0.05
base_order <- 0.05
bias_order <- 0.5
p_connect <- 0.3
B_init <- matrix(0, n_X, n_F)
diag(B_init) <- 1
n_l_t <- sum(lower.tri(B_init))
B_init[lower.tri(B_init)] <- rnorm(n_l_t, 0, .05) +
                            rbinom(n_l_t, 1, .3) * (2 * rbinom(n_l_t, 1, .5) - 1) *
                            # rnorm(n_l_t, bias_order, 1)
                            (bias_order + abs(rnorm(n_l_t)))

# Initialise
log_F_sigma[1, ] <- log_F_sigma_loc + chol_F_sigma %*% rnorm(n_F)
B[1, , ] <- B_init
# log_X_sigma[1, ] <- log_X_sigma_loc + chol_X_sigma %*% rnorm(n_X)
X[1, ] <- (B[1, , ] %*% (exp(log_F_sigma[1, ]) * rnorm(n_F))) + x_sigma * rnorm(n_X)
# X[1, ] <- (B[1, , ] %*% (exp(log_F_sigma[1, ]) * rnorm(n_F))) + exp(log_X_sigma[1, ]) * rnorm(n_X)

# Generate data
for (i in 2:n_T){
  # log_F_sigma[i, ] <- log_F_sigma_loc + (log_F_sigma_scale * log_F_sigma[i-1, ]) + chol_F_sigma %*% rnorm(n_F)
  log_F_sigma[i, ] <- log_F_sigma_loc + (log_F_sigma_scale * (log_F_sigma[i-1, ] - log_F_sigma_loc)) + chol_F_sigma %*% rnorm(n_F)
  B[i, , ] <- B[i-1, , ]
  # log_X_sigma[i, ] <- log_X_sigma_loc + (log_X_sigma_scale * log_X_sigma[i-1, ]) + chol_X_sigma %*% rnorm(n_X)
  X[i,  ] <- (B[i, , ] %*% (exp(log_F_sigma[i, ]) * rnorm(n_F))) + x_sigma * rnorm(n_X)
  # X[i,  ] <- (B[i, , ] %*% (exp(log_F_sigma[i, ]) * rnorm(n_F))) + exp(log_X_sigma[i, ]) * rnorm(n_X)
}

# Standardise X
X_mu <- colMeans(X, dims=1)
X_sd <- apply(X, 2, sd)
X_s <- scale(X)

matplot(X, type='l')
matplot(log_F_sigma, type='l')
# matplot(log_X_sigma, type='l')

# base <- cmdstan_model('infer_tv_base.stan', quiet=FALSE, force_recompile=TRUE)
# fit_b <- base$sample(data=list(P=n_X, F=n_F, TS=n_T, disc=discount, x=X),
#                      num_samples=n_D,
#                      num_warmup=n_W,
#                      refresh=100,
#                      # init=R,
#                      seed=seed,
#                      adapt_delta=AD,
#                      max_depth=TD,
#                      # stepsize=SS,
#                      num_chains=1,
#                      num_cores=4)

stanfit <- stan(file='infer_tv_base.stan',
            data=list(P=n_X, F=n_F, TS=n_T, disc=discount, x=X),
            iter=(n_D+n_W),
            warmup=n_W,
            refresh=100,
            seed=seed,
            chains=1,
            control=list(adapt_delta=AD, max_treedepth=TD))

stanfit_alt <- stan(file='infer_tv_base_alt.stan',
                data=list(P=n_X, F=n_F, TS=n_T, disc=discount, x=X),
                iter=(n_D+n_W),
                warmup=n_W,
                refresh=100,
                seed=seed,
                chains=1,
                control=list(adapt_delta=AD, max_treedepth=TD))

params = rnorm(get_num_upars(stanfit))
sum(abs(grad_log_prob(stanfit, params) - grad_log_prob(stanfit_alt, params)))

bbbales2 · May 3, 2020, 10:03pm

It looked to me like these two models have the same gradients.

Here’s a params I used: params.RDS.r (8.0 KB)

You can load that up and test it with:

m1 = stan_model("m1.stan")
m2 = stan_model("m2.stan")

f1 = sampling(m1, data = list(P=n_X, F=n_F, TS=n_T, disc=discount, x=X), iter = 1, chains = 1)
f2 = sampling(m2, data = list(P=n_X, F=n_F, TS=n_T, disc=discount, x=X), iter = 1, chains = 1)

params = readRDS("params.RDS.r")
g1 = grad_log_prob(f1, params)
g2 = grad_log_prob(f2, params)
max(abs(g1 - g2))

The result was: 7.629395e-06.

For some random values the differences are large, but then the absolute values of the gradients in those cases are really large (like 1e16). So I think the gradients are the same.

So I think we’re back to here: Speeding up and reducing divergences for multivariate hiearchical stochastic factor model

atechnicolorskye · May 6, 2020, 2:49pm

I went back to basics and implemented a multivariate stochastic volatility model. Even then I can’t get log_f_sd_scale to work…

DGP

rm(list=ls())
gc()

library(posterior)
library(tidyverse)
options(tibble.print_max = Inf)

library(cmdstanr)
library(rstan)
options(mc.cores = parallel::detectCores())

seed <- 1
set.seed(seed)

# Simulate data according to Lopes and Carvalho (2007) without switching
n_T <- 100
n_F <- 3
n_X <- 10
n_D <- 1000
n_W <- 1000
R <- 2
AD <- 0.95
TD <- 11
# SS <- 0.01
discount <- 1

# Create data strucutures
X <- array(0, c(n_T, n_F))
log_F_sigma <- array(0, c(n_T, n_F))

# f hyperparameters
log_F_sigma_loc <- rnorm(n_F, sd=0.5)
log_F_sigma_scale <- abs(rnorm(n_F, mean=0, sd=0.5))
# log_F_sigma_scale <- rbeta(n_F, 2, 8)
chol_F_sigma <- matrix(rnorm(n_F * n_F, sd=0.2), c(n_F, n_F))
chol_F_sigma[upper.tri(chol_F_sigma)] <- 0
diag(chol_F_sigma) <- abs(diag(chol_F_sigma))

# Initialise
log_F_sigma[1, ] <- log_F_sigma_loc + chol_F_sigma %*% rnorm(n_F) # log_F_sigma_loc * rnorm(n_F)
# log_X_sigma[1, ] <- log_X_sigma_loc + chol_X_sigma %*% rnorm(n_X)
X[1, ] <- exp(log_F_sigma[1, ]) * rnorm(n_F)
# X[1, ] <- (B[1, , ] %*% (exp(log_F_sigma[1, ]) * rnorm(n_F))) + exp(log_X_sigma[1, ]) * rnorm(n_X)

# Generate data
for (i in 2:n_T){
  # log_F_sigma[i, ] <- log_F_sigma_loc + (log_F_sigma_scale * log_F_sigma[i-1, ]) + chol_F_sigma %*% rnorm(n_F)
  log_F_sigma[i, ] <- log_F_sigma_loc + (log_F_sigma_scale * (log_F_sigma[i-1, ] - log_F_sigma_loc)) + chol_F_sigma %*% rnorm(n_F)
  X[i,  ] <- exp(log_F_sigma[i, ]) * rnorm(n_F)
}

# Standardise X
X_mu <- colMeans(X, dims=1)
X_sd <- apply(X, 2, sd)
# X_s <- scale(X)

matplot(X, type='l')
matplot(log_F_sigma, type='l')
# matplot(log_X_sigma, type='l')

base <- cmdstan_model('infer_tv_MSV.stan', quiet=FALSE, force_recompile=TRUE)
fit_b <- base$sample(data=list(P=n_X, F=n_F, TS=n_T, disc=discount, x=X),
                     num_samples=n_D,
                     num_warmup=n_W,
                     refresh=1000,
                     # init=R,
                     seed=seed,
                     adapt_delta=AD,
                     max_depth=TD,
                     # stepsize=SS,
                     num_chains=1,
                     num_cores=4)

stanfit <- read_stan_csv(fit_b$output_files())
par_sum <- summary(stanfit, pars=c('chol_log_f_sd_sd', 'log_f_sd_loc', 'log_f_sd_scale'))$summary

params <- extract(stanfit)
chol_F_sigma_hat <- colMeans(params$chol_log_f_sd_sd, dims=1)
log_F_sigma_loc_hat <- colMeans(params$log_f_sd_loc, dims=1)
log_F_sigma_scale_hat <- colMeans(params$log_f_sd_scale, dims=1)
par(pty="s")
plot(c(t(chol_F_sigma_hat)) ~ c(t(chol_F_sigma)), cex=2, pch=10, col="dark gray", xlab="Simulated Log Factor Cholesky", ylab="Estimated Log Factor Cholesky", asp=1)
fit_chol <- lm(c(t(chol_F_sigma_hat)) ~ c(t(chol_F_sigma))) # Fit a linear model
abline(fit_chol)
abline(0,1, lty = 2) # 1:1 line
plot(c(log_F_sigma_loc_hat) ~ c(log_F_sigma_loc), cex=2, pch=10, col="dark gray", xlab="Simulated Log Factor Location", ylab="Estimated Log Factor Location", asp=1)
fit_loc <- lm(c(log_F_sigma_loc_hat) ~ c(log_F_sigma_loc)) # Fit a linear model
abline(fit_loc)
abline(0,1, lty = 2) # 1:1 line
plot(c(log_F_sigma_scale_hat) ~ c(log_F_sigma_scale), cex=2, pch=10, col="dark gray", xlab="Simulated Log Factor Scale", ylab="Estimated Log Factor Scale", asp=1)
fit_scale <- lm(c(log_F_sigma_scale_hat) ~ c(log_F_sigma_scale)) # Fit a linear model
abline(fit_scale)
abline(0,1, lty = 2) # 1:1 line

print(fit_b$draws() %>%
        subset_draws(variable = c('lp__', 'x_sd', 'chol_log_f_sd_sd', 'log_f_sd_loc', 'log_f_sd_scale'), regex = TRUE) %>%
        summarise_draws())

print(log_F_sigma_scale)

Stan code

data {
  int <lower=1> P; // number of dimensions 
  int <lower=1> F; // number of latent factors
  int <lower=1> TS; // number of time steps
  real disc; // discount
  vector[F] x[TS];
}
transformed data {
  // vector[F] beta_diag = rep_vector(1.0, F);
  // int beta_l = F * (P - F) + F * (F - 1) / 2; // number of lower-triangular, non-zero loadings
  matrix[F, F] ones;
  for (i in 1:F){
    for (j in 1:F){
      ones[i][j] = 1;
    }
  }
}  
parameters {
  // nuisance parameters
  vector[F] z_log_f_sd[TS]; // reparameterised log latent factor standard deviations
  // vector[F] z_fac[TS]; // reparameterised latent factors
  
  // parameters
  vector[F] log_f_sd_loc; // latent factor standard deviation location
  vector<lower=0.0, upper=1.0>[F] log_f_sd_scale; // latent factor standard deviation scale

  cholesky_factor_corr[F] log_f_sd_Omega;
  vector<lower=0.0>[F] log_f_sd_tau;
}
transformed parameters { 
  vector[F] log_f_sd[TS];
  matrix[F, F] chol_log_f_sd_sd = diag_pre_multiply(log_f_sd_tau, log_f_sd_Omega);
  // matrix[F, F] init_chol = cholesky_decompose(chol_log_f_sd_sd * chol_log_f_sd_sd' ./ (ones - log_f_sd_scale * log_f_sd_scale'));

  log_f_sd[1] = log_f_sd_loc + chol_log_f_sd_sd * z_log_f_sd[1];
  // log_f_sd[1] = log_f_sd_loc + init_chol * z_log_f_sd[1];
  for (t in 2:TS){
    log_f_sd[t] = log_f_sd_loc + (log_f_sd_scale .* (log_f_sd[t-1] - log_f_sd_loc)) + (chol_log_f_sd_sd * z_log_f_sd[t]);
  }
}
model {
  vector[F] sigma;
  // priors
  log_f_sd_loc ~ std_normal();
  log_f_sd_scale ~ normal(0, 1);
  // for (f in 1:F){
  //   log_f_sd_scale[f] ~ student_t(2,0,1) T[0, ];
  // }
  // log_f_sd_scale ~ beta(2, 1./20.);
  // log_f_sd_scale ~ gamma(2, 1./8.);
  // log_f_sd_scale ~ lognormal(-2, 0.5);
  
  log_f_sd_Omega ~ lkj_corr_cholesky(1);
  log_f_sd_tau ~ cauchy(0, 0.5);
  
  // likelihood 
  for (t in 1:TS){
    z_log_f_sd[t] ~ std_normal();
    x[t] ~ normal(0, exp(log_f_sd[t]));
  }
}

Any ideas?

bbbales2 · May 6, 2020, 3:08pm

I haven’t been looking at the model, but do you think the reference Math in the manual might be wrong?: https://mc-stan.org/docs/2_23/stan-users-guide/stochastic-volatility-models.html

I’m looking specifically at where it goes from:

y_t = \epsilon_t e^{h_t / 2}

To:

y_t \sim N(0, e^{h_t / 2})

That just seems wrong. It looks like it is based on the assumption that \epsilon \sim N(0, 1). Looking at http://apps.olin.wustl.edu/faculty/chib/papers/KimShephardChib98.pdf, it seems like the most basic thing to do is assume \log \epsilon_t is normally distributed (and they say even that is an approximation to something else). This has a different implication.

We can write that out:

\log \epsilon_t \sim N(\mu, \sigma) \\ \log y_t = \log \epsilon_t + h_t / 2

And then get a likelihood for y_t:

\log y_t \sim N(h_t / 2 + \mu, \sigma)

I’m not sure what assumptions are made about \mu and \sigma. Maybe it should be 0, 1, but I don’t know. In any case the Stan program would be:

y[t] ~ lognormal(h[t] / 2 + mu, sigma);

atechnicolorskye · May 6, 2020, 3:28pm

I think there’s a misunderstanding here: if y had a lognormal distribution, then it wouldn’t be able to take on negative values; it’s the sd/var of y that has a lognormal distribution i.e. lognormal(mu + scale (1 - mu), sigma) with scale between (-1, 1) but I’ve set it to (0, 1) to make things easier.

My intuition of what’s wrong with scale is that the maximum likelihood solution given a normal(0, x) prior would shrink them towards 0 which works if all parameters of scale has small values. However, I’m generating scale as abs(rnorm(0, y)) then there is non-zero chance that scale contains parameters that are relatively different (0.1 vs 0.8).

bbbales2 · May 6, 2020, 3:40pm

I’ll believe I’m wrong. The paper says also that \epsilon_t is white noise, so I guess the derivation makes sense.

Is what is being implemented here exactly the 2007 Lopes/Carvalho paper (https://faculty.mccombs.utexas.edu/carlos.carvalho/LopesCarvalhoFSV.pdf)?

Or are there extensions? Asking cuz if I’m gonna check the model any closer I gotta have somewhere to look. At you can probably tell I don’t know crap about these models.

atechnicolorskye · May 6, 2020, 3:49pm

I’m trying to implement Lopes/Carvalho ideally but I’m defaulting to the base case of Aguilar and West first as I can’t get even this to work.

Thanks so much helping out >.<

bbbales2 · May 8, 2020, 8:19pm

I had a look at the paper today (this one).

So the likelihood in eq. 2 is:

y_t = \theta_t + X_t f_t + \epsilon_t

Apparently we’re making the assumptions X_t = X and \theta_t = \theta, so we’ll drop those.

We also have:

f_t \sim N(0, H_t)

Where H_t = \text{diag}(h_{t1}, h_{t2}, \ldots, h_{tk})

\epsilon_t \sim N(0, \Psi_t)

Where \Psi_t = \text{diag}(\psi_{t1}, \psi_{t2}, \ldots, \psi_{tk})

So we get rid of \epsilon_t like:

y_t \sim N(\theta + X f_t, \Psi_t)

And now we gotta figure out where f_t and \Psi_t come from (X is on page 2, though I don’t know what the priors would be).

So it looks like h_{ti} = e^{\lambda_{ti}}, and then \lambda is that an autoregressive thing with multivariate innovations (with covariance U).

It looks like \psi_{tj} = e^{\eta_{ti}}, and then \eta_{ti} is another autoregressive thing, but this time the innovations have a diagonal covariance.

Alright enough for now. I think this model has two autoregressive processes in it, and the likelihood for y should look something like:

x[t] ~ normal(theta + X[t] * exp(nu[t]), exp(lambda[t]));

This seems like it will be super non-identified unless there are more simplifications or assumptions I missed. If there are N * P values of x, there are at least that many valuaes of X, nu, and lambda to estimate, along with whatever other hyperparameters go in there.

atechnicolorskye · May 21, 2020, 5:19am

I spoke to my collaborators and they said that a simpler model would suffice at the moment and I’ve gotten it to work with an updated DGP and Stan model. Apart from a few divergences here and there, I get decent results. Could I get you to have a look at the model to see if you can reduce the divergences to 0/further optimise the runtime?

We’ve simplified the DGP to non-time varying f and \epsilon_y while keeping allowing \beta to change across time. In equations, we have f_t \sim N(0, I), \epsilon_{y,t} \sim N(0, \Sigma_y) where \Sigma is a diagonal matrix, \beta_t = \beta_{l} + \beta_{s}(\beta_{t-1} - \beta_l) + \epsilon_{\beta, t} where \beta_l is the location parameter, \beta_s the scale parameter and \epsilon_{\beta, t} \sim N(0, \Sigma_\beta) with \Sigma_\beta is diagonal matrix.

The weirdest thing is that generating \beta_l \sim Unif(0, 1) was the key for making inference work. I’ve checked that the results are relatively stable across priors. Some of the rhats for \Sigma_y are larger than 1.01 while for the rest they are between 0.99 and 1.01.

The DGP and Stan call:

library(posterior)
library(tidyverse)
options(tibble.print_max = Inf)

# library(reticulate)
# use_python("/usr/bin/python")
# np <- import("numpy")

library(cmdstanr)
library(rstan)
# options(mc.cores = parallel::detectCores())

seed <- 1
set.seed(seed)

# Simulate data according to Lopes and Carvalho (2007) without switching
n_T <- 100
n_F <- 3
n_X <- 10
n_D <- 1000
n_W <- 1000
R <- 0.5
AD <- 0.999
TD <- 15
# SS <- 0.1
discount <- 1

# Create data strucutures
X <- array(0, c(n_T, n_X))
B <- array(0, c(n_T, n_X, n_F))

# x hyperparameters 
# x_sigma <- abs(rep(rnorm(1, sd=0.25), n_X))
x_sigma <- abs(rnorm(n_X, sd=0.25))
# x_sigma <- rep(1, n_X)


# f hyperparameters
# F_sigma <- abs(rnorm(n_F, sd=0.5))

# B hyperparameters
# Want B to be 'almost sparse' where B contains mostly values close to 0.05
base_order <- 0.1
bias_order <- 0.5
p_connect <- 0.3
B_loc <- matrix(0, n_X, n_F)
diag(B_loc) <- 1
n_l_t <- sum(lower.tri(B_loc))
B_loc[lower.tri(B_loc)] <- rnorm(n_l_t, 0, .05) +
                           rbinom(n_l_t, 1, .3) * (2 * rbinom(n_l_t, 1, .5) - 1) * 
                           rnorm(n_l_t, bias_order, 1)
                           # (bias_order + abs(rnorm(n_l_t)))
B_scale <- matrix(0, n_X, n_F)
B_scale[lower.tri(B_scale)] <- runif(n_l_t, 0, 1) # abs(rnorm(n_l_t, mean=0.5, sd=0.5)) # rbeta(n_l_t, 2, 2) # rep(0.5, n_l_t)
B_sigma <- matrix(0, n_X, n_F)
B_sigma[lower.tri(B_sigma)] <- abs(rnorm(n_l_t, mean=0, sd=0.2)) # rbeta(n_l_t, 2, 2) # rep(0.5, n_l_t)

# Initialise
# B[1, , ] <- B_loc + base_order * lower.tri(B[1, , ]) * matrix(rnorm(n_X * n_F), c(n_X, n_F))
B[1, , ] <- B_loc + B_sigma * matrix(rnorm(n_X * n_F), c(n_X, n_F))
# X[1, ] <- (B[1, , ] %*% (F_sigma * rnorm(n_F))) + x_sigma * rnorm(n_X)
X[1, ] <- (B[1, , ] %*% rnorm(n_F)) + x_sigma * rnorm(n_X)

# Generate data
for (i in 2:n_T){
  # B[i, , ] <- B_loc + (B_scale * (B[i-1, , ] - B_loc)) + discount * base_order * lower.tri(B[i, , ]) * matrix(rnorm(n_X * n_F), c(n_X, n_F))
  B[i, , ] <- B_loc + (B_scale * (B[i-1, , ] - B_loc)) + discount * B_sigma * matrix(rnorm(n_X * n_F), c(n_X, n_F))
  X[i,  ] <- (B[i, , ] %*% rnorm(n_F)) + x_sigma * rnorm(n_X)
}

# # Standardise X
# X_mu <- colMeans(X, dims = 1)
# X_sd <- apply(X, 2, sd)
# X_s <- scale(X)

matplot(X, type='l')
matplot(matrix(B, c(n_T, n_X * n_F)), type='l')

model <- cmdstan_model('infer_tv.stan', quiet=FALSE, force_recompile=TRUE)
fit <- model$sample(data=list(P=n_X, F=n_F, TS=n_T, disc=discount, x=X), 
                    num_samples=n_D, 
                    num_warmup=n_W, 
                    refresh=100, 
                    init=R, 
                    seed=seed, 
                    adapt_delta=AD, 
                    max_depth=TD, 
                    # stepsize=SS,
                    num_chains=1,
                    num_cores=4)

stanfit <- read_stan_csv(fit$output_files())

# print(x_sigma)
# print(chol_F_sigma) 
# print(log_F_sigma_loc) 
# print(log_F_sigma_scale)

params <- extract(stanfit)
x_sd_hat <- colMeans(params$x_sd, dims=1)
beta_lower_sd_hat <- colMeans(params$beta_lower_sd, dims=1)
beta_lower_loc_hat <- colMeans(params$beta_lower_loc, dims=1)
beta_lower_scale_hat <- colMeans(params$beta_lower_scale, dims=1)
# F_sigma_hat <- colMeans(params$f_sd, dims=1)
pdf("comp.pdf")
par(pty="s")
plot(c(x_sd_hat) ~ c(x_sigma), cex=2, pch=10, col="dark gray", xlab="Simulated X SD", ylab="Estimated X SD")
fit_scale <- lm(c(x_sd_hat) ~ c(x_sigma)) # Fit a linear model
abline(fit_scale)
abline(0,1, lty = 2) # 1:1 line
plot(c(beta_lower_sd_hat) ~ c(B_sigma[lower.tri(B_sigma)]), cex=2, pch=10, col="dark gray", xlab="Simulated Beta SD", ylab="Estimated Beta SD")
fit_scale <- lm(c(beta_lower_sd_hat) ~ c(B_sigma[lower.tri(B_sigma)])) # Fit a linear model
abline(fit_scale)
abline(0,1, lty = 2) # 1:1 line
plot(c(beta_lower_loc_hat) ~ c(B_loc[lower.tri(B_loc)]), cex=2, pch=10, col="dark gray", xlab="Simulated Beta Loc", ylab="Estimated Beta Loc")
fit_scale <- lm(c(beta_lower_loc_hat) ~ c(B_loc[lower.tri(B_loc)])) # Fit a linear model
abline(fit_scale)
abline(0,1, lty = 2) # 1:1 line
plot(c(beta_lower_scale_hat) ~ c(B_scale[lower.tri(B_scale)]), cex=2, pch=10, col="dark gray", xlab="Simulated Beta Scale", ylab="Estimated Beta Scale")
fit_scale <- lm(c(beta_lower_scale_hat) ~ c(B_scale[lower.tri(B_scale)])) # Fit a linear model
abline(fit_scale)
abline(0,1, lty = 2) # 1:1 line
# plot(c(F_sigma_hat) ~ c(F_sigma), cex=2, pch=10, col="dark gray", xlab="Simulated F SD", ylab="Estimated F_SD")
# fit_scale <- lm(c(F_sigma_hat) ~ c(F_sigma)) # Fit a linear model
# abline(fit_scale)
# abline(0,1, lty = 2) # 1:1 line
dev.off()

print(fit$draws() %>%
        subset_draws(variable = c('lp__', 'x_sd', 'beta_lower_scale'), regex = TRUE) %>%
        summarise_draws())
# saveRDS(fit, "draws.rds")
# 
# # Save draws as Python pickle
# py_save_object(r_to_py(params), 'draws.pkl')

The Stan model is largely the same:

data {
  int <lower=1> P; // number of dimensions 
  int <lower=1> F; // number of latent factors
  int <lower=1> TS; // number of time steps
  real disc; // discount
  vector[P] x[TS];
}
transformed data {
  vector[F] beta_diag = rep_vector(1.0, F);
  int beta_l = F * (P - F) + F * (F - 1) / 2; // number of lower-triangular, non-zero loadings
}
parameters {
  // nuisance parameters
  vector[F] z_fac_sd[TS]; // reparameterised latent factors
  vector[beta_l] z_beta_lower_sd[TS]; // reparameterised lower diagonal loading standard deviation
  
  // parameters
  vector[beta_l] beta_lower_loc; // lower diagonal loadings location
  vector<lower=0.0, upper=1.0>[beta_l] beta_lower_scale; // lower diagonal loadings scale
  // real<lower=0.0, upper=1.0> raw_beta_lower_scale; // lower diagonal loadings scale
  // vector<lower=0>[F] beta_diag; // positive diagonal loadings
  
  // priors
  // real<lower=0> raw_beta_lower_sd;
  vector<lower=0>[beta_l] beta_lower_sd;
  // real<lower=0> x_sd; // x standard deviation
  vector<lower=0>[P] x_sd;
}
transformed parameters {
  vector[beta_l] beta_lower[TS];
  matrix[P, F] beta [TS];
  
  // vector[beta_l] beta_lower_scale = rep_vector(raw_beta_lower_scale, beta_l);
  // vector[beta_l] beta_lower_sd = rep_vector(raw_beta_lower_sd, beta_l);
  
  for (t in 1:TS){
    beta_lower[t] = beta_lower_loc + beta_lower_sd .* z_beta_lower_sd[t];
    // beta_lower[t] = beta_lower_loc + beta_lower_sd * z_beta_lower_sd[t];
    if (t > 1){
      beta_lower[t] += beta_lower_scale .* (beta_lower[t-1] - beta_lower_loc);
      }
      
    {
    int idx;
    idx = 1;
    for (j in 1:F) {
      beta[t][j, j] = beta_diag[j]; // set positive diagonal loadings
      for (k in (j+1):F){
        beta[t][j, k] = 0; // set upper triangle values to 0
        }
      for (i in (j+1):P){
        beta[t][i, j] = beta_lower[t][idx]; // set lower diagonal loadings
        idx = idx + 1;
        }
      }
    }
  }
}
model {
  vector[P] mu;
  // priors
  beta_lower_loc ~ std_normal();
  // beta_lower_scale ~ std_normal();
  beta_lower_scale ~ cauchy(0, 0.5);
  // beta_lower_scale ~ lognormal(-2, 1);
  // beta_lower_scale ~ beta(2, 5);
  // beta_lower_scale ~ gamma(2, 1./10.);
  // beta_lower_scale ~ beta(2., 2.);
  
  // beta_lower_sd ~ std_normal();
  beta_lower_sd ~ normal(0, 0.1);
  
  // x_sd ~ gamma (2, 1./10.);
  // x_sd ~ std_normal();
  x_sd ~ normal(0, 0.1);
  
  for (t in 1:TS){
    z_fac_sd[t] ~ std_normal();
    z_beta_lower_sd[t] ~ std_normal();
    // x[t] ~ normal(beta[t] * (f_sd .* z_fac_sd[t]), x_sd);
    // x[t] ~ normal(beta[t] * z_fac_sd[t], x_sd);
    mu = beta[t] * z_fac_sd[t];
    for (p in 1:P){
      x[t][p] ~ normal(mu[p], x_sd[p]);
    }
  }
}

Let me know if you have any thoughts.

bbbales2 · May 27, 2020, 12:29pm

I had a look at this, but I wasn’t able to complete match the code with the equations. It looks reasonable though.

I guess this:

mu = beta[t] * z_fac_sd[t];

Is \mu = f_t, X = I (going from the paper y_t = \theta_t + X_t f_t + \epsilon_t)?

In which case that would mean f_t \sim N(0, B B^T) where B is a matrix constructed from the process described by \beta_t = \beta_l + \beta_s (\beta_{t - 1} - \beta_l).

Not saying things are wrong, just repeating what I’m reading.

Yeah that is. If the prior doesn’t seem to matter, I wonder if the constraints are just making the initial conditions easier to figure out?

In the end, do the estimates push up against the boundary at 1.0?

Parameters in Stan are initialized by default randomly in [-2, 2] on the unconstrained space. By changing the constraints on that parameter, you change where it is initialized.

The way I’m reading this code:

x[t][p] ~ normal(mu[p], x_sd[p]);

Is something like:

\mu \sim N(0, \Sigma_\mu)\\ x_{t} \sim N(\mu, \Sigma_x)

In which case I think you can integrate out \mu. From the definitions here, you have p(x_t | \mu) and p(\mu) and they’re both normals so you should be able to figure out what p(x_t, \mu) is and then just ignore the \mu part that gives you p(x_t). I think that’s how that works. You get your likelihood in terms of your hyperparameters (I didn’t indicate those in the notation) and then those are the only ones you need to sample.

If that is not possible, the normal_id_glm function can speed things up a bit: Stan Functions Reference

atechnicolorskye · June 10, 2020, 12:12am

Sorry for the delay, was busy with NeurIPS.

That’s exactly right.

Yep, you’re spot on there. I have a bunch of entries in beta_scale that are around 0.8 and they tend to be overestimated by the model. See last plot of the attached comparison plots (12.6 KB) .

I’ve been initialising between [-0.5, 0.5] the whole time.

I actually started with drawing x \sim N(0, BB^T + \Sigma) directy where \Sigma is a diagonal matrix but doing that in Stan is super slow as I’ll be running multi_normal(0, BB^T + \Sigma). Wouldn’t it also encourage divergences?

Topic		Replies	Views
Convergence problems with factor loadings in multivariante autoregressive model Modeling rstan , fitting-issues , specification , ecology	11	1595	July 31, 2020
Divergent transitions in hierarchical model Modeling fitting-issues	26	1922	November 7, 2019
Time Varying Factor Analysis Modeling fitting-issues , performance	2	801	March 23, 2020
Divergent transitions in Beta model General fitting-issues , specification , performance , divergences	7	1283	February 7, 2022
Divergent transitions in a multivariate multilevel regression model Modeling fitting-issues	17	1429	August 25, 2018

Speeding up and reducing divergences for multivariate hiearchical stochastic factor model

Related topics