Hello,
I am fairly new to Stan and I am experimenting with a Bayesian Dynamic Factor Model. The model has a large number of time series, N>75, and I want to estimate K~5 factors that capture most of the variation of these time series. Formally, the model is the following:
Where Y_t is a N \times 1 vector of observations at time t, F_t is a K\times 1 vector of unobserved factors, \Lambda is the N \times K loadings matrix, \Psi and \Phi are VAR(1) transition matrices, \epsilon and \nu are appropriately sized vectors of errors.
My issues are with N>75, when the onion method fails and I cannot start sampling since the covariance matrix is not appropriate and/or contains nan
values. Also, I get occasionally the following error
Chain 1: Rejecting initial value:
Chain 1: Log probability evaluates to log(0), i.e. negative infinity.
Chain 1: Stan can't start sampling from this initial value.
Strangely enough, though, I also seem not to sample correctly for one component of the onion machinery. Indeed, L
ends up containing only constant values and inspecting Sigma
I see correlation values which I cannot really interpret compared to observed ones.
Warning messages:
1: In validityMethod(object) :
The following variables have undefined values: L[1,2],The following variables have undefined values: L[1,3],The following variables have undefined values: L[2,3],The following variables have undefined values: L[1,4],The following variables have undefined values: L[2,4],
Now, is there something wrong with my Stan code that I should amend? Is it the onion implementation or some other part that does not check with the model outlined above?
Here is the full stan code I am working on
data {
int<lower=0> T; // Number of time periods
int<lower=0> N; // Number of time series
int<lower=0> K; // Number of factors
matrix[N, T] Y; // Scaled observed time series data, T cols are periods
}
parameters {
matrix[N, K] loadings; // loadings
matrix[K, K] Phi; // VAR(1) matrix for factors
matrix[N, N] Psi; // VAR(1) matrix for observables
matrix[K, T] nu_s; // errors on measurement equation
// onion
vector[choose(N, 2) - 1] l; // do NOT init with 0 for all elements
vector<lower = 0,upper = 1>[N-1] R2; // first element is not really a R^2 but is on (0,1)
}
transformed parameters {
matrix[K, T] factors; // factors for the state equation
// init factors and sample from the error to speed up
factors[:, 1] = rep_vector(1, K);
for (t in 2:T) {
factors[:, t] = Phi*factors[:, t-1] + eps_s[:, t];
}
// onion
real<lower=0> eta = 2;
real<lower = 0> alpha = eta + (N - 2) / 2.0;
vector<lower = 0>[N-1] shape1;
vector<lower = 0>[N-1] shape2;
shape1[1] = alpha;
shape2[1] = alpha;
for(n in 2:(N-1)) {
alpha = alpha - 0.5;
shape1[n] = n / 2.0;
shape2[n] = alpha;
}
matrix[N,N] Sigma;
matrix[N,N] L;
{
int start = 1;
int end = 2;
L[1,1] = 1.0;
L[2,1] = 2.0 * R2[1] - 1.0;
L[2,2] = sqrt(1.0 - square(L[2,1]));
for(n in 2:(N-1)) {
int kp1 = n + 1;
vector[n] l_row = segment(l, start, n);
real scale = sqrt(R2[n] / dot_self(l_row));
for(j in 1:n) L[kp1,j] = l_row[j] * scale;
L[kp1,kp1] = sqrt(1.0 - R2[n]);
start = end + 1;
end = start + n - 1;
}
}
Sigma = multiply_lower_tri_self_transpose(L);
}
model {
// one restriction on the loadings
to_vector(loadings[:, K]) ~ normal(1, .01);
to_vector(loadings[:, 1:(K-1)]) ~ std_normal();
to_vector(Phi) ~ std_normal();
to_vector(eps_s) ~ std_normal();
to_vector(Psi) ~ std_normal();
// onion
R2 ~ beta(shape1, shape2);
vs_y ~ cauchy(0, 2.5);
l ~ std_normal();
for (t in 2:T){
Y[:, t] ~ multi_normal(
loadings*factors[:,t] + Psi*Y[:, t-1],
Sigma
);
}
}
Here is the code I have written with bayesdfm
to simulate some data and test the model. Playing around with the number of time series I notice that with N<25 all works fine (built-in methods and onion one), while with N>50 I start getting errors also with the onion method: the sampler sputters but gets a good draw before hitting the limit. After N>75, the sampler doesn’t even start
library(rstan)
library(bayesdfm)
# number of factors
K <- 5
# periods
TT <- 300
# observables
N <- 250
# declare loadings matrix
loadings <- matrix(
nrow = N,
ncol = K,
data = c(abs(rnorm(K-1, 0, 3)), 1),
byrow = T
)
# simulate some data
sim <- sim_dfa(
num_trends = K,
num_years = TT,
num_ts = N,
loadings_matrix = loadings,
sigma = 3
)
data_v <- sim$y_sim %>%
t %>%
as.data.frame() %>%
# make all series positive and take growth rates
mutate(
across(
.cols = everything(),
.fns = ~ .x + abs(min(.)) + 10
),
across(
.cols = everything(),
.fns = ~ 100*(.x/dplyr::lag(.x) - 1)
)
) %>%
# scale series to be standard normal
apply(2, scale) %>%
na.omit(.) %>%
# select some or all series
# .[, 1:ncol(.)]
.[, 1:10]
data_h <- data_v %>%
t
list_h <- list(
T = ncol(data_h),
N = nrow(data_h),
Y = data_h,
K = 3
)
test <- stan_model("./bsln_onion.stan")
cmg_propcov <- rstan::sampling(
object = test,
data = list_h,
iter = 100,
chains = 1,
cores = 1,
verbose = T,
show_messages = T
)
A side note: I am a big fan of Stan and its community, it would be great if more time series material was made available, increasing even further the user base and Stan’s applications!