 # Hierarchical dynamic linear model with rSTAN, problems with fitting a covariance matrix

I have a HDLM that looks as follows:
y_t \sim \mathcal{N}(X_t \cdot \beta_t, V),
\beta_t \sim \mathcal{N}(\beta_{t-1}, W),
\beta_0 \sim \mathcal{N}(m_0, W),

with t \in \{1,\ldots,N\}, y_t being a vector of length m, X_t being a matrix of size m \times mp and \beta_t – a column vector of length mp.

I have implemented it successfully using JAGS, setting the prior on W to be inverse Wishart with the expected value being some random positive-definite matrix. I’m getting a very satisfying fit with that.

Now, looking for a speed-up, I’ve tried to implement the model using STAN. Here, I’m providing the covariance matrix V on input. Following STAN documentation, to describe W I’m doing

parameters {
corr_matrix[m*p] Omega;   // prior correlation
vector<lower=0>[m*p] tau; // prior scale
...
}

model {
matrix[m*p, m*p] W;
tau ~ cauchy(0, 2.5);
Omega ~ lkj_corr(2);
...


The whole model is

data {
int<lower=1> N;     // maximal time
int<lower=1> m;     // the number of cells
int<lower=1> p;     // number of covariates, including intercept
cov_matrix[m] V;    // covariance matrix of sales
matrix[m, N] Y;     // the matrix of sales: Y[k,t] = sales in cell k at time t; Y[:,t] = sales at time t
real X[N,m,m*p];    // this is an array; X[t,:,:] = covariates at time t
vector[m*p] m0;     // expected coefficients at time 0
}

parameters {
matrix[m*p, N+1] beta;   // beta[:,t] = coefficients at time t
// for prior covariance matrix
corr_matrix[m*p] Omega;  // prior correlation
vector<lower=0>[m*p] tau;// prior scale
}

transformed parameters {
matrix[m, N] ExpY;
matrix[N, m] ExpYtransposed; // can assign values only to rows, and not columns of matrices
for(t in 1:N){
ExpYtransposed[t] = to_row_vector(to_matrix(X[t,:,:]) * beta[:,t]);
}
ExpY = ExpYtransposed';
}

model {
matrix[m*p, m*p] W;
tau ~ cauchy(0, 2.5);
Omega ~ lkj_corr(2);

//the coefficients
col(beta,1) ~ multi_normal(m0, W);
for(t in 2:N){
col(beta,t) ~ multi_normal(col(beta,t-1), W);
}

//the sales
for(t in 1:N){
col(Y,t) ~ multi_normal(col(ExpY,t), V);
}

}


When trying to fit it, I end up with some warnings (“There were 1911 divergent transitions after warmup”, “The largest R-hat is NA, indicating chains have not mixed.”, etc.). When plotting traceplots of ‘tau’, I can see that not only have the chains not mixed, but that each of the three chains is constant.

Then I set W to be the identity matrix and provided it on input. I got a small number of divergent transitions, and a very poor fit, but the chains somewhat mixed. So I’m guessing (or hoping) that the above problems are due to the specification of W, and not the model being bad.

I’m new to STAN. I’d be grateful for tips on improving the model in any way, but especially with regards to the covariance matrix W.

1 Like

Sampling the latent states (beta) in a multivariate context was always a massive pain for me. I tried a range of different approaches but never got anything to match the performance (efffective samples per second) of the (code heavy) Kalman filtering approach. My R package ctsem will set all this up for you and allow sampling via Stan’s HMC or ML / MAP optimization (default). Checkout https://cdriver.netlify.app/ and or the quick start at bottom of GitHub - cdriveraus/ctsem: Hierarchical continuous time state space modelling . Most of the emphasis is on continuous time systems but you can just set it to discrete time in the model spec for the context you want.

2 Likes

Thanks a lot, I’ll have a look into your package. Do you maybe have any idea why JAGS does a decent job fitting the covariance matrix of the betas and STAN has troubles with that?

Don’t know sorry. Best guesses are either – whatever Gibbs sampling tricks JAGS does is more effective than HMC for such conditions, or JAGS is just not telling you that it’s also having trouble.

1 Like

There are two things which I think will give you a pretty significant improvement here.

First, try parameterising your covariance matrices using cholesky factors, these tend to be more efficient in sampling and the multi_normal_cholesky distribution is more efficient than the multi_normal. There’s an example of re-writing the multi_normal to multi_normal_cholesky in this section of the user’s guide: 1.15 Multivariate outcomes | Stan User’s Guide

Second, rather than directly putting a multi_normal prior on beta, try using the multivariate non-centered parameterisation (which also uses the cholesky factor). This tends to sample more efficiently (but not always), more information and examples here: 24.7 Reparameterization | Stan User’s Guide

1 Like

Hi @andrjohns and thanks for your input!

In fact, I did try using Cholesky factors before (for the betas). The way I did it was

data {
int<lower=1> N;        // maximal time
int<lower=1> m;        // the number of cells
int<lower=1> p;         // number of covariates, including intercept
cov_matrix[m] Ycov; // covariance matrix of sales
matrix[m, N] Y;          // the matrix of sales: Y[k,t] = sales in cell k at time t; Y[:,t] = sales at time t
real X[N,m,m*p];       // this is an array
vector[m*p] m0;
}

parameters {
matrix[m*p, N+1] beta; //beta[:,t] = coefficients at time t
// other
cholesky_factor_corr[m*p] L_Omega;  // Cholesky correlation
vector<lower=0>[m*p] L_sigma;           // cholesky scale
}

transformed parameters {
matrix[m*p, m*p] L_Sigma;
matrix[m, N] ExpY;
matrix[N, m] ExpYtransposed; // can assign values only to rows, and not columns of matrices

for(t in 1:N){
ExpYtransposed[t] = to_row_vector(to_matrix(X[t,:,:]) * beta[:,t]);
}
ExpY = ExpYtransposed';
L_Sigma = diag_pre_multiply(L_sigma, L_Omega);
}

model {
//priors
L_sigma ~ cauchy(0, 2.5);
L_Omega ~ lkj_corr_cholesky(4);

col(beta,1) ~ multi_normal_cholesky(m0, L_Sigma);
for(t in 2:N){
col(beta,t) ~ multi_normal_cholesky(col(beta,t-1), L_Sigma);
}

//the sales
for(t in 1:N){
col(Y,t) ~ multi_normal(col(ExpY,t), Ycov);
}

}



The placement of the declarations of L_sigma, L_Sigma and L_Omega is different than in the documentation page you pointed to, but I thought it wouldn’t make too much of a difference. Is that true?

Anyway, with the above formulation I’m also ending up with the chains not mixing (so it seems). This is what I get when calling traceplot(fit, pars = c("beta[1,1]")):

I’ll try debugging using print statements. I’d still be grateful for any explanations as to what may be causing this behaviour.