# Hierarchial state-space model with warnings

Hi everybody,

I am kinda new to Stan modelling, but I have some experience with state-space (linear dynamic) models already. My intention is to build and fit model using Stan in order to infer parameters of hierarchial state-space model (time varying dynamics/time varying process noice etc.).
But to start, I tried simple model that could be solvable using Kalman Filter (but I do not want to use KF because it does not have the flexibility to infer parameters of more complex model).

So the mathematical formulation of the model goes like this

\mathbf{x}_{t} \sim \mathcal{N} \left( \mathbf{A} * \mathbf{x}_{t-1}, \Sigma_{x} \right)
\mathbf{y}_{t} \sim \mathcal{N} \left( \mathbf{x}_{t}, \Sigma_{y} \right)

which is pretty straightforward and KF would estimate the latent states \mathbf{x}_{t} from measurements \mathbf{y}_{t} easily as long as the process noise covariance matrix \Sigma_{x} and measurement covariance matrix \Sigma_{y} are proper covariance matrices. My idea of formulation of this model in Stan is this (model is saved in the file “simpleKalman.stan”)

data {
real dt;
int N;
vector y[N];
matrix[2,2] V;
matrix[2,2] A;
matrix[2,2] Q;
}
parameters {
vector x[N];
}
model {

x ~ multi_normal(y, Q + V);

for( t in 2:N )
{

x[t] ~ multi_normal(A*x[t-1], Q);

}

for( t in 1:N )
{

y[t] ~ multi_normal(x[t], V);

}

}


where N is the amount of data, dt is input prepared for more sophisticated models, A is the state transformation matrix, Q is the process noise matrix (in this formulation, i want Stan to only estimate the latent states) and V is the measurement covariance matrix. The following R code describes the way the data are generated (including the state transition matrix definition) and also the parameters that are provided to Stan in order to fit the model.

nT <- 50;
dt <- 0.9;
A <- matrix(c(1, dt, 0, 1), nrow = 2, ncol = 2, byrow = TRUE);
origX <- matrix(data = 0, nrow = 2, ncol = nT);
X <- matrix(data = 0, nrow = 2, ncol = nT);

x0 <- matrix(c(2000, 100), nrow = 2, ncol = 1);

X[, 1] <- x0;
origX[, 1] <- x0;

for( iT in 2:nT )
{

origX[, iT] <- A %*% origX[, iT - 1];

}

Q <- matrix(c(20^2, 0, 0, 0.01^2), nrow = 2, ncol = 2, byrow = TRUE);
sqrtQ <- t(chol(Q));

for(iT in 1:nT )
{

X[, iT] <- origX[, iT] + sqrtQ %*% matrix(rnorm(2), nrow = 2, ncol = 1);

}

Y <- matrix(0, nrow = 2, ncol = nT)
V <- matrix(c(100^2, 3e-5, 3e-5, 4^2), nrow = 2, ncol = 2);
sqrtV <- t(chol(V));

for( iT in 1:nT )
{

Y[, iT] <- X[, iT] + sqrtV %*% matrix(rnorm(2), nrow = 2, ncol = 1);

}

fit <- stan("simpleKalman.stan",
data=list(dt=dt, N=nT, y=t(Y), V=V, A=A, Q=Q),
iter=10000, chains=4, control = list(adapt_delta = 0.8));


Edit:
This is embarrasing, but I have noticed, that the process noise covariance matrix is stated as model data, however it was not provided to the model in the data list (already corrected in the provided model). The interesting part is that this is even possible (without even warning about the fact that something is stated as a data in the stan file, but is not present in the input).
After correction of this horrible mistake I have tried this model again. The warning about tree depth was still present. But I have read some stan articles since I have posted my question and looked at the n_eff parameter of estimated quantities (the latent states x). I have noticed that for the second part of the state vector this value was always approximately ten times smaller. Therefore I have tried the same model, but this time with

Q <- matrix(c(20^2, 0, 0, 1^2), nrow = 2, ncol = 2, byrow = TRUE);


and then there is no warning and the n_eff improved even for the first part of the state vectors (with the original covariance matrix x[ , 1] had n_eff approx. 1e4 and x[, 2] had approx 1e3 but with the new, all of the elements have n_eff approx. 2e4). The estimated states make sense to me, because if I try to estimate Q using them, I get approximately the right values.
Since I consider this part to be solved and I am able to infer the latent states, I have decided to expand the model. In this toy example, my intention is to estimate the process noise covariance matrix given the data and the measurement covariance matrix. I.e. I want to move the matrix Q from the data part of the stan file to the parameters part. Since I have done some reading about Bayesian modelling, I have decided to put some prior on this parameter. I know there is a lot of discussion about THE proper priors for covariance matrices but honestly I do not thing that the Wishart prior is the issue in this case.
Here is the model with process noise covariance matrix as parameter (the model is again saved in simpleKalman.stan model file)

data {
real dt;
int N;
vector y[N];
cov_matrix V;
matrix[2,2] A;
cov_matrix Q0;
}
parameters {
vector x[N];
cov_matrix Q;
}
model {

Q ~ wishart(2, Q0);

x ~ multi_normal(y, Q + V);

for( t in 2:N )
{

x[t] ~ multi_normal(A*x[t-1], Q);

}

for( t in 1:N )
{

y[t] ~ multi_normal(x[t], V);

}

}


and here is the R script which generates data (I use generative model approach, as is suggested in multiple resources) and provides them to stan

nT <- 100;
dt <- 0.9;
A <- matrix(c(1, dt, 0, 1), nrow = 2, ncol = 2, byrow = TRUE);
origX <- matrix(data = 0, nrow = 2, ncol = nT);
X <- matrix(data = 0, nrow = 2, ncol = nT);

x0 <- matrix(c(2000, 100), nrow = 2, ncol = 1);

X[, 1] <- x0;
origX[, 1] <- x0;

for( iT in 2:nT )
{

origX[, iT] <- A %*% origX[, iT - 1];

}

Q <- matrix(c(20^2, 0, 0, 1^2), nrow = 2, ncol = 2, byrow = TRUE);
sqrtQ <- t(chol(Q));

for(iT in 1:nT )
{

# X[, iT] <- X[, iT] + trnQ*rnorm(1, 0, sigmaa);
X[, iT] <- origX[, iT] + sqrtQ %*% matrix(rnorm(2), nrow = 2, ncol = 1);

}

Y <- matrix(0, nrow = 2, ncol = nT)
V <- matrix(c(100^2, 3e-5, 3e-5, 4^2), nrow = 2, ncol = 2);
sqrtV <- t(chol(V));

for( iT in 1:nT )
{

Y[, iT] <- X[, iT] + sqrtV %*% matrix(rnorm(2), nrow = 2, ncol = 1);

}

centerQ <- Q/5;

fit <- stan("simpleKalman.stan",
data=list(dt=dt, N=nT, y=t(Y), V=V, A=A, Q0=centerQ),
iter=10000, chains=4, control = list(adapt_delta = 0.8));


Inference using this model returns multiple warnings, so I suppose there is something fundamentally wrong with the model

Warning messages:
1: There were 127 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
2: There were 5211 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
3: There were 4 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low
4: Examine the pairs() plot to diagnose sampling problems

5: The largest R-hat is 1.43, indicating chains have not mixed.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#r-hat
6: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#bulk-ess
7: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#tail-ess


but I have no clue about what is it. Please, if somebody could suggest improvement of my code, I would be grateful.

Whenever you have the situation:

a ~ normal(x, c);


where a and c are parameters and x is anything, you run into the danger of getting a funnel and you should look at the noncentered parameterizations. Check this (it’s a description of what happens + a way to get around it): https://mc-stan.org/users/documentation/case-studies/divergences_and_bias.html.

There’s a description of how to do it for multivariate outputs under the heading “Hierarchical Models and the Non-Centered Parameterization” here: https://mc-stan.org/docs/2_18/stan-users-guide/reparameterization-section.html

2 Likes

Thank you @bbbales2 for the answer and links provided to me. I read through the article and manual and I have tried to implement this type of solution so it fits my situation. I have come up with new version of the stan model

data {
real dt;
int N;
vector y[N];
cov_matrix V;
matrix[2,2] A;
}
transformed data {
matrix[2, 2] V_sqrt;
V_sqrt = cholesky_decompose(V);
}
parameters {
vector x_tilde[N];
cov_matrix Q;
}
transformed parameters {
matrix[2, 2] Q_sqrt;
matrix[2, 2] QV_sqrt;
vector x[N];
Q_sqrt = cholesky_decompose(Q);
QV_sqrt = cholesky_decompose(Q + V);

x = y + QV_sqrt*x_tilde;

for( t in 2:N )
{

x[t] = A*x[t-1] + Q_sqrt*x_tilde[t];

}

}
model {

for( t in 1:N )
{

x_tilde[t] ~ std_normal();

}

for( t in 1:N )
{

y[t] ~ multi_normal(x[t], V);

}

}


However, even with this model, there are still some divergent samples. Inspired by the analyses that has been done in the article, I have performed a plot of iterations of the range variance (element Q[1, 1]) and compared it with the actual value. What I understand that might be a problem is the sampling from the distribution of the data y_t, but I am not able to come up with change of this statement and that is in particular due to time dependence of the mean value (the transformation of latent state x between epochs).
Overall my findings are such that the value of Q[1,1] which is the value I am checking against the actual (simulated value) changes with the amount of data provided, the number of iterations and the value of adapt_delta. It is really confusing. I You have any other hint, I would really appreciate it.

This could be possible, right? More data and the posterior changes.

This should not happen though.

What parameter has the highest Rhat or the lowest Neffs? What do traceplots of these parameters colored by chain look like? Are there any clues there?

Looking at the model, probably put a prior on Q, or reparameterize and use an LKJ (24 Correlation Matrix Distributions | Stan Functions Reference, 1.13 Multivariate Priors for Hierarchical Models | Stan User’s Guide).

The model might be faster if you do:

y[t] ~ multi_normal_cholesky(x[t], V_sqrt);


It would definitely be faster if V had much larger dimension than 2, but maybe even for 2x2 you’ll notice it.

Stan has a Kalman filter function gaussian_dlm which could be of interest. I would guess it handles using a Cholesky factor for the user?