# Multivariate time series with missing data

I’m trying to write a multivariate-outcome, AR(1) time-series, missing-data model. An additional challenge is that for my data only one component of the outcome is observable at once. The users guide section, “Missing Data & Partially Known Parameters” offers some useful ideas on how to do this, but I don’t know how to deal with the only-one-component-at-once issue. The best I could come up with is the following (bivariate only), expecting it to not work:

``````data {
int<lower=0> N;   // number of observations
real y_obs[N];     // observed data, missing one component
real<lower=1, upper=2> obs[N];  // indicates which component is observed
}
parameters {
real alpha;
real beta;
cov_matrix[2] SIG;
}
transformed parameters {
real y_mis[N];
vector[2] y[N];
for (i in 1:N) {
if (obs[i] == 1)
y[i] = [y_obs[i], y_mis[i]]';
else
y[i] = [y_mis[i], y_obs[i]]';
}
}
model {
y[1] ~ normal(0,100);
for (n in 2:N)
y[n] ~ multi_normal(rep_vector(alpha,2)
+ rep_vector(beta,2) .* y[n-1], SIG);
}
``````

And indeed it doesn’t work; I get “Exception: normal_lpdf: Random variable[2] is nan, but must not be nan!”, presumably because Stan gets to the first line of the model and finds that one component of y[1] is nan.

The manual section “Missing Multivariate Data” has the following code snipet:

``````for (n in 1:N) {
if (y_observed[n, 1] && y_observed[n, 2])
y[n] ~ multi_normal(mu, Sigma);
else if (y_observed[n, 1])
y[n, 1] ~ normal(mu[1], sqrt(Sigma[1, 1]));
else if (y_observed[n, 2])
y[n, 2] ~ normal(mu[2], sqrt(Sigma[2, 2]));
}
``````

I’ve tried to think of a way to adapt that to my model, but since in my data y[n,1] and y[n,2] are never observed simultaneously I guessing the multi_normal part of the model would never get estimated; the two components would in effect be estimated independently (i.e. no correlation).

Perhaps this means that this can’t be done, but intuitively I’m thinking that this being a time series somehow makes it possible, since the data are correlated time-wise.

Anyway, here’s some sample data:
temp.data.R (3.4 KB)

Any hints, comments, insights, etc. are appreciated.

Thanks

1 Like

Almost there, you just need to move `real y_mis[N];` into the `parameters` block (not `transformed`).

I recommend adding a prior for `SIG`, maybe `diagonal(SIG) ~ normal(0, 100);`.
Also, you don’t need those `rep_vector` calls, `alpha + beta * y[n-1]` works too.

2 Likes

In addition to Niko’s suggestion, especially if you’re planning on expanding the model to more covariates, you probably want to re-parameterize so you have a prior on the SD of each covariate that is independent of the prior on the correlation matrix. I find that parameterization easier to think about and set sensible priors on (ex. half-normal or Weibull for SDs, LKJ for correlations) than a full covariance matrix.

Also, unless your data indeed has an SD on the order of 100, I suspect that `normal(0,100)` is far too uninformed. I think the general recommendation for a weakly-data-informed prior is to use an SD that is within an order of magnitude of the observed data; otherwise you’re putting a lot of credibility out in regions of the parameter space that are very extreme. A prior predictive check is certainly recommended to make sure that your prior generates reasonable data.

1 Like

Yup, that did the trick. And indeed, the rep_vector stuff appears to be unnecessary.

Thanks

Now that I have the basic model working, I’ll work on specifying more sensible priors per your suggestions.

Thank you

Please have a look at my updated model, incorporating your suggestions. I also expanded it to accommodate any number of covariates. It appears to run properly, but a bit slow. Perhaps there are opportunities for improved efficiency.

Thanks

`````` data {
int<lower=0> N;   // number of observations
int<lower=0> K;   // number of covariates
real y_obs[N];        // one component of observed data
int<lower=1> obs[N];  // indicates which component is observed
}
parameters {
real alpha;       // AR(1) model alpha
real beta;        // AR(1) model beta
vector[K-1] y_mis[N];    // missing data
vector<lower=0>[K] SD;     // standard deviations
corr_matrix[K] R;         // correlation matrix
}
transformed parameters {
vector[K] y[N];
cov_matrix[K] SIG;
SIG = diag_matrix(SD) * R * diag_matrix(SD);

// Construct array combining data and missing data:
for (i in 1:N) {
int k = 1;
for (j in 1:K) {
if (obs[i] == j)
y[i,j] = y_obs[i];
else {
y[i,j] = y_mis[i,k];
k += 1;
}
}
}
}
model {
alpha ~ normal(0,2);
beta ~ normal(1,1);
y[1] ~ normal(0,20);
SD ~ normal(0,20);
R ~ lkj_corr(1.5);
for (n in 2:N)
y[n] ~ multi_normal(alpha + beta * y[n-1], SIG);
}
``````

I think this could be `SIG = quad_form_diag(R, SD);`

1 Like

OK, thanks.

You can also use the Cholesky form of the MVN to gain a bit of speed:

``````data {
int<lower=0> N;   // number of observations
int<lower=0> K;   // number of covariates
real y_obs[N];        // one component of observed data
int<lower=1> obs[N];  // indicates which component is observed
}
parameters {
real alpha;       // AR(1) model alpha
real beta;        // AR(1) model beta
vector[K-1] y_mis[N];    // missing data
vector<lower=0>[K] SD;     // standard deviations
cholesky_factor_corr[K] L_R;         // Cholesky factor of correlation matrix
}
transformed parameters {
vector[K] y[N];
cholesky_factor_cov[K] L_SIG = diag_pre_multiply(SD, L_R); // Cholesky factor of covariance matrix

// Construct array combining data and missing data:
for (i in 1:N) {
int k = 1;
for (j in 1:K) {
if (obs[i] == j)
y[i,j] = y_obs[i];
else {
y[i,j] = y_mis[i,k];
k += 1;
}
}
}
}
model {
alpha ~ normal(0,2);
beta ~ normal(1,1);
y[1] ~ normal(0,20);
SD ~ normal(0,20);
L_R ~ lkj_corr_cholesky(1.5); // Prior on Cholesky factor correlation matrix
for (n in 2:N)
y[n] ~ multi_normal_cholesky(alpha + beta * y[n-1], L_SIG); // Cholesky form of the MVN
}
``````
5 Likes

Yes that appears to speed things up a bit.

Thank you

1 Like