How are missing and censored multivariate data estimated in Stan?


#1

Here is the scenario that leads to my questions: Suppose I have P-dimensional multivariate data from N subjects which contains observed values (y_obs), missing values (y_mis), and censored values (y_cens). I define the missing and censored values as parameters in the parameters block (with the censoring point defined as a lower or upper bound for the censored values) and then define vector[P] y[N] in the transformed parameters block for which y[n,p] = y_obs if the index is observed, y[n,p] = y_mis if the index is missing, or y[n,p] = y_cens if the index is censored. Finally, my goal is to estimate the parameters of y ~ multi_normal(mu, Sigma), with mu and Sigma unknown.

I’ve got this model to work very well, but I would like to understand how the joint posterior of the missing values, censored values, and unknown parameters, given the observed data, is estimated. Are the missing and censored values being imputed from the conditional multivariate normal distribution given the observed values and the previous iteration’s parameter values and then the complete data are used to update the parameters in the current iteration? I searched through the manual and some textbooks but was unable to find an answer to this question.

If I explained the scenario poorly, I found this example with only missing data from a bivariate normal distribution https://gist.github.com/ashiklom/d790474834ef1c67fde6466728ea71f4.


#2

There are chapters in the manual for both univariate cases. Missing data is handle the same way if it’s univariate or multivariate—declare parameters for the missing data and make sure to give them priors.

For censored data, you need the CDFs. Those don’t exist in easy forms for anything other than bivariate cases. We have an example of how to implement bivariate CDFs in the manual for bivariate normal now, but that’s as far as we’ve gone.

We’ll eventually add the bivariate normal CDF to the language. For now, you can look at Ben Goodrich’s implementation in Stan in the issue.


#3

Bob, thank you very much for your reply.

I meant my question as more of a behind-the-scenes question as opposed to a model question.

This may make my question a little more clear. Below are two model codes. The first uses the method shown in the manual for bivariate normal data with missing data. Additionally, I have added a generated quantities block with the missing values imputed using the normal_rng function and the appropriate conditional mean and standard deviation for each missing data point (it’s a little messy, my apologies). The second uses “index fiddling” to create a data matrix with missing values treated as parameters and also returns what I would think would be imputed values.

model1 <- ’
data{
int<lower = 1> N;
vector[2] y[N];
int<lower = 0, upper = 1> y_observed[N, 2];
int<lower = 0, upper = N> TotalMissing;
}
parameters{
vector[2] mu;
cov_matrix[2] Sigma;

}
model{
for (n in 1:N) {
if (y_observed[n, 1] && y_observed[n, 2])
y[n] ~ multi_normal(mu, Sigma);
else if (y_observed[n, 1])
y[n, 1] ~ normal(mu[1], sqrt(Sigma[1, 1]));
else if (y_observed[n, 2])
y[n, 2] ~ normal(mu[2], sqrt(Sigma[2, 2]));
}
}
generated quantities{
real y_mis[TotalMissing];
y_mis[1] = normal_rng(mu[1] + (Sigma[1,1]/Sigma[2,2])*(Sigma[1,2]/(sqrt(Sigma[1,1])sqrt(Sigma[2,2])))(y[1,2] - mu[2]), (1 - (Sigma[1,2]/(sqrt(Sigma[1,1])*sqrt(Sigma[2,2])))^2)Sigma[1,1]);
y_mis[2] = normal_rng(mu[2] + (Sigma[2,2]/Sigma[1,1])
(Sigma[1,2]/(sqrt(Sigma[1,1])sqrt(Sigma[2,2])))(y[4,1] - mu[1]), (1 - (Sigma[1,2]/(sqrt(Sigma[1,1])*sqrt(Sigma[2,2])))^2)Sigma[2,2]);
y_mis[3] = normal_rng(mu[2] + (Sigma[2,2]/Sigma[1,1])
(Sigma[1,2]/(sqrt(Sigma[1,1])sqrt(Sigma[2,2])))(y[10,1] - mu[1]), (1 - (Sigma[1,2]/(sqrt(Sigma[1,1])*sqrt(Sigma[2,2])))^2)Sigma[2,2]);
y_mis[4] = normal_rng(mu[1] + (Sigma[1,1]/Sigma[2,2])
(Sigma[1,2]/(sqrt(Sigma[1,1])sqrt(Sigma[2,2])))(y[22,2] - mu[2]), (1 - (Sigma[1,2]/(sqrt(Sigma[1,1])*sqrt(Sigma[2,2])))^2)Sigma[1,1]);
y_mis[5] = normal_rng(mu[1] + (Sigma[1,1]/Sigma[2,2])
(Sigma[1,2]/(sqrt(Sigma[1,1])sqrt(Sigma[2,2])))(y[43,2] - mu[2]), (1 - (Sigma[1,2]/(sqrt(Sigma[1,1])*sqrt(Sigma[2,2])))^2)*Sigma[1,1]);
}

model2 <- ’
data{
int<lower = 1> N;
vector[2] y[N];
int<lower = 0, upper = 1> y_observed[N, 2];
int<lower = 0, upper = N> TotalMissing;
int<lower = 0, upper = N> NumMissing[N,2];
}
parameters{
vector[2] mu;
cov_matrix[2] Sigma;
real y_mis[TotalMissing];
}
transformed parameters{
vector[2] y_star[N];
for(n in 1:N){
for(j in 1:2){
if(y_observed[n, j])
y_star[n,j] = y[n,j];
else
y_star[n,j] = y_mis[NumMissing[n,j]];
}
}
}
model{
y_star ~ multi_normal(mu, Sigma);
}

N <- 50
y <- rmvnorm(N, mean = rep(200, 2), sigma = diag(rep(100^2, 2)))
y_observed <- matrix(c(0, rep(1, 20), 0, rep(1, 20), 0, rep(1, 10), 0,
rep(1, 5), 0, rep(1, 40)), nrow = N, ncol = 2)
y[which(y_observed == 0)] <- 9999999
TotalMissing <- sum(y_observed == 0)
NumMissing <- matrix(0, nrow = N, ncol = 2)
k <- 1
for(i in 1:N){
for(j in 1:2){
if(y_observed[i,j] == 0){
NumMissing[i, j] <- k
k <- k + 1
}
}
}
Missing <- apply(y_observed, 1, function(x) sum(x) < 2) + 0.0
ContainsMissing <- which(Missing == 1)

fit1 <- stan(model_code = model1,
data = c(“N”,
“y”,
“y_observed”,
“TotalMissing”),
iter = 1e4,
warmup = 1e3,
chains = 1,
seed = 315315)
fit2 <- stan(model_code = model2,
data = c(“N”,
“y”,
“y_observed”,
“TotalMissing”,
“NumMissing”),
iter = 1e4,
warmup = 1e3,
chains = 1,
seed = 315315)

The resulting estimates for the mean vector and covariance matrix are VERY similar. However, the second model provides a much more narrow 95% credible intervals which do all contain the missing values and posterior means which are more accurate. While the second model appears messier than necessary for a bivariate normal, once you get to larger dimensions it is MUCH faster!

This is all to lead up to the question: How is the joint distribution being estimated behind-the-scenes given that missing data is present in the data matrix and the data model? Are the values of the missing data imputed from the posterior predictive distribution of the missing data values given the observed data and parameter values from the previous iteration and then the “complete” data are used to update the parameter estimates? Or the other way around?

I would like to feel confident about the model that was built, but I lack the understanding of what is going on behind the scenes to defend this model with absolute certainty.

Thank you in advance for all of your help and time.


#4

The best thing to read for a short overview of how Stan works under the hood is the JSS paper by me and others.

Nope, nothing like that in general. Stan works by computing the log posterior density of the parameters and then uses derivatives of that log density function to perform MCMC. So there’s no fancy ordering of anyting—it’s just a log density calculation.

If you’re talking about the same quantity, there’s a well-defined 95% interval, so at most one of your calculations can be right.

You almost certainly need priors on those missing parameters in your programs.

To comment code, you want to do this:

```
code
```

#5

Bob, thank you very much for your help; I really appreciate it. I’ll make sure to format my code correctly next time, my apologies.