Data imputation/missing data in a correlation model

Hello!

I am presenting a correlation model with some extra generated statistics in the generated quantities box. The model runs fine if the sample sizes are equal and I present the data as a 2 column vector in the data blox. However, I would now like to run the model with unbalanced sample sizes. Prior to posting this, I read chapter 11 and 16 in the Stan manual and I think the model is close; however, I can’t quite put my finger as to wear it is getting caught up.

I also read this post on missing data and used it as a guide.

I understand what the error message (see below) means; however, I can’t seem to figure out a way to fix it. If I turn x into a vector I can no longer add a_obs and a_mis in the transformed parameter block.

A found another post suggestion to make x a row vector, which also did not end up with a solution.

If i declare x as vector[2] x[n] in the data block the row assignments for x in transformed parameters results in errors d/t illegal expressions in the transformed parameter box again.

Below is my model and some simulated data.


use

library(rstan)
library(MASS)

#making up data
a <- mvrnorm(n=50, 0, 4)
b <- mvrnorm(n=45, 0, 3.3)
a_mis <- mvrnorm(n=5,0, 4)

standata <- list(a_obs <- as.vector(a),
                 b_obs <- as.vector(b),
                 N_a_obs <- nrow(a),
                 N_a_mis <- nrow(a_mis),
                 N_b_obs <- nrow(b),
                 a_mis <- a_mis)

#initial start values
myinits <- list(
  list(r=.5, mu=c(1, 2), sigma=c(1, 0.1)),
  list(r=-.5, mu=c(0, -2), sigma=c(.1, 1)),
  list(r=.9, mu=c(.5, -0.5), sigma=c(0.5, 5)),
  list(r=-.9, mu=c(-1, 3), sigma=c(0.001, 3)))

# parameters to be monitored: 
parameters <- c("r", "mu", "sigma","T", "bias", "varErr", "rmsd", "diffsigma")

library(rstan)
model <- "
// Pearson Correlation
data { 
int<lower=0> N_a_obs; //number of observed a
int<lower=0> N_a_mis; //number of missing a
int<lower=0> N_b_obs;
vector [N_a_obs] a_obs;
vector [N_b_obs] b_obs;
}
transformed data {
int<lower=0> N = N_a_obs + N_a_mis;
}
parameters {
vector[N_a_mis] a_mis;
vector[2] mu;
vector<lower=0>[2] sigma;
real<lower=-1, upper=1> r;
} 

transformed parameters {
matrix[2,48] x;
cov_matrix[2] T;
x[,1] = a_obs;
x[,1] = a_mis;
x[,2] = b_obs;
// Reparameterization
T[1,1] <- square(sigma[1]);
T[1,2] <- r * sigma[1] * sigma[2];
T[2,1] <- r * sigma[1] * sigma[2];
T[2,2] <- square(sigma[2]);
}

model {
// Priors
mu ~ normal(0, 100);
sigma ~ cauchy(0,3);
r ~ uniform(-1, 1);
// Data
x ~ multi_normal(mu, T);
}
generated quantities{
vector<lower=-5>[1] bias;
vector<lower=-5>[1] varErr;
vector<lower=-5>[1] rmsd;
vector<lower=-5>[1] diffsigma;
bias[1] = mu[1] - mu[2];
varErr[1] = sqrt(T[2,2] + T[1,1] - 2*T[1,2]);
rmsd[1] = sqrt(bias[1]^2 + varErr[1]^2);
diffsigma[1] = sigma[1] - sigma[2];
}"

samples<- stan(model_code=model,   
               data=standata,
               init=myinits, 
               pars=parameters,
               iter=1000,
               chains=4, 
               cores = 4,
               seed = 100
)

The error I am getting is:

SYNTAX ERROR, MESSAGE(S) FROM PARSER:

Warning (non-fatal): assignment operator <- deprecated in the Stan language; use = instead.
Warning (non-fatal): assignment operator <- deprecated in the Stan language; use = instead.
Warning (non-fatal): assignment operator <- deprecated in the Stan language; use = instead.
Warning (non-fatal): assignment operator <- deprecated in the Stan language; use = instead.
No matches for: 

  matrix ~ multi_normal(vector, matrix)

Available argument signatures for multi_normal:

  vector ~ multi_normal(vector, matrix)
  vector ~ multi_normal(row vector, matrix)
  row vector ~ multi_normal(vector, matrix)
  row vector ~ multi_normal(row vector, matrix)
  vector ~ multi_normal(vector[], matrix)
  vector ~ multi_normal(row vector[], matrix)
  row vector ~ multi_normal(vector[], matrix)
  row vector ~ multi_normal(row vector[], matrix)
  vector[] ~ multi_normal(vector, matrix)
  vector[] ~ multi_normal(row vector, matrix)
  row vector[] ~ multi_normal(vector, matrix)
  row vector[] ~ multi_normal(row vector, matrix)
  vector[] ~ multi_normal(vector[], matrix)
  vector[] ~ multi_normal(row vector[], matrix)
  row vector[] ~ multi_normal(vector[], matrix)
  row vector[] ~ multi_normal(row vector[], matrix)

require real scalar return type for probability function.
  error in 'modelb4d467c49a9b_3a768e768399d96b5f12d9161ad86093' at line 44, column 25
  -------------------------------------------------
    42: r ~ uniform(-1, 1);
    43: // Data
    44: x ~ multi_normal(mu, T);
                                ^
    45: }
  -------------------------------------------------

Error in stanc(file = file, model_code = model_code, model_name = model_name,  : 
  failed to parse Stan model '3a768e768399d96b5f12d9161ad86093' due to the above error.

I understand the correlation model is a simple one, so I am hoping it will also be a good model to learn about missing data in Stan.

Thank you,

Alex Swiderski

The error message is telling you that trying x ~ multinormal(mu, T); doesn’t work because x is a matrix. I haven’t read you model carefully, but maybe what you want is

for (i in 1:48) x[, i] ~ multi_normal(mu, T);

but I can’t tell you whether that’s the most efficient way to do it – very likely not.

I also found this

x[,1] = a_obs;
x[,1] = a_mis;

a bit odd: are you not populating the same object twice with different things here?

Hi Max,

Thanks for the reply. I am not clear as to where you are suggesting putting the statement:

for (i in 1:48) x[, i] ~ multi_normal(mu, T);

Also, what you found in regard to a_obs and a_mis (see below) was my attempt to create a a column in matrix x built upon the observed and simulated missing observations. Does that make sense?

x[,1] = a_obs;
x[,1] = a_mis;

That should go in the model block, replacing the old statement.

Not to me, it doesn’t. What you’re doing there is populating the first column of x with a_obs and then replacing what you just put in there with a_mis. I noticed N is defined but never used. So maybe what you want looks something like this

transformed parameters {
  matrix[N, 2] x;
  cov_matrix[2] T;
  x[1:N_a_obs, 1] = a_obs;
  x[(N_a_obs + 1):N, 1] = a_mis;
  x[,2] = b_obs;
  // Reparameterization
  T[1,1]  = square(sigma[1]);
  T[1,2]  = r * sigma[1] * sigma[2];
  T[2,1] = r * sigma[1] * sigma[2];
  T[2,2]  = square(sigma[2]);
}

?
This only makes sense if N_b_obs == N, otherwise I don’t follow why you’re not modelling b_mis as well. If you could write out the maths of your model that’d be helpful (to me, at least) too.

1 Like

Hi Max, This is really helpful and I know have just one hiccup when i start sampling. I am attaching the new model with your suggestions and the error I am getting. The model now compiles, but with this warning:

DIAGNOSTIC(S) FROM PARSER:
Warning (non-fatal):
Left-hand side of sampling statement (~) may contain a non-linear transform of a parameter or local variable.
If it does, you need to include a target += statement with the log absolute determinant of the Jacobian of the transform.
Left-hand-side of sampling statement:
    x[ , i] ~ multi_normal(...)

Then, when the model tries to sample I get this error:

SAMPLING FOR MODEL '9d6c5d5d1db58fc4894bce113ad9998e' NOW (CHAIN 4).
Unrecoverable error evaluating the log probability at the initial value.
Exception: multi_normal_lpdf: Size of random variable (48) and size of location parameter (2) must match in size  (in 'modelcc8443667780_9d6c5d5d1db58fc4894bce113ad9998e' at line 35)

[1] "Error in sampler$call_sampler(args_list[[i]]) : "                                                                                                                                        
[2] "  Exception: multi_normal_lpdf: Size of random variable (48) and size of location parameter (2) must match in size  (in 'modelcc8443667780_9d6c5d5d1db58fc4894bce113ad9998e' at line 35)"

Where line 35 is:

for (i in 1:48) x[, i] ~ multi_normal(mu, T);

Any suggestions? The most updated model is below:

model <- "
// Pearson Correlation
data { 
int<lower=0> N;
int<lower=0> N_a_obs; //number of observed a
int<lower=0> N_a_mis; //number of missing a
int<lower=0> N_b_obs;
vector [N_a_obs] a_obs;
vector [N_b_obs] b_obs;
}
parameters {
vector[N_a_mis] a_mis;
vector[2] mu;
vector<lower=0>[2] sigma;
real<lower=-1, upper=1> r;
} 
transformed parameters {
matrix[N, 2] x; 
cov_matrix[2] T;
x[1:N_a_obs, 1] = a_obs;
x[(N_a_obs + 1):N, 1] = a_mis;
x[,2] = b_obs;
// Reparameterization
T[1,1]  = square(sigma[1]);
T[1,2]  = r * sigma[1] * sigma[2];
T[2,1] = r * sigma[1] * sigma[2];
T[2,2]  = square(sigma[2]);
}
model {
// Priors
mu ~ normal(0, 100);
sigma ~ cauchy(0,3);
r ~ uniform(-1, 1);
// Data
for (i in 1:48) x[, i] ~ multi_normal(mu, T);
}
generated quantities{
vector<lower=-5>[1] bias;
vector<lower=-5>[1] varErr;
vector<lower=-5>[1] rmsd;
vector<lower=-5>[1] diffsigma;
bias[1] = mu[1] - mu[2];
varErr[1] = sqrt(T[2,2] + T[1,1] - 2*T[1,2]);
rmsd[1] = sqrt(bias[1]^2 + varErr[1]^2);
diffsigma[1] = sigma[1] - sigma[2];
}"

My last question here is as to if this can be done in the generated quantities block so I can view the posterior samples that were used to estimate the missing values?

Thanks again,

Alex

There are multiple problems with this program, I think. If you can write out what exactly you’re trying to do, others and I can give you some specific advice. The data generation process in R seems weird as well.

Hi Max,

Ultimately I am just trying to learn how to run my model with unbalanced sample sizes. The model I am running is a correlation. It works without error as posted below; however, I now have data for an analysis of cognitive assessments where I have results from all participants on one test and not on the second test.

The data generation for my example was just to help provide data to be used for simulation as I cannot post my data due to IRB restrictions. However, if you need me to change the format to gain some support on this topic I am happy to do so in a manner you feel is the most appropriate.

Attached below is my working model without any of the additions for my attempt in running the correlation with unbalanced samples to show you where I have started from.

Please let me know what additional information I can provide and I will happily do so.

Thank you,

Alex

model <- "
// Pearson Correlation
data { 
int<lower=0> n;
vector[2] x[n];
}
parameters {
vector[2] mu;
vector<lower=0>[2] sigma;
real<lower=-1,upper=1> r;
} 
transformed parameters {
cov_matrix[2] T; //creating covariance matrix
// Reparameterization
T[1,1] = square(sigma[1]);
T[1,2] = r * sigma[1] * sigma[2];
T[2,1] = r * sigma[1] * sigma[2];
T[2,2] = square(sigma[2]);
}
model {
// Priors
mu ~ normal(0, 100); // mean prior
sigma ~ cauchy(0,3); // standard deviation prior
r ~ uniform(-1, 1); // correlation
// Data
x ~ multi_normal(mu, T); //multivariable normal distribution 
}
generated quantities{
vector<lower=-5>[1] bias; // additional estimate of error "bias"
vector<lower=-5>[1] varErr; // additional estimate of error "variable error"
vector<lower=-5>[1] rmsd; // additional estimate of error "root mean square deviation"
vector<lower=-5>[1] diffsigma; // additional estimate of error "difference in standard deviations"
bias[1] = mu[1] - mu[2];
varErr[1] = sqrt(T[2,2] + T[1,1] - 2*T[1,2]);
rmsd[1] = sqrt(bias[1]^2 + varErr[1]^2);
diffsigma[1] = sigma[1] - sigma[2];
}"