I am new to stan so I apologize ahead of time for simplicity of my question. I am trying to understand the way stan handles missing data imputation, specifically how it merges imputed data into observed data. I have built a regression model with missing data on the dependent variable (DV). All predictors are observed. Conceptually, I understand that I am using parameters in the model to estimate missing values and then merging these into observed DV.

data {
int<lower=0> N;
int anxiety_num_missing;
real Anxiety[N];
vector [N] Tx;
vector [N] Time;
vector [N] TxT;
int anxiety_missing[N];
}
parameters{
real alpha;
real<lower=0> sigma;
real bN;
real bM;
real bI;
vector[anxiety_num_missing] anxiety_impute;
real mu_anxiety;
real<lower=0> sigma_anxiety;
}
model {
vector[N] mu;
vector[N] anxiety_merged;
alpha ~ normal(0,10);
bN ~ normal(0,10);
bM ~ normal(0,10);
bI ~ normal(0,10);
mu_anxiety ~ normal(3.5,1);
sigma ~ cauchy(0,1);
sigma_anxiety ~ cauchy(0,1);
//imputation
anxiety_merged ~ normal(mu_anxiety, sigma_anxiety);
//regression
mu = alpha + bN*Tx + bM*Time + bI*TxT;
Anxiety ~ normal(mu, sigma);
//merge missing and observed
for (i in 1:N){
anxiety_merged[i] = Anxiety[i];
if (anxiety_missing[i] > 0) anxiety_merged = anxiety_impute[anxiety_missing[i]];
}
}

This code throws the following error:

Dimension mismatch in assignment; variable name = anxiety_merged, type = vector; right-hand side type = real.
Illegal statement beginning with non-void expression parsed as
anxiety_merged
Not a legal assignment, sampling, or function statement. Note that

Assignment statements only allow variables (with optional indexes) on the left;

Sampling statements allow arbitrary value-denoting expressions on the left.

Functions used as statements must be declared to have void returns

which I understand means that anxiety_merged is vector and can’t be on the right side of the merging step, but am not sure how to code around it? Any help would be appreciated.

No worries! Though I’ll also point out that given your model, I believe your posteriors will be unaffected by simply excluding the missing data cases and not bothering with imputation. This is because no parameters are uniquely dependent on the imputed data.

I appreciate the additional thoughts. I am trying to work from very simple to more complex so I understand how he math translates through the code more effectively. If you have thoughts or favorite references/tutorials on how to adapt this simple example to more complex and meaningful, I would be greatly appreciated. At this stage, only my DV has missing data, but I am sure I will encounter other times when some parameters in the model are influenced by missing data. I generally end up dealing with missing data in repeated measures experimental designs, but rarely have IVs that are missing. Any experience to share would be appreciated!

Here’s a lecture and code I did a while ago that talks about missing data. Generally you want to do missing data imputation whenever you have missing data in the IVs or when your DV is a non-hierarchical[^1] multivariate outcome (a.k.a. repeated-measures outcome).

[^1]: hierarchical models will generally automatically do imputation for cells missing data

I have had a chance to apply your code from this lecture to the model described above, but have run into another error that I can’t seem to figure out. The chains have errors because it rejects the first value fed into the model:

Exception: multi_normal_lpdf: Random variable[1] is nan, but must not be nan!

The error identifies the following line as the source of the error.

The best I can come up with is that there is something wrong with cor_mat or scaled_coef_sds that I can’t sort out. Looking at other posts with the same error, it appears that solutions include adding the left side outcome to the parameter section, which doesn’t make sense to me.

The error message points to scaled_coefs as being the problem.
I’m guessing you’re doing something like this

model {
vector[N] scaled_coefs[nSubj];
...
for (s in 1:nSubj) {
scaled_coefs[s,] ~ multi_normal(...);
}
...
for (s in nSubj) {
scaled_coefs[s,] = do_scaling(s, ...);
}
}

Stan requires defining the variable before sampling statments so it should be

model {
vector[N] scaled_coefs[nSubj];
...
for (s in nSubj) {
scaled_coefs[s,] = do_scaling(s, ...);
}
...
for (s in 1:nSubj) {
scaled_coefs[s,] ~ multi_normal(...);
}
}

By the way, multi_normal_cholesky is more efficient than multi_normal.

matrix[...] cor_mat_cholesky = cholesky_decompose(cor_mat);
matrix[...] cov_mat_cholesky = diag_pre_multiply(
scaled_coef_sds, cor_mat_cholesky);
for (s in 1:NSubj) {
scaled_coefs[s] ~ multi_normal_cholesky(
scaled_coef_means, cov_mat_cholesky);
}

If cor_mat is a parameter then you can just declare instead parameter

cholesky_factor_corr[...] cor_mat_cholesky;

and no need to cholesky_decompose(). The correlation matrix (if needed) can be reconstructed from the cholesky factor

Great point about multi_normal_cholesky()@nhuurre. For this adaptation, scaled_coefs[] variable and cor_mat are named in the transformed_data{} block. I’ve included the full code below, which mirrors the generous example by @mike-lawrence linked above.

functions{
//a function to turn the long data to wide coefficients
matrix get_coefs(matrix X, vector Y, int [,] Z) {
matrix[size(Z),cols(X)] coefs ;
for(i in 1:size(Z)){
coefs[i] = transpose(
inverse(
transpose( X[Z[i,1]:Z[i,2]] )
* X[Z[i,1]:Z[i,2]]
)
* transpose( X[Z[i,1]:Z[i,2]] )
* Y[Z[i,1]:Z[i,2]]
);
}
return coefs ;
}
}
data {
// nSubj: number of subjects
int nSubj ;
// nY: total number of observations
int nY ;
// nX: number of predictors
int nX ;
// Y: vector of observations
vector[nY] Y ;
// X: predictor matrix
matrix[nY,nX] X ;
// subjIndices: list of start and end indices in Y
// corresponding to data from each subject
int subjIndices[nSubj,2] ;
}
transformed data{
// scaled_coefs: matrix of wide-format coefficients
// for each subject given their data and predictors
matrix[nSubj,nX] scaled_coefs ;
//use the get_coefs() function, simultaneously scaling Y
// for easy default priors
scaled_coefs = get_coefs(X,(Y-mean(Y))/sd(Y),subjIndices) ;
}
parameters {
// scaled_coef_means: z-scale coefficient means
vector[nX] scaled_coef_means ;
// scaled_coef_sds: z-scale coefficient sds
vector<lower=0>[nX] scaled_coef_sds ;
// cor_mat: correlations amongst coefficients
corr_matrix[nX] cor_mat ;
}
model {
// weakly informed priors
scaled_coef_means ~ normal(0,1) ;
scaled_coef_sds ~ weibull(2,1) ;
cor_mat ~ lkj_corr(1) ;
//coefficients as multivariate normal
for(s in 1:nSubj){
scaled_coefs[s,] ~ multi_normal(
scaled_coef_means
, quad_form_diag(cor_mat, scaled_coef_sds)
) ;
}
}
generated quantities{
// coef_means: coefficient means on the original
// scale of Y
vector[nX] coef_means ;
// coef_means: coefficient sds on the original
// scale of Y
vector[nX] coef_sds ;
coef_sds = scaled_coef_sds * sd(Y) ;
coef_means = scaled_coef_means * sd(Y) ;
//adding back the mean to the intercept only
coef_means[1] = coef_means[1] + mean(Y) ;
}

I will also note that this runs perfectly with the data used to build his example. It was 100% reproducible! My data mirrors his structure, but already includes missing variables on Y, replaced as -1 in the file attached here data.missing.csv (1.5 KB)

I am thinking that the perhaps the function call is doing something different to my data within Stan than it it did with the reproducible example.

I found the solution. The design I had pulled for my example was between/within and not just within, so the contrast matrix W was incorrectly calculating. I changed the contrast matrix to be just the within variable and it worked.

@mike-lawrence The other videos on the list are also helpful! You mention in the 3.13a lecture on hierarchical models that imputation is a unnecessary step in stan for hierarchical mixed models. Conceptually, I understand this statement, but what do we do with the NA that Stan doesn’t like in the Y variable in the hierarchical models?

Here’s a hierarchical scenario: multiple participants, each measured many times in each of two conditions. A brms/stan_glmer formula for a model of this data might be ‘outcome ~ condition + (1+condition | participant)’, which estimates (among other things), an effect of condition for each participant. Now if there’s a glitch in the experiment and and a few participants are randomly missing data for one condition or the other, there will be no NAs in the data, just less data (assuming long format). Nonetheless, despite this missing data, estimates for the condition effect will be produced for each participant such that estimates for a participant missing data will be informed by the distribution of effects from participants not missing data (plus some influence from the data still present for that participant).

I guess to clarify, the automatic handling of missing data for hierarchical models comes from the fact that hierarchical data are usually structured in a long data format where missing data manifests simply as fewer rows. If you have NAs you’d just toss those rows.

I see. That makes sense for MAR. For my own enjoyment and practice, I am proving to myself that these estimates are similar to those achieved if you impute.