# Posterior predictive check in within regression model with considerable missingness

Greetings–

I am checking the difference in scores on tests within one group. There is considerable missingness, so I’ve been using a model to accommodate missing data (below). I want to do a posterior predictive check, using simulated data in the generated quantities section but I’m not sure which standard deviation parameter to use within that random number generator requires a real number (my transformed sd (coef_sds) is a vector). I’ve used the sd of the present values (Y_sd), but this can’t be right, since I should be checking the observed data against the simulated sds. Any suggestions on how to simulate data in generated quantities given my model?

Many thanks!

model to accommodate missingness

functions{
//a function to turn the long data to wide coefficients
matrix get_coefs(matrix X, vector Y, int [,] Z) {
matrix[size(Z),cols(X)] coefs ;
for(i in 1:size(Z)){
coefs[i] = transpose(
inverse(
transpose( X[Z[i,1]:Z[i,2]] )
* X[Z[i,1]:Z[i,2]]
)
* transpose( X[Z[i,1]:Z[i,2]] )
* Y[Z[i,1]:Z[i,2]]
);
}
return coefs ;
}
}
data {
// nSubj: number of subjects
int nSubj ;
// nY: total number of observations
int nY ;
// nX: number of predictors
int nX ;
// Y: vector of observations
vector[nY] Y ;
// X: predictor matrix
matrix[nY,nX] X ;
// subjIndices: list of start and end indices in Y corresponding to data from
// each subject
int subjIndices[nSubj,2] ;
// n_missing: number of missing values total
int n_missing ;
// indices_of_missing: indices in Y of missing values
int indices_of_missing[n_missing] ;
// indices_of_present: indices in Y of present values
int indices_of_present[nY-n_missing] ;
}
transformed data{
// Y_mean: mean of present values in Y
real Y_mean ;
// Y_sd: sd of present values in Y
real Y_sd ;
// scaled_Y: z-scaled Y values, using Y_mean & Y_sd
vector[nY] scaled_Y ;
Y_mean = mean( Y[indices_of_present] ) ;
Y_sd = sd( Y[indices_of_present] ) ;
scaled_Y = ( Y - Y_mean ) / Y_sd ;
}
parameters {
// scaled_coef_means: z-scale coefficient means
vector[nX] scaled_coef_means ;
// scaled_coef_sds: z-scale coefficient sds
vector<lower=0>[nX] scaled_coef_sds ;
// cor_mat: correlations amongst coefficients
corr_matrix[nX] cor_mat ;
// imputed_for_missing: values imputed in place of
// missing values
vector[n_missing] imputed_for_missing ;
}
transformed parameters{
// scaled_Y_with_imputed: vector combining non-missing & imputed values
vector[nY] scaled_Y_with_imputed ;
// scaled_coefs: matrix of wide-format coefficients for each subject given
// their data and predictors
matrix[nSubj,nX] scaled_coefs ;
// add the present values to scaled_Y_with_imputed
scaled_Y_with_imputed[indices_of_present] = scaled_Y[indices_of_present] ;
// add the imputed values to scaled_Y_with_imputed
scaled_Y_with_imputed[indices_of_missing] = imputed_for_missing ;
// use the get_coefs() function
scaled_coefs = get_coefs(X,scaled_Y_with_imputed,subjIndices) ;
}
model {
// weakly informed priors
scaled_coef_means ~ normal(0,1) ;
scaled_coef_sds ~ weibull(2,1) ;
cor_mat ~ lkj_corr(1) ;
imputed_for_missing ~ normal(0,1) ;
// Each subject’s coefficients as multivariate normal
for(s in 1:nSubj){
scaled_coefs[s,] ~ multi_normal(
scaled_coef_means
) ;
}
}
generated quantities{
// coef_means: coefficient means on the original scale of Y
vector[nX] coef_means ;
// coef_sds: coefficient sds on the original scale of Y
vector[nX] coef_sds ;
coef_sds = scaled_coef_sds * Y_sd ;
coef_means = scaled_coef_means * Y_sd ;
// adding back the mean to the intercept only
coef_means[1] = coef_means[1] + Y_mean ;
}

generated quantities for model 2:

generated quantities{
//posterior predictive check variable
vector[nY] y2 ;
// coef_means: coefficient means on the original scale of Y
vector[nX] coef_means ;
// coef_means: coefficient sds on the original scale of Y
vector[nX] coef_sds ;
coef_sds = scaled_coef_sds * Y_sd ;
coef_means = scaled_coef_means * Y_sd ;
// adding back the mean to the intercept only
coef_means[1] = coef_means[1] + Y_mean ;
for(i in 1:nY){
y2[i] = normal_rng( (coef_means)[i], Y_sd) ;
}
}

In the transformed data section, you transform the non-missing data to the z-score scale using the observed mean and sd, which makes for relatively straight-forward specification of weakly-informed priors thereafter. In your generated quantities section, you correctly anticipate that you’ll need to undo this earlier transform to get back to the scale of the original measurements. I think however you’re getting tripped up on the fact that the data is supplied in long-format but then converted to a wide/coefficient representation by `get_coefs`. To get the values you’d want for PPC’ing, you need to generate new coefs from multi_normal_rng then reverse the transform achieved by get_coefs:

``````generated quantities{
// coef_means: coefficient means on the original scale of Y
vector[nX] coef_means ;
// coef_sds: coefficient sds on the original scale of Y
vector[nX] coef_sds ;
// Y2: new samples for posterior predictive checks
vector[nY] Y2 ;
coef_sds = scaled_coef_sds * Y_sd ;
coef_means = scaled_coef_means * Y_sd ;
// adding back the mean to the intercept only
coef_means[1] = coef_means[1] + Y_mean ;
{ /sub-environment to avoid saving ppc_coefs
matrix[nSubj,nX] ppc_coefs ; // simulating new subjects' unscaled coefficients
for(s in 1:nSubj){
scaled_coefs[s,] = multi_normal_rng(
coef_means
) ;
Y2 = rows_dot_product(ppc_coefs, X); //inverse of get_coefs()
}
}
}
``````

p.s. When pasting code in posts here, you can add 4 spaces to the beginning of each line and it’ll show up nicely as above.

I got an error when running this code, perhaps because of:

``````**{ /**sub-environment to avoid saving ppc_coefs
``````

So I changed this to

``````//sub-environment to avoid saving ppc_coefs
``````

When I do this, I get a new error:

``````Left-hand-side of sampling statement:
scaled_coefs[s,  ] ~ multi_normal(...)
variable "matrix" does not exist.
error in 'model_within_long_missing_ppc' at line 100, column 15
98:     coef_means[1] = coef_means[1] + Y_mean ;
99:     //sub-environment to avoid saving ppc_coefs
100:         matrix[nSubj,nX] ppc_coefs ; // simulating new subjects' unscaled coefficients
^
101:         for(s in 1:nSubj){
``````

Seems like an odd error given that I’m trying to specify a new variable, not referencing an existing variable. Any suggestions on how to debug?

``````generated quantities{
// coef_means: coefficient means on the original scale of Y
vector[nX] coef_means ;
// coef_sds: coefficient sds on the original scale of Y
vector[nX] coef_sds ;
// Y2: new samples for posterior predictive checks
vector[nY] Y2 ;
coef_sds = scaled_coef_sds * Y_sd ;
coef_means = scaled_coef_means * Y_sd ;
// adding back the mean to the intercept only
coef_means[1] = coef_means[1] + Y_mean ;
//sub-environment to avoid saving ppc_coefs
matrix[nSubj,nX] ppc_coefs ; // simulating new subjects' unscaled coefficients
for(s in 1:nSubj){
scaled_coefs[s,] = multi_normal_rng(
coef_means
) ;
Y2 = rows_dot_product(ppc_coefs, X); //inverse of get_coefs
}
}
}``````

Oops, sorry, it should be:

``````generated quantities{
// coef_means: coefficient means on the original scale of Y
vector[nX] coef_means ;
// coef_sds: coefficient sds on the original scale of Y
vector[nX] coef_sds ;
// Y2: new samples for posterior predictive checks
vector[nY] Y2 ;
coef_sds = scaled_coef_sds * Y_sd ;
coef_means = scaled_coef_means * Y_sd ;
// adding back the mean to the intercept only
coef_means[1] = coef_means[1] + Y_mean ;
{ //sub-environment to avoid saving ppc_coefs
matrix[nSubj,nX] ppc_coefs ; // simulating new subjects' unscaled coefficients
for(s in 1:nSubj){
scaled_coefs[s,] = multi_normal_rng(
coef_means
) ;
Y2 = rows_dot_product(ppc_coefs, X); //inverse of get_coefs()
}
}
}
``````

(you added the `/` I forgot but also deleted a `{` which needed to be there)

Unfortunately, I’m still getting an error from Stan…

``````Illegal statement beginning with non-void expression parsed as
scaled_coefs[s,  ]
Not a legal assignment, sampling, or function statement.  Note that
* Assignment statements only allow variables (with optional indexes) on the left;
if you see an outer function logical_lt (<) with negated (-) second argument,
it indicates an assignment statement A <- B with illegal left
side A parsed as expression (A < (-B)).
* Sampling statements allow arbitrary value-denoting expressions on the left.
* Functions used as statements must be declared to have void returns
``````

## 100: matrix[nSubj,nX] ppc_coefs ; // simulating new subjects’ unscaled coefficients 101: for(s in 1:nSubj){ 102: scaled_coefs[s,] = multi_normal_rng( ^ 103: coef_means

``````PARSER EXPECTED: "}"
Error in stanc(model_code = paste(program, collapse = "\n"), model_name = model_cppname,  :
failed to parse Stan model 'within_long_missing_ppc' due to the above error.
``````

This confuses me, because there doesn’t appear to be assignment in the sense they’ve indicated (i.e., <-).

Again, any suggestion on debugging?

Cheers!

Sorry, my bad; the assignment should be to ppc_coefs, not scaled_coefs:

Oops, sorry, it should be:

``````generated quantities{
// coef_means: coefficient means on the original scale of Y
vector[nX] coef_means ;
// coef_sds: coefficient sds on the original scale of Y
vector[nX] coef_sds ;
// Y2: new samples for posterior predictive checks
vector[nY] Y2 ;
coef_sds = scaled_coef_sds * Y_sd ;
coef_means = scaled_coef_means * Y_sd ;
// adding back the mean to the intercept only
coef_means[1] = coef_means[1] + Y_mean ;
{ //sub-environment to avoid saving ppc_coefs
matrix[nSubj,nX] ppc_coefs ; // simulating new subjects' unscaled coefficients
for(s in 1:nSubj){
ppc_coefs[s,] = multi_normal_rng(
coef_means
) ;
Y2 = rows_dot_product(ppc_coefs, X); //inverse of get_coefs()
}
}
}``````

Stubborn error (see below) any other suggestions?

Thanks!

``````Left-hand-side of sampling statement:
scaled_coefs[s,  ] ~ multi_normal(...)
Base type mismatch in assignment; variable name = ppc_coefs, type = row vector; right-    hand side type=vector
Illegal statement beginning with non-void expression parsed as
ppc_coefs[s,  ]
Not a legal assignment, sampling, or function statement.  Note that
* Assignment statements only allow variables (with optional indexes) on the left;
if you see an outer function logical_lt (<) with negated (-) second argument,
it indicates an assignment statement A <- B with illegal left
side A parsed as expression (A < (-B)).
* Sampling statements allow arbitrary value-denoting expressions on the left.
* Functions used as statements must be declared to have void returns
``````
``````  error in 'model_within_long_missing_ppc' at line 102, column 13
-------------------------------------------------
100:         matrix[nSubj,nX] ppc_coefs ; // simulating new subjects' unscaled coefficients
101:         for(s in 1:nSubj){
102:             ppc_coefs[s,] = multi_normal_rng(
^
103:                 coef_means
-------------------------------------------------

PARSER EXPECTED: "}"
Error in stanc(model_code = paste(program, collapse = "\n"), model_name = model_cppname,  :
failed to parse Stan model 'within_long_missing_ppc' due to the above error.
``````

[edited to esccape code]

It helps to see the whole model when trying to diagnose parsing errors. I can’t tell what the problem is from the fragment in the error message.

Also, you can escape code by wrapping it in three back-ticks like this:

``````
code
```
```

Here is my full model:

``````functions{
//a function to turn the long data to wide coefficients
matrix get_coefs(matrix X, vector Y, int [,] Z) {
matrix[size(Z),cols(X)] coefs ;
for(i in 1:size(Z)){
coefs[i] = transpose(
inverse(
transpose( X[Z[i,1]:Z[i,2]] )
* X[Z[i,1]:Z[i,2]]
)
* transpose( X[Z[i,1]:Z[i,2]] )
* Y[Z[i,1]:Z[i,2]]
);
}
return coefs ;
}
}
data {
// nSubj: number of subjects
int nSubj ;
// nY: total number of observations
int nY ;
// nX: number of predictors
int nX ;
// Y: vector of observations
vector[nY] Y ;
// X: predictor matrix
matrix[nY,nX] X ;
// subjIndices: list of start and end indices in Y corresponding to data from
//  each subject
int subjIndices[nSubj,2] ;
// n_missing: number of missing values total
int n_missing ;
// indices_of_missing: indices in Y of missing values
int indices_of_missing[n_missing] ;
// indices_of_present: indices in Y of present values
int indices_of_present[nY-n_missing] ;
}
transformed data{
// Y_mean: mean of present values in Y
real Y_mean ;
// Y_sd: sd of present values in Y
real <lower=0> Y_sd ;
// scaled_Y: z-scaled Y values, using Y_mean & Y_sd
vector[nY] scaled_Y ;
Y_mean = mean( Y[indices_of_present] ) ;
Y_sd = sd( Y[indices_of_present] ) ;
scaled_Y = ( Y - Y_mean ) / Y_sd ;
}
parameters {
// scaled_coef_means: z-scale coefficient means
vector[nX] scaled_coef_means ;
// scaled_coef_sds: z-scale coefficient sds
vector[nX] scaled_coef_sds ;
// cor_mat: correlations amongst coefficients
corr_matrix[nX] cor_mat ;
// imputed_for_missing: values imputed in place of
//  missing values
vector[n_missing] imputed_for_missing ;
}
transformed parameters{
// scaled_Y_with_imputed: vector combining non-missing & imputed values
vector[nY] scaled_Y_with_imputed ;
// scaled_coefs: matrix of wide-format coefficients for each subject given
//  their data and predictors
matrix[nSubj,nX] scaled_coefs ;
// add the present values to scaled_Y_with_imputed
scaled_Y_with_imputed[indices_of_present] = scaled_Y[indices_of_present] ;
// add the imputed values to scaled_Y_with_imputed
scaled_Y_with_imputed[indices_of_missing] = imputed_for_missing ;
// use the get_coefs() function
scaled_coefs = get_coefs(X,scaled_Y_with_imputed,subjIndices) ;
}
model {
// weakly informed priors
scaled_coef_means ~ normal(0,1) ;
scaled_coef_sds ~ weibull(2,1) ;
cor_mat ~ lkj_corr(1) ;
imputed_for_missing ~ normal(0,1) ;
// Each subject's coefficients as multivariate normal
for(s in 1:nSubj){
scaled_coefs[s,] ~ multi_normal(
scaled_coef_means
) ;
}
}
generated quantities{
// coef_means: coefficient means on the original scale of Y
vector[nX] coef_means ;
// coef_sds: coefficient sds on the original scale of Y
vector[nX] coef_sds ;
// Y2: new samples for posterior predictive checks
vector[nY] Y2 ;
coef_sds = scaled_coef_sds * Y_sd ;
coef_means = scaled_coef_means * Y_sd ;
// adding back the mean to the intercept only
coef_means[1] = coef_means[1] + Y_mean ;
{ //sub-environment to avoid saving ppc_coefs
matrix[nSubj,nX] ppc_coefs ; // simulating new subjects' unscaled coefficients
for(s in 1:nSubj){
ppc_coefs[s,] = multi_normal_rng(
coef_means
) ;
Y2 = rows_dot_product(ppc_coefs, X); //inverse of get_coefs()
}
}
}
``````

And the error is:

``````SYNTAX ERROR, MESSAGE(S) FROM PARSER:

Warning (non-fatal):
Left-hand side of sampling statement (~) may contain a non-linear transform of a parameter or local variable.
If it does, you need to include a target += statement with the log absolute determinant of the Jacobian of the transform.
Left-hand-side of sampling statement:
scaled_coefs[s,  ] ~ multi_normal(...)
Base type mismatch in assignment; variable name = ppc_coefs, type = row vector; right-hand side type=vector
Illegal statement beginning with non-void expression parsed as
ppc_coefs[s,  ]
Not a legal assignment, sampling, or function statement.  Note that
* Assignment statements only allow variables (with optional indexes) on the left;
if you see an outer function logical_lt (<) with negated (-) second argument,
it indicates an assignment statement A <- B with illegal left
side A parsed as expression (A < (-B)).
* Sampling statements allow arbitrary value-denoting expressions on the left.
* Functions used as statements must be declared to have void returns

error in 'model_within_long_missing_ppc' at line 102, column 13
-------------------------------------------------
100:         matrix[nSubj,nX] ppc_coefs ; // simulating new subjects' unscaled coefficients
101:         for(s in 1:nSubj){
102:             ppc_coefs[s,] = multi_normal_rng(
^
103:                 coef_means
-------------------------------------------------

PARSER EXPECTED: "}"
Error in stanc(model_code = paste(program, collapse = "\n"), model_name = model_cppname,  :
failed to parse Stan model 'within_long_missing_ppc' due to the above error.
``````

My understanding is that the first error is okay in this particular instance of modeling missing data. The following error (beginning with “Illegal Statement…” has me stumped, though.

Any advice would be greatly appreciated!

Thanks. Much easier now. We need spaces between warnings and errors, because the key thing is here (I’ve added line breaks to make it easy to read):

``````Base type mismatch in assignment;
variable name = ppc_coefs,
type = row vector; right-hand side type=vector
``````

So what that says is that you’re trying to assign a vector to a row vector. Those are different types in Stan. If you really need to do this, you want to transpose the vector on the right-hand side so it’s a row vector:

``````ppc_coefs[s, ]
= multi_normal_rng(coef_means,
``````

Note the apostrophe at the end—that’s the tranpose operator.

If you were to pack them into columns, it’d be much more efficient because matrices are column major.

I also think there may be a bug with your assigning `Y2` within the scope of the loop. I think you want that `Y2` assignment out of the for loop. As is, you just keep reassinging it with partially constructed ppc_coefs.

You’re much better off for efficiency doing this:

``````{
matrix[ , ] L_Sigma = cholesky_factor(quad_form_diag(cor_mat, coef_sds));
matrix[nX, nSubj] ppc_coefs_tr;
for (n in 1:nSub)
ppc_coefs_tr[ , n] = multi_normal_cholesky_rng(coef_means, L_Sigma);
Y2 = rows_dot_product(ppc_coefs', X);
}
``````

Now `Y2` is outside of the loop and the transposition is all at once. You could also just set the elements of `Y2` in the loop with plain `dot_product` and then no transpositions are needed.

Hi Bob,

Thanks for your help. I got stuck trying to fit the model back together again from what you wrote. Here’s what I ended up with:

``````functions{
//a function to turn the long data to wide coefficients
matrix get_coefs(matrix X, vector Y, int [,] Z) {
matrix[size(Z),cols(X)] coefs ;
for(i in 1:size(Z)){
coefs[i] = transpose(
inverse(
transpose( X[Z[i,1]:Z[i,2]] )
* X[Z[i,1]:Z[i,2]]
)
* transpose( X[Z[i,1]:Z[i,2]] )
* Y[Z[i,1]:Z[i,2]]
);
}
return coefs ;
}
}
data {
// nSubj: number of subjects
int nSubj ;
// nY: total number of observations
int nY ;
// nX: number of predictors
int nX ;
// Y: vector of observations
vector[nY] Y ;
// X: predictor matrix
matrix[nY,nX] X ;
// subjIndices: list of start and end indices in Y corresponding to data from
//  each subject
int subjIndices[nSubj,2] ;
// n_missing: number of missing values total
int n_missing ;
// indices_of_missing: indices in Y of missing values
int indices_of_missing[n_missing] ;
// indices_of_present: indices in Y of present values
int indices_of_present[nY-n_missing] ;
}
transformed data{
// Y_mean: mean of present values in Y
real Y_mean ;
// Y_sd: sd of present values in Y
real <lower=0> Y_sd ;
// scaled_Y: z-scaled Y values, using Y_mean & Y_sd
vector[nY] scaled_Y ;
Y_mean = mean( Y[indices_of_present] ) ;
Y_sd = sd( Y[indices_of_present] ) ;
scaled_Y = ( Y - Y_mean ) / Y_sd ;
}
parameters {
// scaled_coef_means: z-scale coefficient means
vector[nX] scaled_coef_means ;
// scaled_coef_sds: z-scale coefficient sds
vector[nX] scaled_coef_sds ;
// cor_mat: correlations amongst coefficients
corr_matrix[nX] cor_mat ;
// imputed_for_missing: values imputed in place of
//  missing values
vector[n_missing] imputed_for_missing ;
}
transformed parameters{
// scaled_Y_with_imputed: vector combining non-missing & imputed values
vector[nY] scaled_Y_with_imputed ;
// scaled_coefs: matrix of wide-format coefficients for each subject given
//  their data and predictors
matrix[nSubj,nX] scaled_coefs ;
// add the present values to scaled_Y_with_imputed
scaled_Y_with_imputed[indices_of_present] = scaled_Y[indices_of_present] ;
// add the imputed values to scaled_Y_with_imputed
scaled_Y_with_imputed[indices_of_missing] = imputed_for_missing ;
// use the get_coefs() function
scaled_coefs = get_coefs(X,scaled_Y_with_imputed,subjIndices) ;
}
model {
// weakly informed priors
scaled_coef_means ~ normal(0,1) ;
scaled_coef_sds ~ weibull(2,1) ;
cor_mat ~ lkj_corr(1) ;
imputed_for_missing ~ normal(0,1) ;
// Each subject's coefficients as multivariate normal
for(s in 1:nSubj){
scaled_coefs[s,] ~ multi_normal(
scaled_coef_means
) ;
}
}
generated quantities{
// coef_means: coefficient means on the original scale of Y
vector[nX] coef_means ;
// coef_sds: coefficient sds on the original scale of Y
vector[nX] coef_sds ;
// Y2: new samples for posterior predictive checks
vector[nY] Y2 ;
coef_sds = scaled_coef_sds * Y_sd ;
coef_means = scaled_coef_means * Y_sd ;
// adding back the mean to the intercept only
coef_means[1] = coef_means[1] + Y_mean ;
{ //sub-environment to avoid saving ppc_coefs
matrix[nSubj,nX] ppc_coefs ; // simulating new subjects' unscaled coefficients
ppc_coefs[s, ]
= multi_normal_rng(coef_means,
{
matrix[ , ] L_Sigma = cholesky_factor(quad_form_diag(cor_mat, coef_sds));
matrix[nX, nSubj] ppc_coefs_tr;
for (n in 1:nSub)
ppc_coefs_tr[ , n] = multi_normal_cholesky_rng(coef_means, L_Sigma);
Y2 = rows_dot_product(ppc_coefs', X);
}
}
}

``````

I have a feeling this isn’t what you meant, though because I got the following error:

``````SYNTAX ERROR, MESSAGE(S) FROM PARSER:

Warning (non-fatal):
Left-hand side of sampling statement (~) may contain a non-linear transform of a parameter or local variable.
If it does, you need to include a target += statement with the log absolute determinant of the Jacobian of the transform.
Left-hand-side of sampling statement:
scaled_coefs[s,  ] ~ multi_normal(...)
variable "s" does not exist.
error in 'model_within_long_missing_ppc' at line 101, column 20
-------------------------------------------------
99:     { //sub-environment to avoid saving ppc_coefs
100:         matrix[nSubj,nX] ppc_coefs ; // simulating new subjects' unscaled coefficients
101:         ppc_coefs[s, ]
^
102:   = multi_normal_rng(coef_means,
-------------------------------------------------

Error in stanc(model_code = paste(program, collapse = "\n"), model_name = model_cppname,  :
failed to parse Stan model 'within_long_missing_ppc' due to the above error.
``````

Do you have any ideas about how to debug the code from this point?
Thanks again!