Missing response model (section 10.3 of Stan manual)

I’m developing a regression model using Rstan that will eventually accommodate missing observations in the predictors and the response. I am starting with the example in section 10.3 of the Stan manual, which describes how to use index arrays to handle missing data. The code below is an exact reproduction of what is in the manual. Attempting to fit the model on simulated data produces an error that I don’t understand. The code is pasted below along with a few lines from R to make this a reproducible example. What can I do to resolve this error?

From section 10.3 of the Stan manual (sliced missing data)

model <- "
data{

int<lower = 0> N_obs; // number of non-missing observations
int<lower = 0> N_mis; // number of missing observations
int<lower = 1, upper = N_obs + N_mis> ii_obs[N_obs]; // indexes of non-missing
int<lower = 1, upper = N_obs + N_mis> ii_mis[N_mis]; // indexes of missing
real y_obs[N_obs]; // the non-missing observations

}

transformed data{

int<lower = 0> N = N_obs + N_mis; // sample size

}

parameters{

real y_mis[N_mis]; // missing data are a parameter to be estimated
real<lower=0> sigma; // sigma must be gte 0

}

transformed parameters{

real y[N]; // the data, which will include missing and non-missing components
y[ii_obs] = y_obs; // populate y with the observed part of the data
y[ii_mis] = y_mis; // and the missing part of the data

}

model{

sigma ~ gamma(1, 1);
y[1] ~ normal(0, sigma);
y[2:N] ~ normal(y[1:(N-1)], sigma);

}
"

# Simulate some data
set.seed(1)
y_obs <- rnorm(1000,0,2)
ii_obs <- seq(1,1000,1)
ii_mis <- sample(seq(2,1000,1),10,replace=F)
ii_obs <- ii_obs[-ii_mis]
N_obs <- 990
N_mis <- 10

data_list <- list(y_obs=y_obs, ii_obs=ii_obs, ii_mis=ii_mis, N_obs=N_obs, N_mis=N_mis)

fit <- stan(model_code = model, data = data_list, cores = 1, chains = 2, iter = 2000)

DIAGNOSTIC(S) FROM PARSER:
Warning (non-fatal):
Left-hand side of sampling statement (~) may contain a non-linear transform of a parameter or local variable.
If it does, you need to include a target += statement with the log absolute determinant of the Jacobian of the transform.
Left-hand-side of sampling statement:
y[1] ~ normal(…)
Warning (non-fatal):
Left-hand side of sampling statement (~) may contain a non-linear transform of a parameter or local variable.
If it does, you need to include a target += statement with the log absolute determinant of the Jacobian of the transform.
Left-hand-side of sampling statement:
y[2:N] ~ normal(…)

In file included from C:/Program Files/R/R-3.4.0/library/BH/include/boost/config.hpp:39:0,
from C:/Program Files/R/R-3.4.0/library/BH/include/boost/math/tools/config.hpp:13,
from C:/Program Files/R/R-3.4.0/library/StanHeaders/include/stan/math/rev/core/var.hpp:7,
from C:/Program Files/R/R-3.4.0/library/StanHeaders/include/stan/math/rev/core/gevv_vvv_vari.hpp:5,
from C:/Program Files/R/R-3.4.0/library/StanHeaders/include/stan/math/rev/core.hpp:12,
from C:/Program Files/R/R-3.4.0/library/StanHeaders/include/stan/math/rev/mat.hpp:4,
from C:/Program Files/R/R-3.4.0/library/StanHeaders/include/stan/math.hpp:4,
from C:/Program Files/R/R-3.4.0/library/StanHeaders/include/src/stan/model/model_header.hpp:4,
from file1806e89432d.cpp:8:
C:/Program Files/R/R-3.4.0/library/BH/include/boost/config/compiler/gcc.hpp:186:0: warning: “BOOST_NO_CXX11_RVALUE_REFERENCES” redefined

define BOOST_NO_CXX11_RVALUE_REFERENCES

^
:0:0: note: this is the location of the previous definition
cc1plus.exe: warning: unrecognized command line option "-Wno-ignored-attributes"
Error in new_CppObject_xp(fields$.module, fields$.pointer, …) :
mismatch in dimension declared and found in context; processing stage=data initialization; variable name=y_obs; position=0; dims declared=(990); dims found=(1000)
failed to create the sampler; sampling not done

You need to take the missing values out of y_obs.

Thanks for catching that. I made the correction and that code now works.

I modified the code to accommodate missing observations in the matrix of predictors for a Normal regression model. Now I’m treating the missing elements in X as parameters to be estimated. In the transformed parameters block, I am trying to produce a matrix that includes the observed predictors and simulated missing predictors. Stan throws an error that says:

SYNTAX ERROR, MESSAGE(S) FROM PARSER:

base type mismatch in assignment; variable name = x, type = real; right-hand side type=matrix

ERROR at line 32

_ 31: matrix[1000, 2] x;_
_ 32: x[all_rows,1] = intercept;_
_ ^_
_ 33: x[ii_X_1_obs,2] = X_1_obs;_

PARSER EXPECTED:
Error in stanc(file = file, model_code = model_code, model_name = model_name, : _
_ failed to parse Stan model ‘f62932ae6446ddff2bd49b4a1af22eb1’ due to the above error.

But I do not understand the error, because x is defined as a matrix with 1000 rows and 2 columns. I am trying to populate the first column with 1s so I can estimate an intercept. I do this at line 32 by indexing the rows with all_rows, which is defined in the data block as integers of length 1000. intercept is defined as a matrix with 1000 rows and 1 column. Line 33 attempts to populate the non-missing rows of the second column of the matrix with the non-missing elements in the predictor variable. What is wrong with the assignment at line 32?

Here is the code for a reproducible example:

model <- "
data{

int<lower = 0> N_X_1_obs;
int<lower = 0> N_X_1_mis;
int<lower = 1, upper = N_X_1_obs + N_X_1_mis> ii_X_1_obs[N_X_1_obs];
int<lower = 1, upper = N_X_1_obs + N_X_1_mis> ii_X_1_mis[N_X_1_mis];
real X_1_obs[N_X_1_obs];
vector[N_X_1_obs + N_X_1_mis] y;
int<lower=0, upper=N_X_1_obs + N_X_1_mis> all_rows;
matrix[1000,1] intercept;

}

transformed data{

int<lower=0> N = N_X_1_obs + N_X_1_mis;

}

parameters{

vector[2] beta;
real<lower=0> sigma;
real X_1_mis[N_X_1_mis];

}

transformed parameters{

matrix[1000, 2] x;
x[all_rows,1] = intercept;
x[ii_X_1_obs,2] = X_1_obs;
x[ii_X_1_mis,2] = X_1_mis;

}

model{

y ~ normal(x * beta, sigma);

}
"

set.seed(1)
x <- matrix(runif(1000,0,100),1000,1)
x <- cbind(1,x)
beta <- matrix(c(3,2),ncol=1)
y <- as.vector(x%*%beta + rnorm(1000,0,1))
mle <- lm(y ~ x -1)
print(mle)
ii_X_1_obs <- seq(1,1000,1)
ii_X_1_mis <- sample(seq(2,1000,1),10,replace=F)
ii_X_1_obs <- ii_X_1_obs[-ii_X_1_mis]
X_1_obs <- x[-ii_X_1_mis,2]
N_X_1_obs <- 990
N_X_1_mis <- 10
all_rows <- matrix(seq(1,1000), ncol=1)
intercept <- matrix(rep(1,1000), ncol=1)
data_list <- list(X_1_obs=X_1_obs, N_X_1_obs=N_X_1_obs, N_X_1_mis=N_X_1_mis, ii_X_1_obs=ii_X_1_obs, ii_X_1_mis=ii_X_1_mis, y = y, all_rows=all_rows, intercept=intercept)

fit <- stan(model_code = model, data = data_list, cores = 1, chains = 2, iter = 2000)

I don’t think that works with a matrix, even if the matrix has only one column (in which case it should be a vector). I tend to just take the intercept out of the design matrix and do target += normal(y | alpha + x * beta, sigma); where alpha is an additional parameter to estimate.

Thanks again for the feedback. I got rid of the matrix of predictors and replaced it with a vector as you suggested, and I also removed the intercept column so that I am now fitting the model with a separate parameter, alpha, for the intercept. Stan throws a similar error, now at line 33. For some reason I do not understand, it does not like the attempt to populate the vector x with observed and simulated missing observations. What’s weird about this is that the same procedure works when I use it to simulate y based on example 10.3 in the manual. I’ve attached two text files that each contain a reproducible example.

The file called: simulate-missing-y-this-works.txt is example 10.3 from the manual and runs.

The file called: simulate-missing-x-does-not-work.txt produces the following error.

SYNTAX ERROR, MESSAGE(S) FROM PARSER:

base type mismatch in assignment; variable name=x, type=vector; right-hand side type=real
Illegal statement beginning with non-void expression parsed as
_ x[ii_X_1_obs]_
Not a legal assignment, sampling, or function statement. Note that
_ * Assignment statements only allow variables (with optional indexes) on the left;_
_ if you see an outer function logical_lt (<) with negated (-) second argument,_
_ it indicates an assignment statement A <- B with illegal left_
_ side A parsed as expression (A < (-B))._
_ * Sampling statements allow arbitrary value-denoting expressions on the left._
_ * Functions used as statements must be declared to have void returns_

ERROR at line 31

_ 30: vector[1000] x;_
_ 31: x[ii_X_1_obs] = X_1_obs;_
_ ^_
_ 32: x[ii_X_1_mis] = X_1_mis;_

PARSER EXPECTED: <one of the following:
_ a variable declaration, beginning with type_
_ (int, real, vector, row_vector, matrix, unit_vector,_
_ simplex, ordered, positive_ordered,_
_ corr_matrix, cov_matrix,_
_ cholesky_corr, cholesky_cov_
_ or a _
_ or ‘}’ to close variable declarations and definitions>_
_Error in stanc(file = file, model_code = model_code, model_name = model_name, : _
_ failed to parse Stan model ‘8003038a56728195ef95b52bb08c86d6’ due to the above error.

_simulate-missing-x-does_not-work.txt (1.2 KB)
simulate-missing-y-this-works.txt (1.4 KB)

Try using the vector type for the observed and missing values, instead of a real array

Where are all those underscores coming from? Is that in one of our interfaces?

I changed the data types to ensure that the observed and missing data are vectors as you suggested. That did not resolve the issue. However, it produced a new error suggesting that it did not like parameters coded as vectors of length 1, like this:

vector[1] beta;

so I changed those to real, like this:

real beta;

This, in combination with your suggestion allows the code to run without error. Thanks for your help.

Also, regarding your comment above about the assignment of missing data not working on a matrix using a statement like:

x[all_rows,1] = intercept

I looked back at the Stan manual, and found that it shows an example like this on page 180. Here’s a screen grab:

So my suspicion is that the issue was lack of correspondence among data definitions.

Assuming you are referring to the underscores in the error messages, those showed up when I pasted the text here on the forum (not sure why). They are not in the error message produced in the R console.

You had intercept declared as a matrix but I think it needs to be a vector if you want to use it like this. And you’d need X_1_mis and X_1_obs to be vectors too. Here’s a slightly modified version of your program from above:

data{
  int<lower = 0> N_X_1_obs;
  int<lower = 0> N_X_1_mis;
  int<lower = 1, upper = N_X_1_obs + N_X_1_mis> ii_X_1_obs[N_X_1_obs];
  int<lower = 1, upper = N_X_1_obs + N_X_1_mis> ii_X_1_mis[N_X_1_mis];
  vector[N_X_1_obs] X_1_obs;
  int<lower=0, upper=N_X_1_obs + N_X_1_mis> all_rows;
  vector[all_rows] intercept;
}
transformed data{
  int<lower=0> N = N_X_1_obs + N_X_1_mis;
}
parameters{
  vector[N_X_1_mis] X_1_mis;
}
transformed parameters{
  matrix[all_rows, 2] x;
  x[,1] = intercept;
  x[ii_X_1_obs,2] = X_1_obs;
  x[ii_X_1_mis,2] = X_1_mis;
}
...

There’s nothing wrong with

vector[1] beta;

but you have to give it a length-1 array, which R makes difficult. You have to use something like as.array(x, c(1)) or something like that.

You can also use the slicing and multiple indexing—it works just like it works in R or MATLAB. Unlike R, it requires type consistency. So I’m sure you’re right that the problem is mismatched data types.

Thanks Jonah. I’ll give this a try and will post back here when I have another update. I appreciate you taking the time to help me understand what was going on in the code.

1 Like