Importing dataset with missing variables

I am trying to import a dataset (y_raw) with multiple missing variables into Stan (using cdstan. I have replaced any missing values (NAs) with an easily recognizable number (-999) outside the observed data’s interval. I created a row vector in the “parameters” block (y_mis) containing the imputed values, which I then want to replace the missing values in the dataset in “transformed parameters”. For this I created a for-loop inserting a value in the dataset whenever a match for the missing value, -999, is found. In order to do this I created an index variable that will tick upwards every time a missing is found. However, the “transformed parameters” block doesn’t accept integers, and integers are needed as input for the vector index. I therefore created a variable of type real and tried casting it to integer (`to_int(). This approach won’t work because casting to integer for some reason only works on data and not parameters. A different approach is needed. Any suggestions?

data {
    int<lower=1> P; // number of variables
    int<lower=0> N; // number of observations
    array[P] int N_mis; // number of missing observations
    array[N] row_vector[P] y_raw; // Missing are assigned -999
}

parameters {
    row_vector[sum(N_mis)] y_mis;
}

transformed parameters {
  array[P] row_vector[N] y;
  real total_missing_so_far = 0.0;
  
  for (p in 1:P) {
      // Fix: expression after assignment, remove semicolon
      int y_mis_index = to_int(total_missing_so_far + 1) 
      
      for (n in 1:N) {
          if (y_raw[p, n] == -999) {
              y[p, n] = y_mis[y_mis_index];
              y_mis_index = y_mis_index + 1;
          } else {
              y[p, n] = y_raw[p, n];
          }
      }
      total_missing_so_far = total_missing_so_far + N_mis[p];
  }
}

Hi

Firstly there seems to be a mistake in your code. The y and y_raw variables do not have the same have the same shapes.

I am assuming that these should be the same sizes thus fixing the latter to that of the former.

I had a similar problem but in the a matrix completion setting. My suggestion is to create an array of integers. In your case this will be array[sum(N_mis), 2] int Idx_mis. That is the first entry in Idx_mis would be that missing at position N, P indicated by the index Idx_mis[1,:]. This will also speed up computations significantly.

Sample code:

data {
    int<lower=1> P; // number of variables
    int<lower=0> N; // number of observations
    array[P] int N_mis; // number of missing observations
    array[N] row_vector[P] y_raw; // Missing are assigned -999
    array[sum(N_mis), 2] int Idx_mis; // Idx[n,1] corresponds to N, Idx[n,2] corresponds to P for the missing entry y_mis[n]
}

parameters {
    row_vector[sum(N_mis)] y_mis;
}

transformed parameters {
  array[N] row_vector[P] y = y_raw; // make copy of y_raw; also fixed sizes to be the same.
  
  for (i in 1:sum(N_mis)) {
       y[Idx_mis[i,1], Idx_mis[i,2]] = y_mis[i];
  }
}

If you do not want to import this array of integers beforehand you can add a loop in the transformed data block which assigns Idx_mis

transformed data {
    array[sum(N_mis), 2] int Idx_mis;
    int counter = 1;
    for (p in 1:P) {
          for (n in 1:N) {
                 if (y_raw[n,p] == -999) {
                        Idx_mis[counter,1] = n;
                        Idx_mis[counter,2] = p;
                        counter += 1;
                 }
          }
    }
}
1 Like