Importing dataset with missing variables

pehkawn · April 12, 2024, 12:52pm

I am trying to import a dataset (y_raw) with multiple missing variables into Stan (using cdstan. I have replaced any missing values (NAs) with an easily recognizable number (-999) outside the observed data’s interval. I created a row vector in the “parameters” block (y_mis) containing the imputed values, which I then want to replace the missing values in the dataset in “transformed parameters”. For this I created a for-loop inserting a value in the dataset whenever a match for the missing value, -999, is found. In order to do this I created an index variable that will tick upwards every time a missing is found. However, the “transformed parameters” block doesn’t accept integers, and integers are needed as input for the vector index. I therefore created a variable of type real and tried casting it to integer (`to_int(). This approach won’t work because casting to integer for some reason only works on data and not parameters. A different approach is needed. Any suggestions?

data {
    int<lower=1> P; // number of variables
    int<lower=0> N; // number of observations
    array[P] int N_mis; // number of missing observations
    array[N] row_vector[P] y_raw; // Missing are assigned -999
}

parameters {
    row_vector[sum(N_mis)] y_mis;
}

transformed parameters {
  array[P] row_vector[N] y;
  real total_missing_so_far = 0.0;
  
  for (p in 1:P) {
      // Fix: expression after assignment, remove semicolon
      int y_mis_index = to_int(total_missing_so_far + 1) 
      
      for (n in 1:N) {
          if (y_raw[p, n] == -999) {
              y[p, n] = y_mis[y_mis_index];
              y_mis_index = y_mis_index + 1;
          } else {
              y[p, n] = y_raw[p, n];
          }
      }
      total_missing_so_far = total_missing_so_far + N_mis[p];
  }
}

Garren_Hermanus · April 13, 2024, 1:11am

Hi

Firstly there seems to be a mistake in your code. The y and y_raw variables do not have the same have the same shapes.

I am assuming that these should be the same sizes thus fixing the latter to that of the former.

I had a similar problem but in the a matrix completion setting. My suggestion is to create an array of integers. In your case this will be array[sum(N_mis), 2] int Idx_mis. That is the first entry in Idx_mis would be that missing at position N, P indicated by the index Idx_mis[1,:]. This will also speed up computations significantly.

Sample code:

data {
    int<lower=1> P; // number of variables
    int<lower=0> N; // number of observations
    array[P] int N_mis; // number of missing observations
    array[N] row_vector[P] y_raw; // Missing are assigned -999
    array[sum(N_mis), 2] int Idx_mis; // Idx[n,1] corresponds to N, Idx[n,2] corresponds to P for the missing entry y_mis[n]
}

parameters {
    row_vector[sum(N_mis)] y_mis;
}

transformed parameters {
  array[N] row_vector[P] y = y_raw; // make copy of y_raw; also fixed sizes to be the same.
  
  for (i in 1:sum(N_mis)) {
       y[Idx_mis[i,1], Idx_mis[i,2]] = y_mis[i];
  }
}

If you do not want to import this array of integers beforehand you can add a loop in the transformed data block which assigns Idx_mis

transformed data {
    array[sum(N_mis), 2] int Idx_mis;
    int counter = 1;
    for (p in 1:P) {
          for (n in 1:N) {
                 if (y_raw[n,p] == -999) {
                        Idx_mis[counter,1] = n;
                        Idx_mis[counter,2] = p;
                        counter += 1;
                 }
          }
    }
}

Topic		Replies	Views
Missing data Modeling	1	624	October 6, 2018
How to circumvent defining a integer array in transformed parameter block Modeling specification , ecology , capture-recapture	3	4612	March 7, 2018
Missing response model (section 10.3 of Stan manual) Modeling	11	2520	May 24, 2017
Missing parameters and priors Modeling	25	954	June 27, 2020
Import dataset with NA column as an integer in rstan Modeling rstan	3	432	August 16, 2022

Importing dataset with missing variables

Related topics