How to deal with NaN for the outcome variable/dependent variables

Dear Stan community,

I am fitting a basic linear model to my data, following is the Stan model.

data {
  int<lower=0> N;
  vector[N] y;
}

parameters {
  real mu;
  real<lower=0> sigma;
}

model {
  y ~ normal(mu, sigma);
}

However, around 50% of the y (dependent variable) are missing values (NaN). I don’t want to delete these missing values, as in my experiment the trial sequence is meaningful.

Hence I wonder if is it OK to keep these NaNs as they were in Stan when fitting the model to my data, or is there any tricky thing I need to do to deal with these NaNs?

Many thanks and very much appreciated to any suggestions/comments/solutions.

Best,

1 Like

You can’t put NaNs into Stan but you can declare the missing data points as unknown (in the parameters block), which lets you take into account the uncertainty due to the missingness (and also estimate the missing values if you want):

https://mc-stan.org/docs/stan-users-guide/missing-data.html

1 Like

Hi @jonah ,

Thank you very much for your help!
I’ll take a look.

Best

1 Like

Hi @jonah ,

Sorry for my disturb, may I ask for your suggestion?
Following is my Stan script, which is syntactically correct, and I include random intercept and slope at the participant level.

data {
  int<lower=0> N_obs; // number of observed data points
  int<lower=0> N_mis; // number of missing data points 
  int<lower=1> N; // number of total data points (i.e., observed + missing)
  int<lower=1, upper=N> ii_obs[N_obs]; // indexes of the observed data
  int<lower=1, upper=N> ii_mis[N_mis]; // indexes of the missing data
  int<lower=1> L; // number of participants
  
  int<lower=1, upper=L> ID[N]; // participant's ID
  int<lower=1, upper=2> x[N]; // the predictor variable
  array[N_obs] real y_obs; // observed data
  
}


parameters {
  
  // missing data
  array[N_mis] real y_mis; // missing data

 // Hierarchical mu parameters (at the population-level)
  real mu_b0; // intercept
  real mu_b1; // slope
  
  // sigma parameters
  real<lower=0> sigma; // from the Normal distribution of y
  real<lower=0> sigma_b0; // from the hierarchical distribution of intercept
  real<lower=0> sigma_b1; // from the hierarchical distribution of slope
  
  // Normal distributions for non-center parameterization, i.e., Matt trick
  real z_b0[L];
  real z_b1[L]; 

  
}


transformed parameters {
  
  // data
  array[N] real y; 
  y[ii_obs] = y_obs;
  y[ii_mis] = y_mis;
  
  real b0[L];
  real b1[L];
  
  // trial-level parameters
  
  real u[N]; // mean estimate per trial
  
  for (l in 1:L) {
    b0[l] = mu_b0 + z_b0[l]*sigma_b0;
    b1[l] = mu_b1 + z_b1[l]*sigma_b1;
  }
  
  for (n in 1:N) {
    
    // the mean estimate per trial
    u[n] = b0[ID[n]] + b1[ID[n]]*x[n];
    
  }
  

}
  

model {
  
  mu_b0 ~ normal(0, 10);
  mu_b1 ~ normal(0, 10);
  
  
  sigma ~ gamma(2, 2);
  sigma_b0 ~ gamma(2, 2);
  sigma_b1 ~ gamma(2, 2);
  
  z_b0 ~ normal(0, 1);
  z_b1 ~ normal(0, 1);
  
  y ~ normal(u, sigma);
  
}
 
 
generated quantities {
  vector[N] log_lik;
  
  {for (n in 1:N) {
    log_lik[n] = normal_lpdf(y[n] | u[n], sigma);
  }
    
  }
  
}

I am not sure whether it is the right code, may I have your suggestions?
Thank you very much in advance.

Best

I haven’t had a chance to look at the whole model in detail, but the code related to the missing data points looks good! (Feel free to start a new topic if you have different questions about the model structure or problems fitting it.)

This is ok but uses deprecated syntax. You can update it to array[N_obs] int<lower=1, upper=N> ii_obs, etc.

1 Like

Hi @jonah ,

I see, thank you very much for your reply!
I will start a new topic if I encounter problems fitting it.

1 Like