I am fitting a basic linear model to my data, following is the Stan model.
data {
int<lower=0> N;
vector[N] y;
}
parameters {
real mu;
real<lower=0> sigma;
}
model {
y ~ normal(mu, sigma);
}
However, around 50% of the y (dependent variable) are missing values (NaN). I don’t want to delete these missing values, as in my experiment the trial sequence is meaningful.
Hence I wonder if is it OK to keep these NaNs as they were in Stan when fitting the model to my data, or is there any tricky thing I need to do to deal with these NaNs?
Many thanks and very much appreciated to any suggestions/comments/solutions.
You can’t put NaNs into Stan but you can declare the missing data points as unknown (in the parameters block), which lets you take into account the uncertainty due to the missingness (and also estimate the missing values if you want):
Sorry for my disturb, may I ask for your suggestion?
Following is my Stan script, which is syntactically correct, and I include random intercept and slope at the participant level.
data {
int<lower=0> N_obs; // number of observed data points
int<lower=0> N_mis; // number of missing data points
int<lower=1> N; // number of total data points (i.e., observed + missing)
int<lower=1, upper=N> ii_obs[N_obs]; // indexes of the observed data
int<lower=1, upper=N> ii_mis[N_mis]; // indexes of the missing data
int<lower=1> L; // number of participants
int<lower=1, upper=L> ID[N]; // participant's ID
int<lower=1, upper=2> x[N]; // the predictor variable
array[N_obs] real y_obs; // observed data
}
parameters {
// missing data
array[N_mis] real y_mis; // missing data
// Hierarchical mu parameters (at the population-level)
real mu_b0; // intercept
real mu_b1; // slope
// sigma parameters
real<lower=0> sigma; // from the Normal distribution of y
real<lower=0> sigma_b0; // from the hierarchical distribution of intercept
real<lower=0> sigma_b1; // from the hierarchical distribution of slope
// Normal distributions for non-center parameterization, i.e., Matt trick
real z_b0[L];
real z_b1[L];
}
transformed parameters {
// data
array[N] real y;
y[ii_obs] = y_obs;
y[ii_mis] = y_mis;
real b0[L];
real b1[L];
// trial-level parameters
real u[N]; // mean estimate per trial
for (l in 1:L) {
b0[l] = mu_b0 + z_b0[l]*sigma_b0;
b1[l] = mu_b1 + z_b1[l]*sigma_b1;
}
for (n in 1:N) {
// the mean estimate per trial
u[n] = b0[ID[n]] + b1[ID[n]]*x[n];
}
}
model {
mu_b0 ~ normal(0, 10);
mu_b1 ~ normal(0, 10);
sigma ~ gamma(2, 2);
sigma_b0 ~ gamma(2, 2);
sigma_b1 ~ gamma(2, 2);
z_b0 ~ normal(0, 1);
z_b1 ~ normal(0, 1);
y ~ normal(u, sigma);
}
generated quantities {
vector[N] log_lik;
{for (n in 1:N) {
log_lik[n] = normal_lpdf(y[n] | u[n], sigma);
}
}
}
I am not sure whether it is the right code, may I have your suggestions?
Thank you very much in advance.
I haven’t had a chance to look at the whole model in detail, but the code related to the missing data points looks good! (Feel free to start a new topic if you have different questions about the model structure or problems fitting it.)
This is ok but uses deprecated syntax. You can update it to array[N_obs] int<lower=1, upper=N> ii_obs, etc.