Missing data imputation in Stan

I a little bit confused about missing data imputation in Stan, for a simple example Y and X have the following relation:

vector[N] y;
model {
  vector[N] mu = alpha + beta * x;
  y ~ normal(mu, sigma); 
  alpha ~ normal(0,100);
  beta ~ normal(0,100);

Suppose now we have missing data issue, where we observe all Ys (eg. a vector of length n) but only some of the Xs (eg, a vector of length 2n/3) and we are interested in imputing the missing values of X. I am wondering in this case should I put X into ‘data’ or ‘parameter’ section?

Also I am confused about the general ideas of using Bayesian methods for missing data imputation. In my understanding we should treat the missing data as ‘parameters’ in Bayesian setting. However, in the above situation we also observe 2n/3 of the data, so if we purely put all X as ‘paramter’ it seems not making sense. Should we treat the observed 2n/3 of X as observed data and missing n/3 of X as parameters?


I’d recommend reading through the section in the User’s Manual on Missing Data: https://mc-stan.org/docs/2_25/stan-users-guide/missing-data-and-partially-known-parameters.html

If you have access, I’d also suggest reading the 2nd edition of Statistical Rethinking by Richard McElreath (Chapter 15), which covers this in more depth

I also have a lecture here discussion of missing data starts at around the halfway mark).

1 Like

That’s very interesting resources and thanks again!