Missing data in GP

Hi everyone,

I feel like the answer to my question is very simple, but I just cannot get it. I am modeling a gaussian process on a time series x with y being the outcome variable. I have some missing data in y. I can fit y using

y ~ multi_normal_cholesky(mu, L_K);

as appears in Stan guide, but how do I skip over the missing time points? Should I just pass time vector that is not continuous? For example x = [1 2 4 5] with a missing time point 3? I think this might not be right because of how the kernel is being calculated with

matrix[N, N] K = cov_exp_quad(x, alpha, rho).

Thanks in advance!

1 Like

You have points at x_i with value y_i, then fit GP over those. Then you would want to predict y_j at x_j.

GP doesn’t assume gridded points.

Did I miss something?

Thanks! This makes total sense and I know it’s possible to use GP to predict unobserved future data. However, my data has “holes” in it - for example, I collect continuous data (y) on one subject every day (x) for a month. But in some days during the month I don’t have data because the subject forgot to fill the questionnaire. So is it possible to use GP to interpolate the missing days? From my understanding (maybe it’s wrong!) it should be possible, I just don’t know how to phrase it in Stan in the fitting line

y ~ multi_normal_cholesky(mu, L_K);

Hope I explained myself better!

Hi Nerpa,

what I think Ari said is that you fit your model as usual, and then you look what the model predicts for the days that are “missing”.

So just to make sure I understand - if I have 4 time indexes [1 2 3 4] and time 3 is missing, I just fit the GP for data and time indexed [1 2 4] and then predict in the generated quantities section the missing time index?

If I understood Ari correctly, yes :)

1 Like

Thanks! (Hope this second hand understanding is correct :)

Yes. You now think in “discrete” timestep, but what I usually think of doing, is a continuous space (e.g. spatial or time) and I have observations on some locations (no grid, just some location) and then you fit GP over those values. Then the predictions (in this case interpolation) is done at some steps (could ve same as in fitting or not).

So what you will see is that the uncertainty will increase for locations with “missing” data.

1 Like

Great! Thank you so much.