Time series modeling: Credible intervals don't correlate with model's prediction error / uncertainty

Hi there!

I’m not-so-new to Bayesian modeling but fairly new to time series modeling. I’m writing here for advice on using the credible intervals of a time series model as a measure of model uncertainty (the main motivation for using a Bayesian model for our current project). The goal of our project is to use historical data to forecast values for the coming week – and also to report the uncertainty of these predictions.

Data (time series)

y_{obs}: daily sales of a retail store for (almost) every day from Jan 2nd, 2017 onwards; with weekly and yearly seasonality and a holiday effect.

Some data points are missing (see model).
Data are standardized before given to the model (but rescaled in the analysis plots below).

Model

I constructed the following state-space model drawing from many different resources (e.g., [1], [2], [2a], [2b], [3]):

data {
  int<lower=1> n_obs; // number of observed values
  int<lower=0> n_miss; // number of missing values
  int<lower=0> n_predict; // number of values to predict

  int<lower = 1, upper = n_obs + n_miss> indices_obs[n_obs]; // indices of observed values
  int<lower = 1, upper = n_obs + n_miss> indices_miss[n_miss];  // indices of missing values

  real y_obs[n_obs]; // observed values

  vector[n_obs + n_miss + n_predict] holiday; // dummy coded holiday vector (0 or 1)
}

transformed data {
  int<lower=0> n = n_obs + n_miss; // length of historical data (incl. missing values)

  int<lower=n> predict_start; // start of prediction period
  int<lower=n> predict_end; // end of prediction period

  int<lower=1, upper=365> yday[(n + n_predict)]; // day-of-the-year feature vector
  int<lower=1, upper=7> wday[(n + n_predict)]; // day-of-the-week feature vector

  // fill seasonality feature vectors (1-365, 1-7)
  for (i in 1:(n + n_predict)) {
    yday[i] = i % 365 + 1;
  }
  for (i in 1:(n + n_predict)) {
    wday[i] = i % 7 + 1;
  }

  predict_start = n + 1;  // index (aka timepoint) at which to start the prediction
  predict_end = n + n_predict; // index to end the prediction
}

parameters {
  real x0; // first value
  real drift; // drift
  real<lower=-1, upper=1> phi; // controls to what extent the random walk reverts to the mean, see https://nwfsc-timeseries.github.io/atsa-labs/sec-stan-ar1.html

  vector[predict_end - 1] state_error; // noise of state

  real<lower=0> sigma_state; // standard deviation for state noise
  real<lower=0> sigma_irreg; // standard deviation for irregular component

  real y_miss[n_miss]; // vector of missing values

  vector[6] seas_w; // weekly seasonality
  vector[364] seas_y; // yearly seasonality

  real h; // holiday parameter
}

transformed parameters {
  vector[predict_end] estimation; // estimated outcome
  vector[predict_end] state; // latent state of the time series model

  // zero-mean seasonal terms
  vector[7] seasonal_week = append_row(seas_w, -sum(seas_w));
  vector[365] seasonal_year = append_row(seas_y, -sum(seas_y));

  state[1] = x0; // first state

  for(i in 2:predict_end) {
    state[i] = phi * (state[i-1] - drift) + drift + sigma_state * state_error[i-1];
  }

  // the central model equation:
  // the estimated outcome is the state + yearly/weekly seasonality + the holiday feature (only relevant if holiday != 0)
  estimation = state + seasonal_week[wday] + seasonal_year[yday] + h * holiday;
}

model {
  // create vector that contains observed and missing values
  real y[n];
  y[indices_obs] = y_obs;
  y[indices_miss] = y_miss;
  // missing values are assumed to be normally distributed
  y_miss ~ normal(0, 1);

  // prior distributions of parameters
  x0 ~ normal(0, 1);
  drift ~ normal(0, 1);
  h ~ normal(0, 1);
  sigma_state ~ exponential(2);
  state_error ~ normal(0, 1);
  sigma_irreg ~ exponential(2);

  // likelihood
  y ~ normal(estimation[1:n], sigma_irreg);
}

generated quantities {
  vector[n] yhat; // posterior predictive to check whether the model has fitted the data
  vector[n_predict] prediction; // predicted values

  for(i in 1:n) {
    yhat[i] = normal_rng(estimation[i], sigma_irreg);
  }

  for (i in predict_start:predict_end) {
    // sample with a random generator, cf. with the likelihood
    prediction[i - n] = normal_rng(estimation[i], sigma_irreg);
  }
}

I’m using pystan to compile this model and to hand over the data to the model:

model = pystan.StanModel(file = 'model.stan')
data = {'n_obs': number_observations,
        'n_miss': len(indices_missing),
        'indices_obs': indices_observations,
        'indices_miss': indices_missing,
        'holiday': pd.concat([df.holiday, df_prediction.holiday]),
        'n_predict': days_to_predict,
        'y_obs': df[df.yobs.notnull()].yobs
       }
fit = model.sampling(data = data, verbose = True)

Model fit

In general, the model captures the data quite well. To test the model on unknown data (and to simulate the actual production environment which requires to forecast the coming week), we have the the following training/testing pipeline:

  • fit the model on data from January 2017 - December 2018
  • predict first week of January 2019
  • add first week of January 2019 to training set and re-fit the model
  • predict second week of January 2019
  • add second week of January 2019 to training set and re-fit the model
  • predict third week of January 2019
  • etc. until (almost) the end of 2019

This results in a year of predictions that we can compare to actually known data (but unknown to the model).
Here is a plot of actual data (unknown to the model) and the model’s mean outcome (i.e., the means of the prediction vector), including credible intervals as computed with arviz (arviz.hpd(prediction_df.values, credible_interval=0.95)).

Although this fit looks quite okay, the prediction error is still too high in my opinion. Here is a corresponding plot of the absolute prediction error:

Problem

Now, here’s the problem I need your help with:
We would like to use the model

  • to predict/forecast the next week and
  • to deliver an uncertainty about these predictions.

I thought that the (width of the) credible intervals of the prediction vector contains the uncertainty of the model, following this logic: A larger error should come with a larger model uncertainty, indicated by larger width of the credible intervals.
It turns out that this is not the case. Here is the same error plot from above together with the width of the CIs:

The CI width is relatively large in comparison to the error, however it shows way less variation. Most importantly, the varation in CI widths does not correlate with the magnitude of the error.
In an attempt to identify a potential time-dependent systemacity between error and CI widths, here are the 10 largest errors vs. CI widths:

These plots do look the same for ca. 50 similar datasets. I couldn’t find a dependency between CI widths and model errors (as conformed by close to zero correlation between these two variables).

So, I’m stuck here. My ideas would be to reduce the overall model error (as they indicate substantial trends in the data not captured by the model) by improving the model in general (e.g., adding a slope component, rethinking the 365-day yearly seasonality, …). I hope to see a better correlation of model error and CI width with a better fitting model. Another idea is to add a Kalman filter to the model, because I read this somewhere. However, I actually didn’t yet get the general idea of a Kalman filter to be able to tell if this is a meaningful idea at all …

Or am I actually missing a different point here? Potentially my model is misspecified somehow? Am I computing the wrong intervals? Is it actually possible to use the CIs as an indicator of prediction uncertainty? Any help is appreciated!

Thanks,
kartoffelsalat

1 Like

I don’t use pystan much but it would worth looking into what the default parameters are for the model run. I am guessing you may need to change the default behavior.

Hi, have you tried plotting invidual draws to see how your sample looks like?

Like 100 draws with alpha=0.01?

Do you mean parameters like number of iterations or adapt_delta? Not sure, to what extent changing some of these help with my problem. The model is converging well, at least judging from the Rhat values.

Not sure, if I understand you right. I just plotted the 1% credible intervals. These lie very close to the mean posterior of the prediction.

You have different ways to plot your time series sample. One common way to visualize is to calculate mean vector + credible intervals. This is a good way to see where are the “limits” of your time series.

Another way to is to plot multiple draws from the posterior and see how the autocorrelation works for your time series.

See for example from here https://distill.pub/2019/visual-exploration-gaussian-processes/, where one of the images shows the envelop+the individual draws. I usually like to see individual draws and see how they are behaving

Sorry, I’m still not fully understanding your suggestion. Let me try to be more concrete with my model to tell you what I understood:

I do have the prediction vector that forecasts for the coming week (from predict_start to predict_end). Each of the to-be-predicted-days contains 4000 samples. If I understood you right, you suggest to take a subset of these 4000 samples (e.g., samples no. 300-400) and plot the mean + CIs for these samples (CI with 0.01?). Repeating this for several sample subsets (i.e., creating a new plot for each subset) could reveal potential insights where the model behaves weird.

No. (soft no, hard to add tone ones writing)

So let’s assume you have your prediction data in pred and days in t

# now let's assume your prediction data for each fit is gathered in one big vector
pred = ...
# pred.shape = (samples, days_prediction)
# or edit this how you gather your samples

fig, ax = plt.subplot(1, figsize=(10,5), dpi=120)

# for sequantial selection use range(100)
# or select random choice with np.random.choice(np.arange(len(pred)), size=100, replace=False)
for i in range(100):
    # get one draw (each subset of your prediction are "independent" so the
    # the borders might look "funny"
    y = pred[i, :]
    ax.plot(t, y, alpha=0.05)

Ok, thanks for the code! Here are two pictures (full view and one zoomed in) of a plot generated like this (the data are from a different store; also, due to technical reasons (we didn’t store the whole prediction values) the plotted model outcome is now not the prediction variable from the Stan model but the yhat variable; this shouldn’t make too much difference, right?).

Not sure, to what extent this now helps with my problem.

Ok, looks good.

Now compare to hdp plot. Are they similar.

Now you can move forward and think more about your model. Are the predictions good or is there too much uncertainty in the predictions, are the predictions smooth enough?

Is there prior information you could include?