# Time series modeling: Credible intervals don't correlate with model's prediction error / uncertainty

Hi there!

I’m not-so-new to Bayesian modeling but fairly new to time series modeling. I’m writing here for advice on using the credible intervals of a time series model as a measure of model uncertainty (the main motivation for using a Bayesian model for our current project). The goal of our project is to use historical data to forecast values for the coming week – and also to report the uncertainty of these predictions.

#### Data (time series)

y_{obs}: daily sales of a retail store for (almost) every day from Jan 2nd, 2017 onwards; with weekly and yearly seasonality and a holiday effect.

Some data points are missing (see model).
Data are standardized before given to the model (but rescaled in the analysis plots below).

#### Model

I constructed the following state-space model drawing from many different resources (e.g., [1], [2], [2a], [2b], [3]):

``````data {
int<lower=1> n_obs; // number of observed values
int<lower=0> n_miss; // number of missing values
int<lower=0> n_predict; // number of values to predict

int<lower = 1, upper = n_obs + n_miss> indices_obs[n_obs]; // indices of observed values
int<lower = 1, upper = n_obs + n_miss> indices_miss[n_miss];  // indices of missing values

real y_obs[n_obs]; // observed values

vector[n_obs + n_miss + n_predict] holiday; // dummy coded holiday vector (0 or 1)
}

transformed data {
int<lower=0> n = n_obs + n_miss; // length of historical data (incl. missing values)

int<lower=n> predict_start; // start of prediction period
int<lower=n> predict_end; // end of prediction period

int<lower=1, upper=365> yday[(n + n_predict)]; // day-of-the-year feature vector
int<lower=1, upper=7> wday[(n + n_predict)]; // day-of-the-week feature vector

// fill seasonality feature vectors (1-365, 1-7)
for (i in 1:(n + n_predict)) {
yday[i] = i % 365 + 1;
}
for (i in 1:(n + n_predict)) {
wday[i] = i % 7 + 1;
}

predict_start = n + 1;  // index (aka timepoint) at which to start the prediction
predict_end = n + n_predict; // index to end the prediction
}

parameters {
real x0; // first value
real drift; // drift
real<lower=-1, upper=1> phi; // controls to what extent the random walk reverts to the mean, see https://nwfsc-timeseries.github.io/atsa-labs/sec-stan-ar1.html

vector[predict_end - 1] state_error; // noise of state

real<lower=0> sigma_state; // standard deviation for state noise
real<lower=0> sigma_irreg; // standard deviation for irregular component

real y_miss[n_miss]; // vector of missing values

vector[6] seas_w; // weekly seasonality
vector[364] seas_y; // yearly seasonality

real h; // holiday parameter
}

transformed parameters {
vector[predict_end] estimation; // estimated outcome
vector[predict_end] state; // latent state of the time series model

// zero-mean seasonal terms
vector[7] seasonal_week = append_row(seas_w, -sum(seas_w));
vector[365] seasonal_year = append_row(seas_y, -sum(seas_y));

state[1] = x0; // first state

for(i in 2:predict_end) {
state[i] = phi * (state[i-1] - drift) + drift + sigma_state * state_error[i-1];
}

// the central model equation:
// the estimated outcome is the state + yearly/weekly seasonality + the holiday feature (only relevant if holiday != 0)
estimation = state + seasonal_week[wday] + seasonal_year[yday] + h * holiday;
}

model {
// create vector that contains observed and missing values
real y[n];
y[indices_obs] = y_obs;
y[indices_miss] = y_miss;
// missing values are assumed to be normally distributed
y_miss ~ normal(0, 1);

// prior distributions of parameters
x0 ~ normal(0, 1);
drift ~ normal(0, 1);
h ~ normal(0, 1);
sigma_state ~ exponential(2);
state_error ~ normal(0, 1);
sigma_irreg ~ exponential(2);

// likelihood
y ~ normal(estimation[1:n], sigma_irreg);
}

generated quantities {
vector[n] yhat; // posterior predictive to check whether the model has fitted the data
vector[n_predict] prediction; // predicted values

for(i in 1:n) {
yhat[i] = normal_rng(estimation[i], sigma_irreg);
}

for (i in predict_start:predict_end) {
// sample with a random generator, cf. with the likelihood
prediction[i - n] = normal_rng(estimation[i], sigma_irreg);
}
}

``````

I’m using `pystan` to compile this model and to hand over the data to the model:

``````model = pystan.StanModel(file = 'model.stan')
data = {'n_obs': number_observations,
'n_miss': len(indices_missing),
'indices_obs': indices_observations,
'indices_miss': indices_missing,
'holiday': pd.concat([df.holiday, df_prediction.holiday]),
'n_predict': days_to_predict,
'y_obs': df[df.yobs.notnull()].yobs
}
fit = model.sampling(data = data, verbose = True)
``````

#### Model fit

In general, the model captures the data quite well. To test the model on unknown data (and to simulate the actual production environment which requires to forecast the coming week), we have the the following training/testing pipeline:

• fit the model on data from January 2017 - December 2018
• predict first week of January 2019
• add first week of January 2019 to training set and re-fit the model
• predict second week of January 2019
• add second week of January 2019 to training set and re-fit the model
• predict third week of January 2019
• etc. until (almost) the end of 2019

This results in a year of predictions that we can compare to actually known data (but unknown to the model).
Here is a plot of actual data (unknown to the model) and the model’s mean outcome (i.e., the means of the `prediction` vector), including credible intervals as computed with `arviz` (`arviz.hpd(prediction_df.values, credible_interval=0.95)`).

Although this fit looks quite okay, the prediction error is still too high in my opinion. Here is a corresponding plot of the absolute prediction error:

#### Problem

Now, here’s the problem I need your help with:
We would like to use the model

• to predict/forecast the next week and
• to deliver an uncertainty about these predictions.

I thought that the (width of the) credible intervals of the `prediction` vector contains the uncertainty of the model, following this logic: A larger error should come with a larger model uncertainty, indicated by larger width of the credible intervals.
It turns out that this is not the case. Here is the same error plot from above together with the width of the CIs:

The CI width is relatively large in comparison to the error, however it shows way less variation. Most importantly, the varation in CI widths does not correlate with the magnitude of the error.
In an attempt to identify a potential time-dependent systemacity between error and CI widths, here are the 10 largest errors vs. CI widths:

These plots do look the same for ca. 50 similar datasets. I couldn’t find a dependency between CI widths and model errors (as conformed by close to zero correlation between these two variables).

So, I’m stuck here. My ideas would be to reduce the overall model error (as they indicate substantial trends in the data not captured by the model) by improving the model in general (e.g., adding a slope component, rethinking the 365-day yearly seasonality, …). I hope to see a better correlation of model error and CI width with a better fitting model. Another idea is to add a Kalman filter to the model, because I read this somewhere. However, I actually didn’t yet get the general idea of a Kalman filter to be able to tell if this is a meaningful idea at all …

Or am I actually missing a different point here? Potentially my model is misspecified somehow? Am I computing the wrong intervals? Is it actually possible to use the CIs as an indicator of prediction uncertainty? Any help is appreciated!

Thanks,
kartoffelsalat

1 Like

I don’t use pystan much but it would worth looking into what the default parameters are for the model run. I am guessing you may need to change the default behavior.

Hi, have you tried plotting invidual draws to see how your sample looks like?

Like 100 draws with `alpha=0.01`?

Do you mean parameters like number of iterations or `adapt_delta`? Not sure, to what extent changing some of these help with my problem. The model is converging well, at least judging from the Rhat values.

Not sure, if I understand you right. I just plotted the 1% credible intervals. These lie very close to the mean posterior of the prediction.

You have different ways to plot your time series sample. One common way to visualize is to calculate mean vector + credible intervals. This is a good way to see where are the “limits” of your time series.

Another way to is to plot multiple draws from the posterior and see how the autocorrelation works for your time series.

See for example from here https://distill.pub/2019/visual-exploration-gaussian-processes/, where one of the images shows the envelop+the individual draws. I usually like to see individual draws and see how they are behaving

Sorry, I’m still not fully understanding your suggestion. Let me try to be more concrete with my model to tell you what I understood:

I do have the `prediction` vector that forecasts for the coming week (from `predict_start` to `predict_end`). Each of the to-be-predicted-days contains 4000 samples. If I understood you right, you suggest to take a subset of these 4000 samples (e.g., samples no. 300-400) and plot the mean + CIs for these samples (CI with 0.01?). Repeating this for several sample subsets (i.e., creating a new plot for each subset) could reveal potential insights where the model behaves weird.

No. (soft no, hard to add tone ones writing)

So let’s assume you have your prediction data in `pred` and days in `t`

``````# now let's assume your prediction data for each fit is gathered in one big vector
pred = ...
# pred.shape = (samples, days_prediction)
# or edit this how you gather your samples

fig, ax = plt.subplot(1, figsize=(10,5), dpi=120)

# for sequantial selection use range(100)
# or select random choice with np.random.choice(np.arange(len(pred)), size=100, replace=False)
for i in range(100):
# get one draw (each subset of your prediction are "independent" so the
# the borders might look "funny"
y = pred[i, :]
ax.plot(t, y, alpha=0.05)``````

Ok, thanks for the code! Here are two pictures (full view and one zoomed in) of a plot generated like this (the data are from a different store; also, due to technical reasons (we didn’t store the whole `prediction` values) the plotted model outcome is now not the `prediction` variable from the Stan model but the `yhat` variable; this shouldn’t make too much difference, right?).

Not sure, to what extent this now helps with my problem.

Ok, looks good.

Now compare to hdp plot. Are they similar.

Now you can move forward and think more about your model. Are the predictions good or is there too much uncertainty in the predictions, are the predictions smooth enough?

Is there prior information you could include?