Hi there!
I am new to Stan (and time series) and I like the reasoning of probabilistic programming :-) However, I am having trouble estimating uncertainty for time series with multiplicative noise… Out of laziness I have asked this question earlier this week over at Cross Validated without a satisfying answer yet, so I try it here too.
Say we have a monthly time series y_t \geq 0 dominated by seasonality, where the absolute differences from year to year are much smaller during the low season. To avoid negative values and capture the higher uncertainty during high seasons I would suggest to fit such time series as
y_t = \alpha \cdot y_{t-12} \cdot \varepsilon_t
with e.g.
\varepsilon_t \sim \mathcal{N}(1,\, \sigma).
I would then argue that y_t \sim \mathcal{N}(\alpha \cdot y_{t-12}, \,\sigma \cdot \alpha \cdot y_{t-12}), leading to the following Pystan example with simulated data:
import numpy as np
import pystan
code = """
data {
int<lower=0> N;
vector[N] y;
}
parameters {
real alpha;
real<lower=0> sigma;
}
model {
y[13:N] ~ normal(alpha * y[1:N-12], sigma * alpha * y[1:N-12]);
}
"""
n_years = 4
y = (np.sin(np.linspace(0, n_years*np.pi, n_years*12))**2 + 0.05) * (1 + np.random.randn(n_years*12)/10)
season = pystan.StanModel(model_code=code)
fit = season.sampling(data={'N': len(y), 'y': y}, iter=1000, chains=4)
The resulting posterior for \alpha seems reasonable to me, but \sigma ends up around 0.2 while I used 0.1 to generate the data:
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
alpha 1.0 1.0e-3 0.04 0.93 0.98 1.0 1.02 1.07 1188 1.0
sigma 0.21 8.3e-4 0.03 0.17 0.19 0.21 0.23 0.28 1091 1.0
lp__ 73.14 0.04 1.05 70.32 72.74 73.47 73.88 74.14 891 1.0
Furthermore, if I now draw \alpha and \sigma and plot the fit, I get the following quartile bands:
Pretty much half of the data points lie between the first and the third quartile, but minimum and maximum during high season seem fairly extreme to me …
What is the reason for this large uncertainty during high season? What would be a good model for forecasting that kind of data?
Thank you!