Is it just very hard to estimate ARMA models?

I mostly want a sanity check here that I’m not doing something stupid. The tl;dr is that I’m suprised when doing simulation for ARMA style models that you need a pretty large N to recover parameters in a meaningful way (for both stats::arima and Stan via brms) and I’m wondering if there if better priors or something would help recover parmeters better at smaller N.

Trying to be a good boy I wanted to sanity check that I can recover the parameters from an ARMA(2, 2) model in R and BRMS before I do anything more complicated.

set.seet(1234)
# gen the arma mod
arma_mod = list(ar = c(0.3, -0.3), ma = c(0.1, -0.2), order = c(2, 0, 2))
arma_22 = arima.sim(n = 500, model = arma_mod, mean = 0, sd = 1)
# try using arima(2, 0, 2) for a sanity check
arma_fit = stats::arima(arma_22, order = c(2, 0, 2), include.mean = TRUE, 
  transform.pars = TRUE, method = "ML",
  optim.method = "BFGS",
  optim.control = list(maxit = 1000))
arma_fit
# Call:
# stats::arima(x = arma_22, order = c(2, 0, 2), include.mean = TRUE, transform.pars = TRUE, 
#     method = "ML", optim.method = "BFGS", optim.control = list(maxit = 1000))
# 
# Coefficients:
#          ar1      ar2     ma1      ma2  intercept
#       0.2981  -0.2711  0.0339  -0.2706     0.0683
# s.e.  0.0979   0.0857  0.0994   0.0985     0.0342

So like, not great but if I squint at the parameters with the standard errors I’m like “Okay this is sort of recovering these parameters fine fine fine”

Now I put this into BRMS like so

arma_df = data.frame(y = arma_22)
mod1 = bf(y ~ arma(p = 2, q = 2))
# get_prior(mod1, arma_df)
prior1 = c(set_prior("normal(0, .5)", class = "ar"),
           set_prior("normal(0, .5)", class = "ma"),
           set_prior("normal(0, .5)", class = "sigma"),
           set_prior("normal(0, .5)", class = "Intercept"))

# not setting init to 0 causes some bad R-hat values
fit1 = brm(mod1, data = arma_df, prior = prior1, iter = 2000, cores = 4, init = 0)
summary(fit1)
#  Family: gaussian 
#   Links: mu = identity; sigma = identity 
# Formula: y ~ arma(p = 2, q = 2) 
#    Data: arma_df (Number of observations: 1000) 
# Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
#          total post-warmup samples = 4000
# 
# Correlation Structures:
#       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
# ar[1]    -0.12      0.28    -0.65     0.46 1.00     1790     2128
# ar[2]    -0.02      0.26    -0.54     0.48 1.00     1484     1697
# ma[1]     0.13      0.28    -0.45     0.66 1.00     1792     2050
# ma[2]     0.07      0.26    -0.45     0.58 1.00     1488     1643
# 
# Population-Level Effects: 
#           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
# Intercept    -0.00      0.03    -0.06     0.06 1.00     3195     2217
# 
# Family Specific Parameters: 
#       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
# sigma     0.95      0.02     0.91     1.00 1.00     2898     2132
# 
# Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
# and Tail_ESS are effective sample size measures, and Rhat is the potential
# scale reduction factor on split chains (at convergence, Rhat = 1).
# Warning message:
# There were 18 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup 

So ya know, that doesn’t seem great? All the parameters sit within the CI, but gosh those CI are pretty big!

I figured it might be that it needs more observations so I just kept turning up the number of generated points until stats::arima started giving answers that seemed a little nicer. At around 1500 data points it seems to get closer to recovering the parameters as does brms

set.seet(1234)
# gen the arma mod
arma_mod = list(ar = c(0.3, -0.3), ma = c(0.1, -0.2), order = c(2, 0, 2))
arma_22 = arima.sim(n = 1500, model = arma_mod, mean = 0, sd = 1)
# try using arima(2, 0, 2) for a sanity check
arma_fit = stats::arima(arma_22, order = c(2, 0, 2), include.mean = FALSE, transform.pars = TRUE, method = "ML",
                     optim.method = "BFGS", 
                       optim.control = list(maxit = 1000))
arma_fit
# Call:
#   stats::arima(x = arma_22, order = c(2, 0, 2), include.mean = FALSE, transform.pars = TRUE, 
#                method = "ML", optim.method = "BFGS", optim.control = list(maxit = 1000))
# 
# Coefficients:
#   ar1      ar2     ma1      ma2
# 0.2775  -0.2523  0.1174  -0.2820
# s.e.  0.0604   0.0470  0.0607   0.0577
# 
# sigma^2 estimated as 0.9862:  log likelihood = -2118.3,  aic = 4246.6

arma_df = data.frame(y = arma_22)
mod1 = bf(y ~ 0 + arma(p = 2, q = 2))
# get_prior(mod1, arma_df)
prior1 = c(set_prior("normal(0, .5)", class = "ar"),
           set_prior("normal(0, .5)", class = "ma"),
           set_prior("normal(0, .5)", class = "sigma"))

# not setting init to 0 causes some bad R-hat values
fit1 = brm(mod1, data = arma_df, prior = prior1, iter = 2000, cores = 4, init = 0)
summary(fit1)
# Family: gaussian 
# Links: mu = identity; sigma = identity 
# Formula: y ~ 0 + arma(p = 2, q = 2) 
# Data: arma_df (Number of observations: 1500) 
# Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
# total post-warmup samples = 4000
# 
# Correlation Structures:
#  Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
# ar[1]     0.27      0.06     0.14     0.38 1.00     2133     1962
# ar[2]    -0.26      0.05    -0.35    -0.17 1.00     2075     2360
# ma[1]     0.13      0.06     0.01     0.26 1.00     2082     1948
# ma[2]    -0.27      0.06    -0.38    -0.15 1.00     1782     2008
# 
# Family Specific Parameters: 
#   Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
# sigma     0.99      0.02     0.96     1.03 1.00     3337     2406
# 
# Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
# and Tail_ESS are effective sample size measures, and Rhat is the potential
# scale reduction factor on split chains (at convergence, Rhat = 1).

So that looks a bit nicer, the Stan results at least seem to be near stats::arima, no divergent transition issues, etc.

But that seems like a lot of data points to recover the model? 1500 observations is about 4 years and some change of daily observations. Are my priors too wide? Do I need to think about better starting positions for the parameters? One thing that seems to help is using dense_e as the metric over diag_e. For N = 500 using dense_e this seems to recover parameters similar to what stats::arima brings back

# running the above with n = 500
arma_fit
# Call:
#   stats::arima(x = arma_22, order = c(2, 0, 2), include.mean = FALSE, transform.pars = TRUE, 
#                method = "ML", optim.method = "BFGS", optim.control = list(maxit = 1000))
# 
# Coefficients:
#   ar1      ar2      ma1      ma2
# 0.4161  -0.2276  -0.0593  -0.3134
# s.e.  0.0956   0.0851   0.0942   0.0918
# 
# sigma^2 estimated as 1.138:  log likelihood = -742.06,  aic = 1494.13

fit1 = brm(mod1, data = arma_df, prior = prior1, iter = 2000, cores = 4, init = 0,
  control = list(metric = "dense_e"))
summary(fit1)
# Family: gaussian 
# Links: mu = identity; sigma = identity 
# Formula: y ~ 0 + arma(p = 2, q = 2) 
# Data: arma_df (Number of observations: 500) 
# Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
# total post-warmup samples = 4000
# 
# Correlation Structures:
#   Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
# ar[1]     0.38      0.10     0.19     0.57 1.00     4398     2875
# ar[2]    -0.24      0.08    -0.40    -0.07 1.00     4720     2812
# ma[1]    -0.03      0.10    -0.21     0.17 1.00     4060     2855
# ma[2]    -0.28      0.09    -0.45    -0.09 1.00     4885     3112
# 
# Family Specific Parameters: 
#   Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
# sigma     1.07      0.03     1.01     1.14 1.00     6063     3127
# 
# Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
# and Tail_ESS are effective sample size measures, and Rhat is the potential
# scale reduction factor on split chains (at convergence, Rhat = 1).

Are there some other tricks to recovering parmeters for arma models I should be aware of? The generated Stan code from brms is here

# gen code and data for brms model
make_stancode(mod1, data = arma_df, prior = prior1)
make_standata(mod1, data = arma_df)
// generated with brms 2.14.11
functions {
}
data {
  int<lower=1> N;  // total number of observations
  vector[N] Y;  // response variable
  // data needed for ARMA correlations
  int<lower=0> Kar;  // AR order
  int<lower=0> Kma;  // MA order
  // number of lags per observation
  int<lower=0> J_lag[N];
  int prior_only;  // should the likelihood be ignored?
}
transformed data {
  int max_lag = max(Kar, Kma);
}
parameters {
  vector[Kar] ar;  // autoregressive coefficients
  vector[Kma] ma;  // moving-average coefficients
  real<lower=0> sigma;  // residual SD
}
transformed parameters {
}
model {
  // likelihood including constants
  if (!prior_only) {
    // matrix storing lagged residuals
    matrix[N, max_lag] Err = rep_matrix(0, N, max_lag);
    vector[N] err;  // actual residuals
    // initialize linear predictor term
    vector[N] mu = rep_vector(0.0, N);
    // include ARMA terms
    for (n in 1:N) {
      mu[n] += Err[n, 1:Kma] * ma;
      err[n] = Y[n] - mu[n];
      for (i in 1:J_lag[n]) {
        Err[n + 1, i] = err[n + 1 - i];
      }
      mu[n] += Err[n, 1:Kar] * ar;
    }
    target += normal_lpdf(Y | mu, sigma);
  }
  // priors including constants
  target += normal_lpdf(ar | 0, .5)
    - 1 * log_diff_exp(normal_lcdf(1 | 0, .5), normal_lcdf(-1 | 0, .5));
  target += normal_lpdf(ma | 0, .5)
    - 1 * log_diff_exp(normal_lcdf(1 | 0, .5), normal_lcdf(-1 | 0, .5));
  target += normal_lpdf(sigma | 0, .5)
    - 1 * normal_lccdf(0 | 0, .5);
}
generated quantities {
}
4 Likes

I’m guessing the parameters are just highly correlated with each other. You’re probably recovering the combined output of the parameters (ie the predictions) just fine.

3 Likes

Yeah the predictions look fine so I think this could be the issue

Yes, but if the parameters themselves are of interest, that’s still a problem. (Even if not, chances are you can do better.)

I’d recommend Minnesota priors, a class of priors designed to encode reasonable prior information about time series, e.g. that autoregression parameters on longer lags will usually be smaller than those on shorter lags and that the sum of parameters will usually be close to, but not quite equal, to 1. This handbook includes a good introduction.

2 Likes

Are you using N(0,) for the ar parameter? Note that if the abs value is close to 1, convergence is hard, you can use a uniform U(-1, 1) or a Beta() * 2 - 1 instead.

This isn’t directly related to prior choice but my understanding of time series (and stochastic process more generally) is that if has a long characteristic length scale (or high autocorrelation) having to observe the process for a while to infer the parameters accurately seems unavoidable.