First order autoregressive model + noise, estimation fails with Stan

I’m having trouble with estimating a simple AR(1) + noise model with Stan:

set.seed(1)

x <- arima.sim(model = list(ar = 0.5), n = 500)
y <- 1 + rnorm(500, x, 0.5)

model <- "
data {
  int<lower=0> T;
  vector[T] y;
}

parameters {
  real alpha;
  real<lower=0, upper=1> rho;
  real<lower=0> sigma_y;
  real<lower=0> sigma_x;
  vector[T] x_raw;
}
 transformed parameters {
   vector[T] x;
   x[1] = sigma_x * inv(sqrt(1 - rho^2)) * x_raw[1];
   for(t in 2:T) {
    x[t] = rho * x[t - 1] + sigma_x * x_raw[t];
   }
 }
model {
  x_raw ~ std_normal();
  alpha ~ normal(0, 2);
  rho ~ beta(2, 2);
  sigma_y ~ normal(0, 2);
  sigma_x ~ normal(0, 2);
  y ~ normal(alpha + x, sigma_y);
} 

"

fit <- rstan::stan(
  model_code = model, 
  data = list(y = y, T = length(y)), 
  cores = 4, iter = 1e4)

I get low BFMI warnings and large Rhats, and some divergences:

Warning messages:
1: There were 2221 divergent transitions after warmup. See
https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
to find out why this is a problem and how to eliminate them. 
2: There were 4 chains where the estimated Bayesian Fraction of Missing Information was low. See
https://mc-stan.org/misc/warnings.html#bfmi-low 
3: Examine the pairs() plot to diagnose sampling problems
 
4: The largest R-hat is 1.25, indicating chains have not mixed.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#r-hat 
5: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#bulk-ess 
6: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#tail-ess 
> print(fit, pars = c("x", "x_raw"), include = FALSE, use_cache = FALSE)
Inference for Stan model: anon_model.
4 chains, each with iter=10000; warmup=5000; thin=1; 
post-warmup draws per chain=5000, total post-warmup draws=20000.

         mean se_mean     sd    2.5%     25%    50%    75%  97.5% n_eff Rhat
alpha    1.00    0.01   0.09    0.85    0.94   1.00   1.07   1.17    45 1.08
rho      0.44    0.01   0.07    0.32    0.39   0.43   0.47   0.58   101 1.02
sigma_y  0.41    0.04   0.21    0.10    0.23   0.44   0.58   0.77    27 1.18
sigma_x  1.03    0.01   0.10    0.80    0.96   1.05   1.12   1.17    49 1.08
lp__    22.54   90.67 315.59 -374.18 -225.73 -87.70 241.85 631.77    12 1.31

Switching to centered parameterization does not help.

I know that as both the latent AR process and observations are Gaussian, I could easily marginalize x out using Kalman filter(*), and then sampling should work much better, but I’m still surprised that Stan struggles with such a simple model? Any tips on how get this working?

edit: Here is also the pairs plot from the noncentered model above:


And same from the centered version:

(*) Estimating the same model with bssm which in this case uses Kalman filter and random walk Metropolis works OK:

library(bssm)
bssm_model <- ar1_lg(
  y, 
  rho = uniform(init = 0.4, min = 0, max = 0.999), 
  sd_y = halfnormal(init = 1, sd = 2),
  sigma = halfnormal(init = 1, sd = 2),
  mu = 0, 
  xreg = matrix(1, nrow = length(y)),
  beta = normal(init = 0, mean = 0, sd = 2)
)
out <- replicate(4, 
  run_mcmc(bssm_model, iter = 1e5, burnin = 1e4), 
  simplify = FALSE)
draws <- posterior::bind_draws(
  lapply(out, function(x) as_draws(x, states = 0)),
  along = "chain")

draws |> posterior::summarise_draws()
# A tibble: 4 × 10
  variable  mean median     sd    mad     q5   q95  rhat ess_bulk ess_tail
  <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl>    <dbl>    <dbl>
1 rho      0.423  0.415 0.0689 0.0663 0.324  0.548  1.00    5085.    7031.
2 sigma    1.04   1.07  0.103  0.0978 0.849  1.18   1.00    3166.    4841.
3 sd_y     0.383  0.391 0.211  0.250  0.0411 0.717  1.00    1454.     873.
4 coef_1   1.01   1.01  0.0842 0.0835 0.876  1.15   1.00   22587.   31678.

edit: multiple chains with bssm, tidied the codes.

I think you might have an identifiability issue, close but not exactly like an additive degeneracy issue with sigma_y and sigma_x (see 5.2 in this @betanalpha case study Identity Crisis). I think the estimation learns that sigma_y + sigma_x sums to around 1.5 but has trouble splitting this 1.5 between the two sigmas because it does not have enough information to distinguish them. You see in the pairs plot a clearly negative and strong correlation between the two, although not exactly a straight line.

There is some information in that the innovations to the latent process are more persistent (because of the AR coefficient) while the sigma_y affects measurement errors which are not persistent. Indeed the estimation seems to converge to the true values even if the diagnostics complain all the way there. However, this information seems to not be enough here. In the limit, if the AR coefficient of the latent process x was 0, the two would be indistinguishable.

If this interpretation is correct, then either you bring in prior information or additional data that helps distinguish them.

I tried giving an extremely tight prior to sigma_y (I tried with normal(0.5, .01), just as an example), the diagnostics are fine, as this pins down the measurement error variability.

However, other things that I thought would help didn’t seem to. Generating a second series y2 based on the same x did not work. I though also that generating data with a higher persistence (0.9) would also fix the issue by helping pin down how persistent is the impact of sigma_x, but it doesn’t seem to be the case. Also playing with longer time series (1000) or providing just slightly tighter priors for sigma_y didn’t work sigma_y ~ normal(0, .5); sigma_x ~ normal(0, 1);

I feel like I’m close to grasping the issue but I’m not there yet. Please provide updates if you or someone else figures it out.