Chains not mixing for Gegenbauer Time Series

Hi,

Upfront warning: I am definitely still learning about stan and MCMC generally! My apologies in advance if I have missed something obvious in here.

I am trying to fit a long memory Gegenbauer ARMA time series model with 3 parameters to some synthetic data which has parameter values I already know. If you are not familiar with Gegenbauer ARMA (GARMA) models, they can be defined :
\phi(B)(1-2uB+B^{2})^{d}Y_{t}=\theta(B)\epsilon_{t}

In my case \theta(B)\equiv 1 and there is only 1 autoregressive parameter \phi. We define the Gegenbauer frequency f using the relation u=\cos(2\pi f).

Long memory models generally have their own wikipedia page at:

…but as yet there is no wikipedia page covering Gegenbauer models. (Probably I will need to correct that one day soon!)

I have built up my stan program slowly - starting with fractionally differenced white noise process, graduating from there to a ARFIMA(1,1,0) process (ARIMA process with a fractional differencing parameter, and 1 autoregressive parameter) - both of which worked very well and gave good answers to the parameters I already knew.

For the Gegenbauer models, I have checked that the parameters can be estimated with reasonable accuracy via a number of other frequentist based methods, including Whittle estimation. So I am fairly confident of the true values for the process. For reference, the true values I used for the 3 parameters are d=0.22, f=0.40, \phi=0.80.

What I am finding is that the traceplot is indicating that - of 4 chains, 2 are not mixing very well, for the f and \phi parameters. The pairs plot is showing (I think) that 2 chains are producing samples centered on the correct values and the other two are converging to some other values which are clearly not correct. For the \phi and f parameters, the pairs plot shows a multi-modal distribution anyway.

However if I switch to using only 2 chains, those two chain mix fairly well and the results I get are fairly close to the results I would expect to see (d is a little high at 0.30).

I have to assume that there is a good reason for using 4 chains, rather than 2 chains, and anyway increasing the number of chains as I thought just meant you were able to explore the probability mass rather more effectively and thoroughly than having fewer chains. So I am a bit puzzled as to why this might happen, and more concerned about what to do about it.

I have been trying to read as much as I could about this problem but I haven’t been able to resolve this by increasing the number of warmup samples, or increasing max_treedepth or adapt_delta parameters etc.

All the files required to reproduce the problem (.stan, .R data files etc) are available in my github account

If anyone is able to provide any help with this problem I would be very appreciative!

Thanks,
Richard

file:///home/richard/UniWork/Ggbr%20Research/Review%20Paper/2.0%20Statistical%20Surveys%20Submission/sims/stan_forum_qstn/fit1_traceplot.jpeg


data {
  int<lower=1> N;            // num observations
  real y[N];                 // observed outputs
}
parameters {
  real<lower=-1,upper=1> phi;                  // autoregression coeff
  real<lower=0,upper=0.5> d;
  real<lower=0,upper=0.5> f;
}
model {
  vector[N] nu;
  vector[N] err;
  real g[N+1];
  real u;
  
  u=cos(2.0*pi()*f);
  
  g[1] = 1.0;                // this and next few lines to derive the Ggbr Coefficients.
  g[2] = 2.0*u*d;
  for (j in 3:(N+1)) 
  {
    real rj1;
    real temp;
    
    rj1 = j-1;
    temp = (d-1.0)/rj1;
    g[j]=(2.0*u*(rj1+d-1.0)*g[j-1]-(rj1+2.0*d-2.0)*g[j-2])/rj1;
  }

  nu[1] = 0.0;
  err[1] = y[1] - nu[1];
  for (t in 2:N) {
    real sum_g;
    sum_g = 0.0;
    for (j in 1:(t-1)) sum_g += g[j+1]*err[t-j];

    nu[t] = phi * y[t-1] + sum_g;
    err[t] = y[t] - nu[t];
  }
  phi ~ uniform(-1.0, 1.0);
  d   ~ uniform(0.0, 0.5);
  f   ~ uniform(0.0, 0.5);
  err ~ normal(0.0, 1.0);
}

I’d guess your f is not sufficiently constrained, as it wouldn’t make sense for a frequency to go to zero (which is nearly what happens in some chains). What does a simulation look like when you run with those values (f near 0)? If it isn’t reasonable then changing limits or prior will help.

1 Like

Also I think the general advice is to start with normal prior instead of constrained support (lower, upper) + uniform prior as the former (normal) behaves better during sampling.

1 Like

@maedoc many thankyous for your reply!

The first thing to point out is that if f is 0 then this Gegenbauer case reverts to a “Long Memory” case. When I used the weights for that - assuming a pure long memory case - I find that the chains mix very well and that I can recover the parameter values in a fairly straight forward manner and using the current priors).

I have been taking your advice to use normal rather than uniform priors, and to remove the constrained support for the parameters, as below:

    phi ~ normal(0.0, 0.1);
    d   ~ normal(0.0, 0.1);
    f   ~ normal(0.0, 0.1);
    err ~ normal(0.0, 1.0);

I suppose the point with these priors would be to assume that there is no long memory, and/or no auto-regressive component and then let the estimation process tell us otherwise.

Unfortunately I found a lot of trouble with treedepth and adapt_delta and have had to increase those several times to overcome those problems. My current settings are max_treedepth=16, adapt_delta=0.999.

The traceplot and the pairsplot are reproduced below. Are there any other hints you are able to point me towards please?

Thanks,
Richard

What are some possible values of f given the model? My instinct is to use something like normal(0.5, 0.5) if the values might fall between 0 and 1.

In my experience, the multimodal posterior and tight priors means your model is not as strongly identified by your (simulated?) data as you expect. If you are accustomed to forward modeling, this can be a little counterintuitive, but consider for example, the true f=0.4 is 4 standard deviations away from your prior. At this point you run into a noise floor inducing multimodal posterior (this is at least my mental analogy).

Since you are trying to recover parameters, I would suggest parameterizing your recovery by the effective confidence interval whereby your true value is described by your prior. Concretely start with something easy like 1 std deviation off (f ~ normal(0.3,0.1)) perhaps in one parameter instead of many. When that works, work your way up.

Conversely the onus is on you to check that, for each of these posterior modes, your model isn’t actually generating the same prediction. If it is, then you have a symmetry to break which requires a reparametrization of the model or better data.

1 Like

To make the noise floor analogy concrete, consider 100 samples from normal(0,1), which you try to fit with a normal(4,1): you’d be lucky to have even one or samples in the interval 3.5 to 4.5, so the joint would just be multimodal with modes at the few samples in that interval.

I’m sure there’s a well known description of this that I’m unaware of but in my domain we’d call it a noise floor. In my home state we call it being out in the boonies, living in the sticks…

@maedoc Thank you very much for this advice. I think I have to agree with your comment that

the multimodal posterior and tight priors means your model is not as strongly identified by your (simulated?) data as you expect

What I have been doing is (a) using Gaussian (Normal) priors and (b) restricting the range of allowable values. I can I think restrict d to (0,0.5) and f (0,0.5), but \phi really should be (-1,1). Unfortunately using this results in a multimodal distribution again - only by restricting \phi to be (0,1) can I obtain a single distribution that I might expect.

What I don’t know is why the solution is not uniquely identified, but I think that is my problem not yours. But to explain what I mean, if I do some Taylor series expansions and some Gegenbauer series expansions I can get - for the “correct” solution:
Y_t = (1-1.16 B + 0.48 B^2 + ...) \epsilon_t
and for the second solution I get
Y_t = (1-0.43 B + 0.24 B^2 + ...) \epsilon_t

Now these are quite different models I think - so either I’ve made a mistake on my expansions, or something else is going on.

In any case, my sincere thanks for your assistance (and of course if anything else occurs to you please do not hesitate to contact me!)

Regards
Richard

@ara_winter - f can reasonably be restricted to be between 0 and 1/2. At f=0, the model collapses to form a more standard ‘long memory’ or ARFIMA model.
ARFIMA model

I have made some progress which you might note below in my response to maedoc.

But if you have anything to add, I would appreciate knowing it!

Many thanks,
Richard

I think you have to go a little further than just a Taylor series. The model is assigning similar log prob values to the different modes (or you wouldn’t be getting multiple modes) so maybe the time series is too short? You should compare the posterior predictive density with the data here.