Chain covergence of complex nonlinear models highly dependent on initial values

Hi everyone,

I am fitting a complex nonlinear financial model known as Log-Periodic Power Law Singularity (LPPLS) model, but I found that the successful chain convergence is highly dependent on the initial values of the chains. With a lucky choice of initial values, the multiple chains can converge successfully with a relatively fast sampling speed. However, when using a different set of initial values or random initial values, they will mostly lead to slow sampling and non-convergence of chains. Even with the same lucky choice of initial values that work for a particular dataset, they will still fail to lead to chain convergence when applying to a different dataset.

I am thinking that it may be not a sampling issue specific to the model I am using, but is a general issue for complex nonlinear models. For such models, the likelihood itself often exhibits multiple local maxima in space and hence multiple maxima in the log posterior density to be evaluated. Therefore, the sampling can be very sensitive to the initial values; that is to say, when given different sets of initial values, the sampler of different chains may be stuck around one or more local maxima (NOT SURE if it can be stuck around more than 1 local maximum, and if it does, is this the reason why we get multi-modal posterior distribution assuming no prior-data conflict?).

I wonder how we should deal with such an issue which seems to be quite common to complex nonlinear models. In frequentist inference for such complex nonlinear models, we often see that the cost function (using ML or LS estimator) with multiple local minima is solved by iterative optimization methods with different sets of initial values, which similarly generates different locally optimal solutions. Following this, one seemingly common approach is to select the locally optimal solution that gives the smallest value of the cost function as the best one out of these locally optimal solutions. This approach sounds reasonable to me provided that the number of different initial value sets is reasonably large (say, more than 5), although we may not necessarily get the global optimal solution.

I wonder in the Bayesian paradigm whether we should select the best chain out of these multiple non-converged chains that kind of represent different locally optimal solutions in a similar manner? If that’s the way we should do, is the selection criterion to be the chain that gives the maximum log posterior density compared to the other parallel chains?

I really appreciate it if anyone can give me some insights!

Alan

For you information, the LPPLS model is written as (parameters are A, B, C_1, C_2, m, t_c and \omega):

\mathrm{ln}p_t = A + B(t_c - t)^m + C_1(t_c - t)^m \cos(\omega \mathrm{ln}(t_c - t) ) + C_2(t_c - t)^m \sin(\omega \mathrm{ln}(t_c - t) ) + \varepsilon_t ,

and the Stan codes are:

data {
  int<lower=0> N;           // number of data/trading days
  vector[N] t;              // time index of trading days
  vector<lower=0>[N] p;     // asset price
}

parameters {
  real<lower=0> A;         
  real<upper=0> B;
  real<lower=-1, upper=1> C1;
  real<lower=-1, upper=1> C2;
  real<lower=max(t)+1> tc;
  real<lower=0, upper=1> m;
  real<lower=0> omega;
  real<lower=0> sigma;  // standard deviation of the noise of log price
}

model {
  // priors
  A ~ normal(0, 5);
  B ~ normal(0, 2.5);
  C1 ~ normal(0, 0.5);
  C2 ~ normal(0, 0.5);
  tc ~ normal(100, 50);
  m ~ normal(0.5, 0.2);
  omega ~ normal(10, 3);
  sigma ~ normal(0, 2.5);
  
  // likelihood
  for (n in 1: N){
      p[n] ~ lognormal(A + B*((tc-t[n])^m) + C1*((tc-t[n])^m)*cos(omega*log(tc-t[n])) + C2*((tc-t[n])^m)*sin(omega*log(tc-t[n])), sigma);        
  }
}

Yeah, better defaults and default behavior is something that is being discussed. Check section 11 here: Bayesian Workflow « Statistical Modeling, Causal Inference, and Social Science or [2006.12335] Stacking for Non-mixing Bayesian Computations: The Curse and Blessing of Multimodal Posteriors for a couple different takes on it.

I don’t think there’s anything specific being coded right now – so these things are out there but you’ll need to implement them yourself. Picking chains by density is one choice (with the downside that high density isn’t necessarily what you want – this is probably discussed in that second link but I forget offhand).

I suspect that this is going to be an issue with all models with a periodic component that has a strong representation in the likelihood. I posted over in the slack some explorations I’ve done showing that even a ridiculously simple periodic model has dramatic multimodality in the likelihood.

1 Like

Hi Mike,

Thanks for your reply, but I can’t access your posts in your slack workspace. Do you mind sharing an accessible link here? Thank you!

Try this: https://join.slack.com/t/mc-stan/shared_invite/enQtMzAyNzg1ODQ5MDczLTc1M2Q1YzM4ZjY5MzRjMGFlNDcyYzRhOGYxNTRlZjRlZjI2YzYxZjYyMDRlNDYzOTY5YzU5MTgzM2JlZjAxNTk

Hi Ben:

Thank you very much for your reply. I will read through the papers you shared.

Hi Ben:

Running as many chains as we can and then picking a chain that gives the highest posterior density may be a much easier way to implement in Stan compared to the approach of stacking these chains weighted by, say, importance sampling.

I am assuming here that picking the highest-density chain out of a sufficient number of chains, by and large, allows us to claim that the mode of that chain is the best estimate of the associated parameter (may not necessarily be the global best estimates); however, the issue with this approach is that the interval/uncertainty estimate from the selected chain can be a serious underestimate of the true uncertainty of the parameter as we are not able to obtain the full posterior distribution of the parameter for interval estimation.

Please correct me if I have mistaken anything here, and also could you share your thoughts on how we can properly obtain an uncertainty estimate using this chain selection approach?

Many thanks,
Alan

Hi Mike:

Thanks for the slack invitation. I’ve joined the Stan slack workspace but don’t known where to find your relevant posts.

Alan

You should now be able to click the link I included in my response above :)

The link somehow directs me to the random channel.

Yeah that’s sorta the problem you run into here. Some problems just have sketchy posteriors and there’s a set of Bayesian models where you can be happy with the output even with the knowledge that mathematically you’re not really getting correct expectations (things like LDA fit in here).

Models under development also fit in here (the workflow paper talks about this) – like all sorts of crazy stuff happens in model development and you can’t just make decisions based on models with Rhat = 1.0 and large ESS and whatnot, even if eventually you want a model that behaves like that.

It’s definitely true that once we spend time working with a model we might do initialization differently to dodge these problems.

I’ve not thought much about any of this. Hit up @yuling . Read his papers and ask him questions, the harder the better (https://statmodeling.stat.columbia.edu/2021/01/25/hierarchical-stacking/).

Even “running as many chains as we can and then picking a chain that gives the highest posterior density” might not be trivial as you need to distinguish between high but narrow modes vs low but wide modes. In the paper Ben pointed to, chain-stacking is nearly automated and you get weighted posterior draws that give you mean, sd of all parameters.

2 Likes

Ben, thank you so much for your helpful reply and for bringing in another expert Yuling!

Hi Yuling, thank you for jumping in, and your paper is just right in time for my problem. Also thanks for pointing out the issue of distinguishing between high but narrow modes vs low but wide modes if we want to pick one “best” chain for computing posterior estimates.

I will read through your paper and hope it’ll solve all the problems concerning my models.