Pathfinder failure without hint why it failed

Hello there!

I am currently modifying Facebook’s Prophet to my needs. In contrast to Prophet I use the Pathfinder algorithm which works in most cases. However, I encounter an error on some occasions. The error doesn’t tell much about what went wrong and a Google search didn’t yield anything as well. First, the complete log as produced by cmdstanpy, further down I give some more details about my setup:

method = pathfinder
  pathfinder
    init_alpha = 0.001 (Default)
    tol_obj = 1e-12 (Default)
    tol_rel_obj = 10000 (Default)
    tol_grad = 1e-08 (Default)
    tol_rel_grad = 1e+07 (Default)
    tol_param = 1e-08 (Default)
    history_size = 5 (Default)
    num_psis_draws = 1000 (Default)
    num_paths = 4 (Default)
    save_single_paths = false (Default)
    psis_resample = true (Default)
    calculate_lp = true (Default)
    max_lbfgs_iters = 1000 (Default)
    num_draws = 1000 (Default)
    num_elbo_draws = 25 (Default)
id = 1 (Default)
data
  file = C:\Users\bkambs\AppData\Local\Temp\tmp2vsplre3\q5jib5wn.json
init = C:\Users\bkambs\AppData\Local\Temp\tmp2vsplre3\5dqex9x2.json
random
  seed = 37534
output
  file = C:\Users\bkambs\AppData\Local\Temp\tmp2vsplre3\poissongdlmhx6a\poisson-20241113140824.csv
  diagnostic_file =  (Default)
  refresh = 100 (Default)
  sig_figs = -1 (Default)
  profile_file = profile.csv (Default)
  save_cmdstan_config = false (Default)
num_threads = 1 (Default)

Path [1] :Initial log joint density = 288294.244959
Path [1] : Iter      log prob        ||dx||      ||grad||     alpha      alpha0      # evals       ELBO    Best ELBO        Notes
              1       2.883e+05      8.325e-07   1.645e+02    7.732e-09  1.000e-03        26       -inf       -inf                  
Path [1] :Failure: None of the LBFGS iterations completed successfully
Pathfinder iteration: 0 failed.
Path [2] :Initial log joint density = 288294.244959
Path [2] : Iter      log prob        ||dx||      ||grad||     alpha      alpha0      # evals       ELBO    Best ELBO        Notes
              1       2.883e+05      8.325e-07   1.645e+02    7.732e-09  1.000e-03        26       -inf       -inf                  
Path [2] :Failure: None of the LBFGS iterations completed successfully
Pathfinder iteration: 1 failed.
Path [3] :Initial log joint density = 288294.244959
Path [3] : Iter      log prob        ||dx||      ||grad||     alpha      alpha0      # evals       ELBO    Best ELBO        Notes
              1       2.883e+05      8.325e-07   1.645e+02    7.732e-09  1.000e-03        26       -inf       -inf                  
Path [3] :Failure: None of the LBFGS iterations completed successfully
Pathfinder iteration: 2 failed.
Path [4] :Initial log joint density = 288294.244959
Path [4] : Iter      log prob        ||dx||      ||grad||     alpha      alpha0      # evals       ELBO    Best ELBO        Notes
              1       2.883e+05      8.325e-07   1.645e+02    7.732e-09  1.000e-03        26       -inf       -inf                  
Path [4] :Failure: None of the LBFGS iterations completed successfully
Pathfinder iteration: 3 failed.
No pathfinders ran successfully
  • Prophet’s Stan file is mostly untouched so far. I simply use the poisson glm model instead of the normal distribution one. Plus I changed the prior widths to fit better to the nonlinear link function.
  • Prophet doesn’t use Pathfinder but (unless the user wishes otherwise) only MAPE. I use MAPE to initialize the Pathfinder. The MAPE part of my implementation runs error free. The error occurs only in the subsequent pathfinder
  • For my toy data set, the error occurs as soon as I reduce the number of possible change points below a certain threshold. For those who are not familiar with Prophet: it uses a piecewise linear model to fit a trend to time series data. Change points are the times at which rate changes occur. In Stan, the trend enters the glm via the offset parameter alpha. The rate changes delta are model parameters with a Laplace prior. Bottom line is: reducing the number of change points is equivalent to reducing the number of model parameters, ie. it may be a rigid model causing the algorithm to fail (though by visual inspection of the fit right above the threshold I wouldn’t say it’s too rigid)

Does anyone by any chance know what can cause this error and give me some hints how to debug it?

Thanks!

Pathfinder runs its own optimization as well, so initializing at the MAP estimate is probably not a good idea. Have you tried leaving off that part?

I only added the initialization with the MAPE result, as the pathfinder alone converged towards nonsensical results. The initialization increased quality, speed, and robustness of the fit. Well at least so far.

I’ll give it a try to turn off the initialization again.

One of the primary things that can lead to theFailure: None of the LBFGS iterations completed successfully in the code is if Pathfinder is unable to find any point better than the initial point, which can definitely happen if the initial point is already exactly at a mode

That’s an interesting point!

So I have tried to remove the initialization and indeed it worked. I like the approach of first using MAPE and then pathfinder to speed up the sampling. You think it helps, if i add some epsilon to the MAPE results to avoid the error? Or is it possible to tell pathfinder that it doesn’t optimize, but just samples?

Pathfinder doesn’t just sample around the mode is the issue. The basic description of the algorithm is it optimizes toward the mode, but it remembers each point it visited as it went, and then will sample from the best approximation of any centered at those points. So starting far away from the mode is required.

If you just want to (approximately) sample around the mode, optimization followed by laplace sampling may be what you are looking for

As Brian said, Pathfinder is based on LBFGS which, for the version in pathfinder, finds the MAP estimate. So starting at the MAP means that LBFGS has nowhere to go so it will just fail instantly.

Instead of starting with MAPE have you tried starting with explicit random initial values for pathfinder? i.e. just generating random draws from the support of your parameters that makes sense. Or you can use pathfinder’s okay answers and feed them as inits for something like the laplace approximation.

Thanks a lot for both your inputs!

I wanted to give the Laplace Sampler a try, but it turns out my code around Stan is already too much unflexible to make it possible without major changes.

However, what worked for me was to take the MAPE output, add normal random numbers with a scale corresponding to 10% of the parameter values, and start the pathfinder with these as inits. Up till now it seems the pathfinder doesn’t break while still being fast and with good looking results.

What’s “MAPE”? I see “mean absolute percentage error” online.

Hey,

What’s “MAPE”? I see “mean absolute percentage error” online.

Yeah, you are right. That’s where I know it from, too. I thought I read “MAPE” as abbreviation for “MAP estimation” somewhere these days and just adapted it. Or probably I am just hallucinating and made it up myself

Thanks. You also have to be careful in that MAP and penalized MLE are not equivalent. Stan now lets you do both, with the difference being that MAP uses the Jacobian from the change-of-variables, but the penalized MLE doesn’t. If there are no constrained parameters, they’re equivalent.

I’d recommend using one of our standard interfaces such as cmdstanr or cmdstanpy. That will give you access to all the algorithms along with some nice bells and whistles.

What are you defining as good result here? I’d be cautious, specifying inits very near the MAP estimate will cause a very short search space for pathfinder.

Also MAPE normally refers to Mean Absolute Percent Error. I had the same confusion as Bob. No reason not to just say MAP estimate.

Also, most of our models are hierarchical and do not have a well-defined MAP estimate because p(\theta \mid y) has no upper bound.

Just be forewarned that Laplace is quadratic—if you have D dimensions, it’s going to estimate a D \times D covariance matrix, which can get expensive in both memory and computation.

If you’re working with samples, they’re in the same format for HMC, Laplace, and VI.