Variational Bayes "fitting finished unexpectedly": Arithmetic exception from scale_matrix_exp_multiply()

Hi,

I’m having a problem running a variational model. I’m getting the below error when I run a Linux server, but the exact same code works perfectly fine on a Mac.

> barcode_model$variational(experiment_stan_data, threads = 12)
------------------------------------------------------------ 
EXPERIMENTAL ALGORITHM: 
  This procedure has not been thoroughly tested and may be unstable 
  or buggy. The interface is subject to change. 
------------------------------------------------------------ 
Gradient evaluation took 0.522638 seconds 
1000 transitions using 10 leapfrog steps per transition would take 5226.38 seconds. 
Adjust your expectations accordingly! 
Begin eta adaptation. 
Iteration:   1 / 250 [  0%]  (Adaptation) 
Iteration:  50 / 250 [ 20%]  (Adaptation) 
Warning: Fitting finished unexpectedly! Use the $output() method for more information.

Finished in  309.5 seconds.

Notice, that I’m also running with 12 threads. On the machine where I get this error, I noticed that I didn’t see any threads being spawned in top. Turning off multithreading doesn’t change anything.

This is what I get from output():

method = variational
  variational
    algorithm = meanfield (Default)
      meanfield
    iter = 10000 (Default)
    grad_samples = 1 (Default)
    elbo_samples = 100 (Default)
    eta = 1 (Default)
    adapt
      engaged = true (Default)
      iter = 50 (Default)
    tol_rel_obj = 0.01 (Default)
    eval_elbo = 100 (Default)
    output_samples = 1000 (Default)
id = 1 (Default)
data
  file = <path>/standata-539dc49b86a.json
init = 2 (Default)
random
  seed = 1912800785
output
  file = <path>/barcode-sc-dynamics-202406172040-1-661654.csv
  diagnostic_file =  (Default)
  refresh = 100 (Default)
  sig_figs = -1 (Default)
  profile_file = <path>/barcode-sc-dynamics-profile-202406172040-1-3b728b.csv
  save_cmdstan_config = false (Default)
num_threads = 12 (Default)

------------------------------------------------------------
EXPERIMENTAL ALGORITHM:
  This procedure has not been thoroughly tested and may be unstable
  or buggy. The interface is subject to change.
------------------------------------------------------------



Gradient evaluation took 0.522638 seconds
1000 transitions using 10 leapfrog steps per transition would take 5226.38 seconds.
Adjust your expectations accordingly!


Begin eta adaptation.
Iteration:   1 / 250 [  0%]  (Adaptation)
Iteration:  50 / 250 [ 20%]  (Adaptation)
  • Operating System: Linux and Mac
  • cmdstanr Version: 0.8.1.9000
  • Stan: 2.35.0

Did you try doing what the error message suggested and running the $output() method (I have no idea what it does, but it’s where I’d start)?

The root cause of the problem is that the ADVI algorithm as coded in Stan is quite unstable. The arithmetic in C++ is platform dependent, so it’s not unusual to have slightly different behavior in Linux and on the Mac, especially if it’s an M1/2/3 Mac.

You might want to

  1. try a different seed,
  2. try to provide better inits (or at least a smaller uniform range than (-2, 2) default),
  3. try a range of fixed eta values rather than adapting (I think this is possible, but am not 100% sure), or
  4. try Pathfinder VI (which I’m not 100% sure is available in cmdstanr yet).

Multiple threads would only be spawned if you have parallelized code within your Stan program. I don’t know if CmdStan lets you run multiple optimizations in parallel.

Hi Bob,

Did you try doing what the error message suggested and running the $output() method (I have no idea what it does, but it’s where I’d start)?

Yes, I’ve included the output above. For pathfinder() I also included the output below.

ADVI seems to work fine with a Mac but is never able to get past the beta adaptation on Linux.

try a different seed

I never set a seed, so I’m guessing that might not be the problem, i.e., I’m always trying a different seed.

try to provide better inits (or at least a smaller uniform range than (-2, 2) default),

I tried init = 0 but that still didn’t help.

try Pathfinder

I still get the same error, shown below:

> xx <- barcode_model$pathfinder(experiment_stan_data, num_threads = 12)
Path [2] :Initial log joint density = -14768535422.992096 
Path [4] :Initial log joint density = -14918802812.028357 
Path [3] :Initial log joint density = -14792147284.162289 
Path [1] :Initial log joint density = -14828556428.684467 
Warning: Fitting finished unexpectedly! Use the $output() method for more information.
Finished in  16.0 seconds.
> xx$output()

method = pathfinder
  pathfinder
    init_alpha = 0.001 (Default)
    tol_obj = 1e-12 (Default)
    tol_rel_obj = 10000 (Default)
    tol_grad = 1e-08 (Default)
    tol_rel_grad = 1e+07 (Default)
    tol_param = 1e-08 (Default)
    history_size = 5 (Default)
    num_psis_draws = 1000 (Default)
    num_paths = 4 (Default)
    save_single_paths = false (Default)
    psis_resample = true (Default)
    calculate_lp = true (Default)
    max_lbfgs_iters = 1000 (Default)
    num_draws = 1000 (Default)
    num_elbo_draws = 25 (Default)
id = 1 (Default)
data
  file = <path>/RtmpER9qvW/standata-539d668205b5.json
init = 2 (Default)
random
  seed = 1956612295
output
  file = <path>/RtmpER9qvW/barcode-sc-dynamics-202406192153-1-006d3d.csv
  diagnostic_file =  (Default)
  refresh = 100 (Default)
  sig_figs = -1 (Default)
  profile_file = <path>/RtmpER9qvW/barcode-sc-dynamics-profile-202406192153-1-49a328.csv
  save_cmdstan_config = false (Default)
num_threads = 12 (Default)

Path [2] :Initial log joint density = -14768535422.992096
Path [4] :Initial log joint density = -14918802812.028357
Path [3] :Initial log joint density = -14792147284.162289
Path [1] :Initial log joint density = -14828556428.684467

Forget what I said about the threads. They are being spawned correctly and I’m using map_rect and it works fine. (I was incorrectly looking for processes not threads looking at the top output!) Could my use of map_rect be causing this problem (it works find on the Mac)?

It looked like the Stan process was crashing, so I attached gdb to it and this is what I see:

Program received signal SIGFPE, Arithmetic exception.
0x00000000004ee9de in void stan::math::matrix_exp_action_handler::set_approx_order<Eigen::Matrix<double, -1, -1, 0, -1, -1>, Eigen::Matrix<double, -1, -1, 0, -1, -1>, (void*)0, (void*)0>(Eigen::Matrix<double, -1, -1, 0, -1, -1> const&, Eigen::Matrix<double, -1, -1, 0, -1, -1> const&, double const&, int&, int&) [clone .isra.0] ()

I am using this somewhere in my model

      row_vector[2] cur_bc_prop_mix = [1, 1] * scale_matrix_exp_multiply(seq_day[t], treatment_barcode_A, t0_prob_mix);