Pathfinder to Rstan

I was wondering if there was a plan (and what would be a rough ETA) for adding the Pathfinder algorithm to Rstan.

Thanks.

Not that I know of. It’s all @bgoodri can do to keep up with the requirements of CRAN. It’s in cmdstanr and that interface is getting expanded to make it easier to use for inits following the design in cmdstanpy.

1 Like

Yes I imagine it’s a pain (I’m interested in Rstan because I build few stan-based R packages).

I was just wondering if @bgoodri thought it was a matter of months or 1+ years.

I would use cmdstanr, but the feature that stops me most is that the returned object does not carry the draws, but just a link of it. This it’s quite hard to manage for users that use a stan-third-party package extemporarily (and do not know about Stan) and they just want to save one object, and maybe pass it in the future to other methods.

I would use cmdstanr, but the feature that stops me most is that the returned object does not carry the draws, but just a link of it

Could you clarify this? After the first time the $draws() method is called on a fit object and the csvs are read into R, the object stores all of the draws internally. You can see this is in the source code here

1 Like

Thnaks @andrjohns , I was of the impression that if you delete the draw files, then the fit$draws() command fails to produce them.

If this is not the case, it would simplify things a lot, for the use of cmdstanR in R packages.

Interestingly on some models of mine, vb get’s closer to HMC than Pathfinder. Below I am plotting the point estimates. (tagging @avehtari in case you find this of interest)

I am using sccomp to call all three fitting methods. Below the code

devtools::install_github(stemangiola/sccomp@cmdstanr)
library(sccomp)
path  =
  seurat_obj |>
  sccomp_estimate(
  formula_composition = ~ continuous_covariate * type ,
  formula_variability = ~ 1,
  sample, cell_group,
  cores = 1,
  mcmc_seed = 42,
  max_sampling_iterations = 1000, inference_method = "pathfinder"
  )
var  =
  seurat_obj |>
  sccomp_estimate(
  formula_composition = ~ continuous_covariate * type ,
  formula_variability = ~ 1,
  sample, cell_group,
  cores = 1,
  mcmc_seed = 42,
  max_sampling_iterations = 1000, inference_method = "variational"
  )
hmc  =
  seurat_obj |>
  sccomp_estimate(
  formula_composition = ~ continuous_covariate * type ,
  formula_variability = ~ 1,
  sample, cell_group,
  cores = 1,
  mcmc_seed = 42,
  max_sampling_iterations = 1000, inference_method = "hmc"
  )


tibble(
hmc = hmc |> pull(c_effect),
vb = var |> pull(c_effect),
path = path |> pull(c_effect)
) |> GGally::ggpairs()

Here is how I am calling those fitting methods

mod$sample(
        data = data_for_model ,
        chains = chains,
        parallel_chains = chains,
        threads_per_chain = 1,
        iter_warmup = warmup_samples,
        iter_sampling = as.integer(output_samples /chains),
        save_warmup = FALSE,
        init = init,
        output_dir = output_directory
      )
model$variational(
          data = data,
          output_samples = output_samples,
          iter = iter,
          tol_rel_obj = 0.01,
          output_dir = output_dir,
          init = init,
          ...
        )
 model$pathfinder(
          data = data,
          tol_rel_obj = tol_rel_obj,
          output_dir = output_dir,
          init = init,
          num_paths=10,
          ...
        )

In the experiments of Pathfinder paper we had also 3 of 20 where ADVI had on average smaller 1-Wasserstein distance than Pathfinder (see Figure 3), so this is not surprising. I would guess ADVI did use more computation time than Pathfinder in your case, too (see Figures 4 and 5).

Can you briefly summarize what is sccomp, model and data here?

sccomp is a compositional linear model based on Beta binomials (that are sum constrained to get compositional properties).

https://www.pnas.org/doi/full/10.1073/pnas.2203828120

the data is counts across cell types, basically for a tissue specimen we might have

  • 100 immune cells
  • 900 epithelial cells and so on

And we have many tissue specimen, with treatment labels (for example). This model tests, for example, whether immune cells increase in proportions for treated specimens.

Cool!

Would you expect the posterior to be close to normal with low correlation?

Generally, it seems all marginal posteriors are normally distributed, and roughly independent.

Beta_raw_raw intercept for all components of the system (all cell types in the tissue) = the estimated proportion (in unconstrained space)

beta_raw_raw, slope, all components

beta_raw_raw = all covariate of one component

Alpha = overdispersion of the estimated proportion

For pathfinder,

I am wondering if I am doing things right

the pairplot I get is

Seem all posterior draws are piled up. If I extract the draws it appears to have 1000 draws.

> attr(my_estimate, "fit")$draws() 
# A draws_matrix: 1000 iterations, 1 chains, and 3452 variables
    variable
draw lp_approx__  lp__ beta_raw_raw[1,1] beta_raw_raw[2,1] beta_raw_raw[3,1]
  1           51 -3151              -1.7             0.097             -0.59
  2           51 -3151              -1.7             0.097             -0.59
  3           51 -3153              -1.6             0.068             -0.92
  4           51 -3153              -1.6             0.068             -0.92
  5           51 -3151              -1.7             0.097             -0.59
  6           51 -3151              -1.7             0.097             -0.59
  7           51 -3151              -1.7             0.097             -0.59
  8           51 -3151              -1.7             0.097             -0.59
  9           51 -3151              -1.7             0.097             -0.59
  10          51 -3151              -1.7             0.097             -0.59
    variable
draw beta_raw_raw[4,1] beta_raw_raw[1,2] beta_raw_raw[2,2]
  1             -0.086             -0.27              0.14
  2             -0.086             -0.27              0.14
  3             -0.049             -0.12              0.39
  4             -0.049             -0.12              0.39
  5             -0.086             -0.27              0.14
  6             -0.086             -0.27              0.14
  7             -0.086             -0.27              0.14
  8             -0.086             -0.27              0.14
  9             -0.086             -0.27              0.14
  10            -0.086             -0.27              0.14
# ... with 990 more draws, and 3444 more variables
> attr(my_estimate, "fit")$draws() |> dim()
[1] 1000 3452

The complain I get for pathfinder is

Path [1] :Initial log joint density = -384975.954254 
Path [1] : Iter      log prob        ||dx||      ||grad||     alpha      alpha0      # evals       ELBO    Best ELBO        Notes  
             44      -3.822e+05      5.869e-03   1.540e-01    1.000e+00  1.000e+00      1101 -3.224e+03 -3.224e+03                   
Path [1] :Best Iter: [39] ELBO (-3221.331526) evaluations: (1101) 
Path [2] :Initial log joint density = -384975.954254 
Path [2] : Iter      log prob        ||dx||      ||grad||     alpha      alpha0      # evals       ELBO    Best ELBO        Notes  
             44      -3.822e+05      5.869e-03   1.540e-01    1.000e+00  1.000e+00      1101 -3.225e+03 -3.225e+03                   
Path [2] :Best Iter: [36] ELBO (-3223.888466) evaluations: (1101) 
Path [3] :Initial log joint density = -384975.954254 
Path [3] : Iter      log prob        ||dx||      ||grad||     alpha      alpha0      # evals       ELBO    Best ELBO        Notes  
             44      -3.822e+05      5.869e-03   1.540e-01    1.000e+00  1.000e+00      1101 -3.226e+03 -3.226e+03                   
Path [3] :Best Iter: [43] ELBO (-3223.387871) evaluations: (1101) 
Path [4] :Initial log joint density = -384975.954254 
Path [4] : Iter      log prob        ||dx||      ||grad||     alpha      alpha0      # evals       ELBO    Best ELBO        Notes  
             44      -3.822e+05      5.869e-03   1.540e-01    1.000e+00  1.000e+00      1101 -3.225e+03 -3.225e+03                   
Path [4] :Best Iter: [43] ELBO (-3222.358036) evaluations: (1101) 
Path [5] :Initial log joint density = -384975.954254 
Path [5] : Iter      log prob        ||dx||      ||grad||     alpha      alpha0      # evals       ELBO    Best ELBO        Notes  
             44      -3.822e+05      5.869e-03   1.540e-01    1.000e+00  1.000e+00      1101 -3.222e+03 -3.222e+03                   
Path [5] :Best Iter: [44] ELBO (-3222.415514) evaluations: (1101) 
Path [6] :Initial log joint density = -384975.954254 
Path [6] : Iter      log prob        ||dx||      ||grad||     alpha      alpha0      # evals       ELBO    Best ELBO        Notes  
             44      -3.822e+05      5.869e-03   1.540e-01    1.000e+00  1.000e+00      1101 -3.225e+03 -3.225e+03                   
Path [6] :Best Iter: [37] ELBO (-3223.245075) evaluations: (1101) 
Path [7] :Initial log joint density = -384975.954254 
Path [7] : Iter      log prob        ||dx||      ||grad||     alpha      alpha0      # evals       ELBO    Best ELBO        Notes  
             44      -3.822e+05      5.869e-03   1.540e-01    1.000e+00  1.000e+00      1101 -3.225e+03 -3.225e+03                   
Path [7] :Best Iter: [38] ELBO (-3222.993990) evaluations: (1101) 
Path [8] :Initial log joint density = -384975.954254 
Path [8] : Iter      log prob        ||dx||      ||grad||     alpha      alpha0      # evals       ELBO    Best ELBO        Notes  
             44      -3.822e+05      5.869e-03   1.540e-01    1.000e+00  1.000e+00      1101 -3.224e+03 -3.224e+03                   
Path [8] :Best Iter: [43] ELBO (-3222.202985) evaluations: (1101) 
Path [9] :Initial log joint density = -384975.954254 
Path [9] : Iter      log prob        ||dx||      ||grad||     alpha      alpha0      # evals       ELBO    Best ELBO        Notes  
             44      -3.822e+05      5.869e-03   1.540e-01    1.000e+00  1.000e+00      1101 -3.224e+03 -3.224e+03                   
Path [9] :Best Iter: [43] ELBO (-3223.049653) evaluations: (1101) 
Path [10] :Initial log joint density = -384975.954254 
Path [10] : Iter      log prob        ||dx||      ||grad||     alpha      alpha0      # evals       ELBO    Best ELBO        Notes  
             44      -3.822e+05      5.869e-03   1.540e-01    1.000e+00  1.000e+00      1101 -3.224e+03 -3.224e+03                   
Path [10] :Best Iter: [44] ELBO (-3223.958918) evaluations: (1101) 
Total log probability function evaluations:11760 
Pareto k value (2.219468) is greater than 0.7. Importance resampling was not able to improve the approximation, which may indicate that the approximation itself is poor. 
Finished in  1.8 seconds.

The way I call pathfinder is

my_res = model$pathfinder(
          data = data,
          tol_rel_obj = tol_rel_obj,
          output_dir = output_dir,
          seed = seed+i,
          init = init,
          num_paths=10,
          single_path_draws = output_samples / 10 ,
          ...
        )

The same plot for for vb looks good

Yep, that’s the only case where we would expect that ADVI might perform better than Pathfinder, but then we would also expect the Laplace algorithm to perform also well and that would be much faster than ADVI.

  • It’s not clear what init is. Pathfinder benefits from having disperse initial values in tails of the posterior. If you start near the mode, the Pathfinder optimization ends quickly before getting a good normal approximation and there is less variation between different paths.
  • By default (see Run Stan's Pathfinder Variational Inference Algorithm — model-method-pathfinder • cmdstanr), Pathfinder is doing Pareto smoothed importance resampling in the end. Even if the target and proposal are close to normal, in high dimensional case the distribution of importance ratios gets worse (see Pareto Smoothed Importance Sampling) so that one weight dominates, and resampling gets many copies of that draw. Based on the plot, you have one weight dominating and about 20 unique draws overall. You can turn off the resampling with option psis_resample=FALSE. See examples of how this makes a difference in the Birthdays case study, and how to county the number of unique draws.
  • I almost always set history_size to 50 or 100 for much improved Pathfinder performance.

Thanks Aki. I understand. I am trying

psis_resample=FALSE

with cmdstanr version [Package cmdstanr version 0.8.0.9000 Index] and strangely although the documentation ?pathfinder showed psis_resample as argument

I get

Error in model$pathfinder(data = data, tol_rel_obj = tol_rel_obj, output_dir = output_dir,  : 
  unused argument (psis_resample = FALSE)

Also the autocompletion of the parameters does not show psis_resample.

Not sure if this has happened to anyone else.

Did you just install a new version of cmdstanr? Just checking because the model object stores the functions, and if you install a new version, you need to recreate the model objects, too, to get the new function definitions

Thanks, I recompiled the model.

With your suggestions I get a great match at least of the point estimates

So while inits are quite important for efficient hmc (to the point that you run pathfinder/vb to find inits), for pathfinder you are saying that is almost better to not define them a prior?

        my_res = model$pathfinder(
          data = data,
          tol_rel_obj = tol_rel_obj,
          output_dir = output_dir,
          seed = seed+i,
          # init = init,
          num_paths=10, 
          single_path_draws = output_samples / 10 ,
          max_lbfgs_iters=100, 
          history_size = 100, 
          psis_resample = FALSE,
          ...
        )
1 Like

Great!

HMC benefits if the inits are in or close to the typical set. As each HMC chain uses different random seed and is stochastic, the init can be the same for all the chains, although it is better to use disperse inits for different chains. HMC can struggle if the inits are really far away from the typical set. It seems in your HMC works just fine.

Pathfinder uses deterministic optimization and using the same init for all paths means there are no differences in paths, and the only variation comes from the stochastic ELBO estimate used to choose a normal approximation along the path. Thus, it is important to have diverse inits. Pathfinder is useful when you don’t know where the typical set is, and your diverse inits are unlikely to be near the mode (or one of the modes) of the posterior, as Pathfinder optimization quickly gets to higher density and often the path goes through or near the typical set. It would be possible to modify Pathfinder to also have some kind of expansion phase if the diagnostics would indicate the initial value was close to the mode and the path did only go away from the typical set.

1 Like

Hello @avehtari ,

Without resampling, I get that each of the 10 paths converges on different values.
I imagine this is expected.