I was wondering if there was a plan (and what would be a rough ETA) for adding the Pathfinder algorithm to Rstan.
Thanks.
I was wondering if there was a plan (and what would be a rough ETA) for adding the Pathfinder algorithm to Rstan.
Thanks.
Not that I know of. It’s all @bgoodri can do to keep up with the requirements of CRAN. It’s in cmdstanr and that interface is getting expanded to make it easier to use for inits following the design in cmdstanpy.
Yes I imagine it’s a pain (I’m interested in Rstan because I build few stan-based R packages).
I was just wondering if @bgoodri thought it was a matter of months or 1+ years.
I would use cmdstanr, but the feature that stops me most is that the returned object does not carry the draws, but just a link of it. This it’s quite hard to manage for users that use a stan-third-party package extemporarily (and do not know about Stan) and they just want to save one object, and maybe pass it in the future to other methods.
I would use cmdstanr, but the feature that stops me most is that the returned object does not carry the draws, but just a link of it
Could you clarify this? After the first time the $draws()
method is called on a fit object and the csvs are read into R, the object stores all of the draws internally. You can see this is in the source code here
Thnaks @andrjohns , I was of the impression that if you delete the draw files, then the fit$draws() command fails to produce them.
If this is not the case, it would simplify things a lot, for the use of cmdstanR in R packages.
Interestingly on some models of mine, vb get’s closer to HMC than Pathfinder. Below I am plotting the point estimates. (tagging @avehtari in case you find this of interest)
I am using sccomp to call all three fitting methods. Below the code
devtools::install_github(stemangiola/sccomp@cmdstanr)
library(sccomp)
path =
seurat_obj |>
sccomp_estimate(
formula_composition = ~ continuous_covariate * type ,
formula_variability = ~ 1,
sample, cell_group,
cores = 1,
mcmc_seed = 42,
max_sampling_iterations = 1000, inference_method = "pathfinder"
)
var =
seurat_obj |>
sccomp_estimate(
formula_composition = ~ continuous_covariate * type ,
formula_variability = ~ 1,
sample, cell_group,
cores = 1,
mcmc_seed = 42,
max_sampling_iterations = 1000, inference_method = "variational"
)
hmc =
seurat_obj |>
sccomp_estimate(
formula_composition = ~ continuous_covariate * type ,
formula_variability = ~ 1,
sample, cell_group,
cores = 1,
mcmc_seed = 42,
max_sampling_iterations = 1000, inference_method = "hmc"
)
tibble(
hmc = hmc |> pull(c_effect),
vb = var |> pull(c_effect),
path = path |> pull(c_effect)
) |> GGally::ggpairs()
Here is how I am calling those fitting methods
mod$sample(
data = data_for_model ,
chains = chains,
parallel_chains = chains,
threads_per_chain = 1,
iter_warmup = warmup_samples,
iter_sampling = as.integer(output_samples /chains),
save_warmup = FALSE,
init = init,
output_dir = output_directory
)
model$variational(
data = data,
output_samples = output_samples,
iter = iter,
tol_rel_obj = 0.01,
output_dir = output_dir,
init = init,
...
)
model$pathfinder(
data = data,
tol_rel_obj = tol_rel_obj,
output_dir = output_dir,
init = init,
num_paths=10,
...
)
In the experiments of Pathfinder paper we had also 3 of 20 where ADVI had on average smaller 1-Wasserstein distance than Pathfinder (see Figure 3), so this is not surprising. I would guess ADVI did use more computation time than Pathfinder in your case, too (see Figures 4 and 5).
Can you briefly summarize what is sccomp, model and data here?
sccomp is a compositional linear model based on Beta binomials (that are sum constrained to get compositional properties).
https://www.pnas.org/doi/full/10.1073/pnas.2203828120
the data is counts across cell types, basically for a tissue specimen we might have
And we have many tissue specimen, with treatment labels (for example). This model tests, for example, whether immune cells increase in proportions for treated specimens.
Cool!
Would you expect the posterior to be close to normal with low correlation?
Generally, it seems all marginal posteriors are normally distributed, and roughly independent.
Beta_raw_raw intercept for all components of the system (all cell types in the tissue) = the estimated proportion (in unconstrained space)
beta_raw_raw, slope, all components
beta_raw_raw = all covariate of one component
Alpha = overdispersion of the estimated proportion
For pathfinder,
I am wondering if I am doing things right
the pairplot I get is
Seem all posterior draws are piled up. If I extract the draws it appears to have 1000 draws.
> attr(my_estimate, "fit")$draws()
# A draws_matrix: 1000 iterations, 1 chains, and 3452 variables
variable
draw lp_approx__ lp__ beta_raw_raw[1,1] beta_raw_raw[2,1] beta_raw_raw[3,1]
1 51 -3151 -1.7 0.097 -0.59
2 51 -3151 -1.7 0.097 -0.59
3 51 -3153 -1.6 0.068 -0.92
4 51 -3153 -1.6 0.068 -0.92
5 51 -3151 -1.7 0.097 -0.59
6 51 -3151 -1.7 0.097 -0.59
7 51 -3151 -1.7 0.097 -0.59
8 51 -3151 -1.7 0.097 -0.59
9 51 -3151 -1.7 0.097 -0.59
10 51 -3151 -1.7 0.097 -0.59
variable
draw beta_raw_raw[4,1] beta_raw_raw[1,2] beta_raw_raw[2,2]
1 -0.086 -0.27 0.14
2 -0.086 -0.27 0.14
3 -0.049 -0.12 0.39
4 -0.049 -0.12 0.39
5 -0.086 -0.27 0.14
6 -0.086 -0.27 0.14
7 -0.086 -0.27 0.14
8 -0.086 -0.27 0.14
9 -0.086 -0.27 0.14
10 -0.086 -0.27 0.14
# ... with 990 more draws, and 3444 more variables
> attr(my_estimate, "fit")$draws() |> dim()
[1] 1000 3452
The complain I get for pathfinder is
Path [1] :Initial log joint density = -384975.954254
Path [1] : Iter log prob ||dx|| ||grad|| alpha alpha0 # evals ELBO Best ELBO Notes
44 -3.822e+05 5.869e-03 1.540e-01 1.000e+00 1.000e+00 1101 -3.224e+03 -3.224e+03
Path [1] :Best Iter: [39] ELBO (-3221.331526) evaluations: (1101)
Path [2] :Initial log joint density = -384975.954254
Path [2] : Iter log prob ||dx|| ||grad|| alpha alpha0 # evals ELBO Best ELBO Notes
44 -3.822e+05 5.869e-03 1.540e-01 1.000e+00 1.000e+00 1101 -3.225e+03 -3.225e+03
Path [2] :Best Iter: [36] ELBO (-3223.888466) evaluations: (1101)
Path [3] :Initial log joint density = -384975.954254
Path [3] : Iter log prob ||dx|| ||grad|| alpha alpha0 # evals ELBO Best ELBO Notes
44 -3.822e+05 5.869e-03 1.540e-01 1.000e+00 1.000e+00 1101 -3.226e+03 -3.226e+03
Path [3] :Best Iter: [43] ELBO (-3223.387871) evaluations: (1101)
Path [4] :Initial log joint density = -384975.954254
Path [4] : Iter log prob ||dx|| ||grad|| alpha alpha0 # evals ELBO Best ELBO Notes
44 -3.822e+05 5.869e-03 1.540e-01 1.000e+00 1.000e+00 1101 -3.225e+03 -3.225e+03
Path [4] :Best Iter: [43] ELBO (-3222.358036) evaluations: (1101)
Path [5] :Initial log joint density = -384975.954254
Path [5] : Iter log prob ||dx|| ||grad|| alpha alpha0 # evals ELBO Best ELBO Notes
44 -3.822e+05 5.869e-03 1.540e-01 1.000e+00 1.000e+00 1101 -3.222e+03 -3.222e+03
Path [5] :Best Iter: [44] ELBO (-3222.415514) evaluations: (1101)
Path [6] :Initial log joint density = -384975.954254
Path [6] : Iter log prob ||dx|| ||grad|| alpha alpha0 # evals ELBO Best ELBO Notes
44 -3.822e+05 5.869e-03 1.540e-01 1.000e+00 1.000e+00 1101 -3.225e+03 -3.225e+03
Path [6] :Best Iter: [37] ELBO (-3223.245075) evaluations: (1101)
Path [7] :Initial log joint density = -384975.954254
Path [7] : Iter log prob ||dx|| ||grad|| alpha alpha0 # evals ELBO Best ELBO Notes
44 -3.822e+05 5.869e-03 1.540e-01 1.000e+00 1.000e+00 1101 -3.225e+03 -3.225e+03
Path [7] :Best Iter: [38] ELBO (-3222.993990) evaluations: (1101)
Path [8] :Initial log joint density = -384975.954254
Path [8] : Iter log prob ||dx|| ||grad|| alpha alpha0 # evals ELBO Best ELBO Notes
44 -3.822e+05 5.869e-03 1.540e-01 1.000e+00 1.000e+00 1101 -3.224e+03 -3.224e+03
Path [8] :Best Iter: [43] ELBO (-3222.202985) evaluations: (1101)
Path [9] :Initial log joint density = -384975.954254
Path [9] : Iter log prob ||dx|| ||grad|| alpha alpha0 # evals ELBO Best ELBO Notes
44 -3.822e+05 5.869e-03 1.540e-01 1.000e+00 1.000e+00 1101 -3.224e+03 -3.224e+03
Path [9] :Best Iter: [43] ELBO (-3223.049653) evaluations: (1101)
Path [10] :Initial log joint density = -384975.954254
Path [10] : Iter log prob ||dx|| ||grad|| alpha alpha0 # evals ELBO Best ELBO Notes
44 -3.822e+05 5.869e-03 1.540e-01 1.000e+00 1.000e+00 1101 -3.224e+03 -3.224e+03
Path [10] :Best Iter: [44] ELBO (-3223.958918) evaluations: (1101)
Total log probability function evaluations:11760
Pareto k value (2.219468) is greater than 0.7. Importance resampling was not able to improve the approximation, which may indicate that the approximation itself is poor.
Finished in 1.8 seconds.
The way I call pathfinder is
my_res = model$pathfinder(
data = data,
tol_rel_obj = tol_rel_obj,
output_dir = output_dir,
seed = seed+i,
init = init,
num_paths=10,
single_path_draws = output_samples / 10 ,
...
)
The same plot for for vb looks good
Yep, that’s the only case where we would expect that ADVI might perform better than Pathfinder, but then we would also expect the Laplace algorithm to perform also well and that would be much faster than ADVI.
init
is. Pathfinder benefits from having disperse initial values in tails of the posterior. If you start near the mode, the Pathfinder optimization ends quickly before getting a good normal approximation and there is less variation between different paths.psis_resample=FALSE
. See examples of how this makes a difference in the Birthdays case study, and how to county the number of unique draws.history_size
to 50 or 100 for much improved Pathfinder performance.Thanks Aki. I understand. I am trying
psis_resample=FALSE
with cmdstanr version [Package cmdstanr version 0.8.0.9000 Index]
and strangely although the documentation ?pathfinder
showed psis_resample
as argument
I get
Error in model$pathfinder(data = data, tol_rel_obj = tol_rel_obj, output_dir = output_dir, :
unused argument (psis_resample = FALSE)
Also the autocompletion of the parameters does not show psis_resample
.
Not sure if this has happened to anyone else.
Did you just install a new version of cmdstanr
? Just checking because the model object stores the functions, and if you install a new version, you need to recreate the model objects, too, to get the new function definitions
Thanks, I recompiled the model.
With your suggestions I get a great match at least of the point estimates
So while inits are quite important for efficient hmc (to the point that you run pathfinder/vb to find inits), for pathfinder you are saying that is almost better to not define them a prior?
my_res = model$pathfinder(
data = data,
tol_rel_obj = tol_rel_obj,
output_dir = output_dir,
seed = seed+i,
# init = init,
num_paths=10,
single_path_draws = output_samples / 10 ,
max_lbfgs_iters=100,
history_size = 100,
psis_resample = FALSE,
...
)
Great!
HMC benefits if the inits are in or close to the typical set. As each HMC chain uses different random seed and is stochastic, the init can be the same for all the chains, although it is better to use disperse inits for different chains. HMC can struggle if the inits are really far away from the typical set. It seems in your HMC works just fine.
Pathfinder uses deterministic optimization and using the same init for all paths means there are no differences in paths, and the only variation comes from the stochastic ELBO estimate used to choose a normal approximation along the path. Thus, it is important to have diverse inits. Pathfinder is useful when you don’t know where the typical set is, and your diverse inits are unlikely to be near the mode (or one of the modes) of the posterior, as Pathfinder optimization quickly gets to higher density and often the path goes through or near the typical set. It would be possible to modify Pathfinder to also have some kind of expansion phase if the diagnostics would indicate the initial value was close to the mode and the path did only go away from the typical set.
Hello @avehtari ,
Without resampling, I get that each of the 10 paths converges on different values.
I imagine this is expected.