Issues with differences between MCMC and Pathfinder results; how to make Pathfinder(or something else) more accurate?

Hello!

I am fitting a negative binomial model with the following code:

data {
    int<lower=1> n_obs;
    int<lower=1> n_transcripts;
    
    array[n_obs] int<lower=1> transcript_id;
    array[n_obs] int<lower=0> count;
    array[n_obs] int<lower=1> freq;
}
parameters {
    array[n_transcripts] real<lower=0> mu;
    array[n_transcripts] real<lower=0> odisp;
    
    real<lower=0> odisp_mu;
    real<lower=0> odisp_sigma;
}
model {
    int t;
    
    // regularization of overdispersion factors
    odisp ~ lognormal(odisp_mu, odisp_sigma);
    
    for (i in 1:n_obs) {
        t = transcript_id[i];
        
        // negative binomial sampling process
        target += freq[i] * neg_binomial_2_lpmf(count[i] | mu[t], odisp[t]);
    }
}

Unfortunately, this is a much simpler model relative to the one I would eventually like to fit(as I am slowly including more complexity into the model), and even now as the number of unique transcripts n_transcripts increases, MCMC takes longer and longer. So MCMC currently does not seem like a realistic method I can use for estimation.

So I turned to using Pathfinder, but as a sanity check I wanted to ensure the posteriors looked similar enough. I started with these arguments:

## Fitting model with Pathfinder
fit_path = model.pathfinder(model_data,
                            num_paths = 1,
                            num_single_draws = 10_000)

I calculated the difference in posterior variances and then subsetted the 6 transcripts with the largest(in magnitude) differences:

This was too much of a difference for me to be comfortable so I tried changing Pathfinder’s arguments like tol_grad, num_elbo_draws, num_paths, etc:

## Fitting model with Pathfinder
fit_path = model.pathfinder(model_data,
                            draws = 10_000,
                            num_paths = 10,
                            psis_resample = True,
                            tol_grad = 1e-12,
                            tol_obj = 1e-12,
                            tol_param = 1e-12)

With those arguments I get plots like this:

I’ve repeated this many times and all the runs have some transcripts with this behaviour(parameters tightly centred on the wrong value, way too uncertain, etc) and it’s enough of a problem that I would like to know if there’s anything I can do to prevent this before moving on to making my model more complex.

  • How should I change Pathfinder’s hyper parameters to make it more accurate?
  • My understanding was that Pathfinder was a better alternative to other variational inference(VI) methods, is that not necessarily true?
  • The posterior will always be approximately gaussian in my situation, does this information make other VI methods more accurate compared to Pathfinder?

Thank you!

## Fitting model with MCMC
fit_mcmc = model.sample(model_data,
                        chains = 10,
                        iter_warmup = 1000,
                        iter_sampling = 1000,
                        inits = 1)

## Fitting model with Pathfinder
fit_path = model.pathfinder(model_data,
                            num_paths = 1,
                            draws = 10_000,
                            tol_obj = 1e-12,
                            tol_grad = 1e-12,
                            tol_param = 1e-12,
                            max_lbfgs_iters = 20_000)

## Fitting model with ADVI Meanfield
fit_vari = model.variational(model_data,
                             draws = 10_000,
                             grad_samples = 1,
                             tol_rel_obj = 1e-6,
                             adapt_iter = 10,
                             eval_elbo = 500,
                             iter = 20_000,
                             require_converged = False)

It seems Meanfield does not randomly make very tight posteriors for odisp compared to Pathfinder, even if it is slightly off from the mode sometimes(although I think it may not be enough for me to care):

I think the Pathfinder paper makes it clear that Pathfinder is often better with the same or lower computation budget, but not necessarily always. If you get better results with some other method, then use it.

Pathfinder should be often good with approximately normal distribution, but based on your results it seems that is not true jointly, and looking at the model code there is likely to be a funnel. As illustrated in the Pathfinder paper, in case of funnel the initialization of Pathfinder matters more, that is the initialization needs to be sufficiently overdispersed. Other possibility is that the importance ratios have extreme variation due to very high dimensional posterior (see the PSIS paper) and even with Pareto smoothing the resampling is returning only a few unique draws (see illustration of this in Birthdays case study. You should have received warnings about high Pareto-k’s if this is the case.

If you the posterior has a funnel shape, then Pathfinder might be sensitive to the initial values, but if other VI works well you don’t need to use Pathfinder. If the resampling is failing due to high dimensionality, you can turn off the resampling with psis_resample=FALSE.

By default Stan’s Pathfinder uses small history size, but I almost always get better results with bigger history size, e.g., history_size=100. Even for very complex models, L-BFGS rarely needs more than 100 iterations, and you can get substantial speedup with max_lbfgs_iters=100.

1 Like

Hi @avehtari - if resampling fails, would then it make sense to use the highest log-probability/ELBO path from the returned paths? This is in the sense of finding good inits.

I think the best approach the one shown in the Birthdays case study mentioned above.

1 Like