Using pathfinder to initalize sampling

dmp · November 11, 2024, 6:05pm

Hi, my MCMC sampling takes forever to run, which is not surprising as the parameters are high dimension. I am trying to use pathfinder for initializing, hoping this could speed up the estimation process. However, the pathfinder spits out the following warning. I am wondering in this case, would I be better off with the results from the pathfinder?

Chain [1] Total log probability function evaluations:254755
Chain [1] Pareto k value (25) is greater than 0.7. Importance resampling was not able to improve the approximation, which may indicate that the approximation itself is poor.
01:11:50 - cmdstanpy - INFO - Chain [1] done processing
Chain [1]

Bob_Carpenter · November 11, 2024, 11:04pm

Try it and see. If you’re in CmdStanPy, it’s a one-liner to use Pathfinder as initialization to MCMC. I don’t know how to do it in CmdStanR.

The problem’s unlikely to be dimension because HMC scales really well in dimension. I’ve fit 400K parameter models in half an hour. Where HMC struggles is with varying curvature (multiple scales) and stiffness (poor conditioning or non-positive definiteness).

jonah · November 11, 2024, 11:52pm

In CmdstanR it’s also quite simple. You can pass the fitted model object created by the pathfinder method (or any other method) directly to the init argument of the sample method. I think it was @stevebronder who added that functionality, which is super convenient!

dmp · November 12, 2024, 12:52pm

thanks! I am using cmdstanpy. I asked Chatgpt about this, and got the following reply:

If the Pareto kkk value is greater than 0.7 and importance resampling could not improve the approximation, it indicates that the pathfinder approximation is not reliable, especially in the tail regions of the posterior. This high kkk-value suggests that the pathfinder’s posterior samples may have poor coverage or be biased, particularly in capturing the distribution’s variability and outliers. As a result, using these results as initial values for further inference may carry risks.

The question is more about whether it is a good practice for using results from a bad results of pathfinder rather than from easiness of implementation point of view.

avehtari · November 12, 2024, 3:18pm

ChatGPT answer is not good (it’s a typical vague mashup of correct things not being true for the specific case for which there is not yet enough training material).

See the Birthdays case study for how to do the initialization in case of high Pareto-k warnings. These initial values are likely to be much better than the default approach used by Stan.

dmp · November 12, 2024, 4:00pm

haha. thank you very much for this information! I know ChatGPT is not very trustworthy, but given my limited knowledge, it’s hard to judge corectly. This is good to know.

Let me take a look at the link. Thanks again!

Bob_Carpenter · November 13, 2024, 9:22pm

What do you think is wrong with it, @avehtari? It looks OK to me. I used o1-preview, and think it’s response is clearer (response too long to include here):

I asked it the follow-up as to whether I can still use it to initialize and again, the response seems reasonable to me.

@dmp didn’t show us his query, but the response from GPT appears to be consistent with what our own error message says,

Chain [1] Pareto k value (25) is greater than 0.7. Importance
resampling was not able to improve the approximation, which
may indicate that the approximation itself is poor.

I guess there’s a “may” in our response, but is there a case where the Pareto-k value is 25 and we expect Pathfinder to provide a reasonable approximation?

dmp · November 14, 2024, 8:37am

my query is almost the same as yours.

I have a follow-up question. In the case where, say, I have run a simpler model, but need to add some more parameters. What would be a better approach for initialization: use some of the estimates from the previously estimated simple model, or use the estimates from the pathfinder (pathfinder typically would have the error message: k value greater than 0.7)? I think in the birthday case study example given by @avehtari, they typically just go ahead with the pathfinder initialization even though there is the large k-value error message.

a second naive question: for the vector parameters, when initializing, should I specify element-by-element, say, for vector[3] par, do the following: ‘par[1] =’ , ‘par[2]=’, ‘par[3]=’, or specify the vector itself with ‘par=[ , ,]’. The chatgpt answer was also not very clear to me. I tried both, not was sure whether the sampling was actually using the initial values that I gave.

avehtari · November 14, 2024, 7:48pm

@dmp did not show the specific question asked from ChatGPT, but I assume the question was about “Using pathfinder to initialize sampling” as this is the title of this thread. So I’m interpreting the answer based on that.

This is fine, although it’s missing mentioning that the reliability depends on the use case. The approximation is not reliable for posterior inference, but is likely to be useful for initialization.

This is just nonsense, and I can’t guess where it picked that unless it’s mixing things as Pareto-k diagnostic fits a Pareto distribution to the tail of the importance ratios. Pathfinder approximation for posterior inference given high khat it is unreliable for any posterior quantity also in “bulk”.

Another mash-up, as mentioning outliers doesn’t make any sense in this case, but it probably did pick it from the texts about high Pareto-k in PSIS-LOO possibly caused by outliers. Posterior draws themselves are not biased, but Monte Carlo estimates using those draws can be biased. Poor coverage and bias are not relevant for the question in the way as this sentence makes it sound. The default initialization in Stan draws from uniform[-2,2] in unconstrained space and in most cases a) uniform[-2,2] is not covering the posterior, b) draws from the uniform would produce biased Monte Carlo estimates, c) even if the posterior would be also uniform[-2,2] and we get the default 4 draws they are not producing good estimates with only 4 draws.

Low Pareto-k would indicate approximate draws from the posterior which would be nice, but really here the comparison should be to any other way of obtaining initial values if we are not able to get draws from the posterior. Compared to the default uniform[-2,2] Pathfinder initialization has lower risk.

ChatGPT is also missing the important part telling what to do in case of high Pareto-k in Pathfinder. I think ChatGPT could have written the answer even with the material available before Pathfinder paper as something like that could be mashed up from sentences discussing PSIS and Pareto-k diagnostic. If it is including Pathfinder paper in the training material, that paper is also focusing on Pathfinder as approximation and not as initialization. By default, Pathfinder is using resampling with replacement to target small bias in posterior approximation. If Pareto-k is high the resampling with replacement might produce only one unique draw. For initialization, it is better to use resampling without replacement. This was discussed when Pathfinder was implemented in Stan C++, but this option was not included. My Birthday case study shows how to use Pathfinder with resampling without replacement. This way, initial values for different chains are unique, which improves the convergence diagnostics. There are definitely other methods that can be used to get better initialization when Pathfinder fails (and Pathfinder itself could likely to be improved, too).

To summarize: Pathfinder initialization using resampling without replacement even in case of very high Pareto-k is very likely to be better than intialization from uniform[-2,2].

First you did not ask about the use for initialization, and that part has some inaccurate statements and a few years old recommendations on interpretation of Pareto-k, too. The second part has more misleading statements. Both parts include unnecessary text not relevant for the specific question, and both parts include way too many warnings probably leading most readers confused and scared,

I really hope Stan discourse does not turn to requests for checking ChatGPT answers. It has taken so far 30mins now to write this message. Listing and explaining the problems in the o1-preview answer would take much longer.

With resampling without replacement.

Bob_Carpenter · November 14, 2024, 8:23pm

Thanks so much for all the clarification! I’m afraid I’m as out of date as GPT! In this case, I really wondered what you thought was wrong with the answer as it looked OK to me.

I agree, but feel the whole world’s moving this way. I’m just independently fascinated by what LLMs can do having spent so long working on NLP and AI (more than twice as long as I’ve been in stats!).

I find computer scientists and statisticians react differently to warnings. We had. big debate when Stan first rolled out about how many warnings we’d have, with many people coming down on the side of minimizing warnings because it’d scare users. Not sure we wound up with a good compromise there—some of the warnings are too verbose and some not enough.

dmp · November 15, 2024, 6:54am

Just some updates. I used the pathfinder results as initialization values for sampling despite the large K-value error warning, in general, the sampling is relatively faster. And I also manually replace some of the initliazation values with the estimates from the previously estimated sampling results, it works better.

Thank you @Bob_Carpenter @avehtari both very much for the detailed advises. It is quite useful to understand what I should take away from the warnings.

avehtari · November 19, 2024, 10:09am

Based on the discussion here, I’ve added to the Stan Reference Manual short paragraphs on Pareto-k diagnostic for Pathfinder and use of Pathfinder for initialization. These will appear in the doc web pages at the time of next release (expected in three weeks).

Topic		Replies	Views
Using Pathfinder or other method to set initial values for sampling Modeling	32	1279	July 19, 2024
CmdStanR v0.7.0 released Announcements cmdstanr	15	1000	December 19, 2023
Initializing larger models using samples from smaller models Interfaces techniques , cmdstanr	1	440	August 24, 2023
Parallelising multipathfinder runs Modeling techniques	7	554	May 8, 2024
Using a fixed step size for HMC (or NUTS) and other sampler options Algorithms mcmc	21	3346	June 27, 2019

Using pathfinder to initalize sampling

Related topics