Using pathfinder to initalize sampling

Hi, my MCMC sampling takes forever to run, which is not surprising as the parameters are high dimension. I am trying to use pathfinder for initializing, hoping this could speed up the estimation process. However, the pathfinder spits out the following warning. I am wondering in this case, would I be better off with the results from the pathfinder?

Chain [1] Total log probability function evaluations:254755
Chain [1] Pareto k value (25) is greater than 0.7. Importance resampling was not able to improve the approximation, which may indicate that the approximation itself is poor.
01:11:50 - cmdstanpy - INFO - Chain [1] done processing
Chain [1]

Try it and see. If youā€™re in CmdStanPy, itā€™s a one-liner to use Pathfinder as initialization to MCMC. I donā€™t know how to do it in CmdStanR.

The problemā€™s unlikely to be dimension because HMC scales really well in dimension. Iā€™ve fit 400K parameter models in half an hour. Where HMC struggles is with varying curvature (multiple scales) and stiffness (poor conditioning or non-positive definiteness).

1 Like

In CmdstanR itā€™s also quite simple. You can pass the fitted model object created by the pathfinder method (or any other method) directly to the init argument of the sample method. I think it was @stevebronder who added that functionality, which is super convenient!

1 Like

thanks! I am using cmdstanpy. I asked Chatgpt about this, and got the following reply:

If the Pareto kkk value is greater than 0.7 and importance resampling could not improve the approximation, it indicates that the pathfinder approximation is not reliable, especially in the tail regions of the posterior. This high kkk-value suggests that the pathfinderā€™s posterior samples may have poor coverage or be biased, particularly in capturing the distributionā€™s variability and outliers. As a result, using these results as initial values for further inference may carry risks.

The question is more about whether it is a good practice for using results from a bad results of pathfinder rather than from easiness of implementation point of view.

ChatGPT answer is not good (itā€™s a typical vague mashup of correct things not being true for the specific case for which there is not yet enough training material).

See the Birthdays case study for how to do the initialization in case of high Pareto-k warnings. These initial values are likely to be much better than the default approach used by Stan.

5 Likes

haha. thank you very much for this information! I know ChatGPT is not very trustworthy, but given my limited knowledge, itā€™s hard to judge corectly. This is good to know.

Let me take a look at the link. Thanks again!

What do you think is wrong with it, @avehtari? It looks OK to me. I used o1-preview, and think itā€™s response is clearer (response too long to include here):

I asked it the follow-up as to whether I can still use it to initialize and again, the response seems reasonable to me.

@dmp didnā€™t show us his query, but the response from GPT appears to be consistent with what our own error message says,

Chain [1] Pareto k value (25) is greater than 0.7. Importance
resampling was not able to improve the approximation, which
may indicate that the approximation itself is poor.

I guess thereā€™s a ā€œmayā€ in our response, but is there a case where the Pareto-k value is 25 and we expect Pathfinder to provide a reasonable approximation?

my query is almost the same as yours.

I have a follow-up question. In the case where, say, I have run a simpler model, but need to add some more parameters. What would be a better approach for initialization: use some of the estimates from the previously estimated simple model, or use the estimates from the pathfinder (pathfinder typically would have the error message: k value greater than 0.7)? I think in the birthday case study example given by @avehtari, they typically just go ahead with the pathfinder initialization even though there is the large k-value error message.

a second naive question: for the vector parameters, when initializing, should I specify element-by-element, say, for vector[3] par, do the following: ā€˜par[1] =ā€™ , ā€˜par[2]=ā€™, ā€˜par[3]=ā€™, or specify the vector itself with ā€˜par=[ , ,]ā€™. The chatgpt answer was also not very clear to me. I tried both, not was sure whether the sampling was actually using the initial values that I gave.

@dmp did not show the specific question asked from ChatGPT, but I assume the question was about ā€œUsing pathfinder to initialize samplingā€ as this is the title of this thread. So Iā€™m interpreting the answer based on that.

This is fine, although itā€™s missing mentioning that the reliability depends on the use case. The approximation is not reliable for posterior inference, but is likely to be useful for initialization.

This is just nonsense, and I canā€™t guess where it picked that unless itā€™s mixing things as Pareto-k diagnostic fits a Pareto distribution to the tail of the importance ratios. Pathfinder approximation for posterior inference given high khat it is unreliable for any posterior quantity also in ā€œbulkā€.

Another mash-up, as mentioning outliers doesnā€™t make any sense in this case, but it probably did pick it from the texts about high Pareto-k in PSIS-LOO possibly caused by outliers. Posterior draws themselves are not biased, but Monte Carlo estimates using those draws can be biased. Poor coverage and bias are not relevant for the question in the way as this sentence makes it sound. The default initialization in Stan draws from uniform[-2,2] in unconstrained space and in most cases a) uniform[-2,2] is not covering the posterior, b) draws from the uniform would produce biased Monte Carlo estimates, c) even if the posterior would be also uniform[-2,2] and we get the default 4 draws they are not producing good estimates with only 4 draws.

Low Pareto-k would indicate approximate draws from the posterior which would be nice, but really here the comparison should be to any other way of obtaining initial values if we are not able to get draws from the posterior. Compared to the default uniform[-2,2] Pathfinder initialization has lower risk.

ChatGPT is also missing the important part telling what to do in case of high Pareto-k in Pathfinder. I think ChatGPT could have written the answer even with the material available before Pathfinder paper as something like that could be mashed up from sentences discussing PSIS and Pareto-k diagnostic. If it is including Pathfinder paper in the training material, that paper is also focusing on Pathfinder as approximation and not as initialization. By default, Pathfinder is using resampling with replacement to target small bias in posterior approximation. If Pareto-k is high the resampling with replacement might produce only one unique draw. For initialization, it is better to use resampling without replacement. This was discussed when Pathfinder was implemented in Stan C++, but this option was not included. My Birthday case study shows how to use Pathfinder with resampling without replacement. This way, initial values for different chains are unique, which improves the convergence diagnostics. There are definitely other methods that can be used to get better initialization when Pathfinder fails (and Pathfinder itself could likely to be improved, too).

To summarize: Pathfinder initialization using resampling without replacement even in case of very high Pareto-k is very likely to be better than intialization from uniform[-2,2].

First you did not ask about the use for initialization, and that part has some inaccurate statements and a few years old recommendations on interpretation of Pareto-k, too. The second part has more misleading statements. Both parts include unnecessary text not relevant for the specific question, and both parts include way too many warnings probably leading most readers confused and scared,

I really hope Stan discourse does not turn to requests for checking ChatGPT answers. It has taken so far 30mins now to write this message. Listing and explaining the problems in the o1-preview answer would take much longer.

With resampling without replacement.

Thanks so much for all the clarification! Iā€™m afraid Iā€™m as out of date as GPT! In this case, I really wondered what you thought was wrong with the answer as it looked OK to me.

I agree, but feel the whole worldā€™s moving this way. Iā€™m just independently fascinated by what LLMs can do having spent so long working on NLP and AI (more than twice as long as Iā€™ve been in stats!).

I find computer scientists and statisticians react differently to warnings. We had. big debate when Stan first rolled out about how many warnings weā€™d have, with many people coming down on the side of minimizing warnings because itā€™d scare users. Not sure we wound up with a good compromise thereā€”some of the warnings are too verbose and some not enough.

1 Like

Just some updates. I used the pathfinder results as initialization values for sampling despite the large K-value error warning, in general, the sampling is relatively faster. And I also manually replace some of the initliazation values with the estimates from the previously estimated sampling results, it works better.

Thank you @Bob_Carpenter @avehtari both very much for the detailed advises. It is quite useful to understand what I should take away from the warnings.

1 Like

Based on the discussion here, Iā€™ve added to the Stan Reference Manual short paragraphs on Pareto-k diagnostic for Pathfinder and use of Pathfinder for initialization. These will appear in the doc web pages at the time of next release (expected in three weeks).

2 Likes