Model comparison - lp__


I have various runs of the same model (and sometimes with one slightly changed parameter) and I want to compare models to get the model that fits my data the best. I was told this can be seen by the lp__ value, which is printed out in the model summary. Now my (stupid) question is, which value is better?
Lets say I got those values (I know, they are really close, but nevertheless), which model would be better?
I think I do not fully understand the meaning of log posterior…

Help would be very appreciated!


You are going to have to understand that to make good use of Stan, regardless of its (in)applicability to model comparison. In short, it is the kernel of the posterior density in log-units, possibly ignoring some or all constants. It is not useful for model comparison. Most people around here prefer to do comparisons involving (functions of) the log-likelihood, which excludes the contributions from the priors and the Jacobians for the transformations from the constrained space to the unconstrained space. See this paper:

Other people prefer to calculate the probability that each of several models is correct, conditional on at least one of them being (very close to) correct. See this paper

for some cautions about that approach, but if you are going to use it then you need the posterior density including the constants. If you have that, then you can use

See especially its tutorial.

The original question asked about comparing different runs on the same model, as well as different models.
If you are just comparing different runs, then surely the simple log probability is fine.

Thank you. And do you know if a smaller value is better or a larger value?

Logs are monotonic, so higher density is also higher log density. But what you’re looking for in sampling is to sample around the typical set, not around the mode.

Also, you probably don’t want to use lp__ as your yardstick as it drops constants and includes all the Jacobian adjustments to allow sampling to happen on the unconstrained scale.

Why would you compare different runs on the same model and the same data?
The original question explicitly mentions model comparison. You can’t do that with just lp__ values.

By “same model”, we mean the same log density. So there may be some misunderstanding in terminology here.

By “same model”, we mean the same log density. So there may be some misunderstanding in terminology here.

Oh that might be the case…
Actually, I am quiet confused right now :D

Lets just say, I have the same model code and I am starting some runs with this same model code but with random generated starting parameters, so there is always a slightly different outcome of the different runs.
Now I just want to know, which run was the best, and I read this was possible to answer by looking at the lp__ value, but I am not sure which lp__ value is the better one, a more negative one?

You should just do that with multiple chains in a single run. Or if you do it in CmdStan, then you should be combining the chains running the same model for posterior analysis.

You should not be looking for equal runs of the same model and finding the best. If they’re not all the same, you have convergence problems.

1 Like

@Bob_Carpenter Let’s say I wanna do multiple runs with different prior values. For example: (gamma(2,17) , gamma(17,2), gamma(3,17), … etc etc) to see the difference in the results.

How can I do that with multiple chains in a single run? Can you please illustrate an example if possible ? I am trying to learn how to do this

Thanks in advance,

@bgoodri, when you say the kernel of the posterior density, you are referring to the posterior distribution in log units?

got it , kernel of the posterior basically omits the normalization constant that does not depend on the parameter. I just did not know the term for this is “kernel”

Hi everyone,

sorry to revive this topic, but I sort of have the same issue.
I am running a Bayesian ANN, and I am having the issue that I get multiple answers while running the same model on the same data. In other words, I get a very low MC error per chain, but the chains do not converge (i.e. high Rhats). I actually recently made a post about it;

In that post I was forwarded to the following link;

In that link you can see under “Neural network: When ordering is not enough”, it says;

This means that to identify the model we need have to somehow choose one of those modes, but there is clearly not a “best” (much higher lp__) mode.

However, in my case, I actually can see that there is a difference in lp between the “solutions”. (Sorry if I mix up terminology here…)

If I run 8 chains, is there a way to “pick” the best chains automatically based on the lp_, or are there better ways to do it?

(PS: my goal is to mix machine learning with structural time series, hence I use Stan for ANN.)

See Stacking for non-mixing Bayesian computations: The curse and blessing of multimodal posteriors.

1 Like

That’s brilliant! Model stacking, who would have thought.
I went through the paper and implemented the code successfully to my own by now.

Thank you for this. Love the research you guys are doing, it’s very applied and immediately usable.

1 Like