How to access the lp__ values for all sets of samples with CmdStanPy

Hi, I was wondering how I can access the “lp__” values for each sample generated by CmdStanPy.
For a current analysis it is important that I also assign the respective “lp__” value to each generated sample, however using the “stan_variable” function I am only able to extract the parameters.
I would be grateful for any response.

1 Like

I know there is some way, some sampler stats function, but maybe the easiest way to access it, is to use arviz?

import arviz as az
idata = az.from_cmdstanpy(fit)
lp = idata.sample_stats["lp"]

Where do you need lp? Remember that lp is a density value unless you make sure that all call to distributions in your stan code use target += ...._lpdf, then it is probability.

2 Likes

Regarding this I have a question.
Why is there an lp__ value for each sample for each parameter, while the lp__ values determined by PyStan are one for each set of of parameters or is their interpretation different ?
To clarify what I mean, in case I am not very precise.
For 6250 draws and 16 chains PyStan returns a 16*6250=100.000 entry long array of lp__ values, while ArviZ in this case returns an array that cannot be connected to the 100.000 entry long arrays for the parameters extracted from the sampling.

Edit: I also have another question regarding the r_hat values determined by ArviZ.
For my current fit the r_hat value from ArviZ differs from the R_hat value determined by the summary function from CmdStan. Is their interpretation different ?

1 Like

There is only one lp__ value per draw. Not sure what you mean by value per sample and per parameter.

I don’t know which algorithm is used by cmdstan, so you should make sure both libraries are using the same before diving deeper and seeing if there is a bug. ArviZ uses the one from [1903.08008] Rank-normalization, folding, and localization: An improved $\widehat{R}$ for assessing convergence of MCMC, so this could be the reason for the difference if cmdstan still uses the traditional rhat.
Their interpretation is the same independently of the method used to compute it.

1 Like

I am refering to the fact that the arrays for the samples of the parameters are already connected into one long array, however the results I get from ArviZ and the sampler_diagnostics() function are separated into the different chains, so I am unsure how to reconnect them (to the respective draws) into an array similar to the array I get for the parameters (i.e. the order has to match the order extracted from the sampling array).
In this case I have 100.000 entry arrays for the 8 parameters of my fit and I need to reconnect the respective lp__ value to each of those 100.000 entries (to be in the same order).
PyStan already returns the respective lp__ value for each of the 100.000 sets of the 8 different parameters, so I would be surprised if CmdStanPy would not allow to do the same.

CmdStanPy has method stan_variables and sampler_variables - API Reference — CmdStanPy 0.9.76 documentation

sampler_vars = some_model_fit.sampler_variables()
sampler_vars['lp__'].shape

good question, and clearly, we need more documentation here. suggestions for an illustrative use case welcome.

2 Likes

I am sorry if this was a stupid question, also the documentation is fine, it just appears I managed to miss the description on how the sample arrays are built from the chains.
I have a brief question regarding that.
In the documentation it says that for stan_variable the array is effectively created by appending the chains in sequence, so if the quantity determined by sampler_variables is ordered into an array of adding the m samples from the first chain into entries 1 … m and then chain 2 to entries m+1 … 2m, then chain 3 to entries 2m+1 … 3m (and so on) one creates an array of the same order as the one the parameters are returned in ?
If yes then this completely answers my question.

yes, that’s exactly what’s going on.
the sampler output is a matrix where each row is a draw and each column is a variable.
as the primary use case for a sample is evaluating the per-variable estimates, the data is stored column-major for efficient access.

1 Like