How to get best fit value of the parameters from pystan.StanModel.sampling result?

Optimizing just search for the (local) maximum value (from initial location), there is no concept of “going to all points”. The default algorithm is L-BFGS.

Sampling does mcmc (hmc). The default is NUTS.

This depends how you define your model and what initial values you have. Some complex models might have the maximal value at the boundary.

This is only true in frequentist point of view. What you really want to have is the full posterior distribution and its mean (median), but also correlations between different variables. This kind of depends on what kind of posterior do you have.

Frustrated with pystan. Still could not find a way to get the best-fit values of the parameters from the sampling. Yes, I absolutely need that. The documentation is so poor. I cannot even find a flat chain option to get the samples in an array. I simply cannot believe how sloppy is the documentation in all most everything.

That does not make sense in the Bayesian way of thinking. You get posterior, which is a distribution not a point estimate. The posterior is analog to the best fit “value” (it is a distribution).
You can calculate the posterior mean if you want.

See https://github.com/betanalpha/jupyter_case_studies/blob/master/principled_bayesian_workflow/principled_bayesian_workflow.ipynb

fit.extract()

Sorry to hear that, I understand that this can be a bit confusing at start.

Maybe see the example 1 in Getting Started — pystan 3.7.0 documentation

The method extract extracts samples into a dictionary of arrays for parameters of interest, or just an array.

la = fit.extract(permuted=True)  # return a dictionary of arrays
mu = la['mu']

(and to get posterior mean for mumu.mean())

Thanks. Let us please agree on the fact that I want to get the best-fit point from my sampling. Say I have 6 chains. I did what you suggest. But still there is a confusion. Say I extract the log posterior using the two following ways:

  1. fit.get_logposterior(inc_warmup=False), and then concentrate all the 6 chains.
  2. ex=fit.extract(permuted=True) as you suggested, and then extract the log posterior lp=ex['lp__']

I expect I will have consistent values (if not the same) in the above two methods. I am finding that the values are not arranged similarly in the above two ways! I spent enough time to investigate why and still could not figure it out! The 1st element in method 2 is placed somewhere in a chain obtained in method 1 but not at the beginning of it. Why?

This is because the values are permuted. There is no good reason for this, but this is legacy behavior from the start of the project. If you calculate mean, it does not matter.

Got it. When I put permuted=Fasle, everything is arranged as par expectation. Now I would write my own function to get the best-fit from the sampling. I am not interested in mean of the posterior (possibly median would make more sense for my job).

Thanks for the help, and I am sorry for being annoying.

I think I understand what you’re after (but correct me if I’m wrong). I haven’t worked in python, but hopefully it translates from R easily enough.

Firstly, I don’t think you want to base your conclusions on the lp__ parameter, since it’s not quite the full log density (see this blog post for more info).

Instead, you’ll likely want to estimate the overall log-likelihood of your model (in a similar approach as for the loo calculation), except whereas loo requires the pointwise log-likelihood, you want to sum across all individuals. Using one of my own models as an example:

generated quantities {
  vector[M] log_lik;
  real log_lik_sum;

    for(m in 1:M)
      log_lik[m] = ordered_logistic_lpmf(y[m] | ystar_corr[ii[m],jj[m]], thresholds[gg[m],jj[m]]);

  log_lik_sum = sum(log_lik);
}

After the model has finished sampling, you want to extract the log_lik_sum estimates without permuting, as permuting mixes the chains and iterations together, and you need to know which chain and which iteration a given estimate came from. After you’ve extracted the estimates, you just need to find the location of the highest one. Using my own model as an example again:

out = extract(rrs,pars="log_lik_sum",permuted=F)
which(out==max(out),arr.ind=T)
     iterations chains
[1,]        107      1

This tells me that the largest log-likelihood for my model was found in iteration 107 in chain 1, and I can extract the other parameters from that iteration in that chain.

Hope I got all that right!

Permuting keeps draws together, so this will work, even for permuted=True.

Okay, I am again confused.
I have some data, say ~200, I have a model with 30 free parameters. I just want to fit the model using pystan.
Please tell me where I am wrong.

  1. I had this idea that Probability at point in the parameter space = exp(- \chi^2/2) of that point.
  2. So I expect lp__ would give me - \chi^2/2. But I can see that it is not giving me that value (I checked the chi^2 value).

lp__ = the log posterior density (up to a constant)

16.3 Notation for samples, chains, and draws | Stan Reference Manual

Okay, now I am fully convinced that I could never understand the basic concept of how these samplers work. If probability density of a point in the parameter space is exp[-chi^2/2] then why the sampler does not just calculate and store that at each sampling point?

Anyway, could you please tell me if I the following serves my purpose.

  1. What I want: From the sampling I just want to find which point has most probability (again only among the points in the sample). I will call it the best-fit which should give me lowest chi^2.
  2. What I do: I find the maxima in lp__ and get the values of the parameters at that point (the maxima of lp__). Is it wrong for my particular goal?
    Additional: I will also definitely plot the whole posterior (with 68% and 95% CL)

I roughly understood what you meant to say. But I could not understand the blog. Could you please tell where chi^2 fits into what you are saying/the blog?

Can you clarify what you mean by chi^2? Is there an example that you can point me to?

What that blog post explained is how Stan drops constants from the calculation of the log-density lp__.

The lp_ refers to the joint log-density of the model, with all constants dropped. What I mentioned in my answer is that you want to store the full log-likelihood (including constants) and then use that.

I have say N parameters (theta=theta_1, theta2,…,theta_N):
At each point in parameter space (theta=theta_1, theta2,…,theta_N): Chi^2=sum([data-model_value(theta)]^2/error_in_data^2)

I expect the probability p(data|theta)=exp(-Chi^2/2). Is not that correct? Is it reflected in lp__?

In the rest of the part, I am again lost in jargons.

Oh I see, no that’s not what lp__ is calculating.

When you specify a model with Stan, you’re building up a joint log-likelihood for the data and your priors, which is just a single equation that quantifies your hypothesised model (very rough description, may not be entirely accurate). This joint log-likelihood is what Stan uses to sample/explore the posterior distributions of the parameters in your model.

If we go back to your model from the first post:

data {
    int<lower=0> N;
    vector[N] x;
    vector[N] y;
    vector[N] yerror;
    int <lower=0> NP;
}
parameters {
    real a[NP];
}
model {
    for(i in 1:N){
    target+=normal_lpdf(y[i]|a[3] + a[1] * x[i] + a[2] * x[i]*x[i], yerror[i]);
    
    }

Under this model, lp__ is the sum of the normal_lpdf for every individual, with any constants (values that do not depend on model parameters) dropped from the equation. The normal_lpdf is the log of the density given here.

If I understand correctly, that chi^2 is a fitting procedure (an alternative to OLS) which you use to determine the ‘best fitting’ parameter values, this is not used by Stan for estimating parameters. But you if you know exactly the quantity you need, could you instead calculate it directly in the generated quantities block?

I forgot to mention, I consider every noise as Gaussian and all my parameters have flat priors (say from -infinity to +infinity). Now should not lp__ be directly related to chi^2? If not, then which formula stan uses to calculate p(data|parameters)?

As I mentioned above, for your model you’ve specified p(data | parameters) as normal_lpdf(data | parameters), which is calculated as the log of the density given here.

This will not be what lp__ is, as lp__ drops constant values (values in the density that do not depend on the model parameters).

Okay, referring to my earlier question:

  1. What I want: Among all the sampling points I just want to find which point has most probability (again only among the points in the sample). I will call it the best-fit. Please don’t mind even if it is statistically meaningless.

  2. What I do: I find the maxima in lp__ and get the values of the parameters at that point (the maxima of lp__ ). Is it wrong for my particular goal?

Yes, in that particular case lp__ is \chi_{max}^2-\chi^2 … but if you have any constrained parameters (e.g. something like real<lower=0> sigma;) then lp__ also includes a contribution from the constraining transform.

Please don’t try to extract “best-fit” values from HMC samples. The model object has optimizing(...) method for finding maximum likelihood estimate. It is much, much faster than sampling.