How to get best fit value of the parameters from pystan.StanModel.sampling result?

I am very new to STAN/Pystan. I fit a model in pystan using an array of parameters. Now I get to see the mean of the posterior of parameters (I could calculate the median as well). But I wanted to know what is the best-fit values of the parameters (in other words where the probability is max or the -log_prob is min). I am pretty sure that there must be some easy function to obtain the best-fit.
My MWE is given below:

import pystan
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

A0=0.5; A1=1.5; A2=-0.2; A3=-0.008; A4=0.00025
def func(xx):
    return A0+A1*xx+A2*xx*xx#+A3*(x**3)+A4*(x**4)

x=10*np.random.rand(100); x=np.sort(x)
fx=func(x);
yerror=np.random.rand(len(x))
y=np.random.normal(fx,scale=yerror)

np.random.seed(101)

model = """
data {
    int<lower=0> N;
    vector[N] x;
    vector[N] y;
    vector[N] yerror;
    int <lower=0> NP;
}
parameters {
    real a[NP];
}
model {
    for(i in 1:N){
    target+=normal_lpdf(y[i]|a[3] + a[1] * x[i] + a[2] * x[i]*x[i], yerror[i]);
    
    }
    
}
"""


NP=3;
data = {'N': len(x), 'x': x, 'y': y,'yerror':yerror,'NP':NP}
sm = pystan.StanModel(model_code=model)
fit = sm.sampling(data=data, iter=2000, chains=4, warmup=400, thin=3, seed=101)
print(fit)

I want to get the best-fit for a[3].

Thanks in advance.

(If you want your code to be reproducible you need to call np.random.seed() before calling any other np.random functions.)

For maximum likelihood estimation StanModel objects have optimizing method. Instead of sampling call sm.optimizing(data=data) and you get a dictionary of values.

1 Like

Is there a reason you would want to use optimized values (frequentist) and not full bayesian? What you get, is samples from the posterior distribution, and you get the posterior mean / median with print(fit) or numerically with fit.summary().

See here for an example of bayesian workflow: https://betanalpha.github.io/assets/case_studies/principled_bayesian_workflow.html

Yes, but priors should be flat for MLE, shouldn’t be?

For MLE, yes, but for penalized MLE no.

Okay, so everyone is suggesting to use optimize() rather than sampling. But suppose if I have already used sampling, is there a way to get the maximum likelihood from the samples (results) itself? Is not it the point where lp__ is minimum?

I want to have both 68,95% confidence levels as well as the best fit value. Since the data and the number of parameters are quite large, it would a waste of resource if I have to run both sampling and optimize together.

Thank you.

In my opinion you should use .sampling and not try to find MLE.

Okay, but I really need MLE along with the sampling. For example, I always want to see if the MLE is included in the 68% contour. After reading your comments it comes to mind that there is no such easy way of getting MLE from sampling. I wanted to confirm if I am correct here.

BTW, in some cases I am finding that pystan.StanModel.optimizing(data=data) is not giving me any sensible result, where the sampling works reasonably well. So I absolutely need the MLE from the sampling result. I wonder why it is not calculated automatically. This should be one of the most important number, at least in Physics.

What do you mean by “not any sensible result”? MLE is not guaranteed to exist so it’s possible that your model just doesn’t have MLE. If posterior sampling works fine then MLE should be irrelevant.

You could take the posterior draw with the highest lp__ as “the closest approximation” of MLE–but any posterior draw would be just as representative.
Max lp__ among sampling draws can be very far from max lp__ found by optimization. For example, take 100-dimensional IID standard normal posterior. MLE is all-zeros vector. In a typical posterior draw half the components are positive and half negative but only few near zero. Draws where all components are near zero are extremely rare.

I think there are some mixup between the concepts from frequentist statistics and bayesian statistics. In frequentist world you would be interested in MLE and then calculate Confidence Intervals around it. But in bayesian world you don’t try to find optimal value for Likelihood, but you try to sample from the joint posterior distribution (which is located in the typical set).

So after you get samples (which is approximation of the posterior distribution) you can calculate the mean/median/quantile value and their accuracy. The Credibility Interval that you are interested can be calculated with suitable method (e.g. quantiles).

See


https://betanalpha.github.io/assets/case_studies/probability_theory.html

2 Likes

I am seeing that when there are many parameters pystan.StanModel.optimizing does not travel in the parameter space well and hence gives not the correct MLE. On the other hand, sampling captures the information from the whole parameter space. Therefore, the latter should be trusted more in this scenario.

I could not understand the 2nd part of your answer.

–but any posterior draw would be just as representative.
Max lp__ among sampling draws can be very far from max lp__ found by optimization. For example, take 100-dimensional IID standard normal posterior. MLE is all-zeros vector. In a typical posterior draw half the components are positive and half negative but only few near zero. Draws where all components are near zero are extremely rare.

Sorry for beginner question: would it not make sense to do variation inference then? Since it can be seen as an extension an extension of the expectation-maximization algorithm from MAP estimation (and then MLE estimation)? I am asking because I have never worked with variational inference (I have had no chance), but it is somehow on my agenda.

Thank you for clarifying

if they do not travel in parameter space, it is not the problem with MLE rather than with an optimizer. For example, if you would change initial conditions, would it improve your outcome?

No and yes. It depends.

I would say (and this is my opinion):

Always use dynamic hmc (nuts), if it doesn’t work, find out why

:: large data regime --> Use variational inference if nuts with “optimal” code doesn’t work (too slow or some other reason)

:: large large data --> use optimization and other diagnostic stuff

And it all depends what are you trying to do and how complex your model is.

edit. Also VB and optimization don’t look for the same thing

1 Like

Most likely. What I found that the optimizing is not going to ‘all’ points in large parameter space, and gives MLE at a point on the boundary sometimes. This should not be the case, but it is not surprising since I have no idea how optimize works. It certainly does not do the sampling. Things are really unclear here.

For a beginners point of view, I think there is a huge confusion/misunderstanding between Statistics people and many Physicists. In Physics, the point in the parameter space where the probability (exp (-chi^2/2) for Gaussian noise) bear a huge importance. We generally quote 68% confidence limit around that point. I am very very surprise to see that STAN does not calculate this automatically from the sampling.

Probably it is a good idea to read some textbooks (for example, you may check Richard McElreath’s book on Statistical Rethinking - he also notes about that 98%, 57% CIs reporting culture in one of his chapters). MLE is based on optim function and it searches for the maximization point of the likelihood. It is definitely not about sampling

Thanks a lot for clarification - that was also a bit my feeling about VB.

1 Like