Dear all,
I generated a hierarchical model data X by using some parameter true \theta^* then I fitted pystan on a hierarchical model using the generated data X .
I initialized the starting param (\theta_{stan} ) to be the true parameter \theta^* , but after the pystan model converges, I had the parameter converging to some wrong value \theta_{stan} .
When I calculate the log_prob by using the fit.log_prob() function, I found the true param \theta^* indeeds have higher log prob but still the model converges to a low log prob param \theta_{stan}
Does anyone knows why? It seems that if the pystan is using the log_prob as it target maximization objective, it should find some \theta_{stan} close to the true \theta^* .
Here is the convergence diagnosis, the true sigma2 is 0.0155868 and true v[0] is 1.76456398
Here is the log_prob comparison
Hi,
hmc/nuts do not maximize against log_prob, but our optimize method does.
See for example 15.1 Hamiltonian Monte Carlo | Stan Reference Manual
and
tl;dr
The typical set (at some level of coverage) is the set of parameter values for which the log density (the target function) is close to its expected value. As has been much discussed, this is not the same as the posterior mode. In a d-dimensional unit normal distribution with a high value of d, the 99% typical set looks like an annulus or sphere or doughnut, corresponding to the points whose distance from the origin is approximately \sqrt{d}. This intuition is important because it helps un…
Thanks for the reply. I just tried sm.optimizing(data=…,init=init_list)
But it seems still that it found the sub-optimal points, isn’t that wired as the \theta^* is my init_list?
Sorry, hard to tell without a model (stan code)
How did you define the init
?
Sure,
here my stan code:
ppca_code = """
data {
int D; //number of dimensions
int N; //number of data
int Q; //number of principle components
vector[D] x[N]; //data
real a_vj; // w_j prior
real epsilon;// w_j mean
real a_sigma2; // sigma2 prior
real beta_sigma2;// sigma2 mean
}
transformed data {
matrix[D,D] S;
S = x[1] * x[1]';
for (n in 2:N){
S += x[n] * x[n]';
}
S = S/N;
}
parameters {
ordered[Q] v; // v_j
real<lower=0> sigma2; //data sigma2
matrix[Q,D] W; //projection matrix
}
model {
matrix[D,D] C; //covaraince matrix
for(j in 1:Q){
v[j] ~ inv_gamma(a_vj, epsilon * (a_vj -1));
W[j] ~ multi_normal(rep_vector(0,D), v[j] * diag_matrix(rep_vector(1, D)));
}
sigma2 ~ inv_gamma(a_sigma2, beta_sigma2);
C = crossprod(W)+ sigma2 * diag_matrix(rep_vector(1, D));
target += - N/2 *(log_determinant (C) + trace( C\S));
}
generated quantities {
vector[D] y_hat[N]; //predictive
for (n in 1:N) {
y_hat[n] = multi_normal_rng(rep_vector(0,D), crossprod(W)+ sigma2 * diag_matrix(rep_vector(1, D)));
}
}
"""
And here is how I initialize the param:
init_list = []
for i_ in range(n_chains):
temp_dict = {
'v': sorted(v_star_list),
'sigma2': sigma2_star,
"W": W_star.T
}
init_list.append(temp_dict)
And after I ran
fit_standard = sm.optimizing(data=ppca_dat_standard,init=init_list)
I have the fit_standard found some value that is different from my init_list, which is the true \theta^* that has higher log_prob
btw, I just wanted to be a bit more specific about optimize
Technically, our optimizer finds penalized maximum likelihood estimates.
They’re not MAP estimates because we throw away the Jacobian in the
implicit priors on the transformed parameters. We’ll probably be
adding a proper MAP estimator soon.
Andrew likes to think of all these methods as approximate Bayes.
The penalized max likelihood gives you a Laplace approximation from
which you can gauge uncertainty in the same way as variational
inference, with the difference being centering on the …