Hi!
Does somebody know if the hessian matrix of the optimizing function in rstan is the hessian matrix of the posterior or the one of the log-posterior distribution?
Regards
Robin
I’ll hazard a guess. The objective function of the optimization is in log space. I’m almost certain of this because Stan’s distribution functions assume log space and numerical stability requires it. The gradient is estimated based on the objective function so the gradient is of the log objective, and the hessian is effectively the estimate of the gradient of the gradient hence it’s all based on the log space.
Mode and Hessian are for log-density. Quadratic approximation using mode and Hessian for log-density corresponds to Gaussian approximation for density (see, e.g. Ch4 in BDA3). Note also that the approximation is made in the unconstrained space. RStan draws from the approximation, computes importance weights wrt to true density used for diagnostics, and transforms draws to the constrained space.
Ok, l put it differently. Is there a possibility that the algorithm finds a point estimate in the unconstrained space? Does the algorithm only search for an estimate in the constrained space?
I’m not able to parse this sentence and there are too many alternatives to guess. Is there a word missing?
Ok, l put it differently. Is there a possibility that the algorithm finds a point estimate in the unconstrained space? Does the algorithm only search for an estimate in the constrained space?
No, the algorithm works in the unconstrained space because it’s easier there (to put it simply), and when it finds a solution it transforms it back into the original space, and that’s what you get at the end.
A point estimate in the unconstrained space is also a point estimate in the constrained space respecting the constraints. A mode in the unconstrained space is not necessarily a mode in the constrained space.
ok, thanks. It seems to me that the L-BFGS algorithm and the other two are algorithms specifically for unconstrained optimization so that makes sense.
Unfortunately I have some more questions:
-
The description of the argument draws of the optimizing function says: “(…) how many times to draw from a multivariate normal distribution whose parameters are the mean vector and the inverse negative Hessian in the unconstrained space.” What is meant here with “mean vector”? The mean of the joint posterior density? Or maybe the point estimate (mode) which was calculated? I think the parameters for normal approximation are the mode and the inverse of the observed information at the mode.
-
Which type of transformation is used to transform the draws in the constrained space?
The mean of the approximating normal is set to the mode of the log-density. In multidimensional case the mean is vector valued.
Mean of the normal distribution is also mode of the normal distribution, but is more common to say the parameters of the normal distribution are mean and covariance.