Experiment for Latent Gaussian Models

Nothing urgent, but maybe someone has some comments on the set up.

In an initial experiment, proposed by @anon75146577 here (Algebraic sovler problems (Laplace approximation in Stan)), we had a Poission latent Gaussian model:

\phi \sim \mathrm{Normal}(0, \lambda) \\ \theta_i \sim \mathrm{Normal}(0, \Sigma_\phi) \\ y_{j \in g(i)} \sim \mathrm{Poisson}(e^{\theta_i})

where \Sigma_\phi was a diagonal covariance matrix, with ith entry \sigma_i = \phi^2. Now, as proposed by @avehtari, we want to generalize the problem a bit, and fit a model with a dense covariance matrix. So how do we generate this covariance?

My proposition is use as diagonal elements \sigma_{ii}^2 = \phi^2 + \epsilon, where \epsilon is a random perturbation, and for the off-diagonal elements use:

\sigma_{ij}^2 = \rho \times \sqrt{\sigma_i^2 \sigma_j^2}

where \rho is a correlation coefficient, which will vary during the experiment. Ok, should I add another perturbation to \sigma_{ij}? Is it ok for \Sigma_\phi to be a stochastic function of \phi?

For the effective sample size experiments N_eff BDA3 vs. Stan
I used simple version:

transformed data {
  vector[D] zero = rep_vector(0.0, D);
  cov_matrix[D] S;
  cholesky_factor_cov[D] L;
  for (i in 1:D) {
    for (j in 1:(i-1)) {
      S[i,j]=s;
      S[j,i]=s;
    }
    S[i,i]=1.0;
  }
  L = cholesky_decompose(S);
}

but what you propose would work, too. I guess you use LKJ construction if you want to generate random matrices, or you could use cov_exp_quad with random x and varying lengthscale (increasing lengthscale increases correlation)

matrix[N, N] cov =   cov_exp_quad(X, 1.0, lengthscale) + diag_matrix(rep_vector(square(sigma), N));

with some small sigma

Ok, so it looks like we have multiple choices.

I think the most important thing is to figure out how the global parameter \phi drives the covariance matrix of the local parameters, \theta. In my original proposition, \phi contains the average variance and the average correlation.

Would it make sense to use a scientific example for the case study? I’ll try to read up on specific applications of the Laplace approximation, but let me know if you have a prefered example.

Just to refine my comment: in applications of latent Gaussian models, do we usually specify the covariance matrix or the cholesky decomposition thereof in terms of the global parameters?

Covariance matrix can be deterministic from \phi, and most often is.

Better to have first working simulation example with speedup, and where you can show the performance with respect to increasing correlation. You could make a plot like in N_eff BDA3 vs. Stan - #21 by avehtari , just use that same multivariate normal as latent function, e.g., for log-intensity of Poisson.

You can generate Poisson distributed data from your model, and you can check how the speed changes when you vary the parameters of data generating model.

Hey @charlesm93 -

I’ve a few models that use “latent gaussian” that I’d like to see laplace approximation applied on if the C++ code working well enough. They IBNLT: classification (but this is in the manual), time series, and survival. I can send over the code, if you’re interested. The poisson model has been used in Betancourt’s case study and in Rob’s case study, I’d like to see diversity. For more information on GPs in survival analysis, Alan Saul’s dissertation is excellent, and a great read.

Please take Aki’s suggestion, but I’m extremely interested in scientific examples too, and a survival one might make a good example (we can use it on a problem that’s already been done in GPstuff, to verify the results are accurate. I have most of this work done, and I can send over the code. I left off with the latent AFT survival component almost identical to the GPstuff vresion, but my latent function was jagged, and I was trying to impose constraints and combine kernels to see if I can extract the smooth component.).

And I link laplace approximation is a general optimization method and can be applied over a variety of models, and works particularly well for GPs.

We can have a dense covariance matrix with the covariance function describing the properties of the function (i.e. matern52 is less smooth than squared exponential). You generate the matrix with the covariance function, I haven’t put time into your proposed kernel. But - when we use a well known covariance function we can see how well it peforms on other well known models.

Although, admittedly, I’m blatantly ignoring prior distribution specification that can reduce sampling pathologies, as described in Betancourt’s case studies (i.e. the modeling in my case studies is trash), the following document might help you conceptualize the problem better. I’m aware of the problems, and I’m eager to fix, it’s just not a priority right now:

1 Like

@anon79882417 Thanks for all the great input. I’ll continue focusing on simulated examples, which seems to be general consensus about what should be done first, and I’ll then dive into working with a few scientific examples, possibly some of the models you proposed.

I’ll post my progress on this thread.

Yeah I wasn’t implying don’t recover parameters from simulated data, as this is key in generative modeling, it’s just not traditional, at least to my knowledge, to generate your own covariance matrices unless you have a specific reason for doing so, such as the Matérn class of covariance function, of which the exp quad Aki recommends is a special case.

Getting excited to see this in action, thanks for your effort

Small point of confusion. When approximating p(\theta | y, \phi), we wish to match the mode and the curvature. According to @anon75146577, we can show that the hessian of p(\theta | y, \phi) at \theta = \theta^* is:

\mathcal H = \Sigma^{-1} + H

where H_{ij} = \frac{\partial^2}{\partial \theta_i \theta_j} \log p(\theta | y, \phi) and is evaluated at \theta^*. We then have:

p(\theta | y, \phi) \approx \mathrm{Normal}(\theta^*, \mathcal H^{-1})

From this, we can do a few calculations, as done in the attached file: MathAppendix.pdf (139.3 KB)

However, in the simple case where \Sigma is diagonal with each diagonal element being \sigma_\phi, the resulting log posterior is:

\log p(\phi | y) \approx \log p(\phi) + \log p(y | \theta^*, \phi) - \frac{1}{2} \left (M \log \sigma^2 + \sum_{i = 1}^M (\log m_i + \theta_i^*) + \theta^{*T} \sigma^{-1}_\phi \theta^* \right)

which is inconsistent with the example Stan code, attached laplace_dan.stan (2.0 KB).

We match the code if we use \mathcal H = H, i.e. without adding \Sigma^{-1}. In the case where we have a dense covariance matrix, using \mathcal H = H also works better (else the model gets stuck in a never-ending sequence of Metropolis proposals rejections). But \mathcal H = H does not seem quiet correct. I’ve tried working out the result, but right now, it’s more algebra than I think might be needed. I thought I’d check whether I’m missing something basic.

2 Likes

Ok, here’s the answer to my own question.

For f the density of a normal, N(\mu, \Sigma), we indeed have that the Hessian of \log f verifies H_f = - \Sigma^{-1}.

Now let g be the function we want to evaluate. Because we approximate the function as a Gaussian, we also make the following approximation \Sigma^{-1} \approx - H_g. \Sigma^{-1} gives us the negative curvature of the approximate function, and setting it equal to - H_g therefore amounts to matching the curvatures of the approximation and the real distribution.

Note further the curvature of the real function is evaluated at its mode. For a Gaussian, the curvature is constant, but this is in general not the case. The point here is that, when doing a Laplace approximation, we focus on the mode the distribution we approximate, which is why we match modes and curvature at the mode.

2 Likes