Saving the log-likelihood for each observation when using a marginalized likelihood


Hello everyone, I’m looking to get some feedback on whether the way that I’m computing and storing the log-likelihood for each observation, for a model where the likelihood is obtained by integrating a parameter out of a normal distribution, is correct.


I have a dataset where for each z_i (zcmb in the code) there is a corresponding value of m_i (mb in the code) where the later as an associated error of \sigma_i (dmb in the code).
The total number of observations will be represented by N.


This model has two parameters, \Omega_m and \mathcal{M}, where the later will be marginalized out the likelihood to avoid a degeneracy between them.

Starting with a normal likelihood which takes the form

\mathcal{L} = \prod\limits_{i=1}^{N} \frac{1}{\sqrt{2\pi} \sigma_i} \,\, \text{exp}\left(-\frac{1}{2} \left( \frac{m_i - m(z_i)}{\sigma_i} \right)^2\right) ,

where m(z_i) is given by

m(z_i) = \mathcal{M} - 5\log{D_L(z_i)}

where D_L(z_i) can be computed directly from the parameter \Omega_m, for a given z_i as

D_L(z_i) = (1 + z_i) \int_0^{z_i} (\Omega_m(1+x)^3 + 1 - \Omega_m)^{-1/2} dx .

We now marginalize the parameter \mathcal{M}, by doing

\mathcal{L}' = \int_{-\infty}^\infty \mathcal{L} d \mathcal{M}

which yields a new likelihood of the form

\mathcal{L}' = \prod\limits_{i=1}^{N} \left( \frac{1}{\sigma_i} \right) \frac{e^{-1/2(A - B^2/C)}}{\sqrt{C}},


A \equiv \sum\limits_{i=1}^{N} \frac{\Delta^2_i}{\sigma^2_i},
B \equiv \sum\limits_{i=1}^{N} \frac{\Delta_i}{\sigma^2_i},
C \equiv \sum\limits_{i=1}^{N} \frac{1}{\sigma^2_i},

where \Delta_i is merely an auxiliary variable which is defined as

\Delta_i \equiv m_i - 5\log{D_L(z_i )}.

Computing the natural logarithm of the likelihood yields

\log{\mathcal{L}}' = - \sum\limits_{i=1}^{N} \log{\sigma_i} - \frac{1}{2} \left( A - \frac{B^2}{C} \right) - \frac{1}{2} \log{C} .

I’m fairly confident that all of the middle steps are well computed, as I did them a few times and also match with results I have found in a paper. Besides, they aren’t that hard to perform.

In Stan, the model is coded as follows:

  real integrand(real x, real xc, array[] real theta, array[] real x_r, array[] int x_i) {
    real Omega_m = theta[1];
    return ( Omega_m*(1.0+x)^3 + (1.0-Omega_m) )^(-0.5);

data {
  int N1;
  array[N1] real zcmb;
  array[N1] real mb;
  array[N1] real dmb;

transformed data {
  // create null data values to give to integrate_1d because it's required
  array[0] real x_r;
  array[0] int x_i;

parameters {
  real<lower=0,upper=1> Omega_m;

transformed parameters {
  real  DL;
  real Delta;
  real A = 0;
  real B = 0;
  real C = 0;

  for (i in 1:N1) {
    DL = 5.0*log10((1.0+zcmb[i]) * integrate_1d(integrand, 0, zcmb[i], {Omega_m}, x_r, x_i));
    Delta = mb[i] - DL;
    A += (Delta/dmb[i])^2;
    B += Delta/dmb[i]^2;
    C += dmb[i]^(-2);

model {
  // priors
  Omega_m ~ normal(0.284, 10);

  // likelihood for the SNIa
  // already removed factors which do not depend on the parameters
  target += -0.5*(A - B^2/C);


Because I’m looking to apply model selection criteria using PSIS-LOO-CV, I must save the log-likelihood for each observation in the generated quantities block. Now, what I used to do before on my even simpler models, was just to use a normal_lpdf, but now, I have a bunch of other terms.

My solution, which I’m not very confident with, was to write it as follows:

generated quantities {
  vector[N1] log_lik;
  for (i in 1:N1) {
    log_lik[i] = -log(dmb[i]) - 0.5*(A - B^2/C) - 0.5*log(C);

After running the full Stan model (the two snippets before joined together), using PSIS-LOO-CV implemented in Arviz gave me the following output:

Computed from 10000 posterior samples and 40 observations log-likelihood matrix.

         Estimate       SE
elpd_loo -1123.98     1.76
p_loo       38.81        -

There has been a warning during the calculation. Please check the results.

Pareto k diagnostic values:
                         Count   Pct.
(-Inf, 0.5]   (good)        0    0.0%
 (0.5, 0.7]   (ok)          0    0.0%
   (0.7, 1]   (bad)        40  100.0%
   (1, Inf)   (very bad)    0    0.0%

Pareto k data for each observation (ordered):
[0.9539276336319557, 0.9539276336319557, 0.953927633631959, 0.953927633631961, 0.9539276336319588, 0.9539276336319557, 0.953927633631959, 0.9539276336319612, 0.9539276336319557, 0.9539276336319656, 0.9539276336319557, 0.9539276336319612, 0.9539276336319557, 0.9539276336319557, 0.9539276336319557, 0.9539276336319557, 0.953927633631961, 0.9539276336319557, 0.9539276336319609, 0.9539276336319557, 0.9539276336319557, 0.953927633631959, 0.953927633631959, 0.9539276336319557, 0.953927633631961, 0.9539276336319557, 0.9539276336319557, 0.9539276336319557, 0.9539276336319557, 0.9539276336319557, 0.9539276336319588, 0.9539276336319586, 0.9539276336319557, 0.953927633631961, 0.9539276336319613, 0.9539276336319557, 0.953927633631961, 0.953927633631961, 0.9539276336319658, 0.9539276336319557]

The previous output clearly indicates the the model is misspecified.
However, the best-fit is just the value I expected, and actually fits well to the data (once you provide a value of \mathcal{M} that is obtained my finding the minima of the unmarginalized likelihood \mathcal{L}).

Plotting the values of the \hat{k}'s reveals:
Notice the previous scale and how constant these values are.


Clearly I’m doing something wrong here, so I would like to ask you the following question:

  • How does one save the log-likelihood for each observation when the likelihood has factors which depend on all of the observations?

Further insight on the usage of PSIS-LOO-CV, as well as feedback on the question itself, is always greatly appreciated!

Hello everyone, I believe I have found a solution to my very specific problem, and I’m putting it here in hopes that either somebody comes along and provides some feedback on it and/or can be used for somebody in the future with a similar problem to mine.
Naturally, I’m far from an expert on the matter at hand and this should be read with caution.

Before the marginalization, the exponent of the exponential in the Gaussian likehood (multiplied by -2) which I will refer to as \chi^2 was of the form
\chi^2 = A + 2\mathcal{M}B - C\mathcal{M}^2 \,,

where the minimum is located at

\chi^2 = - B/C \,.

Combining the two previous equations together yields the marginalized \chi^2, which intuitively makes sense to me.

Now, if I am to consider the \chi^2 for a single event, I will do so only for A, B and C, but not for \mathcal{M}, which results in

\chi^2_i = A_i + 2 \mathcal{M} B_i + C_i \mathcal{M} \,,

where the index i denotes the i-th event and the value of \mathcal{M} is computed using the minimum which I have written before, using all of the events.

This works, with the statistics by the elpd looking very good, which matches the fact that visually the fit also seems to be very good (I won’t show the fit here, but trust me, it is always very close to the observations!)

Computed from 10000 posterior samples and 40 observations log-likelihood matrix.

         Estimate       SE
elpd_loo   -25.83     6.22
p_loo        0.93        -

Pareto k diagnostic values:
                         Count   Pct.
(-Inf, 0.5]   (good)       40  100.0%
 (0.5, 0.7]   (ok)          0    0.0%
   (0.7, 1]   (bad)         0    0.0%
   (1, Inf)   (very bad)    0    0.0%

Further comments, ideas or feedback is always appreciated.

1 Like