Introduction
Hello everyone, I’m looking to get some feedback on whether the way that I’m computing and storing the log-likelihood for each observation, for a model where the likelihood is obtained by integrating a parameter out of a normal distribution, is correct.
Dataset
I have a dataset where for each z_i (zcmb
in the code) there is a corresponding value of m_i (mb
in the code) where the later as an associated error of \sigma_i (dmb
in the code).
The total number of observations will be represented by N.
Model
This model has two parameters, \Omega_m and \mathcal{M}, where the later will be marginalized out the likelihood to avoid a degeneracy between them.
Starting with a normal likelihood which takes the form
\mathcal{L} = \prod\limits_{i=1}^{N} \frac{1}{\sqrt{2\pi} \sigma_i} \,\, \text{exp}\left(-\frac{1}{2} \left( \frac{m_i - m(z_i)}{\sigma_i} \right)^2\right) ,
where m(z_i) is given by
m(z_i) = \mathcal{M} - 5\log{D_L(z_i)}
where D_L(z_i) can be computed directly from the parameter \Omega_m, for a given z_i as
D_L(z_i) = (1 + z_i) \int_0^{z_i} (\Omega_m(1+x)^3 + 1 - \Omega_m)^{-1/2} dx .
We now marginalize the parameter \mathcal{M}, by doing
\mathcal{L}' = \int_{-\infty}^\infty \mathcal{L} d \mathcal{M}
which yields a new likelihood of the form
\mathcal{L}' = \prod\limits_{i=1}^{N} \left( \frac{1}{\sigma_i} \right) \frac{e^{-1/2(A - B^2/C)}}{\sqrt{C}},
where
A \equiv \sum\limits_{i=1}^{N} \frac{\Delta^2_i}{\sigma^2_i},
B \equiv \sum\limits_{i=1}^{N} \frac{\Delta_i}{\sigma^2_i},
C \equiv \sum\limits_{i=1}^{N} \frac{1}{\sigma^2_i},
where \Delta_i is merely an auxiliary variable which is defined as
\Delta_i \equiv m_i - 5\log{D_L(z_i )}.
Computing the natural logarithm of the likelihood yields
\log{\mathcal{L}}' = - \sum\limits_{i=1}^{N} \log{\sigma_i} - \frac{1}{2} \left( A - \frac{B^2}{C} \right) - \frac{1}{2} \log{C} .
I’m fairly confident that all of the middle steps are well computed, as I did them a few times and also match with results I have found in a paper. Besides, they aren’t that hard to perform.
In Stan, the model is coded as follows:
functions{
real integrand(real x, real xc, array[] real theta, array[] real x_r, array[] int x_i) {
real Omega_m = theta[1];
return ( Omega_m*(1.0+x)^3 + (1.0-Omega_m) )^(-0.5);
}
}
data {
int N1;
array[N1] real zcmb;
array[N1] real mb;
array[N1] real dmb;
}
transformed data {
// create null data values to give to integrate_1d because it's required
array[0] real x_r;
array[0] int x_i;
}
parameters {
real<lower=0,upper=1> Omega_m;
}
transformed parameters {
real DL;
real Delta;
real A = 0;
real B = 0;
real C = 0;
for (i in 1:N1) {
DL = 5.0*log10((1.0+zcmb[i]) * integrate_1d(integrand, 0, zcmb[i], {Omega_m}, x_r, x_i));
Delta = mb[i] - DL;
A += (Delta/dmb[i])^2;
B += Delta/dmb[i]^2;
C += dmb[i]^(-2);
}
}
model {
// priors
Omega_m ~ normal(0.284, 10);
// likelihood for the SNIa
// already removed factors which do not depend on the parameters
target += -0.5*(A - B^2/C);
}
Results
Because I’m looking to apply model selection criteria using PSIS-LOO-CV, I must save the log-likelihood for each observation in the generated quantities block. Now, what I used to do before on my even simpler models, was just to use a normal_lpdf
, but now, I have a bunch of other terms.
My solution, which I’m not very confident with, was to write it as follows:
generated quantities {
vector[N1] log_lik;
for (i in 1:N1) {
log_lik[i] = -log(dmb[i]) - 0.5*(A - B^2/C) - 0.5*log(C);
}
}
After running the full Stan model (the two snippets before joined together), using PSIS-LOO-CV implemented in Arviz gave me the following output:
Computed from 10000 posterior samples and 40 observations log-likelihood matrix.
Estimate SE
elpd_loo -1123.98 1.76
p_loo 38.81 -
There has been a warning during the calculation. Please check the results.
------
Pareto k diagnostic values:
Count Pct.
(-Inf, 0.5] (good) 0 0.0%
(0.5, 0.7] (ok) 0 0.0%
(0.7, 1] (bad) 40 100.0%
(1, Inf) (very bad) 0 0.0%
Pareto k data for each observation (ordered):
[0.9539276336319557, 0.9539276336319557, 0.953927633631959, 0.953927633631961, 0.9539276336319588, 0.9539276336319557, 0.953927633631959, 0.9539276336319612, 0.9539276336319557, 0.9539276336319656, 0.9539276336319557, 0.9539276336319612, 0.9539276336319557, 0.9539276336319557, 0.9539276336319557, 0.9539276336319557, 0.953927633631961, 0.9539276336319557, 0.9539276336319609, 0.9539276336319557, 0.9539276336319557, 0.953927633631959, 0.953927633631959, 0.9539276336319557, 0.953927633631961, 0.9539276336319557, 0.9539276336319557, 0.9539276336319557, 0.9539276336319557, 0.9539276336319557, 0.9539276336319588, 0.9539276336319586, 0.9539276336319557, 0.953927633631961, 0.9539276336319613, 0.9539276336319557, 0.953927633631961, 0.953927633631961, 0.9539276336319658, 0.9539276336319557]
The previous output clearly indicates the the model is misspecified.
However, the best-fit is just the value I expected, and actually fits well to the data (once you provide a value of \mathcal{M} that is obtained my finding the minima of the unmarginalized likelihood \mathcal{L}).
Plotting the values of the \hat{k}'s reveals:
Notice the previous scale and how constant these values are.
Questions
Clearly I’m doing something wrong here, so I would like to ask you the following question:
- How does one save the log-likelihood for each observation when the likelihood has factors which depend on all of the observations?
Further insight on the usage of PSIS-LOO-CV, as well as feedback on the question itself, is always greatly appreciated!