Pareto k warns of misspecified model on model recovery when computing PSIS-LOO-CV

Hello everyone.
I’m working on a model recovery for a future gravitational wave observatory, the Einstein Telescope, by using mock data which was generated based on the model given below.
It happens that, after performing the model recovery, PSIS-LOO-CV prints a warning, as the Pareto \hat{k} estimates gives me some observations with a fairly high value.
This happens even thought the best fit values are very close to the original values, being inside the 1 \sigma region, which is itself rather small.

The Model

In short, the model I’m working with make a prediction on an observable which is referred to as the luminosity distance (d_L), which is a function of another observable which is known as the redshift (z) and two parameters h and \Omega_m. The function is given by:

d_L(z) = \frac{3(1+z)}{h} \int_0^z \frac{1}{\sqrt{\Omega_m(1+x)^3 + 1 - \Omega_m}} dx

The Data

The mock data used to get the model recovery is generated in a rather straightforward manner:

  1. Get a redshift z_* for an event we expect to observe;
  2. Compute the luminosity distance for that event d_L(z_*);
  3. Get the expected error for the observed event \sigma(z_*);
  4. Consider the final for the luminosity distance by a taking a random sample from a Gaussian distribution, i.e. d_L(z_*) \rightarrow \mathcal{N}(d_L(z_*), \sigma^2(z_*)), such that the observed value isn’t on top of the expected.

We then consider a dataset of 1000 events.

The Results

The output of the Pareto \hat{k} diagnostics given by arviz, after running the chain in Stan, using pystan (version 3), is as follows:

Computed from 14000 by 1000 log-likelihood matrix

         Estimate       SE
elpd_loo -1734.38    36.61
p_loo       15.47        -

There has been a warning during the calculation. Please check the results.
------

Pareto k diagnostic values:
                         Count   Pct.
(-Inf, 0.5]   (good)      499   49.9%
 (0.5, 0.7]   (ok)        432   43.2%
   (0.7, 1]   (bad)        65    6.5%
   (1, Inf)   (very bad)    4    0.4%

The value of p_loo, according to the loo-glossary: LOO package glossary, suggests that the model is badly misspecified, which I know is not the case.

However, in Fitting the true model and getting bad k-pareto from loo Aki Vehtari says that it is expected that PSIS can fail for very flexible models. However, it seems to me that this is not the case here, as we have 1000 observations versus 2 parameters.

I then removed all the troublesome events and re-ran the MCMC, and the Pareto \hat{k} revealed no bad or very bad observations!
What’s interesting, is that the value for PSIS-LOO-CV is essentially the same.
On the other side, the constrains without the troublesome observations revealed to be slightly worst (about 2 times larger 1 \sigma region on the posterior), which I believe could be easily explain by the lack of events.

I then ran an MCMC with only the troublesome events, which was still able to recover the original parameters (with a 1 \sigma region larger than the previous case) and the value of PSIS-LOO-CV was much higher, and some observations were still being considered bad (1 in 66) or very bad (7 in 66).
This is something I was expecting, so no surprises here.

The questions

I am left with some questions regarding these results:

  1. How could the same procedure, which is itself rather simple, create events that turn out troublesome and events which are perfectly fine? If I had to guess, I would say that the step 4 of the generating procedure might create very far away events, could that explain it?
  2. Given that PSIS-LOO-CV stayed more or less the same with and without the troublesome events (-1739.907 ± 37.395 with and -1652.191 ± 33.998 without), what does that imply for the observations which were left out?
  3. Can I confidently use the results from PSIS-LOO-CV to compare between two models, using this dataset with the troublesome events?
    A similar question was raised in How bad is a small percentage of very bad? (pareto-k-diagnostic of loo package) where the user asks how many bad observations can one have such that model comparison is still possible. I don’t think I understand the provided answer, and hopefully somebody could elaborate on that.
    However, in High Pareto-k values for the same observations across different models: Can I still use loo to compare these models?, Ben Goodrich states that if a model is to have high Pareto k values for some observations, then comparison is not ensured to be possible (or at least, that was my understanding of his reply).
  4. I also have the case where I have a different observatory which only has 15 events. In some cases, the very bad or bad observations make up a significant portion of them (5-6). What should I do in that situation?

TL;DR: Doing model recovery with 1000 generated events with the same procedure leads to warning by PSIS-LOO-CV regarding high Pareto \hat{k} values for some observations, yet the original values are recovered anyways. After troublesome observations are removed the value of PSIS-LOO-CV is more or less the same, and the size of the 1 \sigma region on the posterior increases. Doing an MCMC with only the troublesome events I am also able to recover the original values, resulting in a much higher 1 \sigma region and much higher value of PSIS-LOO-CV. Read questions for my concerns.

Nice that you already have found out a lot of useful material on the problem.

As p_loo is much larger than the total number of parameters, we can infer that some of the observations are highly influential. If your simulation and model match each other (and there are no errors in the implementation), then we know there is no model misspecification and it has to be that even with two parameters, the posterior predictive distribution is changing a lot when leaving out an observation. As PSIS-LOO has problems, you could try brute-force-leave-out to test this. Pick one of the problematic observations, check the predictive distribution, leave it out, refit with others observations and compare how much different the predictive distribution is. If the computation time is not an issue you can continue with brute-force-leave-one-out or use K-fold-CV.

Probably, and you can check this.

I think these are quite far away from each other

It would be useful to understand the issue better. PSIS-LOO can sometimes be used in the early phase of the workflow even with high k’s but if you want to say something more accurate, the challenge is that there is a small probability of big error when there are very high khats.

What is the alternative model you are considering?

Can you share Stan code and data (or data simulation code)?

1 Like

Thank you for the hindsight, hopefully I was able to provide a clear description of my problem.

So, to see if the ideas in my mind are correct, if the Pareto \hat{k} diagnostic gives me a high value it might be a very influential observation, or the incorrect model. In this case, since p_loo is much larger, and the model is correct, that implies that it is a highly influential observation, correct?

What I did, and do correct me if this is the wrong approach, was to remove all of the high \hat{k} observations and run three MCMC’s: all observations, only high \hat{k} observations and only low \hat{k} observations. The posterior distribution for the 3 of them compared side by size are as follows:
Figure_1
So it seems to me that the posterior is very similar between these 3 cases.
I am yet to become more familiar with the posterior predictive distribution in order to start looking into that. From my understanding, it’s the probability of observing a given event knowing the actually observed data, and it seems that arviz has a way to compute it very easily, but I don’t know how to interpret that very well, and maybe adapt to the case where one observation is 2D, as the examples in plot_ppc seem to be for 1D observation.

You are right, and I should have checked this before-hand.
After making an MCMC with only the very influential observations, with Pareto \hat{k} > 0.5, and and MCMC with all of the others, I’ve decided to plot the absolute value of the observed value of the luminosity distance for each observation, labeled as d_L^{(obs)}(z), with the real value of the luminosity distance, d_L(z), divided by the error for that event \sigma^{(obs)}. The result was the following:
Modulus of the observed luminosity minus the theoretical luminosity distance, divided by the error, for each event.
As we can see, the highly influential observations are essentially in two places: at low z, regardless of the error, and at high z, and are one the closest to the function which was used to generate them.
This means that it’s not randomness generating very influential observations which are very far away from the theoretical value used to generate them.

Well, you have to sum the estimated error in one of them about 2.5 times to get the other, so yes, I suppose that’s not very close, but I wouldn’t really know what consider close here as I have no intuition behind those numbers. I should look into the paper provided for more details.

Yes, of course. This one is a little bit more complicated, and it’s given by
d_L(z) = \sqrt{ \frac{1 - \lambda}{1 - \lambda/E^2}} e^{\frac{\lambda}{2}(1 - 1/E^2)} \frac{3(1+z)}{h} \int_0^z \frac{1}{E(x)} dx
where the function E has to be solved numerically from the following equation
(E^2 - 2\lambda)e^{\lambda/E^2} = \Omega_m (1+z)^3
and \lambda is given by
\lambda = \frac{1}{2} + W_0 \left( -\frac{\Omega_m}{2e^{1/2}} \right)
As such, it still has the same two parameters: h and \Omega_m

The first model, which was used in the catalog generating procedure, is coded in the following file:
LCDM.stan (2.4 KB)
The dataset which I generated using the previous model, is given in the following .csv file:
ET-4.csv (54.8 KB)
The alternative model, which I wish to compare against the first model, is given the following file:
fotis-noOmegar.stan (3.9 KB)
After running the MCMC using the first model I separated the dataset ET-4.csv into two files, one which included only the very influential events (Pareto \hat{k} > 0.5), named ET-4-highk.csv and the only which included all other events, named ET-4-lowk.csv:
ET-4.csv (54.8 KB)
ET-4-highk.csv (3.8 KB)
ET-4-lowk.csv (51.1 KB)

Hi,

I used your code LCDM.stan and data ET-4.csv. I get no high khats and p_loo is close to 2. Residuals follow very nicely the normal distribution.

> loo(f$draws("log_lik"))

Computed from 1000 by 1000 log-likelihood matrix

         Estimate   SE
elpd_loo  -1721.0 36.5
p_loo         2.1  0.5
looic      3442.0 73.1
------
Monte Carlo SE of elpd_loo is 0.1.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.

I guess you had some error in your loo computation.

fotis-noOmegar.stan model produces clearly wrong predictions (it’s very clear just plotting predictions and luminosity), and loo result that clearly indicates very bad model misspecification.

That is very strange… Would the number of warmup + samples matter? I ran 4 chains, each with 1000 warmup steps and 2500 samples. Also, I’m using PyStan3 and I used arviz to compute PSIS-LOO-CV, which you are also using but in Python.
Also, I just re-ran the chains and everything seemed alright.
I’ll do a MWE using PyStan3 and arviz and send it here ASAP.

Yes, I’m aware of that, but I wanted to quantify that based on the value of PSIS-LOO-CV as well.
Although in this case it is fairly obvious, it might not be for other models or datasets, so I was taking the opportunity to learn more on the subject.

1 Like

Shouldn’t matter. The posterior is simple so the 1000 warmup is plenty.

Once again, many thanks for your help so far, and here is the promised MWE in Python:

# imports
from random import gauss
import numpy as np
import arviz as az
import pandas
import stan

# model file
with open("LCDM.stan", "r") as file:
    program = file.read()

# data file
file = "ET-4.csv"
columns = pandas.read_csv(file, comment="#")
dic = {"N1": len(columns["redshift"]),
       "redshift": np.array(columns["redshift"]),
       "luminosity_distance": np.array(columns["luminosity_distance"]),
       "error": np.array(columns["error"])}

# initial conditions
init = []
for i in range(0, 4):  # num of chains is hardcoded
    init.append({})
    init[i]["h"] = gauss(0.7, 0.15)
    init[i]["Omega_m"] = gauss(0.7, 0.1)

# run!
posterior = stan.build(program, data=dic)
fit = posterior.sample(num_chains=4, num_samples=2500, num_warmup=1000, init=init, save_warmup=True)

# compute PSIS-LOO-CV
loo = az.loo(fit, pointwise=True)
print(loo)

And here’s the full output of this script:

$ python misc/mwe.py
Building: found in cache, done.
Messages from stanc:
Warning in '/tmp/httpstan_gie2dwdj/model_t5jd2i3v.stan', line 51, column 60: The
    variable integrand may not have been assigned a value before its use.
Sampling: 100% (14000/14000), done.
Messages received during sampling:
  Gradient evaluation took 0.029433 seconds
  1000 transitions using 10 leapfrog steps per transition would take 294.33 seconds.
  Adjust your expectations accordingly!
  Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
  Exception: normal_lpdf: Location parameter[1] is inf, but must be finite! (in '/tmp/httpstan_rsp79i50/model_t5jd2i3v.stan', line 63, column 2 to column 42)
  If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
  but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
  Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
  Exception: normal_lpdf: Location parameter[1] is inf, but must be finite! (in '/tmp/httpstan_rsp79i50/model_t5jd2i3v.stan', line 63, column 2 to column 42)
  If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
  but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
  Gradient evaluation took 0.033037 seconds
  1000 transitions using 10 leapfrog steps per transition would take 330.37 seconds.
  Adjust your expectations accordingly!
  Gradient evaluation took 0.03099 seconds
  1000 transitions using 10 leapfrog steps per transition would take 309.9 seconds.
  Adjust your expectations accordingly!
  Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
  Exception: normal_lpdf: Location parameter[1] is inf, but must be finite! (in '/tmp/httpstan_rsp79i50/model_t5jd2i3v.stan', line 63, column 2 to column 42)
  If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
  but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
  Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
  Exception: normal_lpdf: Location parameter[1] is inf, but must be finite! (in '/tmp/httpstan_rsp79i50/model_t5jd2i3v.stan', line 63, column 2 to column 42)
  If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
  but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
  Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
  Exception: normal_lpdf: Location parameter[1] is inf, but must be finite! (in '/tmp/httpstan_rsp79i50/model_t5jd2i3v.stan', line 63, column 2 to column 42)
  If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
  but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
  Gradient evaluation took 0.034786 seconds
  1000 transitions using 10 leapfrog steps per transition would take 347.86 seconds.
  Adjust your expectations accordingly!
/home/undercover/.micromamba/envs/fotis/lib/python3.10/site-packages/arviz/stats/stats.py:1048: RuntimeWarning: overflow encountered in exp
  weights = 1 / np.exp(len_scale - len_scale[:, None]).sum(axis=1)
/home/undercover/.micromamba/envs/fotis/lib/python3.10/site-packages/numpy/core/_methods.py:48: RuntimeWarning: overflow encountered in reduce
  return umr_sum(a, axis, dtype, out, keepdims, initial, where)
/home/undercover/.micromamba/envs/fotis/lib/python3.10/site-packages/arviz/stats/stats.py:811: UserWarning: Estimated shape parameter of Pareto distribution is greater than 0.7 for one or more samples. You should consider using a more robust model, this is because importance sampling is less likely to work well if the marginal posterior and LOO posterior are very different. This is more likely to happen with a non-robust model and highly influential observations.
  warnings.warn(
Computed from 14000 by 1000 log-likelihood matrix

         Estimate       SE
elpd_loo -1762.19    44.80
p_loo       43.28        -

There has been a warning during the calculation. Please check the results.
------

Pareto k diagnostic values:
                         Count   Pct.
(-Inf, 0.5]   (good)      563   56.3%
 (0.5, 0.7]   (ok)        149   14.9%
   (0.7, 1]   (bad)       276   27.6%
   (1, Inf)   (very bad)   12    1.2%

As you can see, there are in fact observations with a high Pareto \hat{k}.
This is very strange, but I don’t feel like I’m doing anything wrong here, and I’m using the same packages you seem to be using, only in Python.
Would you mind sharing your MWE in R so that I can run it on my machine?


Here is a copy of my Python virtual environment, which is managed my micromamba, a lightweight virtual environment manager which is fully compatible with conda:
micromamba-env.txt (1.8 KB)
pip-list.txt (1.8 KB)
The more relevant package/versions are:

  • Python 3.10.4
  • pystan 3.4.0
  • arviz 0.12.0
  • numpy 1.22.3

As you have save_warmup=True, are those draws also used also for loo? What if you change to save_warmup=False?

1 Like

I thought precisely about that when I was writing the MWE, but for some reason, I thought there was a distinction between the warmups and then samples inside the fit object.
I’ve just re-ran the MWE without saving the warmup and it does in fact show that all values are well behaved!

A follow-up question, since I would like to store the warmup as well for future reference, how would I change the fit function in such a way a to remove the warmup?
That way I could save it in a file, even show the chain convergence plot, and then before computing model selection criteria I would remove the warmup steps.

1 Like

Great!

Sorry, I’m not familiar enough with Python, but I’ll ping some people who probably know the answer immediately

No that’s fine, I’ll figure it out myself, that’s the one thing I feel like shouldn’t be complicated in the middle of all of this.
I will do that, re-run all the PSIS-LOO-CV, and if none come out with high Pareto \hat{k} values, then I will move on confidently!
Thank you once again for your help.

Can you first try to create InferenceData object with

idata = az.from_cmdstanpy(fit)
# or pystan
idata = az.from_pystan(fit)

Check that loglikelihood is found in idata object.

And then pass that to loo

az.loo(idata)

No that didn’t work out.
After running the MWE provided, not forgetting to save the warmup steps, the commands you suggested yield the following:

In [1]: idata = az.from_pystan(fit)

In [2]: az.loo(idata)
/home/joseferreira/.micromamba/envs/fotis-r/lib/python3.10/site-packages/arviz/stats/stats.py:1048: RuntimeWarning: overflow encountered in exp
  weights = 1 / np.exp(len_scale - len_scale[:, None]).sum(axis=1)
/home/joseferreira/.micromamba/envs/fotis-r/lib/python3.10/site-packages/numpy/core/_methods.py:48: RuntimeWarning: overflow encountered in reduce
  return umr_sum(a, axis, dtype, out, keepdims, initial, where)
/home/joseferreira/.micromamba/envs/fotis-r/lib/python3.10/site-packages/arviz/stats/stats.py:811: UserWarning: Estimated shape parameter of Pareto distribution is greater than 0.7 for one or more samples. You should consider using a more robust model, this is because importance sampling is less likely to work well if the marginal posterior and LOO posterior are very different. This is more likely to happen with a non-robust model and highly influential observations.
  warnings.warn(
Out[2]: 
Computed from 14000 by 1000 log-likelihood matrix

         Estimate       SE
elpd_loo -1795.71    56.59
p_loo       76.71        -

There has been a warning during the calculation. Please check the results.
------

Pareto k diagnostic values:
                         Count   Pct.
(-Inf, 0.5]   (good)      219   21.9%
 (0.5, 0.7]   (ok)        274   27.4%
   (0.7, 1]   (bad)       438   43.8%
   (1, Inf)   (very bad)   69    6.9%

The overflow seems to happen even without running the commands you have suggested:

In [3]: az.loo(fit)
/home/joseferreira/.micromamba/envs/fotis-r/lib/python3.10/site-packages/arviz/stats/stats.py:1048: RuntimeWarning: overflow encountered in exp
  weights = 1 / np.exp(len_scale - len_scale[:, None]).sum(axis=1)
/home/joseferreira/.micromamba/envs/fotis-r/lib/python3.10/site-packages/numpy/core/_methods.py:48: RuntimeWarning: overflow encountered in reduce
  return umr_sum(a, axis, dtype, out, keepdims, initial, where)
/home/joseferreira/.micromamba/envs/fotis-r/lib/python3.10/site-packages/arviz/stats/stats.py:811: UserWarning: Estimated shape parameter of Pareto distribution is greater than 0.7 for one or more samples. You should consider using a more robust model, this is because importance sampling is less likely to work well if the marginal posterior and LOO posterior are very different. This is more likely to happen with a non-robust model and highly influential observations.
  warnings.warn(
Out[3]: 
Computed from 14000 by 1000 log-likelihood matrix

         Estimate       SE
elpd_loo -1795.71    56.59
p_loo       76.71        -

There has been a warning during the calculation. Please check the results.
------

Pareto k diagnostic values:
                         Count   Pct.
(-Inf, 0.5]   (good)      219   21.9%
 (0.5, 0.7]   (ok)        274   27.4%
   (0.7, 1]   (bad)       438   43.8%
   (1, Inf)   (very bad)   69    6.9%

Isn’t there a way to access the inside of the fit object and modify it in such a way as it only trims the chain?
It is perhaps best to take this discussion over at the PyStan Github repository, only that it only features an issues section, not a discussions section. Regardless, if you agree with this statement, I will open up and issue there and ask this question (although again, it’s not an issue).

This is not a pystan issue but ArviZ issue.

az.from_pystan should move the warmup samples to another group. Unless there was no flag if sample contained warmup or not.

cc @OriolAbril

Ok, apparently we have not updated the reader to handle warmup for pystan3.

Quick fix before we fix the reader is to use .isel method

idata = idata.isel(draw=slice(-num_samples, None))

(I still need to test if that syntax works against idata or only on groups)

1 Like

Many thanks, this seems to be working!
I would just like to clarify, for future reference, that the value of num_samples is the number of sampler per chain, not the total number of samples (if you are running different chains).
I’ve tried using the total number of samples and it didn’t work, but it didn’t show a warning either, it just didn’t do anything.

Here’s the output of az.loo, after sampling the posterior distribution and getting the fit object:

>>> az.loo(fit, pointwise=True)
/home/joseferreira/.micromamba/envs/fotis-r/lib/python3.10/site-packages/arviz/stats/stats.py:1048: RuntimeWarning: overflow encountered in exp
  weights = 1 / np.exp(len_scale - len_scale[:, None]).sum(axis=1)
/home/joseferreira/.micromamba/envs/fotis-r/lib/python3.10/site-packages/numpy/core/_methods.py:48: RuntimeWarning: overflow encountered in reduce
  return umr_sum(a, axis, dtype, out, keepdims, initial, where)
/home/joseferreira/.micromamba/envs/fotis-r/lib/python3.10/site-packages/arviz/stats/stats.py:811: UserWarning: Estimated shape parameter of Pareto distribution is greater than 0.7 for one or more samples. You should consider using a more robust model, this is because importance sampling is less likely to work well if the marginal posterior and LOO posterior are very different. This is more likely to happen with a non-robust model and highly influential observations.
  warnings.warn(
Out[6]: 
Computed from 14000 by 1000 log-likelihood matrix

         Estimate       SE
elpd_loo -1760.82    44.22
p_loo       41.92        -

There has been a warning during the calculation. Please check the results.
------

Pareto k diagnostic values:
                         Count   Pct.
(-Inf, 0.5]   (good)      520   52.0%
 (0.5, 0.7]   (ok)         63    6.3%
   (0.7, 1]   (bad)       388   38.8%
   (1, Inf)   (very bad)   29    2.9%

And using your code to convert it to an InferenceData object and trimming it (again here the number of samples is 2500, which is the number of samples per chain):

>>> idata = az.from_pystan(fit)
>>> idata = idata.isel(draw=slice(-2500, None))
>>> az.loo(idata)
Computed from 10000 by 1000 log-likelihood matrix

         Estimate       SE
elpd_loo -1720.96    36.52
p_loo        2.08        -
------

Pareto k diagnostic values:
                         Count   Pct.
(-Inf, 0.5]   (good)     1000  100.0%
 (0.5, 0.7]   (ok)          0    0.0%
   (0.7, 1]   (bad)         0    0.0%
   (1, Inf)   (very bad)    0    0.0%

You can even see it’s using the right number of samples by looking at the previous output.

1 Like

Apologies for reviving this thread once the issue has been sorted out.
However, after running the MWE file mwe.py (1.2 KB), which is now corrected and properly computes the value of loo, with ET-4.csv, I am getting similar values for both LCDM.stan (the model used to generate the data) and fotis-noOmegar.stan (the other model which you mentioned was misspecified).

For LCDM.stan the output of az.loo was

Computed from 10000 by 1000 log-likelihood matrix

         Estimate       SE
elpd_loo -1721.00    36.52
p_loo        2.12        -
------

Pareto k diagnostic values:
                         Count   Pct.
(-Inf, 0.5]   (good)     1000  100.0%
 (0.5, 0.7]   (ok)          0    0.0%
   (0.7, 1]   (bad)         0    0.0%
   (1, Inf)   (very bad)    0    0.0%

Whereas for fotis-noOmegar.stan the result was very similar, as you can see:

Computed from 10000 by 1000 log-likelihood matrix

         Estimate       SE
elpd_loo -1720.98    36.50
p_loo        2.20        -
------

Pareto k diagnostic values:
                         Count   Pct.
(-Inf, 0.5]   (good)     1000  100.0%
 (0.5, 0.7]   (ok)          0    0.0%
   (0.7, 1]   (bad)         0    0.0%
   (1, Inf)   (very bad)    0    0.0%

Yet, you state that:

Which I have not observed, as both models have very similar values for loo, which are similar to the one you have showed in that exact same post.

Now, I tried to plot ET-4.csv and the best fit for both LCDM.stan and fotis-noOmegar.stan in the luminosity distance versus redshift place, yet due to the massive number of event it’s unclear, at least visibly, which one is best.
So what I decided to do was to use another catalog, with fewer events and spread out across a wider range, available in LISA-N30-1.csv (1.9 KB), and plotted the best fit for LCDM.stan, the best fit for fotis-noOmegar.stan and the catalog itself:
LISA-N30-1.csv and corresponding best fits for both LCDM.stan and fotis-noOmegar.stan, showing that LCDM.stan is clearly the best fit for the data.
Which clearly shows that LCDM.stan is the best fit.

Yet, similarly to what happened for ET-4.csv, the computation of loo reveals that both models are of at the same level.
For LCDM.stan with LISA-N30-1.csv we have:

Computed from 30000 by 30 log-likelihood matrix

         Estimate       SE
elpd_loo   -31.74    10.15
p_loo        0.89        -
------

Pareto k diagnostic values:
                         Count   Pct.
(-Inf, 0.5]   (good)       30  100.0%
 (0.5, 0.7]   (ok)          0    0.0%
   (0.7, 1]   (bad)         0    0.0%
   (1, Inf)   (very bad)    0    0.0%

And for fotis-noOmegar.stan with LISA-N30-1.csv we have similarly:

Computed from 30000 by 30 log-likelihood matrix

         Estimate       SE
elpd_loo   -31.76    10.16
p_loo        0.88        -
------

Pareto k diagnostic values:
                         Count   Pct.
(-Inf, 0.5]   (good)       30  100.0%
 (0.5, 0.7]   (ok)          0    0.0%
   (0.7, 1]   (bad)         0    0.0%
   (1, Inf)   (very bad)    0    0.0%

So the question is: why are two models equivalent for loo, yet, one is clearly better than the other when looking at the best fit?

EDIT: Disregard what I said about the size of the log-likelihood matrix, not only did I misread it, but after looking at the documentation its size does in fact make sense.

Maybe you have an error in your script?

The error was in fact coming from my script, and I have finally been able to find it and fix it.

I think that will put and end to this thread once and for all, thanks again!

1 Like