Hello everyone.

I’m working on a model recovery for a future gravitational wave observatory, the Einstein Telescope, by using mock data which was generated based on the model given below.

It happens that, after performing the model recovery, PSIS-LOO-CV prints a warning, as the Pareto \hat{k} estimates gives me some observations with a fairly high value.

This happens even thought the best fit values are very close to the original values, being inside the 1 \sigma region, which is itself rather small.

# The Model

In short, the model I’m working with make a prediction on an observable which is referred to as the luminosity distance (d_L), which is a function of another observable which is known as the redshift (z) and two parameters h and \Omega_m. The function is given by:

d_L(z) = \frac{3(1+z)}{h} \int_0^z \frac{1}{\sqrt{\Omega_m(1+x)^3 + 1 - \Omega_m}} dx

# The Data

The mock data used to get the model recovery is generated in a rather straightforward manner:

- Get a redshift z_* for an event we expect to observe;
- Compute the luminosity distance for that event d_L(z_*);
- Get the expected error for the observed event \sigma(z_*);
- Consider the final for the luminosity distance by a taking a random sample from a Gaussian distribution, i.e. d_L(z_*) \rightarrow \mathcal{N}(d_L(z_*), \sigma^2(z_*)), such that the observed value isn’t on top of the expected.

We then consider a dataset of 1000 events.

# The Results

The output of the Pareto \hat{k} diagnostics given by `arviz`

, after running the chain in Stan, using `pystan`

(version 3), is as follows:

```
Computed from 14000 by 1000 log-likelihood matrix
Estimate SE
elpd_loo -1734.38 36.61
p_loo 15.47 -
There has been a warning during the calculation. Please check the results.
------
Pareto k diagnostic values:
Count Pct.
(-Inf, 0.5] (good) 499 49.9%
(0.5, 0.7] (ok) 432 43.2%
(0.7, 1] (bad) 65 6.5%
(1, Inf) (very bad) 4 0.4%
```

The value of `p_loo`

, according to the loo-glossary: LOO package glossary, suggests that the model is badly misspecified, which I know is not the case.

However, in Fitting the true model and getting bad k-pareto from loo Aki Vehtari says that it is expected that PSIS can fail for very flexible models. However, it seems to me that this is not the case here, as we have 1000 observations versus 2 parameters.

I then removed all the troublesome events and re-ran the MCMC, and the Pareto \hat{k} revealed no bad or very bad observations!

What’s interesting, is that the value for PSIS-LOO-CV is essentially the same.

On the other side, the constrains without the troublesome observations revealed to be slightly worst (about 2 times larger 1 \sigma region on the posterior), which I believe could be easily explain by the lack of events.

I then ran an MCMC with only the troublesome events, which was still able to recover the original parameters (with a 1 \sigma region larger than the previous case) and the value of PSIS-LOO-CV was much higher, and some observations were still being considered bad (1 in 66) or very bad (7 in 66).

This is something I was expecting, so no surprises here.

# The questions

I am left with some questions regarding these results:

- How could the same procedure, which is itself rather simple, create events that turn out troublesome and events which are perfectly fine? If I had to guess, I would say that the step 4 of the generating procedure might create very far away events, could that explain it?
- Given that PSIS-LOO-CV stayed more or less the same with and without the troublesome events (-1739.907 ± 37.395 with and -1652.191 ± 33.998 without), what does that imply for the observations which were left out?
- Can I confidently use the results from PSIS-LOO-CV to compare between two models, using this dataset with the troublesome events?

A similar question was raised in How bad is a small percentage of very bad? (pareto-k-diagnostic of loo package) where the user asks how many bad observations can one have such that model comparison is still possible. I don’t think I understand the provided answer, and hopefully somebody could elaborate on that.

However, in High Pareto-k values for the same observations across different models: Can I still use loo to compare these models?, Ben Goodrich states that if a model is to have high Pareto k values for some observations, then comparison is not ensured to be possible (or at least, that was my understanding of his reply). - I also have the case where I have a different observatory which only has 15 events. In some cases, the very bad or bad observations make up a significant portion of them (5-6). What should I do in that situation?

**TL;DR:** Doing model recovery with 1000 generated events with the same procedure leads to warning by PSIS-LOO-CV regarding high Pareto \hat{k} values for some observations, yet the original values are recovered anyways. After troublesome observations are removed the value of PSIS-LOO-CV is more or less the same, and the size of the 1 \sigma region on the posterior increases. Doing an MCMC with only the troublesome events I am also able to recover the original values, resulting in a much higher 1 \sigma region and much higher value of PSIS-LOO-CV. Read questions for my concerns.