Number of points with pareto_k > 0.7 increases when a subset of the data set is used

Max_Mantei · June 13, 2019, 4:26pm

You are not the first one to struggle with the Lognormal (I did too for a while). You might want to check out the actually good Wiki article. Note, that the Lognormal has support (0,+\infty), but it’s location parameter has is \mu \in (-\infty,+\infty) – note, that this is not the mean. This is working, because (loosely speaking) the location parameter “runs through” the \exp() function, which maps the input values to the positive real numbers (-\infty,+\infty) \rightarrow (0,+\infty). This means, that \mu is on the log-scale. And that it is called \mu is kinda unfortunate. Well, it does actually make sense, because the way how the log-normal is often defined (in Stan as well) is

\log(Y) \sim \text{Normal}(\mu,\sigma) \leftrightarrow Y \sim \text{LogNormal}(\mu,\sigma).

Well, these are not really equivalent, because in the first you have a change of variables (\log(Y)) and since in Bayesian estimation you are working with probability densities, you need apply the change of variable formula… There is also a tiny section on this in the Stan manual.

Does this make sense?

Also, when I said that you need to take care of the model formula when applying link functions, I was a bit sloppy. I give you two examples from Simon Wood’s fantastic GAM book (Chapter 3, GLMs):

Suppose a disease epidemic with relatively few cases (y_i) for now. But the rate of new cases occurring increases exponentially through time. Let \eta_i be the expected number of cases at time t_i, the model is \eta_i = \gamma \exp(\delta t_i). Using a log-link the model becomes \log(\eta_i)=\log(\gamma) + \delta t_i and letting \beta_0 = \log(\gamma) and \beta_1 = \delta, one can write: \log(\eta_i) = \beta_0 + \beta_1 t_i. Note that \beta_0 is now on the log-scale. In R this could look like: glm(y ~ 1 + t, family = poisson(link="log")).
The rate of caputre of prey, y_i by a hunting animal increases with density of prey, x_i but levels off eventually when predators catch as much as they are capable of. Let the model by \eta_i=\dfrac{\alpha x_i}{h + x_i}, where \alpha is the max. capture rate and h the prey density at which the max. capture rate is halved. Using a reciprocal (i.e. inverse) link yields 1/\eta_i = \dfrac{1}{\alpha} + \dfrac{h}{\alpha} \dfrac{1}{x_i} = \beta_0 + \beta_1 \dfrac{1}{x_i}, by letting \beta_0=1/\alpha and \beta_1=h/\alpha. In R this could look like this: glm(y ~ 1 + I(1/x), family = Gamma(link="inverse")).

That being said, I’m not sure that this can be neatly done with your model…

Topic		Replies	Views
A quick note what I infer from p_loo and Pareto k values Modeling loo	35	15894	August 21, 2022
Pareto K diagnostics and kfold model comparison brms loo	15	4998	February 13, 2019
Recommendations for what to do when k exceeds 0.5 in the loo package? Modeling loo	21	7520	March 8, 2018
Good PP check and R square but large Pareto k values Modeling performance , loo	10	2155	September 2, 2020
Favors better LOO-CV results or Pareto k diagnostics for model selection Modeling loo , model-comparison	11	476	June 19, 2024

Number of points with pareto_k > 0.7 increases when a subset of the data set is used

Related topics