Pareto K diagnostics and kfold model comparison

Hello,

I am fitting binomial models with brms that are yielding Pareto k values > 0.5. I then tried zero_inflated_binomial and beta_binomial distributions and these models yielded fewer Pareto k values >5 but when I compared the models with kfold as the package suggested, the models with more bad Pareto k values had lower kfold ic values. Can I trust models with so many Pareto k warnings? Should I trust the kfold ic comparison despite it favoring models with many more Pareto k warnings? Also, none of the loo_pit diagnostic plots look particularly good to me, but I would appreciate some help in interpretation. I’ve given summaries of my 3 models below.

The data consists of 1547 observations of X successes out of 50 or 100 total trials. Each model has 499-500 parameters. I can give more info on the models if it helps but I’m hoping for general advice on how to deal with these high pareto k values. I tried following the advice on other threads (here and here) but different response distributions didn’t help much.

Binomial

loo(mBinomial)
#> Computed from 5700 by 1547 log-likelihood matrix

#>          Estimate    SE
#> elpd_loo  -4026.2  62.0
#> p_loo       545.5  24.3
#> looic      8052.5 124.1
#> ------
#> Monte Carlo SE of elpd_loo is NA.

#> Pareto k diagnostic values:
#>                          Count Pct.    Min. n_eff
#> (-Inf, 0.5]   (good)     1351  87.3%   430       
#>  (0.5, 0.7]   (ok)        157  10.1%   63        
#>    (0.7, 1]   (bad)        30   1.9%   18        
#>    (1, Inf)   (very bad)    9   0.6%   3         
#> See help('pareto-k-diagnostic') for details.

kfold(mBinomial)

#> Based on 10-fold cross-validation

#>            Estimate    SE
#> elpd_kfold  -4058.8  61.6
#> p_kfold          NA    NA
#> kfoldic      8117.6 123.2

Mbrmsmooth13qqmbrmsmooth13%20overlay

Zero_inflated_binomial

loo(mzBinomial)

#> Computed from 5700 by 1547 log-likelihood matrix

#>          Estimate    SE
#> elpd_loo  -4026.2  61.6
#> p_loo       541.2  24.8
#> looic      8052.3 123.2
#> ------
#> Monte Carlo SE of elpd_loo is NA.

#> Pareto k diagnostic values:
#>                          Count Pct.    Min. n_eff
#> (-Inf, 0.5]   (good)     1321  85.4%   289       
#>  (0.5, 0.7]   (ok)        174  11.2%   138       
#>    (0.7, 1]   (bad)        45   2.9%   20        
#>    (1, Inf)   (very bad)    7   0.5%   3         
#> See help('pareto-k-diagnostic') for details.

kfold(mzBinomial)

#> Based on 10-fold cross-validation

#>            Estimate    SE
#> elpd_kfold  -4019.6  57.7
#> p_kfold          NA    NA
#> kfoldic      8039.2 115.4

0qq
0overlay

Beta-Binomial

loo(mbBinomial)

#> Computed from 5700 by 1547 log-likelihood matrix

#>          Estimate   SE
#> elpd_loo  -3892.7 43.2
#> p_loo       320.2 12.6
#> looic      7785.4 86.3
#> ------
#> Monte Carlo SE of elpd_loo is NA.

#> Pareto k diagnostic values:
#>                          Count Pct.    Min. n_eff
#> (-Inf, 0.5]   (good)     1430  92.4%   358       
#>  (0.5, 0.7]   (ok)        103   6.7%   110       
#>    (0.7, 1]   (bad)        13   0.8%   27        
#>    (1, Inf)   (very bad)    1   0.1%   13        
#> See help('pareto-k-diagnostic') for details.

kfold(mbBinomial)

#> Based on 10-fold cross-validation

#>            Estimate   SE
#> elpd_kfold  -4251.4 47.0
#> p_kfold          NA   NA
#> kfoldic      8502.8 94.0

mbrmsmooth20qq
Mbrmsmooth20overlay

Thank you very much,
Sam

Probably not. Sometimes many Pareto k warnings indicate problems in models, but there are also cases where the model can be just fine, but it’s just that kind of model where most observations are highly influential and importance sampling fails (and same for waic).

LOO-PIT also indicate problems.Just visually zinb seems to be best. I highly recommend to use also posterior predictive checking plots.

You can trust kfold. I’m not certain how you read the values, but if we look at the elpd_kfold values, then
zinb is the best (-4019.6), binomial is the second (-4058.8), and beta-binomial is third (-4251.4).

And I repeat, that you should look at your data and compare to posterior predictions. Based on the information so far it’s likely that there is something wrong, but to say more then I would need to know more about the model and the data. You said that models have max 500 parameters, but two models have p_loo> 540, which often indicates problems.

Great, thank you. I think I was mostly confused because the zinb model is best by kfold but the beta-binomial seems way better in having fewer bad pareto k values and lower p_loo, despite the worst kfold value.

I am modeling the swimming behavior (either swimming or not) over time of larval crabs of 2 species by the sun intensity, tide height, and ultraviolet radiation treatment with random intercepts for replicate jar (4 jars per treatment per experiment) and experiment (I repeated the experiment 10 times for each species in different sun intensity / tide conditions). There were either 50 (species 1) or 100 (species 2) larvae / jar. Both sun intensity and tide height have non-linear effects and are modeled with poly(x,3) terms. I was worried that including both of these non-linear effects made the model too complex but fitting the model with only 1 of these non-linear covariates results in a much worse kfold ic and more pareto k warnings.

I fit the models in brms using this model formula:
Surface_count|trials(Total)~(Treatment+Species+Time+poly(Brightness.s,3)+poly(Tideheight.s,3))^3 + (1|Experiment+Uniquerep)

Posterior predictive checking plots all look good to me and pretty similar, except for the prop_zero check on the binomial model. Prop_zero plots for the beta binomial and zero inflated binomial models aren’t perfect but they don’t look terrible to me.

Binomial

ppc_dens_overlay


ppc_stat prop_zero

Zero inflated binomial

ppc_dens_overlay


ppc_stat prop_zero

Beta binomial

ppc_dens_overlay


ppc_stat prop_zero

Thanks again,
Sam

It is possible that poly is causing trouble as polynomials are often unstable and not recommended to be used. Could your try with splines s() instead? Even if it’s actually the following part which is the biggest problem, I would still recommend to switch from poly to s.

I’m not certain what (1|Experiment+Uniquerep) means, but I assume it means you have a parameter for each unique combination of Experiment and Uniquerep? There is about 3 observations per parameter, which can make each observation influential and when one observation is removed the posterior changes so much that importance sampling fails and we get high k value as warning for that.

It’s good that ppc diagnostics look ok. However, this seems to be a nice example where with enough many parameters the model is “overfitting” and the posterior predictions match the data well. LOO-PIT then reveals that if we remove an observation, predict it given all others, it is not actually so easy to predict that left out observation and the predictive distributions are biased downwards.

btw. we are wroking on improvin LOO-PIT plots which currently have many bad design choices…

I was worried poly could be causing the issues but I fit models excluding one or both of Tideheight and Brightness, and I fit a model with s() instead of poly for these two factors. All of these changes resulted in more pareto k warnings than the original models with poly terms.

The (1|Experiment+Uniquerep) are my random intercepts for experiment and replicate jar. I conducted 10 experiments on each species (20 total) and each experiment consisted of 4 replicate jars per UVR treatment.

Overfitting would make sense given my number of parameters but I am not certain I am counting my parameters correctly. I used the brms function parnames and used the number of parameters it spit out. This includes every random intercept as well as the fixed effects coefficients and scale parameters. Is this correct?

I’m also still confused on the total number of parameters, and how many observations you per each random effect. Anyway you can trust kfold for the predictive performance comparison. However, there can still be something to learn which could make the models even better. One thing you could do is plot ppc diagnostics with histograms instead of kernel density estimates. Given your observations are discrete, it is likely that kernel density estimate can hide something important. Another thing is, what if you remove that (1|Experiment+Uniquerep)? I guess the predictive performance is worse, but I’m interested how bad are ppc plots without it?

Is there a better way to calculate the number of parameters than counting up how many parameters are spit out by parnames(), which I think would be equivalent to counting up the parameters in the “Parameters” block in the Stan code?

My data are formatted as aggregated binomial outcomes, so I currently have just 5 observations per “Uniquerep” random effect, corresponding to the 5 time points at which I sampled the replicate jars. Replicate jar is nested within experiment in my experimental design, with 16 jars per experiment, so there are 16*5=80 observations per “Experiment” random effect. However, this is assuming that each row in my aggregated dataframe = 1 observation. But wouldn’t each observed event count as an observation? So if I recorded in row 1: 15 larvae swimming out of 50 total in the jar, wouldn’t that count as 50 observations? If so, there would be 250-500 observations per “Uniquerep” and 4000-8000 overvations per “Experiment.”

I fit the zinb without (1|Experiment+Uniquerep) and the number of Pareto-k warnings dropped from 52 to 6, indicating that these may be the issue. However, kfold indicated that the model without random intercepts was vastly worse (kfold ic=12,218.38 compared to 8,039.19 with (1|Experiment+Uniquerep)). The PPC density overlay plot for zinb without (1|Experiment+Uniquerep) also looks much worse, although the prop_zero plot looks good:

ppc_dens_overlay

ppc_stat prop_zero

I made new plots with pp_check(model, type="hist"), which I think is what you are referring to and they still look good to me (except for the new model without any random intercepts).

Binomial
mbrmsmooth13%20ppc_hist
Zero inflated binomial
0%20ppc_hist
Zero inflated binomial with no random intercepts
noER%20ppc_hist
Beta binomial
Mbrmsmooth20%20ppc_hist

Do you have any other recommendation for improving the model?

I’m curious as to whether disaggregating the data and fitting a bernouli model instead of my aggregated binomial model might give a different result. I’m not sure I could actually fit this model in practice since the dataset would have 121,900 rows. I just encountered a similar problem with pareto K warnings in a different dataset with a similar structure. When I disaggregated the data and fit a bernouli model, I ended up with 0 pareto k values >0.5. The aggregated binomial model had 33 pareto k values >0.7 and 4 values >1. Shouldn’t these two models be equivalent?

Thanks again,
Sam

@paul.buerkner?

With the aggregated data by default one log_lik is one row, and then in loo we leave out one aggregated observation. You need to write the model without aggregation or compute yourself non-aggergated log_lik’s to if you want it otherwise.

Does this count as (number of experiments) + (number of reps) or (number of experiments) x (number of reps)?

I repeat again that you may get Pareto-k warnings also when the model is good but very flexible. Based on the plot I would assume that binomial part in Zero inflated binomial model is not adequate and the overdispersion is modelled with the random effects. Could you try Zero inflated negative-binomial which can model overdispersion by adding just one more parameter?

That’s easy for the the zero inflated model as there is separate part for modeling just that.

In Bernoulli model you are leaving out one disaggreted observation and in aggregated binomial model you are leaving out one group of aggregated observations. If you leave out more data, the posterior changes more. You would need to decide which prediction or model part you are interested in.

@sbashevkin Please use the “Interfaces - brms” tag for brms related questions. Otherwise, I will likely overlook it (I changed the tag manually this time).

With regard to the number of parameters, parnames() won’t help you. To what kind of number of parameters are you referring to? Effective number of parameters?

It is (number of experiments) + (number of reps). So for example, Experiment 1 includes reps 1-16, Experiment 2 includes reps 17-32, etc. I have included a varying intercept for each level of Experiment and each level of replicate.

Is flexible the same as overfit? Do you mean that the model is flexible and changes easily to changes in the data because there are so many parameters?

I’m not sure how I would fit a negative binomial model since it doesn’t support a “trials” option and requires integer responses. The only way I could think to fit this model would be to calculate the percent success for each row in the dataframe and round to the nearest integer, but I’m not sure how valid that would be since it would be ignoring the sample size (total number of larvae in each jar).

Sorry about that. I thought this question would be a more general one about interpretation of loo diagnostics.

I am interested in the number of parameters that I should be comparing to the p_loo estimates, as @avehtari refers to here: A quick note what I infer from p_loo and Pareto k values where he says

I was thinking that the results of parnames() would be equivalent to the number of parameters in the Stan parameters block?

parnames() are the number of saved quantities of the model, which may be coming from the parameters, transformed parameters and generated quantities block. Also, some parameters from the parameters block may not be saved and thus do not appear in parnames().

I guess I should have suggested zero inflated beta binomial?

I thought you had counts? If the number of trials is much larger than counts, then neg-bin can be used as an accurate approximation

Overfit is difficult concept. Sometimes you need a flexible model and there is no overfit. It’s clear you need a bit more flexibility than zero inflated binomial, but it’s not clear do you need all those random effects.

Yes. Some random effects have quite few observations.

Apologies for the delayed response, I got busy with other projects.

The problem is that the number of trials is not constant.

Hmm do you or @paul.buerkner have any suggestions for how to code the zero-inflated beta-binomial as a custom response family in brms? Is it possible to modify perhaps the zero-inflated binomial distribution and substitute beta-binomial for the binomial component?

This doesn’t matter as you can include that as a offset. What matters for the approximation to be good is whether the number of trials is much larger than the observed counts.

@paul.buerkner may have easier solution, but you can generate the Stan code with `make_stancode’ and then modify that code as you like.

Look at how I implement the beta-binomial family in vignette("brms_customfamilies") and then how I implement zero-inflated models by looking at make_stancode and based on this, make you own custom family.

1 Like