How to decide on the best model?

montenegrina · December 1, 2020, 9:21pm

Hello,

I am having trouble deciding on which model to choose, here is what I did:

library("rstanarm")
library("loo")
library("bayesplot")

fit1 <-
  stan_glm(
    formula = PHENO~C1+C2+C3+C4+C5+C6+C7+C8+C9+C10+SEX+AGE+REN_INSF+mean_HBA1C+DDuration,
    data = mf,
    gaussian(link = "identity"),
    prior = normal(0, 2.5, autoscale = TRUE),
    prior_intercept = normal(0, 5, autoscale = TRUE),
    seed = 12345
  )

loo1 <- loo(fit1, save_psis = TRUE)
print(loo1)

pdf("PSIS.pdf")
plot(loo1)
dev.off()


fit2 <-
  stan_glm(
    formula = PHENO~C10+REN_INSF+mean_HBA1C+DDuration,
    data = mf,
    gaussian(link = "identity"),
    prior = normal(0, 2.5, autoscale = TRUE),
    prior_intercept = normal(0, 5, autoscale = TRUE),
    seed = 12345
  )

loo2 <- loo(fit2, save_psis = TRUE)

pdf("PSIS2.pdf")
plot(loo2,label_points = TRUE)
dev.off()


yrep <- posterior_predict(fit1)

pdf("LOO.pdf")
ppc_loo_pit_overlay(
  y = mf$PHENO,
  yrep = yrep,
  lw = weights(loo1$psis_object)
)
dev.off()


yrep <- posterior_predict(fit2)

pdf("LOO2.pdf")
ppc_loo_pit_overlay(
  y = mf$PHENO,
  yrep = yrep,
  lw = weights(loo1$psis_object)
)
dev.off()
 
loo_compare(loo1, loo2)
     elpd_diff se_diff
fit2  0.0       0.0   
fit1 -7.3       2.5 

> print(loo1)

Computed from 4000 by 1253 log-likelihood matrix

         Estimate    SE
elpd_loo  -2246.3  89.4
p_loo        38.8   5.8
looic      4492.6 178.9
------
Monte Carlo SE of elpd_loo is 0.1.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     1251  99.8%   229       
 (0.5, 0.7]   (ok)          2   0.2%   307       
   (0.7, 1]   (bad)         0   0.0%   <NA>      
   (1, Inf)   (very bad)    0   0.0%   <NA>      

All Pareto k estimates are ok (k < 0.7).
See help('pareto-k-diagnostic') for details.
> print(loo2)

Computed from 4000 by 1253 log-likelihood matrix

         Estimate    SE
elpd_loo  -2239.1  90.0
p_loo        29.7   4.5
looic      4478.1 180.1
------
Monte Carlo SE of elpd_loo is 0.1.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.

Relevant plots are in attach. It seems that judging by the Pareto k estimates model loo2 is better.

But when I look at the plots LOO1.pdf and LOO2.pdf none of them looks like it is fitting…

Can you please tell me what to do next?

Variables from the 2nd model (PHENO~C10+REN_INSF+mean_HBA1C+DDuration) come from the selection done with:

library(olsrr)
model <- lm(PHENO~C1+C2+C3+C4+C5+C6+C7+C8+C9+C10+SEX+AGE+REN_INSF+mean_HBA1C+DDuration, data = mf)
k <- ols_step_best_subset(model)[LOO2.pdf|attachment](upload://cwKk0zauUFfWvkTXzIUOxduASx6.pdf) (473.2 KB) [LOO.pdf|attachment](upload://8GnS4dCj1FBOiVzeASHWZPiXlgX.pdf) (473.2 KB) [PSIS.pdf|attachment](upload://935kcdKXxZwv8bFX7TIxkQQcKO.pdf) (31.4 KB) [PSIS2.pdf|attachment](upload://zUeUX6QqsT4l1XaQglF8g7csG9v.pdf) (31.3 KB)

Please advise,
AnaLOO.pdf (473.2 KB)LOO2.pdf (473.2 KB)PSIS.pdf (31.4 KB)PSIS2.pdf (31.3 KB)

Zhanglong · December 2, 2020, 7:17am

You may want to use loo2 rather than loo1 in this chunk for diagnosing model 2.

avehtari · December 3, 2020, 6:34pm

See this video and case studies at Model selection tutorials and talks explaining loo and Pareto k.

Pareto k is primarily a diagnostic for checking whether fast LOO computation is reliable. Use loo_compare(loo1,loo2) for comparing expected predictive performance.

montenegrina · December 3, 2020, 7:07pm

Thank you so much. There is one more thing I was wondering about and it is how my power changes with deciding to include or exclude a certain covariate?

Just briefly about my data. The phenotype for my data is the presence of Diabetic Retinopathy in subject (yes/no/NA). I would like to perform GWAS and see which variants are associated with the trait. I would also like to control for some covariates. (available are:C1+C2+C3+C4+C5+C6+C7+C8+C9+AGE+SEX+REN_INSF+mean_HBA1C+DDuration)

Some clinical covariates, such as age are not influenced by genetic factors and can be included it the model without concern of confounding due to shared genetics with the phenotype of interest, I guess? Other clinical covariates such as DDuration (Diabetes Duration) and REN_INSF (renal insufficiency) maybe related to disease status (e.g. retinopathy yes or no) and have a genetic basis themselves. In this case, including the covariate may alter the power to identify SNPs associated with phenotype via the covariate. And how does the proper choice of PCs influence all of this?

Do you by any chance know a software which does this kind of power calculation or related?

Topic		Replies	Views
Model comparison, Log-Likelihood and WAIC Modeling	1	1121	December 1, 2019
Model comparison questions (performance/accuracy evaluation) Modeling loo	10	2078	December 24, 2018
Quick question about update in rstanarm Modeling	6	410	October 12, 2020
Comparing results from lm and stan_lm rstanarm	4	927	September 18, 2017
Loo_compare() to compare models with different priors General	2	383	December 1, 2023

How to decide on the best model?

Related topics