How to decide on the best model?

Hello,

I am having trouble deciding on which model to choose, here is what I did:

library("rstanarm")
library("loo")
library("bayesplot")

fit1 <-
  stan_glm(
    formula = PHENO~C1+C2+C3+C4+C5+C6+C7+C8+C9+C10+SEX+AGE+REN_INSF+mean_HBA1C+DDuration,
    data = mf,
    gaussian(link = "identity"),
    prior = normal(0, 2.5, autoscale = TRUE),
    prior_intercept = normal(0, 5, autoscale = TRUE),
    seed = 12345
  )

loo1 <- loo(fit1, save_psis = TRUE)
print(loo1)

pdf("PSIS.pdf")
plot(loo1)
dev.off()


fit2 <-
  stan_glm(
    formula = PHENO~C10+REN_INSF+mean_HBA1C+DDuration,
    data = mf,
    gaussian(link = "identity"),
    prior = normal(0, 2.5, autoscale = TRUE),
    prior_intercept = normal(0, 5, autoscale = TRUE),
    seed = 12345
  )

loo2 <- loo(fit2, save_psis = TRUE)

pdf("PSIS2.pdf")
plot(loo2,label_points = TRUE)
dev.off()


yrep <- posterior_predict(fit1)

pdf("LOO.pdf")
ppc_loo_pit_overlay(
  y = mf$PHENO,
  yrep = yrep,
  lw = weights(loo1$psis_object)
)
dev.off()


yrep <- posterior_predict(fit2)

pdf("LOO2.pdf")
ppc_loo_pit_overlay(
  y = mf$PHENO,
  yrep = yrep,
  lw = weights(loo1$psis_object)
)
dev.off()
 
loo_compare(loo1, loo2)
     elpd_diff se_diff
fit2  0.0       0.0   
fit1 -7.3       2.5 

> print(loo1)

Computed from 4000 by 1253 log-likelihood matrix

         Estimate    SE
elpd_loo  -2246.3  89.4
p_loo        38.8   5.8
looic      4492.6 178.9
------
Monte Carlo SE of elpd_loo is 0.1.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     1251  99.8%   229       
 (0.5, 0.7]   (ok)          2   0.2%   307       
   (0.7, 1]   (bad)         0   0.0%   <NA>      
   (1, Inf)   (very bad)    0   0.0%   <NA>      

All Pareto k estimates are ok (k < 0.7).
See help('pareto-k-diagnostic') for details.
> print(loo2)

Computed from 4000 by 1253 log-likelihood matrix

         Estimate    SE
elpd_loo  -2239.1  90.0
p_loo        29.7   4.5
looic      4478.1 180.1
------
Monte Carlo SE of elpd_loo is 0.1.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.

Relevant plots are in attach. It seems that judging by the Pareto k estimates model loo2 is better.

But when I look at the plots LOO1.pdf and LOO2.pdf none of them looks like it is fitting…

Can you please tell me what to do next?

Variables from the 2nd model (PHENO~C10+REN_INSF+mean_HBA1C+DDuration) come from the selection done with:

library(olsrr)
model <- lm(PHENO~C1+C2+C3+C4+C5+C6+C7+C8+C9+C10+SEX+AGE+REN_INSF+mean_HBA1C+DDuration, data = mf)
k <- ols_step_best_subset(model)[LOO2.pdf|attachment](upload://cwKk0zauUFfWvkTXzIUOxduASx6.pdf) (473.2 KB) [LOO.pdf|attachment](upload://8GnS4dCj1FBOiVzeASHWZPiXlgX.pdf) (473.2 KB) [PSIS.pdf|attachment](upload://935kcdKXxZwv8bFX7TIxkQQcKO.pdf) (31.4 KB) [PSIS2.pdf|attachment](upload://zUeUX6QqsT4l1XaQglF8g7csG9v.pdf) (31.3 KB) 

Please advise,
AnaLOO.pdf (473.2 KB)LOO2.pdf (473.2 KB)PSIS.pdf (31.4 KB)PSIS2.pdf (31.3 KB)

You may want to use loo2 rather than loo1 in this chunk for diagnosing model 2.

See this video and case studies at Model selection tutorials and talks explaining loo and Pareto k.

Pareto k is primarily a diagnostic for checking whether fast LOO computation is reliable. Use loo_compare(loo1,loo2) for comparing expected predictive performance.

2 Likes

Thank you so much. There is one more thing I was wondering about and it is how my power changes with deciding to include or exclude a certain covariate?

Just briefly about my data. The phenotype for my data is the presence of Diabetic Retinopathy in subject (yes/no/NA). I would like to perform GWAS and see which variants are associated with the trait. I would also like to control for some covariates. (available are:C1+C2+C3+C4+C5+C6+C7+C8+C9+AGE+SEX+REN_INSF+mean_HBA1C+DDuration)

Some clinical covariates, such as age are not influenced by genetic factors and can be included it the model without concern of confounding due to shared genetics with the phenotype of interest, I guess? Other clinical covariates such as DDuration (Diabetes Duration) and REN_INSF (renal insufficiency) maybe related to disease status (e.g. retinopathy yes or no) and have a genetic basis themselves. In this case, including the covariate may alter the power to identify SNPs associated with phenotype via the covariate. And how does the proper choice of PCs influence all of this?

Do you by any chance know a software which does this kind of power calculation or related?