Log likelihood > 0 for stanreg objects in rstanarm

Gen_Y · October 1, 2020, 9:44am

I have several models and applied waic/loo on model comparison.
I noticed that sometimes elpd_waic > 0 and then waic < 0.
If my understanding is correct, waic and looic should be > 0 ?
So I traced back and found that the log-likelihood matrix is kind of weird.
Here’s some info on model:

Model Info:

 function:     stan_mvmer
 formula (y1): Y ~ month + I(month^2) + (month | id)
 family  (y1): gaussian [identity]
 algorithm:    sampling
 priors:       see help('prior_summary')
 sample:       1000 (posterior sample size)
 num obs:      77 (y1)
 groups:       id (25)
 runtime:      <0.1 mins

Estimates:
                                          mean    sd      2.5%    25%     50%     75%     97.5%
y1|(Intercept)                            0.067   0.030   0.009   0.049   0.067   0.086   0.125
y1|month                                  0.035   0.052  -0.070   0.001   0.036   0.071   0.139
y1|I(month^2)                             0.022   0.017  -0.012   0.011   0.022   0.034   0.056
y1|sigma                                  0.145   0.014   0.121   0.136   0.145   0.153   0.176
y1|mean_PPD                               0.180   0.024   0.128   0.165   0.179   0.196   0.224
Sigma[id:y1|(Intercept),y1|(Intercept)]   0.003   0.004   0.000   0.000   0.002   0.005   0.015
Sigma[id:y1|month,y1|(Intercept)]         0.001   0.002  -0.004   0.000   0.000   0.002   0.005
Sigma[id:y1|month,y1|month]               0.009   0.005   0.003   0.006   0.008   0.011   0.021
log-posterior                           -50.256   7.200 -65.446 -54.853 -50.005 -45.344 -37.229

Diagnostics:
                                        mcse  Rhat  n_eff
y1|(Intercept)                          0.001 1.000 1079 
y1|month                                0.002 1.001  639 
y1|I(month^2)                           0.001 1.001  782 
y1|sigma                                0.000 0.999  838 
y1|mean_PPD                             0.001 1.000  889 
Sigma[id:y1|(Intercept),y1|(Intercept)] 0.000 1.001  454 
Sigma[id:y1|month,y1|(Intercept)]       0.000 1.007  114 
Sigma[id:y1|month,y1|month]             0.000 1.007  311 
log-posterior                           0.533 1.004  183

I run 2 chains with 1,000 iterations for each chain.
Here’s the histogram of log likelihood of fitted model:

I know it is possible that log likelihood might be > 0, but this is weird…
I don’t know the exact reason, but I multiply Y by 10.
Before, the range of Y is [0.05, 1.14], now it is [0.5, 11.4].
The model info:

Model Info:

 function:     stan_mvmer
 formula (y1): Y ~ month + I(month^2) + (month | id)
 family  (y1): gaussian [identity]
 algorithm:    sampling
 priors:       see help('prior_summary')
 sample:       1000 (posterior sample size)
 num obs:      77 (y1)
 groups:       id (25)
 runtime:      <0.1 mins

Estimates:
                                          mean     sd       2.5%     25%      50%      75%      97.5% 
y1|(Intercept)                             0.663    0.304    0.071    0.471    0.661    0.862    1.271
y1|month                                   0.369    0.541   -0.670    0.014    0.381    0.750    1.394
y1|I(month^2)                              0.218    0.190   -0.149    0.091    0.212    0.347    0.597
y1|sigma                                   1.456    0.145    1.219    1.346    1.444    1.551    1.775
y1|mean_PPD                                1.794    0.231    1.368    1.641    1.805    1.936    2.273
Sigma[id:y1|(Intercept),y1|(Intercept)]    0.303    0.380    0.000    0.036    0.152    0.424    1.331
Sigma[id:y1|month,y1|(Intercept)]          0.072    0.240   -0.477   -0.033    0.067    0.199    0.551
Sigma[id:y1|month,y1|month]                0.895    0.463    0.259    0.576    0.810    1.106    2.086
log-posterior                           -230.414    7.409 -245.770 -235.059 -230.053 -225.311 -216.799

Diagnostics:
                                        mcse  Rhat  n_eff
y1|(Intercept)                          0.010 1.001  900 
y1|month                                0.022 1.002  594 
y1|I(month^2)                           0.007 1.001  691 
y1|sigma                                0.006 1.007  499 
y1|mean_PPD                             0.007 0.999 1180 
Sigma[id:y1|(Intercept),y1|(Intercept)] 0.019 1.009  384 
Sigma[id:y1|month,y1|(Intercept)]       0.023 1.019  107 
Sigma[id:y1|month,y1|month]             0.029 1.010  250 
log-posterior                           0.651 1.024  130

And log likelihood:

now it seems common, I guess…
Is this multiplication, Y * 10 , the right way to fix this problem?
And the weird value of elpd_waic and waic is caused by this problem?
Any help would be appreciated!

Gen_Y · October 1, 2020, 9:59am

BTW, the 0.05 is a kind of threshold value.
In 77 observations, 49 of them are 0.05.
The Y of some subjects increases, but the Y of some other subjects just stay at 0.05 ovetime.

jonah · October 1, 2020, 4:29pm

Good question. If the likelihood is between 0 and 1 then the log-likelihood will be negative, but the likelihood is not constrained to be less than 1 so it’s possible for the log-likelihood to be positive. This means that the ELPD computed by loo can also be positive or negative.

There’s a bit more info in the first entry here (it refers to elpd_loo but same is true for elpd_waic):

elpd_loo is the Bayesian LOO estimate of the expected log pointwise predictive density (Eq 4 in VGG2017) and is a sum of N individual pointwise log predictive densities. Probability densities can be smaller or larger than 1, and thus log predictive densities can be negative or positive.

Gen_Y · October 2, 2020, 2:22am

Thanks, I see. And I have one more question regarding to the pareto-k diagonostics.
This dataset contains 25 subjects and 77 observations, I selected 24 subjects (73 observations) as train dataset, leave one subject (4 observations) as holdout data.
For the fitted model

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     45    61.6%   210       
 (0.5, 0.7]   (ok)       22    30.1%   171       
   (0.7, 1]   (bad)       5     6.8%   12        
   (1, Inf)   (very bad)  1     1.4%   6

I assume it is generally acceptable, which means the model is not badly misspecified.
But after specifying holdout data as newdata in log_lik function,

     elpd_loo mcse_elpd_loo      p_loo      looic influence_pareto_k
25  -1.483282    0.04369236  0.1106539   2.966563          0.9634965
26  -2.883195    0.42590082  1.2587956   5.766391          1.1475562
27 -23.120600    0.82810545 20.8118162  46.241199          4.2885984
28 -73.703965    1.15532140 69.5372476 147.407930          9.5307370

This looks terrible.
And I tried using other subject as holdout data, the results are not satisfying.
In this scenario, should I consider this model misspecified?
(or, loo/waic may not give stable estimates when dealing with small sample size?)

Gen_Y · October 2, 2020, 8:27am

Actually I don’t know if I compared models in the right way…
Say I have 3 models, and I trained them on 24 subjects (73 observations). The goal is to obtain good predictions on the rest one holdout subject.
I’m thinking of comparing models like

models <- list(fit1, fit2, fit3)
log_lik_list <- lapply(models, log_lik, newdata = holdout_data)
waic_list <- lapply(log_lik_list, waic)
loo_compare(waic_list)

And choose one best model to predict the holdout data.
I’m wondering if I should use

log_lik_list <- lapply(models, log_lik)

Which way is better (or the right) way to compare models, when the goal is to predict holdout data.

Topic		Replies	Views
Computation of WAIC and LOO for structured data Modeling loo	14	1591	September 9, 2019
How can I acquire the log likelihood from "loo"? Modeling loo	10	605	June 17, 2021
LOOIC diagnostics with Stanfit object Modeling loo	5	651	October 30, 2020
Model comparison, Log-Likelihood and WAIC Modeling	1	1132	December 1, 2019
Fit statistics — troubles calculating log likelihood? Modeling specification , loo	6	2056	October 16, 2017

Log likelihood > 0 for stanreg objects in rstanarm

Related topics