Infinite values in logLik matrix

Hey everyone,

for some reason it seems to occur, that the pointwise logLik matrix contains infinte values.

LL <- log_lik(fit_S)
> table(is.finite(LL))

  FALSE    TRUE 
     51 4479949 
> LL[! is.finite(LL)]
 [1] -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf
[17] -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf
[33] -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf
[49] -Inf -Inf -Inf

This makes it impossible to compute waic or loo-CV, i.e.:

> waic(fit_S)
Error in validate_ll(x) : All input values must be finite.
>
> w <- waic(fit_S, pointwise = TRUE)
Warning message:

NA (NA%) p_waic estimates greater than 0.4. We recommend trying loo instead. 
>
> w

Computed from 8000 by 560 log-likelihood matrix

          Estimate SE
elpd_waic      NaN NA
p_waic         NaN NA
waic           NaN NA

NA (NA%) p_waic estimates greater than 0.4. We recommend trying loo instead. 
>
> table(is.nan(w$pointwise[,3]))

FALSE  TRUE 
  535    25 
> 

The value of -Inf (or even any other extremely small value) seems to be extremly unlikely, if one has a look to the quantiles of the finte values:

> summary(as.numeric(LL))
      Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
      -Inf -1.315e-07  0.000e+00       -Inf  0.000e+00  0.000e+00 
> quantile(as.numeric(LL), c(0.005, 0.01, 0.05, 0.1))
        0.5%           1%           5%          10% 
-1.328816100 -0.775278871 -0.048196403 -0.002251526 

Therefore, I don’t know how to handle these values.

Do you have any ideas what could be the cause and how to fix it?

I have tried it using both, the current CRAN as well as github versions of brms and loo.
Some more details on my brmsfit:

> summary(fit_S)
 Family: categorical 
  Links: mu1 = logit; mu2 = logit 
Formula: y ~ 1 
         mu1 ~ inside + bexotic * exotic1 + bnative * native1 + bdonation * donation1
         mu2 ~ inside + bexotic * exotic2 + bnative * native2 + bdonation * donation2
         inside ~ 1 + s(t, id, bs = "fs", xt = "ps", m = c(2, 1), k = 5)
         bexotic ~ 1 + s(t, id, bs = "fs", xt = "ps", m = c(2, 1), k = 5)
         bnative ~ 1 + s(t, id, bs = "fs", xt = "ps", m = c(2, 1), k = 5)
         bdonation ~ 1 + s(t, id, bs = "fs", xt = "ps", m = c(2, 1), k = 5)
   Data: data (Number of observations: 560) 
  Draws: 2 chains, each with iter = 6000; warmup = 2000; thin = 1;
         total post-warmup draws = 8000

Smooth Terms: 
                      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sds(inside_stid_1)       15.22     12.35     0.62    46.22 1.00     3793     3846
sds(inside_stid_2)      120.14     65.75     9.72   260.15 1.00     1899     2033
sds(bexotic_stid_1)       1.19      0.91     0.06     3.47 1.00     2809     3882
sds(bexotic_stid_2)      27.13      9.01    13.02    47.86 1.00     1752     2695
sds(bnative_stid_1)       5.37      3.12     0.44    12.64 1.01     1044     1566
sds(bnative_stid_2)      17.36     11.03     1.38    43.33 1.00     1031     1848
sds(bdonation_stid_1)     3.68      1.85     0.53     7.86 1.00     1292     1367
sds(bdonation_stid_2)    20.29      7.34     8.90    37.19 1.00     1666     2362

Population-Level Effects: 
                    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
inside_Intercept       10.29      4.63     0.79    18.98 1.00     4581     5921
bexotic_Intercept       4.16      1.46     1.85     7.44 1.00     1751     2609
bnative_Intercept       9.04      2.68     4.55    14.86 1.00     1719     2453
bdonation_Intercept    -4.55      1.45    -7.76    -2.19 1.00     1858     2979

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

I guess the problem comes from using the default flat priors, and then the posterior has very long tail. Add proper priors (proper meaning that it has finite integral). See set_prior function - RDocumentation. Tell if you need help with setting the priors.

Pinging also @paul.buerkner as this is likely to be another example where flat priors are problematic and we’re collecting examples where better priors matter.

3 Likes

Thanks for your reply @avehtari!

We are not using flat/improper priors, but proper weakly informative priors (half-normal with large variance) for the smoothing variances:

          prior class                                 coef group resp dpar     nlpar lb ub       source
   normal(0, 5)     b                                                      bdonation               user
   normal(0, 5)     b                            Intercept                 bdonation       (vectorized)
   normal(0, 5)     b                                                        bexotic               user
   normal(0, 5)     b                            Intercept                   bexotic       (vectorized)
   normal(0, 5)     b                                                        bnative               user
   normal(0, 5)     b                            Intercept                   bnative       (vectorized)
   normal(0, 5)     b                                                         inside               user
   normal(0, 5)     b                            Intercept                    inside       (vectorized)
 normal(0, 100)   sds                                                      bdonation  0            user
 normal(0, 100)   sds s(t,id,bs="fs",xt="ps",m=c(2,1),k=5)                 bdonation  0    (vectorized)
 normal(0, 100)   sds                                                        bexotic  0            user
 normal(0, 100)   sds s(t,id,bs="fs",xt="ps",m=c(2,1),k=5)                   bexotic  0    (vectorized)
 normal(0, 100)   sds                                                        bnative  0            user
 normal(0, 100)   sds s(t,id,bs="fs",xt="ps",m=c(2,1),k=5)                   bnative  0    (vectorized)
 normal(0, 100)   sds                                                         inside  0            user
 normal(0, 100)   sds s(t,id,bs="fs",xt="ps",m=c(2,1),k=5)                    inside  0    (vectorized)

Unfortunately, there is not enough prior knowledge to justify the usage of informative priors.

Great! This wasn’t shown in your first post, so assumed you had the default priors which are flat for class b.

These are super wide (also normal(0,5) for class b is quite wide). I would expect that with narrower priors the computation would work better, with priors still being weakly informative. You could do prior predictive simulation with narrower priors to check that they are sensible

1 Like

Is there any way to fix the logLik computation without changing the priors or refitting the models? We fitted 100 models for a simulation procedure, which took about a week on a server. Changing the priors would lead to refitting all models and the need to resubmit an application for server usage.

In addition, it seams that changing the priors might solve the problem, but it is not guarantied to do, especially if the new priors are still weakly informative (i.e. still have hight variance).

I also understood this post in the way that this problem is reasoned by a perfect fit for the majority of the observations. This is could also be the source of our problem since:
(1.) Obviousy, this seems to be more likely with binary or categorical data.
(2.) Our simulation study wanted to show that our model converges to the data generating process on average. Therefore, a perfect fit for the majority of the observations might occur.

You could use existing draws, weight them using new prior and importance sampling weights. First, this would help to diagnose prior sensitivity and possible weakly informative likelihood. Second, if the new prior mostly affects the tail, you can get more stable loo computation, too.