Grouped kfold return NaN

jscamac · June 23, 2023, 1:01am

Hi
I’ve successfully fitted two brms models, where I’m modelling canopy area as a function of different non-linear functions (an example is shown below). Each model has converged, with few (if any) divergent iterations, and each parameter has good effective sample sizes.

However, when I try to do a group split kfold validation I’m getting NANs (see below) and I’m not sure why.

packageVersion("brms")
[1] ‘2.18.0’

packageVersion("loo")
[1] ‘2.5.1’

packageVersion("cmdstanr")
[1] ‘0.5.3’

An example of the model I’m fitting looks like this

 out <- brms::brm(
                 bf(
                   Canopy ~ log(Asym/(1+ exp(-beta * (Growth_years - Tmax)))),
                   beta ~ 1 + (1|Scientific) + street_tree,
                   Tmax ~ 1 + (1|Scientific) + street_tree,
                   Asym ~ 1 + (1|Scientific) + street_tree,
                   nl = TRUE),
                 prior = 
                   prior(normal(200, 100),lb=0.001, nlpar ="Asym") + 
                   prior(normal(0.01,100), lb=0.001, nlpar="Tmax") +
                   prior(normal(0,1), lb = 0, nlpar="beta"),
                 control = list(adapt_delta = 0.99, max_treedepth = 15),
                 family = lognormal,
                 backend = "cmdstanr", 
                 threads = 2,
                 init = 0,
                 data = eucalyptus_growth_dat,
                 chains = 3,
                 cores = 3)

Where, canopy is the observed canopy area of an individual tree, street_tree is a binary variable and Growth_years is the estimated age of the individual.

As stated above, the model seems to fit with minimal issues (apart from it taking an age to finish sampling; there is over 40,000 rows of data). Parameter estimates also look biologically plausible.
What I want to do is examine the predictive capacity between different model variants by using a split group kfold approach, where the interest is assessing how well the model predicts withheld random effect groups. To do that I specify this using the following:

 brms::kfold(x = out, 
                      K = 5, 
                      folds = "grouped", 
                      group = "Scientific")

Which, if I am correct in equivalent to doing something like:

brms::kfold(x = eucalypt_negexp_model, 
                         K = 5, 
                         folds =  loo::kfold_split_grouped(K = 5,x = eucalyptus_growth_dat$Scientific))

However, when I run this type of cross validation, I get the following outcome with no error or warning messages:

Based on 5-fold cross-validation

           Estimate SE
elpd_kfold      NaN NA
p_kfold         NaN NA
kfoldic         NaN NA

I’m not sure what is causing this… though it might be related to the severe unevenness of the folds.

e.g.

    1     2     3     4     5 
 3084  2504 14119   986 20907

I’ve looked at each of these folds and there appears to be reasonable variability in the other data input parameters (e.g. growth_years & street_tree). I’ve even managed to individually fit each of these fold subsets without running into convergence issues. I’m assuming that under the hood there is a problem with the level of unevenness among folds. Though when I’ve tried to replicate this unevenness in folds in mock datasets I’m not running into this issue. Unfortunately I can’t share the data. Any tips or advice?

Oh and incase your interested this is the part of the pointwise samples

        elpd_kfold       p_kfold   kfoldic
    [1,]  -4.539832  3.795695e-01  9.079664
    [2,]  -4.605903  3.846158e-01  9.211806
    [3,]  -5.111670  2.462674e-01 10.223340
    [4,]  -5.059898  3.184598e-01 10.119795
    [5,]  -5.220611  2.450830e-01 10.441223
    [6,]  -5.311285  2.091175e-01 10.622569
    [7,]  -4.737715  4.543159e-01  9.475429
    [8,]  -4.828914  4.463188e-01  9.657829
    [9,]  -4.673482  4.691005e-01  9.346964
   [10,]  -5.298323  2.176782e-01 10.596645
   [11,]  -4.792010  5.030404e-01  9.584020
   [12,]  -4.881061  4.830266e-01  9.762122
   [13,]  -4.788471  5.238636e-01  9.576942
   [14,]  -4.787594  5.329466e-01  9.575187
   [15,]  -4.811155  5.378390e-01  9.622310
   [16,]  -4.788929  5.541308e-01  9.577859
   [17,]  -9.743180 -4.613748e+00 19.486359
   [18,] -10.342949 -4.544968e+00 20.685898
   [19,]  -4.615625 -3.565669e-01  9.231250
   [20,]  -3.992905  3.770351e-01  7.985810
   [21,]  -5.231787  2.939448e-01 10.463575
   [22,]  -5.240638  1.869140e-01 10.481277
   [23,]  -5.210137  2.735241e-01 10.420273
   [24,]  -4.907122  5.067121e-01  9.814244
   [25,]  -5.121841  3.715039e-01 10.243681
   [26,]  -4.253101  4.649373e-01  8.506202
   [27,]  -4.337621  5.543479e-01  8.675243
   [28,]  -4.392189  4.400904e-01  8.784379
   [29,]  -4.517743  5.398474e-01  9.035485
   [30,]  -3.475455  2.589708e-01  6.950910
   [31,]  -4.020786  5.061724e-02  8.041572
   [32,]  -4.666528  3.760590e-01  9.333056
   [33,]  -4.961022  3.242280e-01  9.922045
   [34,]  -4.521691  3.749092e-01  9.043382
   [35,]  -4.541400  3.485298e-01  9.082800
   [36,]  -4.621729  4.402935e-01  9.243458
   [37,]  -5.096600  3.269840e-01 10.193199
   [38,]  -4.766069  4.856513e-01  9.532138
   [39,]  -5.971669 -5.074343e-01 11.943338
   [40,]  -5.367172  7.109470e-02 10.734345
   [41,]  -4.992893  3.610960e-01  9.985787
   [42,]  -4.616530  1.087240e-01  9.233060
   [43,]  -5.919764 -1.681431e-01 11.839528
   [44,]  -6.410071 -4.884728e-01 12.820142
   [45,]  -6.070530 -2.773045e-01 12.141059
   [46,]  -6.957099 -8.632811e-01 13.914197
   [47,]        NaN           NaN       NaN
   [48,]        NaN           NaN       NaN
   [49,]        NaN           NaN       NaN
   [50,]        NaN           NaN       NaN
   [51,]        NaN           NaN       NaN

jscamac · June 26, 2023, 4:36am

Hi all,
So I’ve reran the model and saved the fold fits using save_fits…

eucalyptus_growth_dat$fold <- loo::kfold_split_grouped(K = 5,x = eucalyptus_growth_dat$Scientific)

test <-  brms::kfold(x = eucalypt_negexp_model, 
                     folds =  eucalyptus_growth_dat$fold,
                     chains = 3,
                     cores = 3, save_fits = TRUE)

and have made a little progress in identifying the error. I am now getting the following warning message:

“NAs were found in the log-likelihood. Possibly this is because some of your responses contain NAs. If you use ‘mi’ terms, try, setting ‘resp’ to those response variables without missing values. Alternatively, use ‘newdata’ to predict only complete cases.”

I’ve double checked the data, and there are no NAs in either the response or predictors used.
Looking at the code, the warning appears to be associated with the log_lik function.

I’ve applied the log_like function to each fitted fold… e.g.

Calculate log_like for each fold training data

sapply(1:5, function(x) {
+     range(log_lik(test$fits[[x]]))
+ })
            [,1]       [,2]       [,3]       [,4]       [,5]
[1,] -22.5837527 -21.963125 -23.606130 -21.984757 -18.221019
[2,]  -0.9003099  -1.080946  -1.084199  -1.088226  -1.448915

No NANs…

Calculate log_lik for withheld data of each fold

sapply(1:5, function(x) {
range(log_lik(object = test$fits[[x]],newdata= subset(eucalyptus_growth_dat, fold = x), allow_new_levels=TRUE))
})
     [,1] [,2]        [,3]        [,4]        [,5]
[1,]  NaN  NaN -53.4418859 -30.8435437 -35.5653738
[2,]  NaN  NaN   0.4899595  -0.9395137  -0.5251746

Warning messages:
1: In log(Asym/(1+ exp(-beta * (Growth_years - Tmax)))) : NaNs produced
2: NAs were found in the log-likelihood. Possibly this is because some of your responses contain NAs. If you use 'mi' terms, try setting 'resp' to those response variables without missing values. Alternatively, use 'newdata' to predict only complete cases.

Interestingly, each time I rerun this I get slightly different answers:

e.g. (repeat run through)

sapply(1:5, function(x) {
range(log_lik(object = test$fits[[x]],newdata= subset(eucalyptus_growth_dat, fold = x), allow_new_levels=TRUE))})

     [,1]        [,2]       [,3]        [,4]       [,5]
[1,]  NaN -31.5581745 -58.196591 -30.9568028 -37.215003
[2,]  NaN  -0.7810369   1.059761  -0.8951281  -0.478189

In fact if I run it enough times all folds successfully return a range of log likelihoods with no NANs. I’m assuming this comes about due to the sampling for uncertainty required for the new levels… Though if I’ve understood correctly, this is done by sampling from the full posterior… which itself doesn’t contain NAs. So I’m at a loss at what is happening here. Possibly an underflow problem?

avehtari · June 26, 2023, 10:34am

Your post is very clear and this is very likely the correct conclusion. This is likely to happen if your prior is weak and the groups differ that much that the learnt population distribution is also weak. You could try with more informative priors.

Akso it would be interesting to know what are those extreme values that cause likelihood computation to produce NAs. Note that in brms the likelihood computations for cross-validation are done in R, snd not in Stan code, so that there can be different numerical failure

jscamac · June 26, 2023, 10:59am

Hmmm…
That is harder to determine… as each time I rerun the log_like function different observations appear to be returning the NANs.

e.g.

which(is.nan(apply(brms::log_lik(test$fits[[1]], newdata=subset(eucalyptus_growth_dat, grp ==1), allow_new_levels = TRUE), 2, mean)))

# Observations contain NaN
[1] 2384 2699 2700 2701 2702 2703 2704 2705 2706

If I run it again I get a different suite of observations

which(is.nan(apply(brms::log_lik(test$fits[[1]], newdata=subset(eucalyptus_growth_dat, grp ==1), allow_new_levels = TRUE), 2, mean)))

# Observations with NaN
 [1] 3733 3734 3735 3736 3737 3750 3751 3752 3753 3754 3755 3756 3757 3758 3759 3760 3761 3762 3763 9270 9271

Looking at the underlying data there doesn’t appear to be anything extreme in the response or predictor variables. Running this multiple times, I can have anywhere from zero NaNs to 20 odd… and almost never the same observations. So I’m really unsure why this is happening.

Here is an example of a fold fit summary that appears to be susceptible to these NaNs.

print(summary(test$fits[[1]]), digits=4)
 Family: lognormal 
  Links: mu = identity; sigma = identity 
Formula: Canopy ~ log(Asym * (1 - exp(-beta * Growth_years))) 
         beta ~ 1 + (1 | Scientific) + street_tree
         Asym ~ 1 + (1 | Scientific) + street_tree
   Data: .x4 (Number of observations: 29263) 
  Draws: 3 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 3000

Group-Level Effects: 
~Scientific (Number of levels: 50) 
                   Estimate Est.Error l-95% CI u-95% CI   Rhat Bulk_ESS Tail_ESS
sd(beta_Intercept)   0.0737    0.0138   0.0511   0.1042 1.0007     1054     1878
sd(Asym_Intercept)  75.7784   10.5449  57.7407 100.4123 1.0010      675      867

Population-Level Effects: 
                 Estimate Est.Error l-95% CI u-95% CI   Rhat Bulk_ESS Tail_ESS
beta_Intercept     0.1507    0.0184   0.1175   0.1901 1.0010      661     1380
beta_street_tree   0.0009    0.0018   0.0000   0.0074 1.0018      779      354
Asym_Intercept   113.1552   13.2489  88.9815 141.6632 1.0011      366      686
Asym_street_tree   0.1190    0.1153   0.0034   0.4187 1.0019     3189     1406

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI   Rhat Bulk_ESS Tail_ESS
sigma   0.9322    0.0041   0.9242   0.9403 1.0019     8028     1722

Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

NOTE the above is a slightly different (simpler) formulation to the model above. But it suffers from the same problem.

My priors are weak. I’ll consider strengthening them.

jscamac · July 1, 2023, 12:26am

Resolved this.
The issue was the posteriors of the non-linear parameters (e.g. Asym and beta) contained negative posteriors samples for the Intercept and Street effect. I thought using positive half normal priors on these parameters would have enforced positive estimates… but I think because they were so weak, and that there were few “street trees” for some species (i.e. Scientific) that these negatives arose due to the model not being exposed to those situations.

Anyway the solution was to use the sub models to estimate the non-linear parameters on the logscale and then exponentiate them in the primary model (see below). It also involved using tighter priors as suggested by Aki.

brms::brm(
  bf(
    Canopy ~ log(exp(Asym)/(1+ exp(-exp(beta) * (Growth_years - exp(Tmax))))),
    beta ~ 1 + (1|Scientific), 
    Tmax ~ 1 + (1|Scientific),
    Asym ~ 1 + (1|Scientific),
    nl = TRUE),
  prior = 
    prior(normal(0,4), nlpar ="Asym") + 
    prior(normal(0,2.5), nlpar="Tmax") +
    prior(normal(0,1), nlpar="beta"),
  family = lognormal,
  backend = "cmdstanr", 
  data = eucalyptus_growth_dat,
  chains = 3,
  cores = 3)

Topic		Replies	Views
Kfold returning NaN rstanarm loo	13	3156	September 19, 2017
What is the meaning of 'group' in kfold funciton of brms package? brms loo	2	643	October 9, 2019
Multiple attemps to run k-fold cross-validation fail with brms brms fitting-issues , cmdstanr , cross-validation , brms	0	795	December 14, 2021
Contrasts problem with "reloo" and "kfold" brms rstan , loo , bug	3	607	July 1, 2020
Using Loo 2 for grouped k-fold cv rstanarm loo	14	1913	June 12, 2018

Grouped kfold return NaN

Calculate log_like for each fold training data

Calculate log_lik for withheld data of each fold

Related topics