Computation of WAIC and LOO for structured data

dhunfini · August 26, 2019, 5:36am

Hi,
The data structure which I work is the following. There are J subjects each with n observations, where J=67 and n=5. I have fitted two models to the data and would like to compare the models for predictive accuracy and model selection, using the loo() and waic() functions in R’s loo package. The models have the following specifications:

Model 1: 5 of the estimated parameters are common to all subjects and 1 parameter estimated specific to each individual
Model 2: Assume there are two groups and each group has its own 5 estimated parameters. 1 parameter estimated specific to each subject in the data.

Currently, the computed log_like matrix is an S-by-n-by-J array. However, the loo() and waic() functions require an S by n log_likelihood matrix, where S is the number of MCMC samples and n is number of data, I was wondering what would be the best approach to structure the log_like matrix from Stan to pass into these function. Would it be better to compute an S-by-J matrix or an S-by-nJ matrix as below?

 generated quantities {
             real log_lik1[n_obs, J];
             vector[n_obs*J]log_lik;

               for (j in 1:J){
                    for (n in 1:n_obs){
                       log_lik1[n,j] = lognormal_lpdf(y_hat[n,j]|log(y_new[n,j]),std);
                    }
              }
         log_lik = to_vector(log_lik1)
}

Or

generated quantities {
             real log_lik1[n_obs, J];
             vector[J]log_lik;

               for (j in 1:J){
                    for (n in 1:n_obs){
                       log_lik1[n,j] = lognormal_lpdf(y_hat[n,j]|log(y_new[n,j]),std);
                    }
                  log_like[j] = sum(log_lik1[,j]);
              }
}

Thanks very much for the help

avehtari · August 26, 2019, 6:41am

These answer to two different questions. See this video and this case study (available at
Model selection tutorials and talks)

With only n=5 observations per individual and an individual specific parameter, is it possible that PSIS in loo() and even more likely waic in waic() fail. In that case you can use K-fold-CV.

dhunfini · August 27, 2019, 5:40am

Thanks @avehtari. The links are very helpful.
Just to clarify, is the loo() function used in R different from the R code explained on page 14 of http://www.stat.columbia.edu/~gelman/research/unpublished/waic_stan ?
Am i correct to say that the R code given on page 14 of the link uses importance-sampling approximated
LOO and R’s loo() uses PSIS loo? Are these two approaches the same of different?

thanks

avehtari · August 27, 2019, 8:30am

That version is very old (the part of the link saying unpublished reveals that) and @andrewgelman should remove it. That old version used Truncated Importance Sampling.

Please use the new version available at Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC | Statistics and Computing or [1507.04544] Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC which use Pareto smoothed Importance sampling.

dhunfini · August 27, 2019, 10:06pm

Thanks @avehtari for clarifying my confusion.
Another question; I read your replies in the post Incorrect recovery using LOO and WAIC fit indices - #6 by Rehab. I believe my data structure is same as that described in the post. With regard to your comment,

while in generated quantities you either

make it vector if you want to calculate leave one observation away

sum over i if you want to leave out one item

sum over j if you want to leave out one examinee

and if I structure the log_lik as an S-by-J matrix as given in the second code block above, and use loo(), does that mean I am leaving out one subject? Just asked this so that I can correctly interpret the result, in case implementing loo() is successful.

Suppose if the loo() is successful, how is the interpretation of result different from the leave-one-group-out CV explained in Section 5.3 of Cross-validation for hierarchical models?

Thanks

avehtari · August 28, 2019, 8:34am

Yes.

No difference if importance sampling in loo() works. If there is a subject specific parameter(s), it’s likely that importance sampling fails and K-fold-CV would be more robust (alternatively quadrature integration could be used, but we don’t have case study for that yet).

dhunfini · September 2, 2019, 1:45am

As predicted by @avehtari, the loo() produced the following warning message:

Computed from 6000 by 67 log-likelihood matrix

         Estimate    SE
elpd_loo  -5810.6 113.7
p_loo        63.3  11.2
looic     11621.2 227.4
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     34    50.7%   393       
 (0.5, 0.7]   (ok)       24    35.8%   61        
   (0.7, 1]   (bad)       7    10.4%   46        
   (1, Inf)   (very bad)  2     3.0%   2         
See help('pareto-k-diagnostic') for details.
Warning messages:
1: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
 
2: In log(z) : NaNs produced
3: In log(z) : NaNs produced

and waic() output suggested to use loo().
I am modifying Stan code to compute K-fold CV and is wondering what measure should i use to assess model validity, after K-fold CV runs are done? As I am not using rstanarm, how should I use the log_lik output from each fold fit to compare two models?

Also, just wondering if there is a helper function called kfold_split_group() in the loo version 2? Is it same as kfold_split_stratified()?
Thanks

avehtari · September 2, 2019, 5:52pm

Maybe one of these examples help K-fold cross-validation in Stan | DataScience+ or K-fold cross-validation in Stan | DataScience+ ?

I hope this helps Helper functions for K-fold cross-validation — kfold-helpers • loo

dhunfini · September 3, 2019, 3:17am

Thanks Dr. @avehtari.
I have another question regarding exact LOO computation explained on your page https://avehtari.github.io/modelselection/rats_kcv.html. The result of loo() for one of my model has only 3 points with Pareto k larger than 0.7. Below is the result

> print(loo_2)

Computed from 6000 by 67 log-likelihood matrix

         Estimate    SE
elpd_loo  -4977.9 126.1
p_loo        43.5   7.2
looic      9955.9 252.1
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     41    61.2%   374       
 (0.5, 0.7]   (ok)       23    34.3%   139       
   (0.7, 1]   (bad)       3     4.5%   45        
   (1, Inf)   (very bad)  0     0.0%   <NA>      
See help('pareto-k-diagnostic') for details.

Then I tried the following intending to compute exact LOO for the failing folds:

(loo_2 <- loo(log_lik, r_eff = rel_n_eff, cores = 1, k_threshold=0.7))

which gives me the same result as above .

Computed from 6000 by 67 log-likelihood matrix

         Estimate    SE
elpd_loo  -4977.9 126.1
p_loo        43.5   7.2
looic      9955.9 252.1
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     41    61.2%   374       
 (0.5, 0.7]   (ok)       23    34.3%   139       
   (0.7, 1]   (bad)       3     4.5%   45        
   (1, Inf)   (very bad)  0     0.0%   <NA>      
See help('pareto-k-diagnostic') for details.

Just wondering why the two results are same? Note: the loo() in the second would’t run without specifying r_eff and cores arguments.

Thanks

avehtari · September 4, 2019, 8:24am

dhunfini:

Then I tried the following intending to compute exact LOO for the failing folds:
(loo_2 <- loo(log_lik, r_eff = rel_n_eff, cores = 1, k_threshold=0.7))
which gives me the same result as above .

Automatic refit with k_threshold argument works only for rstanarm (I guess brms, too) as then we have enough information what data needs to be passed for Stan and how to compute lpd for left-out-observation. In your case the k_threshold argument is ignored and the result is thus same.

When using rstan, you need to write a version of your Stan code which is the same you would need to do for k-fold-CV. You then run only those 3 leave-one-out-folds and insert the results to loo object. We don’t have a case study showing how to do this.

We have a paper Pushing the Limits of Importance Sampling through Iterative Moment Matching on a method which would probably help and which could be run automatically in your case, but it’s not yet in loo package. We’ll probably get it in loo package at latest in October, but that might be too long wait for you.

dhunfini · September 5, 2019, 10:24pm

Thanks for so many helpful comments @avehtari.

I would like to clarify another aspect of cross validation. On page 179 of the book Bayesian Data Analysis, by Gelman et al. (2013) 3rd ed. various deviance and LOO-CV results are given for the 8 school examples. Its mentioned that LOO-CV for the no pooling case is not given as its not possible to predict for the held-out school.
Would that be because parameter theta is independently estimated for each school and thus each school is assumed independent?
I would like to clarify this as in my model, one of the parameter is subject-specific and there is no sharing between subjects for this parameter. Thus, reading through the 8 school example, I question myself if CV (either LOO or K-fold) is applicable for my model.

Thanks

avehtari · September 6, 2019, 8:46am

Are you interested in generalization of the results for subjects not in the data? How would you choose the subject-specific parameter for a new subject who was not in the data? If you don’t have sharing of information between subjects, do you have a fixed prior for that parameter?

dhunfini · September 9, 2019, 3:55am

I do have a fixed prior for the subject specific parameter, and wanted to generalise results for subjects not in the data. I am not entirely certain if its appropriate to perform CV, if having the subject specific parameter with a fixed prior. That’s why asked the question.

Thanks

avehtari · September 9, 2019, 8:28am

With fixed proper prior the predictive distribution for an new subject is also proper and you can use CV. The wording in BDA3 could be a bit more specific. In BDA3 the separate model was considered to be the extreme case of population prior having infinite variance. In your case you have shared prior but with fixed parameters for the population.

dhunfini · September 9, 2019, 10:35pm

Thanks Dr @avehtari. That clarifies my confusion.

Topic		Replies	Views
Fit statistics — troubles calculating log likelihood? Modeling specification , loo	6	2044	October 16, 2017
WAIC for any stanfit object Modeling	6	3280	September 2, 2019
Model comparison, Log-Likelihood and WAIC Modeling	1	1116	December 1, 2019
Extract log-likelihood from large size stanfit using the Loo package General loo	4	2282	June 7, 2018
Log likelihood > 0 for stanreg objects in rstanarm Modeling loo , rstanarm	4	654	October 2, 2020

Computation of WAIC and LOO for structured data

Related topics