Debugging (trace) with loo functions

loo

#1

Hello,
When working with stanreg objects in R, and specifically looking at functions in loo, is there a reason why debugging functions such as ‘trace’ shouldn’t work as wrappers?

I’ve been trying to figure out why kfold() is generating a series of very general errors (e.g. ‘eval(predvars, data, env)’), but it’s shooting in the dark without knowing what part of the kfold() function is throwing the actual error. I’m working on a reproducible version of my issue (my data is confidential and has PII), but right now I’m just trying to figure out why I could generate errors but not use traceback to determine what part of kfold has the error. Thanks!


#2

Can you post the output of traceback() after the error happens?


#3

Thanks for the quick response. Apologies because I was clearly having brain flatulance and having trouble with trace() and then misusing traceback().

Below is the code which is throwing “Error in eval(predvars, data, env) : object ‘EDPI’ not found” and the associated traceback().

fit1a <- stan_gamm4(ASESaffected_total_scr ~ te(pfnc_wt, pnit_wt) + s(ases_days_since) +
GENDER + ageyr ,
family = gaussian(), data = as.data.frame(complete_ASESdat_wt), iter= 4000, adapt_delta = 0.99,
random = ~(1 |EDPI))

dat <- fit1a[[‘data’]]
folds <- kfold_split_stratified(K=10, x = dat$EDPI)
kfold(fit1a, K=10, folds = folds)

traceback()

19: eval(predvars, data, env)
18: eval(predvars, data, env)
17: model.frame.default(data = data, subset = subset, weights = weights,
na.action = na.action, drop.unused.levels = TRUE, formula = ASESaffected_total_scr ~

16: stats::model.frame(data = data, subset = subset, weights = weights,
na.action = na.action, drop.unused.levels = TRUE, formula = ASESaffected_total_scr ~

15: eval(mf, parent.frame())
14: eval(mf, parent.frame())
13: lme4::lFormula(formula = as.formula(form), data = data, subset = subset,
weights = weights, na.action = na.action, control = make_glmerControl())
12: eval(mc, parent.frame())
11: eval(mc, parent.frame())
10: lme4::glFormula(as.formula(form), data, family = gaussian, subset,
weights, na.action, control = make_glmerControl())
9: stan_gamm4(formula = ASESaffected_total_scr ~ te(pfnc_wt, pnit_wt) +
s(ases_days_since) + GENDER + ageyr, random = ~(1 | EDPI),

8: eval(fit_k_call)
7: eval(fit_k_call)
6: eval(expr, pf)
5: eval(expr, pf)
4: withVisible(eval(expr, pf))
3: evalVis(expr)
2: capture.output(fit_k <- eval(fit_k_call))
1: kfold(fit1a, K = 10, folds = folds)

Am I mis-specifying somewhere?


#4

Can you re-create your problem with any public example? I was able to do

data(Orthodont, package = "nlme4")
test <- stan_gamm4(distance ~ s(age, k = 2), data = Orthodont) 
library(loo)
folds <- kfold_split_stratified(K = 2, Orthodont$Sex)
test2 <- kfold(test, K = 2, folds = folds)

and it seemed to work apart from warnings about divergent transitions.

I think the problem may be that you are stratifying by the variable used to allow the intercept to vary by, in which case there is no variation on it in the stratified samples. If that is the problem, you could call

fit1a <- stan_gamm4(ASESaffected_total_scr ~ te(pfnc_wt, pnit_wt) + s(ases_days_since) + GENDER + ageyr ,
family = gaussian(), data = as.data.frame(complete_ASESdat_wt), iter= 4000, adapt_delta = 0.99)

without the random term and then call kfold_split_stratified with K equal to the number of levels of EDPI, followed by kfold.


#5

I really can’t alter the structure of the model itself as you suggest, for theoretical reasons it should be nested with at least a two-level model. This is actually a simplified model before I work with higher-order models. I’m also assessing these stan models with the ‘standard’ gamm4 models I’ve already built.

I’m not re-creating the exact error, but I suspect they may be related. Here is a worked example:

N <- 300 #Number of students per school
sigma <- 200

x1 <- runif(N, 10, 40)
x2 <- runif(N, 25, 55)
x3 <- runif(N, 40, 70)
x4 <- runif(N, 55, 85)
x5 <- runif(N, 70, 100)

y1 <- 20 + 0 * x1 + rnorm(N, 0, sigma)
y2 <- 40 + 10 * x2 + rnorm(N, 0, sigma)
y3 <- 60 + 20 * x3 + rnorm(N, 0, sigma)
y4 <- 80 + 30 * x4 + rnorm(N, 0, sigma)
y5 <- 100 + 40 * x5 + rnorm(N, 0, sigma)

ID <- rep(LETTERS[1:20], each = N)

test <- data.frame(SES = c(x1, x2, x3, x4, x5),
Score = c(y1, y2, y3, y4, y5), ID = ID)

HLM0 <- stan_gamm4(Score ~ s(SES), family = gaussian(), random = ~(1 | ID), data = test)

strat.test <- kfold_split_stratified(K=10, x= test$ID)
kfold(HLM0, K=10, folds = strat.test)

I’ve been unable to find a worked example of a multilevel model which uses kfold(), which would help with my doing it correctly. I can always write my own code to do cross-validation, I just assumed the built in loo functions would be more efficient.


#6

If you K-fold by the ID variable, then it is not going to have any variation within each fold to estimate the ~(1 | ID) terms (or their variance) with. The kfold_split_stratified variable is more intended to be used on a variable constructed to ensure that each of the folds has a sufficiently large number of observations on all levels of ID.