I tinkered a little now and tried your solution. The good news is that I found a way to use the loo_subsample() function for an exponential model. The bad news is that probably the solution does not scale well, and I could not manage to use the rstanarm::ll_fun() function.
The first step is fitting the model based on some simulated data (here I just use one single treatment covariate, but of course in practice dozens to hundreds of covariates are more realistic, I omitted some code here, but the actual simulation used 50 covariates out of which 10 are relevant).
# Step 1: Load rstanarm in dev_mode
devtools::dev_mode(on=T)
library(rstanarm)
# Step 2: Simulate treatment covariate
library(simsurv)
set.seed(999111)
Nobs <- 1000
covs <- data.frame(id = 1:Nobs,
trt = rbinom(Nobs,1,0.5))
dat <- simsurv(dist = "exp",
lambdas = 0.1,
betas = c(trt = -0.5),
x = covs,
maxt = 10)
head(dat)
# Step 3: Fit expontial model to data
stanmodel <- stan_surv(
formula = Surv(eventtime, status) ~ trt,
data = dat,
basehaz = "exp",
chains = 2,
cores = 4,
seed = 42,
iter = 2000
)
Now, first to the solution I got working: I changed the log-likelihood function used by the loo_subsampling() function, so that it basically searches the i-th row data_i in the data frame dat, which is declared globally outside of the functions scope.
This search is probably quite inefficient, but sidesteps the problem of having no analytic access to the log-likelihood by simply using the log_lik() function in rstanarm like you proposed via the newdata argument.
library(dplyr)
# Log-likelihood function for MCMC posterior model
llfun_stanmodel <- function(data_i, draws) {
for(i in 1:nrow(dat)){
s = sum((dat[i,] == data_i)[1,])
if (s==ncol(dat)){
data_i_index = dat[i,]$id
}
}
rstanarm::log_lik(stanmodel, newdata = dat[data_i_index,])
}
After that, I simply extract the posterior draws of the fitted model, check if the loo_i() function used in loo_subsampling() works as desired, and run the subsampling function for LOO-CV:
# Exact LOO-CV with MCMC samples
parameter_draws_mcmc <- as.matrix(stanmodel)
# check loo_i function
library("loo")
loo_i(1, llfun_stanmodel, data = dat, draws = parameter_draws_mcmc)
set.seed(4711)
start.time <- Sys.time()
loo_ss_mcmc <-
loo_subsample(
llfun_stanmodel,
draws = parameter_draws_mcmc,
data = dat,
observations = 100 # take a subsample of size 100
)
print(loo_ss_mcmc)
end.time <- Sys.time()
time.taken <- end.time - start.time
For 1000 observations and 50 covariates running the loo_subsample function took still about 20 minutes on my machine for 500 subsamples, and about 17 minutes for 100 subsamples. I think maybe this is because of the inefficient implementation of the log-likelihood which has to run a search everytime. Any suggestions how to speed up my modified log-likelihood?
Now to the problem with this solution: It does not work with approximate posterior inference like Laplace or variational inference. When trying to modify the log-likelihood to compute the log-likelihood based on a Laplace or variational inference posterior, R throws an error because the log_lik() function is only implemented for precise MCMC posteriors:
# Meanfield variational inference approximation
stanmodel_mf <- stan_surv(
formula = Surv(eventtime, status) ~ trt,
algorithm = "meanfield",
data = dat,
basehaz = "exp",
seed = 42,
iter = 4000
)
# Log-likelihood function for meanfield variational inference posterior model
llfun_stanmodel_mf <- function(data_i, draws) {
for(i in 1:nrow(dat)){
s = sum((dat[i,] == data_i)[1,])
if (s==ncol(dat)){
data_i_index = dat[i,]$id
}
}
rstanarm::log_lik(stanmodel_mf, newdata = dat[data_i_index,]) # note: stanmodel_mf instead of stanmodel now
}
# Exact LOO-CV with meanfield variational inference samples
parameter_draws_mf <- as.matrix(stanmodel_mf, pars = "trt")
# check loo_i function
loo_i(1, llfun_stanmodel_mf, data = dat, draws = parameter_draws_mf)
The loo_i() function already does not work, and throws the error. I think there is no easy solution to this, because there is no implementation of log_lik() for approximate posteriors.
Now to your solution via the rstanarm::ll_fun(object) function. I copied and pasted the whole code you linked into my script. When I try to run the function, I get some problems. First, some of the functions are not known, e.g. the validate_stanreg_object(x) function call inside rstanarm::ll_fun(object) is not known. How can I include these functions?
However, I commented the function out and was able to get the likelihood function then. Problematically, there are of course various parameters the function uses. Without knowledge what these parameters are and how they relate to each other, there is few hope of reverse engineering the likelihood via this method. For some parameters it is easy to infer what they mean, for example there’s a variable eta which is the linear combination of predictors. For others, it is (for me) very unclear what they are.
A quick sidenote: Of course, writing down the likelihood function for an exponential survival model or a weibull model should be no problem whatsoever, but in particular for the more complex models like M-spline models or B-spline models this quickly becomes difficult.
Another quick sidenote: I only was able to run meanfield or variational inference for the exponential model via the stan_surv() function. It seems like variational inference and Laplace approximations are not implemented thus far for most other models, so I guess there is little hope of approximating the posterior to speed up the LOO-CV computation for those survival models by now.
Let me know what you think of my solution. I am in particular not satisfied with the time the loo_subsample() function takes to run. Maybe this is simple due to the computational effort of precise MCMC fitting of these survival models, and it would already be much faster when approximations were available. Maybe this is due to the inefficient search, which becomes costly especially when the sample size becomes large.