I implemented in stan/R the idea I outlined in my last post in order to estimate the expected log predictive density with LOO-CV (elpd_loo) for a multilevel/hierarchical model while forgetting the group association of the left-out data point. If anybody is willing to check the method and how I implemented it, I would be very glad!
I insert the stan program and the R script below. The stan program is a bit long, I’m sorry. The model consists of two separately parameterised, analogous negative-binomial models for two kinds of data (indoor and outdoor) measured for the same data points (therefore, almost each variable comes once with ‘IN’ and once with ‘OUT’), whose log-likelihoods are summed up in the end. There are two switches: \texttt{doexactloopdi}=1 for doing exact LOO-CV and \texttt{doforward}=1 for running forward simulation (everything inside \texttt{if (doforward)} in the generated quantities can be ignored for the purpose of LOO-CV). The model runs fine, except for warnings about Rhat being NaN, but I think this is due to some constant variables (I checked once with \texttt{monitor()} that everything is fine.). To me the output looks reasonable. I didn’t simplify the code for time reasons and since this could hide problems in the original code I am not aware of (I am of course also grateful for any other suggestions on my code.). Still, if nobody is responding after a while, I could simplify the code.
The relevant variable for ‘LOO-CV with forgetting group association’ is \texttt{log_lik_2} (sorry there are no line numbers…). When I multiply the predictive density, \texttt{p_tilde}, with the weights (i.e. importance ratios), \texttt{omega}, in the R script, I assume that the order of the sample is the same. This is risky, I think. I know that I can keep the chains separate and un-permuted when extracting \texttt{log_lik_2} from the stanfit object, but I don’t know how to do this for extracting the weights from the psis object (with \texttt{weights.importance_sampling}). Suggestions?
Some background: The data is from a field study comparing different mosquito-control tools deployed at house level with the aim of reducing mosquito densities. The data is mosquito counts per house and per day, indoor and outdoor. There are 12 houses and the study run over 18 weeks, the control and the 3 active treatments are assigned with a Latin square design. The aim of the model is to provide estimates of ‘protective efficacy’ for each treatment, defined as the relative reduction of mosquito counts due to the treatment, which is computed in the generated quantities after forward simulation.
Thank you very much for reading, help, suggestions, comments…
data{
int<lower=0> n; // number of 'house-days' (experimental days * houses) // This needs to be adjusted if doexactloopdi=1
int<lower=0> m[3]; // [number of houses, number of treatments, number of weeks]
int<lower=0> yHLC[n]; // data HLC
int<lower=0> yCDC[n]; // data CDC
int xHOUSE[n]; // feature HOUSE
int xTREAT[n]; // feature TREATment
int xWEEK[n]; // feature WEEK
// to control priors
int<lower=0> priorscale; // to control scale of prior
// to control forward simulation
int<lower=0, upper=1> doforward; // =1 to switch forward simulation on, =0 to switch off
int zn[doforward ? 1 : 0]; // number of simulated data points
int zm[doforward ? 3 : 0]; // [number of simulated houses, number of simulated treatments, number of simulated weeks]
int zxHOUSE[doforward ? zn[1] : 0]; // feature HOUSE
int zxTREAT[doforward ? zn[1] : 0]; // feature TREATment
int zxWEEK[doforward ? zn[1] : 0]; //feature week number
int indexC[doforward ? zm[1]*zm[3] : 0]; // indices of control in output zy
int indexT[doforward ? zm[1]*zm[3] : 0]; // indices of pull in output zy
int indexR[doforward ? zm[1]*zm[3] : 0]; // indices of push in output zy
int indexP[doforward ? zm[1]*zm[3] : 0]; // indices of push-pull in output zy
real<lower=0> c[doforward ? 1 : 0]; // to correct division by 0, or not if c=0. is added to denominators of generated quantities
// to control computation of log likelihood for loo
int<lower=1> T; // number of sample points for unknown group parameter
int<lower=0, upper=1> doexactloopdi;
int<lower=0> yHLCi[doexactloopdi ? 1 : 0]; // HLC of left-out data point i
int<lower=0> yCDCi[doexactloopdi ? 1 : 0]; // CDC of left-out data point i
int xHOUSEi[doexactloopdi ? 1 : 0]; // feature HOUSE
int xTREATi[doexactloopdi ? 1 : 0]; // feature TREATment
int xWEEKi[doexactloopdi ? 1 : 0]; // feature WEEK
}
parameters{
real logaIN; // mean of lograte indoors
real logaOUT; // mean of lograte outdoors
real<lower=0> sigma_alphaIN; // standard deviation of lograte indoors due to house
real<lower=0> sigma_alphaOUT; // standard deviation of lograte outdoors due to house
vector[m[1]] thetaIN; // standardised deviation of indoor lograte from mean of lograte indoors due to house
vector[m[1]] thetaOUT; // standardised deviation of outdoor lograte from mean of lograte outdoors due to house
real<lower=0> sigma_tauIN; // standard deviation of lograte indoors due to week
real<lower=0> sigma_tauOUT; // standard deviation of lograte outdoors due to week
vector[m[3]] tauIN; // standardised deviation of indoor lograte from mean of lograte indoors due to week
vector[m[3]] tauOUT; // standardised deviation of outdoor lograte from mean of lograte outdoors due to week
vector[m[2]-1] logbetaIN; // treatment effect indoor
vector[m[2]-1] logbetaOUT; // treatment effect outdoor
vector[m[2]] logpsiIN; // one scale parameter per treatment group and per indoor, so far no hierarchy
vector[m[2]] logpsiOUT; // one scale parameter per treatment group and per outdoor, so far no hierarchy
}
transformed parameters{
// parameters for negative binomial for data
vector[n] loglambdaIN; // lograte indoor
vector[n] loglambdaOUT; // lograte indoor
vector[n] phiIN; // scale parameter indoor
vector[n] phiOUT; // scale parameter outdoor
// defining log means
vector[4] logbetaINall = append_row(0, logbetaIN);
vector[4] logbetaOUTall = append_row(0, logbetaOUT);
loglambdaIN = logaIN + thetaIN[xHOUSE] * sigma_alphaIN + logbetaINall[xTREAT] + tauIN[xWEEK] * sigma_tauIN;
loglambdaOUT = logaOUT + thetaOUT[xHOUSE] * sigma_alphaOUT + logbetaOUTall[xTREAT] + tauOUT[xWEEK] * sigma_tauOUT;
// defining scale parameters
phiIN = exp(logpsiIN[xTREAT]);
phiOUT = exp(logpsiOUT[xTREAT]);
}
model{
//priors and hyperpriors
target += normal_lpdf(logaIN | 0, priorscale);
target += normal_lpdf(logaOUT | 0, priorscale);
target += 2*cauchy_lpdf(sigma_alphaIN | 0,priorscale);
target += 2*cauchy_lpdf(sigma_alphaOUT | 0,priorscale);
target += 2*cauchy_lpdf(sigma_tauIN | 0,priorscale);
target += 2*cauchy_lpdf(sigma_tauOUT | 0,priorscale);
target += normal_lpdf(logbetaIN| 0, priorscale);
target += normal_lpdf(logbetaOUT| 0, priorscale);
target += normal_lpdf(logpsiIN| 0, priorscale);
target += normal_lpdf(logpsiOUT| 0, priorscale);
// hierarchical parameters
target += normal_lpdf(thetaIN| 0, 1);
target += normal_lpdf(thetaOUT| 0, 1);
target += normal_lpdf(tauIN| 0, 1);
target += normal_lpdf(tauOUT| 0, 1);
// model
target += neg_binomial_2_log_lpmf(yCDC | loglambdaIN, phiIN);
target += neg_binomial_2_log_lpmf(yHLC | loglambdaOUT, phiOUT);
}
generated quantities{
// definitions for log likelihood for loo
vector[doexactloopdi ? 1 : n] log_lik;
vector[doexactloopdi ? 1 : n] log_lik_2;
vector[doexactloopdi ? 1 : 0] loglambdaINi; // lograte indoor
vector[doexactloopdi ? 1 : 0] loglambdaOUTi; // lograte indoor
vector[doexactloopdi ? T : 0] loglambdaINi_rn; // lograte indoor
vector[doexactloopdi ? T : 0] loglambdaOUTi_rn; // lograte indoor
vector[doexactloopdi ? 1 : 0] phiINi; // scale parameter indoor
vector[doexactloopdi ? 1 : 0] phiOUTi; // scale parameter outdoor
vector[doexactloopdi ? T : 0] thetaINi_rn; // random numbers
vector[doexactloopdi ? T : 0] thetaOUTi_rn; // random numbers
vector[doexactloopdi ? T : 0] tauINi_rn; // random numbers
vector[doexactloopdi ? T : 0] tauOUTi_rn; // random numbers
matrix[doexactloopdi ? 0 : n,T] loglambdaIN_rn; // random group
matrix[doexactloopdi ? 0 : n,T] loglambdaOUT_rn; // random group
matrix[doexactloopdi ? 0 : n,T] thetaIN_rn; // random numbers
matrix[doexactloopdi ? 0 : n,T] thetaOUT_rn; // random numbers
matrix[doexactloopdi ? 0 : n,T] tauIN_rn; // random numbers
matrix[doexactloopdi ? 0 : n,T] tauOUT_rn; // random numbers
// definitions for forward simulation
int<lower=0> zIN[doforward ? zn[1] : 0]; //simulate CDC data for known houses and unknown house and all treatment groups
int<lower=0> zOUT[doforward ? zn[1] : 0]; //simulate HLC data for known houses and unknown house and all treatment groups
vector[doforward ? zn[1] : 0] zmeanIN; //mean of simulated CDC data for known houses and unknown house and all treatment groups
vector[doforward ? zn[1] : 0] zmeanOUT; //mean of simulated HLC data for known houses and unknown house and all treatment groups
//vector[doforward ? zm[1] : 0] rrIN_OUT; // rates rario of indoor vs outdoor in control setting
//vector[doforward ? zm[1] : 0] SpIN; // simulated probability of mosquito going indoor in control setting
//vector[doforward ? zm[1] : 0] EpIN; // expected probability of mosquito going indoor in control setting
vector[doforward ? zn[1] : 0] PEf_IN; // protective efficacy against indoor biting in the field
vector[doforward ? zn[1] : 0] PEf_OUT; // protective efficacy against outdoor biting in the field
// expressions for log likelihood per datapoint to be used by loo function
if (doexactloopdi){
//log_lik
loglambdaINi = logaIN + thetaIN[xHOUSEi] * sigma_alphaIN + logbetaINall[xTREATi] + tauIN[xWEEKi] * sigma_tauIN;
loglambdaOUTi = logaOUT + thetaOUT[xHOUSEi] * sigma_alphaOUT + logbetaOUTall[xTREATi] + tauOUT[xWEEKi] * sigma_tauOUT;
phiINi = exp(logpsiIN[xTREATi]);
phiOUTi = exp(logpsiOUT[xTREATi]);
log_lik[1] = neg_binomial_2_log_lpmf(yCDCi | loglambdaINi, phiINi) + neg_binomial_2_log_lpmf(yHLCi | loglambdaOUTi, phiOUTi);
//log_lik_2
thetaINi_rn = to_vector(normal_rng(rep_vector(0,T),rep_vector(1,T)));
thetaOUTi_rn = to_vector(normal_rng(rep_vector(0,T),rep_vector(1,T)));
tauINi_rn = to_vector(normal_rng(rep_vector(0,T),rep_vector(1,T)));
tauOUTi_rn = to_vector(normal_rng(rep_vector(0,T),rep_vector(1,T)));
loglambdaINi_rn = logaIN + thetaINi_rn * sigma_alphaIN + rep_vector(to_array_1d(logbetaINall[xTREATi])[1],T) + tauINi_rn * sigma_tauIN;
loglambdaOUTi_rn = logaOUT + thetaOUTi_rn * sigma_alphaOUT + rep_vector(to_array_1d(logbetaOUTall[xTREATi])[1],T) + tauOUTi_rn * sigma_tauOUT;
log_lik_2[1] = ( neg_binomial_2_log_lpmf(rep_array(yCDCi[1],T) | loglambdaINi_rn, rep_vector(phiINi[1],T)) + neg_binomial_2_log_lpmf(rep_array(yHLCi[1],T) | loglambdaOUTi_rn, rep_vector(phiOUTi[1],T)) )/ T;
}else{
thetaIN_rn = to_matrix(normal_rng(rep_vector(0,n*T),rep_vector(1,n*T)),n,T);
thetaOUT_rn = to_matrix(normal_rng(rep_vector(0,n*T),rep_vector(1,n*T)),n,T);
tauIN_rn = to_matrix(normal_rng(rep_vector(0,n*T),rep_vector(1,n*T)),n,T);
tauOUT_rn = to_matrix(normal_rng(rep_vector(0,n*T),rep_vector(1,n*T)),n,T);
loglambdaIN_rn = logaIN + thetaIN_rn * sigma_alphaIN + rep_matrix(logbetaINall[xTREAT],T) + tauIN_rn * sigma_tauIN;
loglambdaOUT_rn = logaOUT + thetaOUT_rn * sigma_alphaOUT + rep_matrix(logbetaOUTall[xTREAT],T) + tauOUT_rn * sigma_tauOUT;
for (j in 1:n) {
// log_lik for PSIS
log_lik[j] = neg_binomial_2_log_lpmf(yCDC[j] | loglambdaIN[j], phiIN[j]) + neg_binomial_2_log_lpmf(yHLC[j] | loglambdaOUT[j], phiOUT[j]);
// log_lik_2 for PSIS with forgetting group association of held-out data point
log_lik_2[j] = ( neg_binomial_2_log_lpmf(rep_array(yCDC[j],T) | loglambdaIN_rn[j,], rep_vector(phiIN[j],T)) + neg_binomial_2_log_lpmf(rep_array(yHLC[j],T) | loglambdaOUT_rn[j,], rep_vector(phiOUT[j],T)) )/ T;
}
}
// expressions for forward simulation and protective efficacy
if (doforward){
vector[zm[1]] zthetaIN; // standardised deviation of indoor lograte from mean of lograte indoors, plus unknown 13th house
vector[zm[1]] zthetaOUT; // standardised deviation of outdoor lograte from mean of lograte outdoors, plus unknown 13th house
vector[zm[3]] ztauIN; // standardised deviation of indoor lograte from mean of lograte indoors due to week, plus unknown week
vector[zm[3]] ztauOUT; // standardised deviation of outdoor lograte from mean of lograte outdoors due to week, plus unknown week
vector[zn[1]] zloglambdaIN; // lograte indoor
vector[zn[1]] zloglambdaOUT; // lograte indoor
vector[zn[1]] zphiIN; // scale parameter indoor
vector[zn[1]] zphiOUT; // scale parameter outdoor
int zINC[zn[1]];
int zOUTC[zn[1]];
zthetaIN[:m[1]] = thetaIN;
zthetaOUT[:m[1]]= thetaOUT;
zthetaIN[zm[1]] = normal_rng(0,1);
zthetaOUT[zm[1]]= normal_rng(0,1);
ztauIN[:m[3]] = tauIN;
ztauOUT[:m[3]]= tauOUT;
ztauIN[zm[3]] = normal_rng(0,1);
ztauOUT[zm[3]]= normal_rng(0,1);
// defining log means
zloglambdaIN = logaIN + zthetaIN[zxHOUSE] * sigma_alphaIN + logbetaINall[zxTREAT] + ztauIN[zxWEEK] * sigma_tauIN;
zloglambdaOUT = logaOUT + zthetaOUT[zxHOUSE] * sigma_alphaOUT + logbetaOUTall[zxTREAT] + ztauOUT[zxWEEK] * sigma_tauOUT;
zphiIN = exp(logpsiIN[zxTREAT]);
zphiOUT = exp(logpsiOUT[zxTREAT]);
zIN = neg_binomial_2_log_rng(zloglambdaIN, zphiIN); //simulate CDC data
zOUT = neg_binomial_2_log_rng(zloglambdaOUT, zphiOUT); //simulate HLC data
zmeanIN = exp(zloglambdaIN); //mean of simulated CDC data for known houses and unknown house and all treatment groups
zmeanOUT = exp(zloglambdaOUT); //mean of simulated HLC data for known houses and unknown house and all treatment groups
//rrIN_OUT = exp(logalphaIN) ./ exp(logalphaOUT);
//SpIN = to_vector(neg_binomial_2_log_rng(logalphaIN + logalphaOUT, exp(logpsiIN[1]))) ./ (to_vector(neg_binomial_2_log_rng(logalphaIN + logalphaOUT, exp(logpsiIN[1]))) + rep_vector(0.1,12)); // the last term is correction to preclude division by zero. Rethink about this.
{ // build output vectors with controls only
zINC[indexC] = zIN[indexC];
zINC[indexT] = zIN[indexC];
zINC[indexR] = zIN[indexC];
zINC[indexP] = zIN[indexC];
//
zOUTC[indexC] = zOUT[indexC];
zOUTC[indexT] = zOUT[indexC];
zOUTC[indexR] = zOUT[indexC];
zOUTC[indexP] = zOUT[indexC];
}
PEf_IN= (to_vector(zINC) - to_vector(zIN) ) ./ (to_vector(zINC)+c[1]);
PEf_OUT= (to_vector(zOUTC) - to_vector(zOUT) ) ./ (to_vector(zOUTC)+c[1]);
}
}
This is the R function I wrote to run the model and compute the two LOO-CV versions:
validfit <- function(species, model, no_zHOUSE, no_zWEEK, WEEK2DAY, doforward, doexactloopdi, T_est, pareto_k_threshold, chains, warmup, iter, savetofile) {
# This function return a list containing the stanfit object for the given model fitted to the whole data, the matrix defining the forward simulations if doforward==1, the loo() output for this with PSIS-approximated elpd (expected log posterior density) replaced by the exact elpd for the data points with pareto-k value above pareto_k_threshold, elpd_tilde_loo_psis which is forgetting the group association of the left-out data point (again with exact loo if the pareto-k value is above pareto_k_threshold), SE_elpd_tilde_loo_psis which is the standard error of the latter, and a vector with the indices of the data points with pareto-k value above pareto-k_threshold
# pareto_k_threshold = 0.5 or at most 0.7 is recommended
#species is currently either 'gamb' or 'fune'
## preparations
if(chains <= 1) {stop("needs multiple chains")}
options(mc.cores = parallel::detectCores())
library('rstan')
rstan_options(auto_write = TRUE)
library("loo")
model_file <- paste0("./", model, ".stan", collapse = NULL)
m <- stan_model(file = model_file) # Stan program
## fit model to whole data
input_stan <- data_stan(species=species, doexactloopdi=0, T_est, index=NA, doforward=doforward, no_zHOUSE=no_zHOUSE, no_zWEEK=no_zWEEK, WEEK2DAY)
sim <- input_stan$sim
fit <- sampling(
m,
data = input_stan, # named list of data
chains = chains, # number of Markov chains
warmup = warmup, # number of warmup iterations per chain
iter = iter, # total number of iterations per chain
control = list(adapt_delta = 0.8),
show_messages = FALSE
)
## LOO-CV by PSIS with using group association of left-out data point for prediction
log_lik <- loo::extract_log_lik(fit, merge_chains = FALSE) # Extract pointwise log-likelihood
# as of loo v2.0.0 we can optionally provide relative effective sample sizes
# when calling loo, which allows for better estimates of the PSIS effective
# sample sizes and Monte Carlo error
r_eff <- relative_eff(exp(log_lik))
loo_ <- loo(fit, r_eff = r_eff, save_psis=TRUE) # CHECK ORDERING OF ELEMENTS IN PSIS_OBJECT IN TERMS OF SAMPLE
I <- pareto_k_ids(loo_, threshold = pareto_k_threshold) # extracting the indices of data points with pareto-k estimate above pareto_k_threshold
## LOO-CV by PSIS with forgetting group association, storing pointwise, sum and SE to be computed further down
omega <- loo::weights.importance_sampling(loo_$psis_object, log=FALSE, normalize = TRUE) # check what is meant with log scale and if normalisation is done on the right scale
log_p_tilde <- rstan::extract(object=fit, pars='log_lik_2', permuted = TRUE, inc_warmup = FALSE, include = TRUE)
p_tilde <- exp(log_p_tilde$log_lik_2)
elpd_tilde_loo_psis_pointwise <- log(colSums(omega * p_tilde)) # sum instead of mean as weights omega are normalised
## exact LOO-CV for data points with pareto-k estimate above pareto_k_threshold by PSIS
if (doexactloopdi==1){
# refitting model #I times with each time leaving out one i in I and saving corresponding exact expected log posterior density (elpd), once with using group association of left-out data point and once with forgetting.
for (i in I) {
input_stan <- data_stan(species=species, doexactloopdi=1, T_est, index=i, doforward=0, no_zHOUSE=NA, no_zWEEK=NA, WEEK2DAY) # calling function saved in load_data.R
fit_lio <- sampling(
m,
data = input_stan, # named list of data
chains = chains, # number of Markov chains
warmup = warmup, # number of warmup iterations per chain
iter = iter, # total number of iterations per chain
control = list(adapt_delta = 0.8),
show_messages = FALSE
)
# replace pointwise log_lik
log_lik_lio_i <- loo::extract_log_lik(fit_lio, merge_chains = FALSE) # extract log likelihood for data point i
loo_$pointwise[i,1] <- log(mean(exp(log_lik_lio_i))) # replace by exact_elpd for left out data point i CHECK IF THIS REALLY CHANGES THE OUTPUT OF loo_
# replace pointwise elpd_tilde_loo_psis with exact elpd_tilde_loo_psis
log_p_tilde_lio <- rstan::extract(object=fit_lio, pars='log_lik_2', permuted = TRUE, inc_warmup = FALSE, include = TRUE)
p_tilde_lio <- exp(log_p_tilde_lio$log_lik_2)
elpd_tilde_loo_psis_pointwise[i] <- log(mean(p_tilde_lio))
}
}
# compute elpd_tilde_loo_psis, the elpd when forgetting the group association of the left-out data point
elpd_tilde_loo_psis <- sum(elpd_tilde_loo_psis_pointwise)
SE_elpd_tilde_loo_psis <- sd(elpd_tilde_loo_psis_pointwise)*sqrt(length(elpd_tilde_loo_psis_pointwise))
## outputting results
if (doforward==1){
result <- list(fit=fit, sim=sim, loo_=loo_, I=I, elpd_tilde_loo_psis=elpd_tilde_loo_psis, SE_elpd_tilde_loo_psis=SE_elpd_tilde_loo_psis)
}else{
result <- list(fit=fit, loo_=loo_, I=I, elpd_tilde_loo_psis=elpd_tilde_loo_psis, SE_elpd_tilde_loo_psis=SE_elpd_tilde_loo_psis)
}
## saving results
if (is.na(savetofile)!=1){
savetopath_fit <- paste0("./", savetofile, "_fit.rds")
saveRDS(fit, file=savetopath_fit)
savetopath_loo <- paste0("./", savetofile, "_loo.rds")
saveRDS(list(loo_=loo_, I=I, elpd_tilde_loo_psis=elpd_tilde_loo_psis, SE_elpd_tilde_loo_psis=SE_elpd_tilde_loo_psis), file=savetopath_loo)
if (doforward==1){
savesimtopath <- paste0("./", savetofile, "_sim.rds")
saveRDS(input_stan$sim, file=savesimtopath)
}
}
return(result)
}