Currently I am using stan for a binary classification problem. To check the discriminatory power of my model I want to use measures like the recall, precision and the receiver operator characteristics curve. Due to the size of the data set I do not want to split up my data set into a training and a validation set. Instead of this I thought about using a similar approach as used in Vehtari, Gelman, Gabry’s paper , in which pareto smoothed importance sampling is used the calculate the cross validation likelihood by:
Where w_i^s is the weight determined by the pareto smoothing of raw importance sampling ratios.
As input for the recall etc., I want to use the cross validation likelihoods, however I am not sure if this is a good approach due to variance and bias in the approximation. Does anyone know if it is okay to use these approximations for these types of measures?