Mcse in loo 2.0

Using loo 2.0 I created a loo1 object and I am looking at loo1$pointwise. In R I get the total elpd by typing sum(loo1$pointwise[,1]) and I get the elpd_loo for the model. So far, this looks good. Now I am interested in getting the monte carlo standard error of the elpd_loo. Should I type in R, sum(loo1$pointwise[,2])?

If I understand this right, this would assume perfectly correlated errors and is likely to be an overestimate.

If I have any k>0.7, what are the dangers in doing this?

I saw your other post about having solved this, but just to have it on record for anyone who comes across this going forward, there’s an mcse_loo function included with the other diagnostic functions:

For high k values it will be NA since the estimate is not so trustworthy:

If k>0.7 we observe impractical convergence rates and unreliable Monte Carlo error estimates.

Then why do you have the pointwise mcse values when k>0.7?

Fair question. A few things:

First, it’s not just the pointwise mcse values that are a problem if k is large. The other pointwise values (elpd_loo, p_loo) are not good estimates if \hat{k} is large either.

In the papers, package documentation, and in the output from the user facing functions, there are tons of warnings about not trusting estimates if \hat{k} is large, but in the internals of the objects created by loo (or really by the psis() function) it is still useful to have what the untrustworthy values actually are instead of deleting them or making them NA. If we as researchers (or users who have an interest in the underlying algorithms, or other researchers, or anyone else) want to understand what actually happens when \hat{k} is large and why, then not being able to see the actual values would be a big impediment.

As mainly a user, I can accept that. But I do have a general question. Have you looked into alternatives to the Pareto? I view your problem as one of robust estimation of the mean of the importance ratios. It strikes me that you should be able to find a good distribution to approximate the distribution of raw importance ratios.

That is what I was trying to do with my initial post to an earlier thread with the loggamma distribution. The argument that Aki posted against that was that the loggamma may have worked for my problem, but there are other problems for which that would not work.

At the very least, it might be possible to identify a class of problems where something like a loggamma would work better.

I haven’t looked into the log-gamma dist for this. Aki often has intuitions on this topic that I don’t have until I’ve actually done the math and often simulations too, so my default is to trust his judgement if I haven’t looked into it myself. But we would certainly be interested in looking at the log-gamma if there were reason to suspect it could be useful in general. Sounds like Aki’s not convinced, but if you can convince us it’s worth looking into we’ll definitely take it seriously!

Below is the script to do the (translated) loggamma and the raw elpd calculations. It tales the loglik1 input from the stan run.

I have also done bootstrap estimates of the mcse, also below, for one of my actuarial models. I don’t want to inflict the actuarial model itself unless you really want it. but just below are the bootstrap results. Note that the translated loggamma (tlg) has the lowest mcse

Just for the record, in the paper I am writing for actuaries, I will be using the psis (and noting the pareto k issue) I consider you guys to be the gold standard on this topic. But I am interested in the mcse error, as some of my model comparisons are close calls.

print(co_model$mcse)
$mcse_test_boot
[1] 0.331

$mcse_psis_boot
[1] 0.215

$mcse_tlg_boot
[1] 0.111

$mcse_raw_boot
[1] 0.208

function to fit loggamma distribution by the method of moments

reference - p227 of Loss Distributions - Hogg and Klugman 1984

fit_loggamma=function(x){
minlogx=min(log(x))
tlogx=log(x)-minlogx
m1=mean(tlogx) # translate to make all numbers positive
m2=mean(tlogx^2)
alpha=m1^2/(m2-m1^2)
lambda=m1/(m2-m1^2)
mean_tlg=1/(1-1/lambda)^alpha*exp(minlogx) #use exp(minlogx) to untranslate
output=list(mean_tlg = mean_tlg,
alpha_tlg = alpha,
lambda_tlg = lambda)
return(output)
}

function to get elpd by three different formulas

elpd_calc=function(loglik){

calc #1 with the “loo” package

loo_psis ← loo(loglik)

calculate importance ratios

impratio=matrix(0,dim(loglik)[1],dim(loglik)[2])
for (i in 1:dim(loglik)[2]){
impratio[,i]=1/exp(loglik[,i])
}

calculate raw elpd

prloo_raw=rep(0,dim(loglik)[2])
varloo_raw=rep(0,dim(loglik)[2])
for (i in 1:dim(loglik)[2]){
prloo_raw[i]=1/mean(impratio[,i])
varloo_raw[i]=var(impratio[,i])/dim(loglik)[1]/mean(impratio[,i])^2
}
elpd_raw=sum(log(prloo_raw))
se_elpd_raw=sum(sqrt(varloo_raw))

calculate elpd_tlg

prloo_tlg=rep(0,dim(loglik)[2])
lambda_tlg=NULL
for(i in 1:dim(loglik)[2]){
tlg=fit_loggamma(impratio[,i])
prloo_tlg[i]=1/tlg$mean_tlg
lambda_tlg=c(lambda_tlg,tlg$lambda_tlg)
}
elpd_tlg=sum(log(prloo_tlg))
output=list(elpd_psis = round(loo_psis$estimates[1,1],3),
elpd_raw = round(elpd_raw,3),
elpd_tlg = round(elpd_tlg,3),
p_psis = round(loo_psis$estimates[2,1],3),
max_k_psis = round(max(loo_psis$diagnostics$pareto_k),3),
min_lam_tlg = round(min(lambda_tlg),3))
return(output)
}

bootstrap to get an estimate of the mcmc sample variance

nboot=100
set.seed(12345)
cl ← makePSOCKcluster(4)
registerDoParallel(cl)
boot.se=foreach (nb=1:nboot,.combine=rbind,.export=ls(.GlobalEnv)) %dopar%{
test_elpd_boot=0
sam=sample(1:num.mcmc,num.mcmc,replace=T)
for (i in 1:dim(test_data)[1]){
mu.test=log(Premium[test_data$w[i]])+logelr[sam]+
alpha[sam,test_data$w[i]]+
beta[sam,test_data$d[i]]*(1-gamma[sam])^(test_data$w[i]-1)
test_elpd_boot=test_elpd_boot+
log(mean(dnorm(log(test[i]),mu.test,sig[sam,test_data$d[i]],log=F)))
}

temp=elpd_calc(loglik1[sam,])
elpd_psis_boot=temp$elpd_psis
elpd_tlg_boot=temp$elpd_tlg
elpd_raw_boot=temp$elpd_raw
boot=c(test_elpd_boot,elpd_psis_boot,elpd_tlg_boot,elpd_raw_boot)
return(boot)

}
stopCluster(cl)
mcse=list(mcse_test_boot=round(sd(boot.se[,1]),3),
mcse_psis_boot=round(sd(boot.se[,2]),3),
mcse_tlg_boot =round(sd(boot.se[,3]),3),
mcse_raw_boot =round(sd(boot.se[,4]),3))

Yes, but also diagnosing when that mean is infinite.

Only in rare cases we can assume that bulk of that distribution has some know distributional form, while it’s more common that we can assume that tail to have GDP shape.

Yes, it 's fine to use loggamma if that works for you. loo package tries to be automatic, and it would be difficult to make good automatic model checking whether log-gamma is a good for arbitrary ratio distribution. For a single specific model like in your case it is possible to manually check that it makes sense.

For k<0.7 mcse’s have been so small that they are negligible in model comparison. For k>1 mean is infinite. So the interesting range where we might get useful improvement is 0.7<k<1. Figures 6 and 7 in PSIS paper however show that even if the whole ratio distribution comes exactly from a known distribution the benefits are small.

If mcse_raw_boot is close to mcse_psis_boot, then I guess k<0.5, and you could have obtained the same reduction in mcse by increasing your sample 4 fold. And this would work for any distribution of ratios.

Your mcse’s are so small that you shouldn’t make any loo comparison decision based on that kind of accuracy. Estimate of SE for elpd_loo has definitely bigger error than 0.1 or 0.2.

Can you be more specific what do you mean by “close call” here, so I can say how I would interpret that model comparison result taking into account all the different uncertainties in the estimates.

I think a more accurate statement would be that you chose to do the robust estimation by fitting a distribution that could have an infinite mean. I am not saying that it was a bad choice given the variety of situations you are trying to serve. But I wonder if there was a distribution that always has a finite first, or even second moments that might do as well.

I don’t think the loggamma is ready for prime time. As for anything that I plan to publish, I will use the psis, point out that there were some observations with large ks and use the bootstrap as justification for using the results.

I will do that mid next week. Right now I am running my models on 200 datasets to get the bootstrap mcse. They each take about 10-15 minutes or so to run (bootstrap is not to blame here). I think a better description is not a “close call” but that something different might be present in the new holdout data. Predicting the future is hard. Getting a rough estimate of the mcses will help me make that call.

I chose to use a distribution that could have an infinite mean because the true distribution can have an infinite mean.

No, because theory says that the true distribution may have infinite first and second moment, and thus any fitting distribution which has finite first or second moment will fail when the true distribution has infinite first or second moment.

There could be a distribution which is better than GPD, but it should be also able to model so thick tails that first and second moments are infinite so that it is possible to diagnose those cases.

1 Like

Aki’s point about needing to be able to handle infinite moments is important because that is the reality for some models that are fit. So the generalized Pareto is a good option for that reason, and moreover because we understand when it doesn’t work, or at least we understand it well enough to provide diagnostics based on both theory and experiment.

Like Aki said, it’s comceivable there are other good general options, and there are certainly good options for special cases, but so far the GPD is working very well for a wide variety of models. Anyway, glad you’re interested in the topic. We’re not planning to stop actively working on this, so we will definitely keep looking into improvements we can make to the package.

I have been thinking about your about your comment that the “true distribution may have infinite first and second moments.” Think of, as a thought experiment, trying an inefficient estimation of elpd_loo where you leave one out, and then do an MCMC predictive distribution many times? Are there some models where the elpd_loo statistic has an infinite variance? If not, why use the gpd?

It looks like you’re referring to exact elpd calculations where the observation is actually held out, in which case you are correct that there’s no need for the gpd.