Explanation for more efficient sampling


I’m trying to model a classification process. We observe that P of the N examples are classified as positive by some decision maker. We then observe that TP of those P positives turn out to be true positives. We model this as two binomial draws

P ~ binomial(N, a);
TP ~ binomial(P, b);

We model a and b in two ways. In the first, a = P(X > t) and b = E[X > t], where X ~ beta(alpha, beta). The positive rate (a) is the fraction of beta samples X above some threshold, and the precision (b) is the mean of those samples:

a = 1 - beta_cdf(t, alpha, beta)
b = (alpha * (1 - beta_cdf(t, alpha + 1, beta))) / ((1 - beta_cdf(x, alpha, beta)) * (alpha + beta))

In the second, the classification process is modeled with LDA, where the classes have equal-variance normal distributions separated by some delta.

tp = phi * (1 - normal_cdf(t, delta, 1));
fp = (1 - phi) * (1 - normal_cdf(t, 0, 1)); 
a = (tp + fp);
b = tp / (tp + fp);

We place reasonable priors on (t, alpha, beta) in the first case and (t, phi, delta) in the second case.

My question is: why is inference more efficient (fewer leapfrog steps per sample and more effective samples per sample) in the second model?


Those are widly different formulas. I can’t say I really understand either model.

Efficiency comes down to both the time to calculate the log density (how well the log density is coded as a program) and the number of times it needs to be evaluated (statistical efficiency). For the former, you can look at tree depth or number of leapfrog steps for the number of times it gets evaluated.

The normal_cdf is probably more efficient than the beta_cdf— if I recall the beta_cdf relies on some nasty internal functions for derivatives that are still being refined.

Statistical efficiency is trickier. You want to take a look at the posterior pairs plots to see if you’re getting problematic posteriors like banana or funnel shapes.