I recently came across this great article discussing predictive accuracy.

Bob Carpenter makes the point that binary classification in an applied setting really needs to account for the costs of false positives/negatives (if I understood correctly). This is just what I need from my model. Fraud investigators use my model output to determine what to investigate. False negatives will cost the company money, but false positives cost investigator time. I’d like the ability to tune the model accordingly.

Is it possible to incorporate these into a logistic regression model using Stan?

I’ve tried a cost-sensitive loss function using xgboost and it works well. But, I really like the interpretability of Stan models. Using Stan, I can not only tell an investigator what to look at, but I can also tell them why they should look at it. My current Stan model (hierarchical logistic regression using brms) has a very large number of false positives.

You don’t incorporate the costs of misclassification into the model block. You do your same model but evaluate your cost function in the generated quantities block.

I’m beginning to understand that. The Bayesian Decision Theory chapter in Pattern Classification by Duda and Hart was very helpful. Along with this paper:

Having a dataset where p(+1)=p(-1) implies equal costs. But, by changing p(+1) and p(-1) I can imply a certain cost ratio c(-1)/c(+1) (i.e. 1/100 False Positive Cost to False Negative Cost ratio)

But the computation can be more efficient if they are combined. It’s more common in importance sampling to consider proposal distributions that take into account the function in addition of the distribution. This has been proposed also for distributional variational approximations. We could combine also MCMC and IS and intentionally sample more draws where they matter for the decision task and use importance weighting to get the correct expectations in the end.

That would be great if we could do it stably. How do you intentionally take more draws from where they matter? Do you change the log density being sampled or change the algorithm somehow?

I would change the target in Stan. How to change that is probably not trivial before some initial sampling. For example, consider that we would like to estimate some extreme tail quantile. We could do an initial run with the log density, and after learning approximate location of that quantile we could change the target to have higher values near that approximate location and lower values for the bulk of the distribution. For complex models and decisions tasks depending, e.g. predictions this is probably more difficult. If this would be easy, people would be doing it more often.

After working with this model more, I’ve realized there’s an additional complexity to my approach that I ignored. Not only does the ratio P(+1)/P(-1) imply a cost ratio, but when doing hierarchical logistic regression (with group G), I need to preserve the class ratio within each group P(+1|G)/P(-1|G). Ignoring this has given my strange results, but once I adjusted my training data to have the correct proportions per group things looked much better.