Question re: classification model evaluation

Hi all,

I am in the midst of peer review on a paper that developed a probabilistic binary classifier (a fancy way of saying logistic regression with frills) for a geoscience application (using Stan!). Here are a few key properties:

  • The dataset is imbalanced towards positive outcomes (~3:1) and both positive and negative predictions will be of significance to users.
  • The mis-classifcation cost will vary from use to use.
  • The actual population ratio of outcomes is unknown but it is generally assumed that the data collection is biased towards positive outcomes.

This may not be the best place for this question but there are a lot of smart modelers on this forum and I wanted to throw it out there. I used my typical predictive modeling tools of up-sampling to compensate the minority class and balance the training data and calculated performance metrics using the ROC curve and AUC. One reviewer commented that we should calculate the performance recall curve (which I think is fair) because a significant difference in the PR curves when the class labels are switched indicates sampling bias (which I don’t immediately agree with).

The paper that they refer to simulated a population of yes/no outcomes from a fake logistic regression using typical procedures. They then drew samples from this population of varying class ratios by fixing the length of the majority class (which they labeled as 0) at 5000 and drawing an appropriate number of the minority class (labeled as 1). They then fit logistic regressions to these samples and calculated ROC and PR curves and performance metrics from the same sample data. Their results showed (as would be expected) that ROC derived metrics did not change with the class ratio of the testing/training data, but PR derived metrics did – particularly when the minority class was labeled as the class of interest. The paper claims that their results show that “the difference in the mean Prec, Rec, and F-M
within the class and between the classes can be used as an indicator of sampling bias
and the resulting inaccuracy of the predicted probabilities using the MLLR model”

My gut feeling is that this claim is false and are the stated results would be true for any testing data with an imbalanced class ratio independent of the model being evaluated. Does anyone have any insight to shed or this?

Sometimes @harrelfe hangs around the forum, and I’m sure he could guide you through this. But otherwise, for this sort of questions you’ll probably have better luck on https://discourse.datamethods.org.

1 Like

Thanks for the suggestion, I’ll check the forum out!