Post

Interpretations of ROC-AUC & PR-AUC metrics

Interpretations of ROC-AUC & PR-AUC metrics

1. Help me understand how to interpret ROC-AUC score. E.g. what does a 90% ROC-AUC score mean? How is it different from PR-AUC score?

These metrics often get thrown around but not well-understood. Let’s break it down cleanly, with intuition, toy examples, and guidance on when to use each.

Part 1: ROC-AUC: What it really measures

ROC curve = Receiver Operating Characteristic curve.

  • It plots True Positive Rate (Recall) vs. False Positive Rate (FPR = FP / (FP+TN)) at all classification thresholds.
  • The AUC (Area Under Curve) is the probability that a randomly chosen positive example is ranked above a randomly chosen negative example.

Interpretation of ROC-AUC = 0.90

  • If you pick one positive sample and one negative sample, your model will score the positive higher 90% of the time.
  • Equivalently: the model is very good at separating the classes overall.

Extreme cases:

  • 0.5 → random guessing (diagonal line).
  • 1.0 → perfect separation.

Part 2: PR-AUC: What it really measures

Precision-Recall curve plots Precision vs. Recall at all thresholds.

  • The PR-AUC is the average precision across recall levels.
  • It tells you: When I want high recall, how much precision do I have to give up?

Interpretation of PR-AUC = 0.60

  • Across all recall levels, the model has about 60% average precision.
  • If the dataset is very imbalanced (say only 1% positives), 60% PR-AUC is actually quite good, because a random classifier’s PR-AUC would be ~0.01 (the base positive rate).

Part 3: ROC-AUC vs PR-AUC - when to use

  • ROC-AUC is good when:

    • Classes are relatively balanced.
    • You care about overall ranking ability.
    • Example: disease diagnosis where both false positives and false negatives matter.
  • PR-AUC is better when:

    • Classes are highly imbalanced (rare positives).
    • You really care about “when the model predicts positive, how often is it correct?”
    • Example: fraud detection, cancer detection, anomaly detection.

ROC-AUC can be misleading in imbalanced cases.

  • With 1% positives, a model can get ROC-AUC = 0.90 but PR-AUC = 0.10. That means: yes, it ranks positives higher, but most of the actual predicted positives are still false alarms.

Part 4: Comparison to Accuracy / Precision / Recall

  • Accuracy: fraction of correct predictions.
    • Misleading on imbalanced data (99% accuracy just by always predicting “negative”).
  • Precision: of the predicted positives, how many were correct.
  • Recall: of all true positives, how many you caught.
  • These depend on one threshold (say 0.5 cutoff).

ROC-AUC and PR-AUC don’t depend on a fixed threshold.

  • They summarize performance across all possible thresholds.
  • That makes them more robust for model evaluation and threshold tuning.

Part 5: Rule of Thumb

  • Use ROC-AUC: when classes are balanced and you care about ranking quality.
  • Use PR-AUC: when positives are rare and you care about actual “positive predictions” being useful.
  • Use Accuracy/Precision/Recall: when the threshold is fixed and well-defined (e.g. regulatory or business rule).

So, 90% ROC-AUC means: your model ranks positives above negatives 9 times out of 10 - strong discrimination. But whether that translates to useful predictions depends on class balance. That’s why PR-AUC is often the more telling metric in imbalanced settings.


2. Explain what’s the interpretation of PR-AUC = 0.60 and why so? Also explain why a random classifier’s PR-AUC would be ~0.01 if we’ve 1% positive and 99% negative data?

Let’s drill into PR-AUC = 0.60 and why the random baseline is the positive prevalence (e.g. ~0.01 if only 1% positives).

Part 1: What does PR-AUC = 0.60 mean?

  • PR curve = plots Precision vs Recall at every threshold.
  • PR-AUC = area under that curve = “average precision across recall levels.”

If PR-AUC = 0.60, that means:

  • On average, across all thresholds, when the model tries to recall positives, 60% of its predicted positives are actually correct.
  • Another way: If you rank the data by model score and “walk down the list,” the top portions of the ranking are fairly enriched for true positives: 60% on average compared to baseline prevalence.

Part 2: Why is the random classifier’s PR-AUC equal to the prevalence?

Suppose you have a dataset of 10,000 samples:

  • 100 are positive (1%).
  • 9,900 are negative.

Now imagine a random classifier that assigns random scores. If you sweep thresholds:

  • At any threshold, the set of predicted positives is essentially a random subset of the population.
  • So the fraction of true positives inside that set = prevalence = 100 / 10,000 = 0.01.

So:

  • Precision is flat = 0.01, regardless of recall.
  • PR curve is a horizontal line at 0.01.
  • Area under that curve = 0.01.

That’s why:

  • PR-AUC baseline = positive rate.
  • With PR-AUC = 0.60, your model is massively better than random if prevalence is low (say 1%).

Part 3: Why does this matter?

  • If your dataset has 1% positives:
    • ROC-AUC = 0.90 might sound “great” … but you could still end up with very few correct positives in the top predictions.
    • PR-AUC = 0.60 tells you that your model’s predicted positive set is, on average, 60x more enriched in true positives than random guessing. That’s far more useful.

Summary

  • PR-AUC = 0.60 → on average, 60% of predicted positives are true positives across thresholds.
  • Random classifier’s PR-AUC = prevalence because predictions are random subsets, so expected precision equals the base rate (e.g. 0.01 if 1% positives).
  • That’s why PR-AUC is the metric of choice for imbalanced classification: it reflects how much better you are than just knowing the base rate.
Enjoyed this article? Never miss out on future posts - follow me.
This post is licensed under CC BY 4.0 by the author.