Interpretations of ROC-AUC & PR-AUC metrics
1. Help me understand how to interpret ROC-AUC score. E.g. what does a 90% ROC-AUC score mean? How is it different from PR-AUC score?
These metrics often get thrown around but not well-understood. Let’s break it down cleanly, with intuition, toy examples, and guidance on when to use each.
Part 1: ROC-AUC: What it really measures
ROC curve = Receiver Operating Characteristic curve.
- It plots True Positive Rate (Recall) vs. False Positive Rate (FPR = FP / (FP+TN)) at all classification thresholds.
- The AUC (Area Under Curve) is the probability that a randomly chosen positive example is ranked above a randomly chosen negative example.
Interpretation of ROC-AUC = 0.90
- If you pick one positive sample and one negative sample, your model will score the positive higher 90% of the time.
- Equivalently: the model is very good at separating the classes overall.
Extreme cases:
- 0.5 → random guessing (diagonal line).
- 1.0 → perfect separation.
Part 2: PR-AUC: What it really measures
Precision-Recall curve plots Precision vs. Recall at all thresholds.
- The PR-AUC is the average precision across recall levels.
- It tells you: When I want high recall, how much precision do I have to give up?
Interpretation of PR-AUC = 0.60
- Across all recall levels, the model has about 60% average precision.
- If the dataset is very imbalanced (say only 1% positives), 60% PR-AUC is actually quite good, because a random classifier’s PR-AUC would be ~0.01 (the base positive rate).
Part 3: ROC-AUC vs PR-AUC - when to use
ROC-AUC is good when:
- Classes are relatively balanced.
- You care about overall ranking ability.
- Example: disease diagnosis where both false positives and false negatives matter.
PR-AUC is better when:
- Classes are highly imbalanced (rare positives).
- You really care about “when the model predicts positive, how often is it correct?”
- Example: fraud detection, cancer detection, anomaly detection.
ROC-AUC can be misleading in imbalanced cases.
- With 1% positives, a model can get ROC-AUC = 0.90 but PR-AUC = 0.10. That means: yes, it ranks positives higher, but most of the actual predicted positives are still false alarms.
Part 4: Comparison to Accuracy / Precision / Recall
- Accuracy: fraction of correct predictions.
- Misleading on imbalanced data (99% accuracy just by always predicting “negative”).
- Precision: of the predicted positives, how many were correct.
- Recall: of all true positives, how many you caught.
- These depend on one threshold (say 0.5 cutoff).
ROC-AUC and PR-AUC don’t depend on a fixed threshold.
- They summarize performance across all possible thresholds.
- That makes them more robust for model evaluation and threshold tuning.
Part 5: Rule of Thumb
- Use ROC-AUC: when classes are balanced and you care about ranking quality.
- Use PR-AUC: when positives are rare and you care about actual “positive predictions” being useful.
- Use Accuracy/Precision/Recall: when the threshold is fixed and well-defined (e.g. regulatory or business rule).
So, 90% ROC-AUC means: your model ranks positives above negatives 9 times out of 10 - strong discrimination. But whether that translates to useful predictions depends on class balance. That’s why PR-AUC is often the more telling metric in imbalanced settings.
2. Explain what’s the interpretation of PR-AUC = 0.60 and why so? Also explain why a random classifier’s PR-AUC would be ~0.01 if we’ve 1% positive and 99% negative data?
Let’s drill into PR-AUC = 0.60 and why the random baseline is the positive prevalence (e.g. ~0.01 if only 1% positives).
Part 1: What does PR-AUC = 0.60 mean?
- PR curve = plots Precision vs Recall at every threshold.
- PR-AUC = area under that curve = “average precision across recall levels.”
If PR-AUC = 0.60, that means:
- On average, across all thresholds, when the model tries to recall positives, 60% of its predicted positives are actually correct.
- Another way: If you rank the data by model score and “walk down the list,” the top portions of the ranking are fairly enriched for true positives: 60% on average compared to baseline prevalence.
Part 2: Why is the random classifier’s PR-AUC equal to the prevalence?
Suppose you have a dataset of 10,000 samples:
- 100 are positive (1%).
- 9,900 are negative.
Now imagine a random classifier that assigns random scores. If you sweep thresholds:
- At any threshold, the set of predicted positives is essentially a random subset of the population.
- So the fraction of true positives inside that set = prevalence = 100 / 10,000 = 0.01.
So:
- Precision is flat = 0.01, regardless of recall.
- PR curve is a horizontal line at 0.01.
- Area under that curve = 0.01.
That’s why:
- PR-AUC baseline = positive rate.
- With PR-AUC = 0.60, your model is massively better than random if prevalence is low (say 1%).
Part 3: Why does this matter?
- If your dataset has 1% positives:
- ROC-AUC = 0.90 might sound “great” … but you could still end up with very few correct positives in the top predictions.
- PR-AUC = 0.60 tells you that your model’s predicted positive set is, on average, 60x more enriched in true positives than random guessing. That’s far more useful.
Summary
- PR-AUC = 0.60 → on average, 60% of predicted positives are true positives across thresholds.
- Random classifier’s PR-AUC = prevalence because predictions are random subsets, so expected precision equals the base rate (e.g. 0.01 if 1% positives).
- That’s why PR-AUC is the metric of choice for imbalanced classification: it reflects how much better you are than just knowing the base rate.