What Does “90 AUC” Really Mean?
An interactive exploration of AUC, prevalence, and the metrics that actually matter at deployment.
We trained a binary classifier and achieved an AUC of 0.90. Great! But what does that mean for my deployment where I need discretized outputs? It depends entirely on two things we haven’t specified yet — the decision threshold we choose and the prevalence of the positive class in the population we’re deploying to.
The goal of this post is to gain intuition around how AUC, prevalence, and decision thresholds interact with metrics of interest at deployment.
Score Distributions
ROC Curve
Precision–Recall Curve
The Model
In Canonical example mode, we use a binormal model: negative cases have scores drawn from N(0, 1) and positive cases from N(μ, 1), where μ is chosen to produce the desired AUC.
In Sample many examples mode, we show that the same AUC can arise from very different score distributions. Each alternate curve models the positive class as a two-component Gaussian mixture, w · N(μ1, σ1²) + (1−w) · N(μ2, σ2²), with parameters chosen so the AUC matches the target. This produces bimodal and skewed distributions — a “hard” subpopulation that overlaps with negatives and an “easy” one that separates cleanly. Additional curves use a heteroscedastic binormal model (varying the positive-class variance) for further variety. Click any alternate curve to lock it as the active model and see how its metrics differ from the canonical one.
The score distributions are scaled by prevalence, so their relative areas reflect how many positives and negatives you actually encounter.