What Does “90 AUC” Really Mean?

An interactive exploration of AUC, prevalence, and the metrics that actually matter at deployment.

We trained a binary classifier and achieved an AUC of 0.90. Great! But what does that mean for my deployment where I need discretized outputs? It depends entirely on two things we haven’t specified yet — the decision threshold we choose and the prevalence of the positive class in the population we’re deploying to.

The goal of this post is to gain intuition around how AUC, prevalence, and decision thresholds interact with metrics of interest at deployment.

Score Distributions

Drag the threshold line or use the slider below

ROC Curve

Click to set operating point

Precision–Recall Curve

AUC 0.90

Prevalence 5.0%

Operating Point

Precision

—

Recall

—

F1 Score

—

Recall @ Precision ≥ 0.80

—

Precision @ Recall ≥ 0.80

—

Per 1,000 Screened

Predicted +

Predicted −

Actual +

—

Actual −

—

The Model

In Canonical example mode, we use a binormal model: negative cases have scores drawn from N(0, 1) and positive cases from N(μ, 1), where μ is chosen to produce the desired AUC.

In Sample many examples mode, we show that the same AUC can arise from very different score distributions. Each alternate curve models the positive class as a two-component Gaussian mixture, w · N(μ₁, σ₁²) + (1−w) · N(μ₂, σ₂²), with parameters chosen so the AUC matches the target. This produces bimodal and skewed distributions — a “hard” subpopulation that overlaps with negatives and an “easy” one that separates cleanly. Additional curves use a heteroscedastic binormal model (varying the positive-class variance) for further variety. Click any alternate curve to lock it as the active model and see how its metrics differ from the canonical one.

The score distributions are scaled by prevalence, so their relative areas reflect how many positives and negatives you actually encounter.