Understanding AI Performance Metrics in Healthcare: A Practical Guide for Clinical Leaders

Picture this: A vendor walks into your AI governance committee meeting, slides gleaming with impressive statistics. “Our sepsis prediction algorithm achieves 94% accuracy!” they announce triumphantly. The CMO nods approvingly. The CFO is already calculating ROI. But your CMIO — the one who actually has to implement this thing — is squinting at the fine print.

She should be. That 94% accuracy figure might be hiding a dirty secret.

Here is the twist: if only 5% of your patients develop sepsis, an algorithm that simply predicts “no sepsis” for every single patient would achieve 95% accuracy. It would also be completely, catastrophically useless — missing every case it was designed to catch.

Welcome to the wild world of AI performance metrics, where the numbers that sound most impressive often matter least, and the metrics that actually predict clinical success require a decoder ring to understand.

This guide will give you that decoder ring.

The Smoke Detector Principle: Understanding the Tradeoffs

Before we dive into formulas, let us establish the fundamental tension at the heart of every AI system through a familiar analogy: your kitchen smoke detector.

A smoke detector has one job: detect fires. But it faces an impossible choice:

Option A: The Hair-Trigger Detector Set the sensitivity high, and it catches every fire — even the tiny ones. Great! Except now it also screams every time you make toast, sear a steak, or take a hot shower with the bathroom door open. After the fifteenth false alarm this month, you rip the batteries out. Now it catches zero fires.

Option B: The Chill Detector Set it to only alarm for serious, obvious fires, and you will enjoy peaceful cooking. But by the time this relaxed detector notices something is wrong, your kitchen is already engulfed in flames.

Every healthcare AI faces this exact tradeoff. A sepsis alert that fires constantly will be ignored. One that waits for certainty will fire too late. There is no free lunch — only choices about which errors you are willing to tolerate.

The metrics we are about to explore are simply different ways of measuring where an AI sits on this spectrum, and whether that position makes sense for your clinical context.

The AI Tradeoff Spectrum

The Four Possible Outcomes: Meet the Confusion Matrix

Every time an AI makes a prediction, it lands in one of four buckets. Understanding these buckets is the foundation for everything else.

Imagine a metal detector at the beach. It beeps (positive prediction) or stays silent (negative prediction). Meanwhile, reality is that there either is or is not actually buried treasure beneath your feet.

True Positive (TP): The detector beeps, you dig, and find a gold doubloon. Victory! The AI correctly identified something real.

True Negative (TN): The detector stays silent, and there really is nothing there but sand. Correct non-detection. You saved yourself pointless digging.

False Positive (FP): The detector beeps excitedly, you dig for twenty minutes, and find… a bottle cap. The dreaded false alarm. In healthcare, this might mean an unnecessary workup, a worried patient, or another alert for an exhausted nurse to dismiss.

False Negative (FN): The detector stays silent, but six inches down was a chest of Spanish gold. You walk away, never knowing what you missed. In healthcare, this is the nightmare scenario — the missed diagnosis, the patient who deteriorates because no one was watching.

The Confusion Matrix Explained

Now, here is the crucial insight: you cannot minimize all four error types simultaneously. Reducing false negatives (catching more real cases) inevitably increases false positives (more false alarms). This is not a flaw in any particular AI — it is a fundamental law of detection systems.

The question is never “is this AI perfect?” The question is “does this AI make the right tradeoffs for our specific clinical situation?”

The Metrics: A Field Guide

Sensitivity: The “Did We Catch It?” Metric

The Story: Dr. Martinez runs the sepsis committee at a 400-bed community hospital. After a sentinel event where a patient deteriorated undetected, she is evaluating AI early warning systems. Her primary question: “Of all the patients who actually develop sepsis, how many will this system catch?”

What Sensitivity Measures: Of all the cases that actually have the condition, what percentage did the AI correctly flag?

The Formula: Sensitivity = True Positives / (True Positives + False Negatives)

The Analogy: Sensitivity is like grading a goalie on saves. If 100 shots come at the goal and the goalie stops 92, their “sensitivity” is 92%. The shots that got past — those are the false negatives.

When It Matters Most: Sensitivity is paramount when missing a case is catastrophic. Screening for cancer. Alerting on deteriorating patients. Detecting pulmonary embolisms. In these scenarios, you want the highest sensitivity you can tolerate — even if it means more false alarms.

The Catch: A smoke detector with 100% sensitivity would alert on every wisp of steam. An AI tuned for maximum sensitivity will flag patients who are fine. Pushing sensitivity up pushes specificity down.

Specificity: The “Did We Leave Them Alone?” Metric

The Story: Dr. Okafor is the hospitalist who has to respond to those sepsis alerts. He is drowning in notifications — his pager buzzes constantly, and 80% of the time, the patient is fine. He wants to know: “Of all the patients who do NOT have sepsis, how many does this system correctly leave alone?”

What Specificity Measures: Of all the cases without the condition, what percentage did the AI correctly clear as negative?

The Formula: Specificity = True Negatives / (True Negatives + False Positives)

The Analogy: If sensitivity is grading the goalie on saves, specificity is grading them on not diving for shots that were going wide anyway. A goalie who flops dramatically at every ball, even ones headed five feet outside the post, has low specificity — lots of wasted effort.

When It Matters Most: Specificity matters when false positives create real harm: alert fatigue, unnecessary procedures, patient anxiety, or wasted resources. A skin cancer AI with low specificity means many unnecessary biopsies. A chest X-ray AI with low specificity means radiologists drowning in false findings.

The Catch: You can achieve 100% specificity by never flagging anything. Perfect peace and quiet — and zero catches. Specificity must be balanced against sensitivity.

Positive Predictive Value (PPV): The “Can I Trust This Alert?” Metric

The Story: Nurse Thompson sees an AI alert pop up on her dashboard: “Patient in 412B at elevated risk for deterioration.” She has been burned before — last week she rushed to three “high risk” patients who were perfectly stable. She wonders: “When this system says something is wrong, how often is it actually right?”

What PPV Measures: When the AI says positive, how often is it correct?

The Formula: PPV = True Positives / (True Positives + False Positives)

The Analogy: PPV is the “spam filter” metric. If your email spam filter flags 100 messages as spam and 75 of them really are spam, its PPV is 75%. The other 25 are legitimate emails you might miss — the false positives that erode your trust in the filter.

Why PPV Is Sneaky: Here is where things get interesting. PPV depends heavily on how common the condition is (prevalence). The same AI with the same sensitivity and specificity will have vastly different PPV depending on your patient population.

A Tale of Two Hospitals:

Hospital A is a specialized cardiac center. 30% of chest pain patients have acute coronary syndrome (ACS). Their AI has 90% sensitivity and 80% specificity.

When the AI alerts, PPV is 66% — two-thirds of alerts are real.

Hospital B is a suburban urgent care. Only 2% of chest pain patients have ACS. Same AI, same sensitivity, same specificity.

When the AI alerts, PPV is only 8% — eleven out of twelve alerts are false alarms.

Same AI. Radically different real-world experience.

The Lesson: Always ask vendors: “What is the PPV in a population like ours?” Sensitivity and specificity look impressive in slide decks. PPV tells you what your clinicians will actually experience.

PPV Changes With Prevalence

Negative Predictive Value (NPV): The “Can I Trust the All-Clear?” Metric

The Story: Dr. Patel is using an AI to help rule out pulmonary embolism in low-risk patients. She wants to know: “When this system says a patient is negative, how confident can I be that they really are negative?”

What NPV Measures: When the AI says negative, how often is it correct?

The Formula: NPV = True Negatives / (True Negatives + False Negatives)

The Analogy: NPV is the “peace of mind” metric. When the security system says “all clear,” how confidently can you go to sleep?

Why NPV Is Usually High: Here is a counterintuitive truth — NPV is almost always impressively high, even for mediocre AIs. Why? Because most patients do not have most conditions. If only 2% of your patients have a disease, even a so-so AI will correctly clear most of the 98% who are healthy. You will see NPVs of 98%, 99%, even 99.5%.

Do Not Be Fooled: A high NPV does not mean the AI is good. It often just means the condition is rare. An AI that flipped a coin would still have stellar NPV for rare diseases.

Accuracy: The Seductive Deceiver

The Story: A vendor proudly presents their diabetic retinopathy AI: “99.2% accuracy in our validation study!” The room is impressed. But the skeptical data scientist in the back raises her hand: “What was the prevalence of retinopathy in your study population?”

The vendor hesitates. “About… 0.8%.”

The Problem with Accuracy: Accuracy is the percentage of all predictions that were correct. It sounds intuitive, democratic, comprehensive. It is also deeply misleading for healthcare applications.

The Formula: Accuracy = (True Positives + True Negatives) / Total Cases

Why It Lies: Remember our opening example? An AI that predicts “no sepsis” for everyone achieves 95% accuracy when sepsis prevalence is 5%. An AI that predicts “no rare cancer” for everyone achieves 99.9% accuracy when that cancer affects 0.1% of patients.

Accuracy is dominated by the majority class. For rare conditions — which is most of what we screen for in healthcare — accuracy tells you almost nothing about what matters.

The Rule: Be deeply skeptical of accuracy claims, especially for rare conditions. Ask for sensitivity and PPV instead.

AUC-ROC: The Threshold-Agnostic Referee

The Story: Your governance committee is evaluating three competing sepsis AI vendors. Each presents different sensitivity and specificity numbers, but you suspect they have cherry-picked favorable thresholds. You need a way to compare their underlying discriminative ability — how well they separate sick patients from healthy ones, regardless of where you set the alarm threshold.

What AUC Measures: The AI’s ability to correctly rank a randomly chosen positive case higher than a randomly chosen negative case.

The Plain English: If you picked one patient who has sepsis and one who does not, how likely is the AI to assign a higher risk score to the one who actually has sepsis?

The Scale:

0.50 = Coin flip. The AI has no discriminative ability.
0.70-0.80 = Acceptable discrimination.
0.80-0.90 = Excellent discrimination.
0.90+ = Outstanding discrimination.

Understanding the ROC Curve

The Analogy: Imagine a talent show judge who must rank singers from best to worst. AUC asks: if you gave the judge one good singer and one bad singer, how likely are they to correctly rank the good one higher? A perfect judge (AUC = 1.0) always gets it right. A random guesser (AUC = 0.5) gets it right half the time by chance.

Why AUC Is Useful: AUC lets you compare algorithms without committing to a specific threshold. It answers: “Setting aside where we put the cutoff, how good is this AI at its fundamental job of distinguishing sick from healthy?”

The Limitations: AUC does not tell you about calibration (whether the AI’s probability estimates are accurate), and it does not tell you what happens at any specific operating threshold. An AI with excellent AUC might still have terrible PPV if you deploy it for a rare condition.

F1 Score: The Balanced Scorecard

The Story: Your team is building a custom AI model to identify patients for a care management program. You care about both finding eligible patients (sensitivity) and not wasting care manager time on ineligible referrals (PPV). You need a single number that balances both concerns.

What F1 Measures: The harmonic mean of precision (PPV) and recall (sensitivity).

The Formula: F1 = 2 x (Precision x Recall) / (Precision + Recall)

Why Harmonic Mean? The harmonic mean harshly penalizes imbalances. An AI with 95% sensitivity but only 10% PPV does not get a nice average score of 52.5%. It gets an F1 of 0.18. The harmonic mean forces both metrics to be reasonable.

The Analogy: F1 is like a decathlon score — you cannot just be amazing at one event and terrible at others. You need balanced performance across the board.

When to Use It: F1 is most valuable when you need a single summary number and both false positives and false negatives matter roughly equally. It is particularly useful for imbalanced datasets where accuracy would be misleading.

Metrics by Use Case: The Right Tool for the Job

Different AI tools have fundamentally different failure modes and clinical stakes. Let us map the right metrics to three common healthcare AI applications.

AI Scribes: A Different Beast Entirely

AI scribes do not make binary predictions — they generate free-text clinical notes from conversations. This requires entirely different quality measures.

The Horror Story: A physician discusses that a patient has issues with their hands, feet, and mouth. The AI scribe confidently documents a diagnosis of “hand, foot, and mouth disease.” This is not a classification error — it is a hallucination, the AI inventing clinical content that was never discussed.

What to Measure:

Quality Dimension	What It Means	Why It Matters
Factual Accuracy	Did the note capture correct information?	Errors become part of the medical record
Completeness	Did it capture ALL relevant information?	Omissions are silent — clinicians must remember what was missed
Hallucination Rate	Did it add information never discussed?	Fabricated content can drive incorrect treatment decisions
Organization	Is the note well-structured?	Affects downstream usability and comprehension

The Omission Problem: Omission errors are particularly insidious because they require the clinician to actively recall what was discussed and notice its absence. After seeing twelve patients, can you reliably remember that the AI failed to document the medication change you discussed with patient number four?

What to Ask Vendors:

“How do you measure hallucination rate, and what is it?”
“What percentage of clinically relevant information is captured?”
“Can clinicians trace AI-generated text back to source audio?”

Clinical Decision Support: The Alert Fatigue Battleground

CDS tools face the classic sensitivity-specificity tension at its most acute. Miss a deteriorating patient, and someone could die. Alert too often, and clinicians tune out entirely.

The Cautionary Tale: A major health system implemented a sepsis AI with 97% sensitivity — it almost never missed a case. But the PPV was 15%. Nurses received an average of 47 alerts per 12-hour shift. Within three months, the override rate exceeded 90%. The AI had effectively been turned off by exhausted clinicians.

The Metrics That Matter:

Metric	Target Range	Why This Range
Sensitivity	85-95%	High enough to catch most cases, not so high that everything alerts
PPV	>30%	Clinicians need to see enough real cases to maintain trust
Alert Volume	Site-specific	Must be manageable within existing workflows

The Key Question: “At your recommended threshold, what is the PPV in a population with our prevalence?”

Sensitivity is what vendors want to show you. PPV is what your nurses will experience.

Medical Imaging AI: Detection Plus Localization

Imaging AI has the added complexity that being right is not enough — the AI must also point to the right place.

The Localization Problem: An AI says “positive for pulmonary embolism.” The radiologist looks and sees… nothing in the area the AI highlighted. Is this a true positive because there was a PE elsewhere in the scan? Is it a false positive because the AI pointed to the wrong spot? Traditional metrics do not capture this nuance.

The Metrics That Matter:

Metric	Why It Matters
Sensitivity	Core value proposition is not missing pathology
Specificity	Determines false positive volume in high-throughput reading
PPV	Real-world experience when AI highlights findings
AUC	Comparison across algorithms at various thresholds
Localization Accuracy	Is the AI pointing to the right place?

The Prevalence Trap: A pulmonary embolism AI might show 95% sensitivity and 99% specificity in vendor materials. But PE prevalence in routine chest CTs is around 1-2%. At that prevalence, even excellent specificity translates to a PPV of only 50-70%. Half of the AI’s “findings” will be false positives.

What to Ask Vendors:

“Was the AI validated on equipment similar to ours?”
“How does performance vary by anatomic location?”
“What is the PPV at our expected disease prevalence?”

Practical Wisdom for Governance Committees

The Vendor Slide Deck Decoder

When a vendor presents performance metrics, apply these filters:

They Show	You Ask
”94% accuracy"	"What is the prevalence? What is the sensitivity?"
"98% sensitivity"	"What is the PPV at that threshold? What is the alert volume?"
"AUC of 0.95"	"What is the sensitivity and PPV at your recommended threshold?"
"Validated on 10,000 cases"	"Were those cases representative of our population?”

The Three Questions That Matter

Before approving any AI deployment, your committee should be able to answer:

“What is the worst failure mode, and how likely is it?” (Usually: missed cases or alert fatigue)
“What metrics actually predict success in OUR context?” (Usually: PPV and sensitivity at the specific threshold, not AUC or accuracy)
“How will we know if performance degrades after go-live?” (Hint: if you do not have a monitoring plan, you do not have a deployment plan)

The Local Validation Imperative

An AI that performs brilliantly in a vendor’s test environment may fail in yours because of:

Different disease prevalence
Different patient demographics
Different imaging equipment or protocols
Different documentation styles
Different workflow integration points

Demand local validation data or pilot periods with your own performance measurement before full deployment.

Conclusion: Beyond the Numbers

The metrics we have explored are not just academic exercises — they are the language of AI accountability. A governance committee that cannot interrogate sensitivity, specificity, and PPV claims is flying blind.

But here is the deeper truth: no metric captures everything that matters. Patient trust is not a number. Clinician workflow disruption does not show up in an AUC. The downstream consequences of a missed diagnosis ripple through families and communities in ways no confusion matrix can quantify.

Use these metrics as tools for asking better questions, not as substitutes for clinical judgment. The right AI for your health system is not the one with the most impressive statistics — it is the one whose tradeoffs align with your clinical priorities, your patient population, and your operational reality.

And when that vendor walks in with their gleaming slide deck, you will know exactly which questions to ask.

This post is part of our series on preparing for Joint Commission AI certification. For a complete overview of the RUAIH framework, see Part 1: Understanding the RUAIH Framework. For guidance on ongoing quality monitoring of AI tools, see Part 5: Quality Monitoring for Healthcare AI.

Harness.health provides AI governance infrastructure purpose-built for regional health systems preparing for the Joint Commission’s 2026 voluntary AI certification program. Learn how we can help.