Who's Running QC on Artificial Intelligence?

Artificial intelligence is arriving in medicine faster than medicine is ready for it. Ambient documentation tools are transcribing clinical encounters. Natural language processing models are screening radiology reports for incidental findings. Clinical decision support algorithms are flagging sepsis, predicting readmissions, and recommending medication adjustments. In many health systems, these tools are already live, influencing care, and accumulating the kind of quiet institutional trust that accrues to anything that seems to be working.

What most health systems do not have is a coherent framework for governing these tools. Who validated the model before deployment, monitors its performance over time, or is responsible when it drifts, degrades, or fails? These questions are surprisingly hard to answer, and the difficulty is not accidental. It reflects a fundamental misclassification of what AI tools actually are.

The wrong mental model

Most institutions are treating AI as an information technology problem. In this framing, deploying a clinical AI tool is roughly analogous to implementing a new module in the electronic medical record (EMR) – a software project managed by IT, governed by vendor contracts, updated in the background, and measured primarily by user adoption and operational efficiency. This framing is intuitive. AI tools are software. They live in the same infrastructure as the EMR. The procurement process looks similar.

But this classification is incorrect – and it has consequences.

An EMR is fundamentally a database. It stores and retrieves information. It does not make claims about the world. When it malfunctions, it typically does so in legible ways – data is missing, a field doesn't populate, a report won't generate. The failure is visible. More importantly, the EMR is not in the business of telling you what to do with a patient.

A clinical AI tool is doing something categorically different. It takes an input, processes it through a methodology, and returns a result intended to inform a clinical decision. It is making a claim about the world: this patient is at high risk, this finding is abnormal, this intervention is indicated. That is not a database function, but a diagnostic one. And we already have a well-developed framework for governing tools that perform diagnostic functions.

We call it the clinical laboratory.

A framework we already have

Consider what we already do with laboratory instruments. Before a new analyzer goes live, it undergoes rigorous analytical validation – we establish its precision, accuracy, reference intervals, and reportable range. We run it in parallel with existing methods. We define the conditions under which it performs reliably and the conditions under which it does not. Only after that process is complete does it report a result that touches a patient.

And then we don't assume it keeps working. Every morning, before the first patient sample is run, we run controls. We define action limits, participate in proficiency testing, track performance over time, and investigate shifts before they become failures. If a reagent lot changes, we revalidate. If the instrument behaves unexpectedly, we take it offline until we understand why. There is an entire infrastructure of ongoing vigilance built into how we practice, because we understand that a tool that was accurate last month may not be accurate today.

Clinical AI tools have analogous failure modes and almost none of this infrastructure. A model trained on one patient population may perform poorly when deployed at a different institution with different demographics, different documentation practices, and different disease prevalence.

A model that performed well at validation may degrade silently as the underlying data distribution shifts, coding practices change, patient acuity fluctuates, and clinical workflows evolve. This is called model drift, and it is the AI equivalent of calibration drift in an analyzer. Unlike calibration drift, it rarely triggers an alarm.

Adapting QC models for AI failure modes

There is one important way in which AI fails differently than a laboratory instrument, and it is worth naming directly. An analyzer that returns an erroneous result returns a number. A number can be questioned, repeated, correlated with clinical context. An AI system can return a fluent, confident, well-structured wrong answer – one that sounds authoritative, is difficult to challenge, and may be more persuasive precisely because of its apparent coherence. This is a failure mode without a clean laboratory analogy, and it argues for heightened rather than relaxed oversight.

The governance questions that follow from taking the laboratory instrument framing seriously are the same questions we already know how to ask. Who performs analytical validation before deployment, and what does that process require? Who monitors ongoing performance, and on what schedule? What are the action thresholds that trigger review or removal from service? Who holds accountability when a tool underperforms – the vendor, the institution, the ordering clinician? What is the regulatory framework, and who enforces it?

Currently, the answers to most of these questions are unclear, inconsistent, or buried in vendor contracts that most clinicians will never read. The FDA has begun developing frameworks for AI as a medical device, but regulatory guidance is lagging well behind deployment. Most health systems have not established anything resembling a QC program for their clinical AI tools. The clinician at the end of the chain – the one making the decision influenced by the AI output – typically has no visibility into how the tool was validated, whether its performance has been monitored, or whether the result on their screen is reliable.

Who should be at the table?

This is not a sustainable situation, and it is one that pathologists and laboratory medicine professionals are uniquely positioned to address. We have spent decades building the intellectual and operational infrastructure for exactly this problem — validating diagnostic tools, monitoring their performance, maintaining accountability chains, and advocating for regulatory standards that protect patients. We understand, in a way that most clinicians and most IT departments do not, that a diagnostic tool is only as trustworthy as the oversight structure around it.

The argument is not that pathologists should govern all clinical AI. It is that the laboratory medicine framework – validation before deployment, ongoing QC, defined accountability, and regulatory oversight with teeth – should be the model for how medicine governs AI broadly. And the people who know that framework best are the ones who should be at the table when governance structures are designed.

AI is not going to slow down while medicine catches up. The tools are already deployed, and more are coming. The question is not whether to govern them, but whether governance will be intentional and rigorous or reactive and superficial. In the laboratory, we long ago decided that the stakes are too high for anything less than rigor. The same logic applies here.

We run QC on our analyzers every morning. It is time to ask who is running QC on the algorithm.

About the Author(s)

Caitlin Raymond

Caitlin Raymond is an Assistant Professor of Pathology and Laboratory Medicine at the University of Wisconsin – Madison.

Who's Running QC on Artificial Intelligence?

Why we should stop treating AI as an information technology problem and start governing it like our other laboratory instruments

The wrong mental model

A framework we already have

Adapting QC models for AI failure modes

Who should be at the table?

About the Author(s)

Caitlin Raymond

Explore More in Pathology

Recommended

A Patient Is More Than a Price Tag

This Time, It’s Personal

A Calculated Risk

The Google Genome

Explore

Featured Topics

Issues

Career Development

Educational Resources

Events

People & Profiles

Who's Running QC on Artificial Intelligence?

Why we should stop treating AI as an information technology problem and start governing it like our other laboratory instruments

The wrong mental model

A framework we already have

Adapting QC models for AI failure modes

Who should be at the table?

Newsletters

About the Author(s)

Caitlin Raymond

Explore More in Pathology

Recommended

Related Content

A Patient Is More Than a Price Tag

This Time, It’s Personal

A Calculated Risk

The Google Genome