Deep learning models that predict cancer biomarkers from routine histology images are frequently confounded by other molecular and clinical factors, limiting their reliability as substitutes for genomic testing.
In a large, multi-cohort analysis of 8,221 patients across breast, colorectal, lung, and endometrial cancers, researchers examined whether artificial intelligence (AI) systems can accurately infer gene mutations and biomarker status directly from hematoxylin and eosin–stained whole slide images. Although several models achieved high overall accuracy, deeper analysis revealed that performance often depended on related biomarkers, tumor grade, or overall mutation burden rather than the target biomarker itself.
The study, published in Nature Biomedical Engineering, aimed to distinguish an “ideal” model – one that predicts a biomarker based only on its biological effects in tissue – from a confounded model that also relies on unrelated variables such as grade or other mutations. In practice, many biomarkers tend to occur together or exclude one another. Heatmaps showed strong patterns of co-occurrence and mutual exclusivity among genes across datasets.
When researchers tested models within subgroups defined by these related factors, predictive accuracy often dropped substantially. For example, models predicting microsatellite instability in colorectal cancer showed lower performance when cases were stratified by other linked molecular features. Similar declines were seen when patients were stratified by histologic grade or tumor mutational burden.
Importantly, simple models using only pathologist-assigned grade sometimes approached the performance of complex AI systems. This finding suggests that image-based algorithms may be capturing grade-associated morphology rather than biomarker-specific patterns.
Co-author Kim Branson, SVP Global Head of Artificial Intelligence and Machine Learning, said, “We've found that predicting a BRAF mutation by looking at correlated features like microsatellite instability is often like predicting rain by looking at umbrellas – it works, but it doesn't mean you understand meteorology.”
“Crucially, if a model cannot demonstrate information gain above a simple pathologist-assigned grade,” Branson continued, “we haven't advanced the field; we've just automated a shortcut. The roadmap for the next generation of pathology AI isn't necessarily bigger models; it’s stricter evaluation protocols that force algorithms to stop cheating and learn the hard biology.”
The researchers conclude that, while AI tools can screen large image datasets rapidly and may assist with triage, current approaches are not yet robust enough to replace molecular assays. Aggregate accuracy metrics alone may overstate clinical utility unless results are examined within clinically relevant subgroups.
The authors call for bias-aware validation strategies, including stratified analyses and permutation testing, before deploying such systems in routine practice. Until models can disentangle true biological signals from correlated features, confirmatory molecular testing remains essential for treatment decisions.
