A recent study published in VIEW has compared the performance of large multimodal models (LMMs) in interpreting pulmonary CT scans for lung cancer diagnostics, offering insight into the potential roles and limitations of generative AI in radiologic workflows.
The research evaluated the diagnostic capabilities of several publicly available LMMs, including GPT-4V, LLaVA-1.5, Gemini, and HuggingFace’s HuggingGPT, alongside a fine-tuned version of BiomedCLIP. These models were tested on a dataset comprising 100 representative axial chest CT images from the Lung Image Database Consortium (LIDC) image set. Each image was selected to reflect a range of radiologic features common in lung nodules, such as spiculation, cavitation, and pleural retraction.
The study explored multiple question types: binary classification (benign vs malignant), nodule description, TNM staging, and NCCN risk categorization. Ground truth labels were derived from LIDC consensus annotations, with further validation from two board-certified radiologists.
In binary classification tasks, GPT-4V showed the highest agreement with expert consensus (75 percent), followed by Gemini and LLaVA-1.5. However, performance dropped notably in the NCCN risk assessment, with accuracy across all models falling below 50 percent. When generating descriptive radiologic terms, LLaVA-1.5 produced the most consistent results, while BiomedCLIP – despite its domain-specific training – underperformed in free-text outputs.
Model hallucinations and inconsistencies were common, particularly when asked to justify diagnostic decisions or assign staging. Several models exhibited challenges in aligning textual descriptions with visual evidence. Additionally, the authors noted substantial inter-model variability depending on prompt design and image content.
Importantly, none of the models reached the diagnostic reliability of expert radiologists. The findings suggest that while current LMMs can mimic aspects of radiologic reasoning, their interpretative abilities are limited by lack of medical grounding, inconsistency in response generation, and insufficient alignment with clinical ontologies.
The study underscores the need for improved model training approaches, including integration of structured clinical data, domain-specific fine-tuning, and prompt engineering tailored to medical imaging tasks. The authors suggest that future versions of LMMs may benefit from closer collaboration with radiology professionals to develop more medically aligned benchmarks and safety constraints.
For laboratories and diagnostic teams exploring the integration of generative AI tools, the study provides a comparative reference point for assessing current LMM capabilities in medical imaging – highlighting areas where caution, validation, and human oversight remain essential.