Researchers at the University of Hong Kong have developed a named entity (NE) framework using large language models (LLMs) to extract cancer staging and risk classification data from semi-structured clinical notes of patients with thyroid cancer. In validation tests, the model achieved over 90 percent accuracy.
The framework was validated using 289 pathology reports from the Cancer Genome Atlas Thyroid Cancer (TCGA-THCA) database and 35 simulated clinical cases. It focused on classifying thyroid cancer in accordance with the American Joint Committee on Cancer (AJCC) 8th edition staging system and the American Thyroid Association (ATA) risk categories.
By combining outputs from multiple lightweight LLMs (including Mistral-7B and Llama-3), the system achieved 95 percent F1-score for ATA risk and 98 percent for AJCC staging on real reports, and up to 93 percent F1-score in the simulated cases.
The model was trained on 50 annotated pathology reports, where human annotators marked critical clinical details such as tumor size, lymph node spread, extrathyroidal extension, and vascular invasion, all necessary for proper staging and risk classification.
It used a variety of prompting methods (zero-shot, chain-of-thought, few-shot) and tools like Langchain and Excel templates to process and organize extracted data. A majority-vote ensemble approach helped improve consistency across different model outputs, and most models achieved 100 percent F1-scores for detecting metastatic disease (M stage).
Despite strong performance, challenges remained – particularly when reports lacked clarity on extrathyroidal extension, lymph node involvement, or aggressive cancer variants. Errors also arose from confusing documentation on the number of lymph nodes involved versus those removed. These issues affected the accuracy of both AJCC staging and ATA risk scoring.
Although the tool performed well, the researchers emphasized that human verification is still needed due to occasional errors and variability in model outputs. The framework was designed to be generalizable across both standard TCGA pathology reports and custom clinical notes mimicking local documentation styles. Limitations included a relatively small number of advanced-stage cases, no access to surgery or imaging reports in the TCGA dataset, and reliance on outdated AJCC and ATA guidelines, which may need updating for future use.
The researchers believe the tool can be adapted to other cancers with minimal changes. They have made their annotated data and model code publicly available to support future work in clinical natural language processing and automated diagnostics.
“In line with the government's strong advocacy of AI adoption in healthcare, as exemplified by the recent launch of LLM-based medical report writing system in the Hospital Authority, our next step is to evaluate the performance of this AI assistant with a large amount of real-world patient data,” explained corresponding author Carlos Wong in the press release, “Once validated, the AI model can be readily deployed in real clinical settings and hospitals to help clinicians improve operational and treatment efficiency.”