Yayın:
Enhancing the CAD-RADS™ 2.0 category assignment performance of ChatGPT and DeepSeek through "few-shot" prompting

dc.contributor.authorKaya, Hasan Emin
dc.contributor.buuauthorKAYA, HASAN EMİN
dc.contributor.departmentTıp Fakültesi
dc.contributor.departmentRadyoloji Ana Bilim Dalı
dc.date.accessioned2025-12-11T10:14:24Z
dc.date.issued2025-09-23
dc.description.abstractObjective: To assess whether few-shot prompting improves the performance of 2 popular large language models (LLMs) (ChatGPT o1 and DeepSeek-R1) in assigning Coronary Artery Disease Reporting and Data System (CAD-RADS™ 2.0) categories. Methods: A detailed few-shot prompt based on CAD-RADS™ 2.0 framework was developed using 20 reports from the MIMIC-IV database. Subsequently, 100 modified reports from the same database were categorized using zero-shot and few-shot prompts through the models' user interface. Model accuracy was evaluated by comparing assignments to a reference radiologist's classifications, including stenosis categories and modifiers. To assess reproducibility, 50 reports were reclassified using the same few-shot prompt. McNemar tests and Cohen kappa were used for statistical analysis. Results: Using zero-shot prompting, accuracy was low for both models (ChatGPT: 14%, DeepSeek: 8%), with correct assignments occurring almost exclusively in CAD-RADS 0 cases. Hallucinations occurred frequently (ChatGPT: 19%, DeepSeek: 54%). Few-shot prompting significantly improved accuracy to 98% for ChatGPT and 93% for DeepSeek (both P<0.001) and eliminated hallucinations. Kappa values for agreement between model-generated and radiologist-assigned classifications were 0.979 (0.950, 1.000) (P<0.001) for ChatGPT and 0.916 (0.859, 0.973) (P<0.001) for DeepSeek, indicating almost perfect agreement for both models without a significant difference between the models (P=0.180). Reproducibility analysis yielded kappa values of 0.957 (0.900, 1.000) (P<0.001) for ChatGPT and 0.873 [0.779, 0.967] (P<0.001) for DeepSeek, indicating almost perfect and strong agreement between repeated assignments, respectively, with no significant difference between the models (P=0.125). Conclusion: Few-shot prompting substantially enhances LLMs' accuracy in assigning CAD-RADS™ 2.0 categories, suggesting potential for clinical application and facilitating system adoption.
dc.identifier.doi10.1097/RCT.0000000000001802
dc.identifier.pubmed41004838
dc.identifier.urihttps://hdl.handle.net/11452/57272
dc.language.isoen
dc.publisherWolters Kluwer
dc.relation.journalJournal of Computer Assisted Tomography
dc.subjectCAD-RADS 2.0
dc.subjectCoronary CT angiography
dc.subjectLarge language models
dc.titleEnhancing the CAD-RADS™ 2.0 category assignment performance of ChatGPT and DeepSeek through "few-shot" prompting
dc.typeArticle
dspace.entity.typePublication
local.contributor.departmentTıp Fakültesi/Radyoloji Ana Bilim Dalı
local.indexed.atPubMed
relation.isAuthorOfPublication820ae5d8-78dc-4cbe-84ad-3afa735304d2
relation.isAuthorOfPublication.latestForDiscovery820ae5d8-78dc-4cbe-84ad-3afa735304d2

Dosyalar

Orijinal seri

Şimdi gösteriliyor 1 - 1 / 1
Küçük Resim
Ad:
Kaya_2025.pdf
Boyut:
121.8 KB
Format:
Adobe Portable Document Format