Yayın: Enhancing the CAD-RADS™ 2.0 category assignment performance of ChatGPT and DeepSeek through "few-shot" prompting
| dc.contributor.author | Kaya, Hasan Emin | |
| dc.contributor.buuauthor | KAYA, HASAN EMİN | |
| dc.contributor.department | Tıp Fakültesi | |
| dc.contributor.department | Radyoloji Ana Bilim Dalı | |
| dc.date.accessioned | 2025-12-11T10:14:24Z | |
| dc.date.issued | 2025-09-23 | |
| dc.description.abstract | Objective: To assess whether few-shot prompting improves the performance of 2 popular large language models (LLMs) (ChatGPT o1 and DeepSeek-R1) in assigning Coronary Artery Disease Reporting and Data System (CAD-RADS™ 2.0) categories. Methods: A detailed few-shot prompt based on CAD-RADS™ 2.0 framework was developed using 20 reports from the MIMIC-IV database. Subsequently, 100 modified reports from the same database were categorized using zero-shot and few-shot prompts through the models' user interface. Model accuracy was evaluated by comparing assignments to a reference radiologist's classifications, including stenosis categories and modifiers. To assess reproducibility, 50 reports were reclassified using the same few-shot prompt. McNemar tests and Cohen kappa were used for statistical analysis. Results: Using zero-shot prompting, accuracy was low for both models (ChatGPT: 14%, DeepSeek: 8%), with correct assignments occurring almost exclusively in CAD-RADS 0 cases. Hallucinations occurred frequently (ChatGPT: 19%, DeepSeek: 54%). Few-shot prompting significantly improved accuracy to 98% for ChatGPT and 93% for DeepSeek (both P<0.001) and eliminated hallucinations. Kappa values for agreement between model-generated and radiologist-assigned classifications were 0.979 (0.950, 1.000) (P<0.001) for ChatGPT and 0.916 (0.859, 0.973) (P<0.001) for DeepSeek, indicating almost perfect agreement for both models without a significant difference between the models (P=0.180). Reproducibility analysis yielded kappa values of 0.957 (0.900, 1.000) (P<0.001) for ChatGPT and 0.873 [0.779, 0.967] (P<0.001) for DeepSeek, indicating almost perfect and strong agreement between repeated assignments, respectively, with no significant difference between the models (P=0.125). Conclusion: Few-shot prompting substantially enhances LLMs' accuracy in assigning CAD-RADS™ 2.0 categories, suggesting potential for clinical application and facilitating system adoption. | |
| dc.identifier.doi | 10.1097/RCT.0000000000001802 | |
| dc.identifier.pubmed | 41004838 | |
| dc.identifier.uri | https://hdl.handle.net/11452/57272 | |
| dc.language.iso | en | |
| dc.publisher | Wolters Kluwer | |
| dc.relation.journal | Journal of Computer Assisted Tomography | |
| dc.subject | CAD-RADS 2.0 | |
| dc.subject | Coronary CT angiography | |
| dc.subject | Large language models | |
| dc.title | Enhancing the CAD-RADS™ 2.0 category assignment performance of ChatGPT and DeepSeek through "few-shot" prompting | |
| dc.type | Article | |
| dspace.entity.type | Publication | |
| local.contributor.department | Tıp Fakültesi/Radyoloji Ana Bilim Dalı | |
| local.indexed.at | PubMed | |
| relation.isAuthorOfPublication | 820ae5d8-78dc-4cbe-84ad-3afa735304d2 | |
| relation.isAuthorOfPublication.latestForDiscovery | 820ae5d8-78dc-4cbe-84ad-3afa735304d2 |
Dosyalar
Orijinal seri
1 - 1 / 1
