Comparative performance evaluation of multimodal large language models, radiologist, and anatomist in visual neuroanatomy questions

Güneş, Yasin Celal; Ülkir, Mehmet

Comparative performance evaluation of multimodal large language models, radiologist, and anatomist in visual neuroanatomy questions

dc.contributor.author	Güneş, Yasin Celal
dc.contributor.author	Ülkir, Mehmet
dc.date.accessioned	2025-02-25T10:28:44Z
dc.date.available	2025-02-25T10:28:44Z
dc.date.issued	2025-01-02
dc.description.abstract	This study examined the performance of four different multimodal Large Language Models (LLMs)—GPT4-V, GPT-4o, LLaVA, and Gemini 1.5 Flash—on multiple-choice visual neuroanatomy questions, comparing them to a radiologist and an anatomist. The study employed a cross-sectional design and evaluated responses to 100 visual questions sourced from the Radiopaedia website. The accuracy of the responses was analyzed using the McNemar test. According to the results, the radiologist demonstrated the highest performance with an accuracy rate of 90%, while the anatomist achieved an accuracy rate of 67%. Among the multimodal LLMs, GPT-4o performed the best, with an accuracy rate of 45%, followed by Gemini 1.5 Flash at 35%, ChatGPT4-V at 22%, and LLaVA at 15%. The radiologist significantly outperformed both the anatomist and all multimodal LLMs (p<0.001). GPT-4o significantly outperformed GPT4-V and LLaVA (p<0.001), but no significant difference was found between GPT-4o and Gemini 1.5 Flash (p=0.123). However, Gemini 1.5 Flash showed significant superiority over LLaVA (p<0.001) and also demonstrated a statistically significant difference compared to GPT4-V (p=0.004). This study highlights the significant performance gap between multimodal LLMs and medical professionals. While multimodal LLMs hold great potential in the medical field, they have not yet reached the level of accuracy of medical experts in correctly identifying neuroanatomical regions.
dc.description.abstract	Bu çalışma, dört farklı çok modlu Büyük Dil Modeli'nin (GPT4-V, GPT-4o, LLaVA, Gemini 1.5 Flash) görsel nöroanatomi çoktan seçmeli sorularındaki performansını, bir radyolog ve bir anatomistle karşılaştırarak incelemiştir. Kesitsel bir araştırma dizaynına dayanan çalışmada, Radiopaedia web sitesinden alınan 100 görsel soruya verilen yanıtlar değerlendirilmiştir. Yanıtların doğruluğu McNemar testi kullanılarak analiz edilmiştir. Sonuçlara göre, radyolog %90 doğruluk oranı ile en yüksek performansı sergilerken, anatomist %67 doğruluk oranı elde etmiştir. Çok modlu LLM'ler arasında en iyi performansı %45 doğruluk oranı ile GPT-4o göstermiştir; onu %35 ile Gemini 1.5 Flash, %22 ile ChatGPT4-V ve %15 ile LLaVA takip etmiştir. Radyolog, hem anatomiste hem de tüm çok modlu LLM'lere kıyasla anlamlı derecede üstün bir performans sergilemiştir (p<0.001). GPT-4o, GPT4-V ve LLaVA'ya kıyasla anlamlı derecede daha iyi bir performans göstermiş (p<0.001), ancak Gemini 1.5 Flash ile arasında anlamlı bir fark gözlenmemiştir (p=0.123). Bununla birlikte, Gemini 1.5 Flash, LLaVA'ya karşı anlamlı bir üstünlük sağlamış (p<0.001) ve GPT4-V ile karşılaştırıldığında da istatistiksel olarak anlamlı bir fark ortaya çıkmıştır (p=0.004). Bu çalışma, çok modlu LLM'ler ile tıbbi uzmanlar arasındaki belirgin performans farkını ortaya koymaktadır. Çok modlu LLM'ler tıp alanında büyük bir potansiyel vaat etse de, nöroanatomik bölgeleri doğru bir şekilde tanımlama konusunda henüz tıbbi uzmanların doğruluk seviyesine ulaşamamaktadırlar.
dc.identifier.doi	10.32708/uutfd.1568479
dc.identifier.endpage	556
dc.identifier.issue	3
dc.identifier.startpage	551
dc.identifier.uri	https://doi.org/10.32708/uutfd.1568479
dc.identifier.uri	https://dergipark.org.tr/tr/pub/uutfd/issue/89968/1568479
dc.identifier.uri	https://dergipark.org.tr/tr/download/article-file/4293044
dc.identifier.uri	https://hdl.handle.net/11452/50612
dc.identifier.volume	50
dc.language.iso	en
dc.publisher	Bursa Uludağ Üniversitesi
dc.relation.journal	Uludağ Üniversitesi Tıp Fakültesi Dergisi
dc.relation.publicationcategory	Makale - Uluslararası Hakemli Dergi
dc.rights	info:eu-repo/semantics/openAccess
dc.subject	Neuroanatomy
dc.subject	Large language models
dc.subject	GPT-4o
dc.subject	Gemini 1.5 Flash
dc.subject	Nöroanatomi
dc.subject	Büyük dil modelleri
dc.title	Comparative performance evaluation of multimodal large language models, radiologist, and anatomist in visual neuroanatomy questions
dc.title.alternative	Çok modlu büyük dil modelleri, bir radyolog ve bir anatomistin görsel nöroanatomi sorularındaki karşılaştırmalı performans değerlendirmesi
dc.type	Article

Dosyalar

Orijinal seri

Şimdi gösteriliyor 1 - 1 / 1

Ad:: 50_3_26.pdf
Boyut:: 556.69 KB
Format:: Adobe Portable Document Format

İndir

Koleksiyonlar

2024 Cilt 50 Sayı 3