Evaluating Large Language Models on a Persian-English Bilingual Medical Question Answering Benchmark
PersianMedQA is a large-scale, expert-validated multiple-choice question answering dataset covering 23 medical specialties, collected from 14 years of Iranian residency and pre-residency board examinations.
This dataset enables comprehensive evaluation of large language models in medical reasoning tasks, with questions available in both Persian (original) and high-quality English translations. Our work represents the first comprehensive Persian medical QA benchmark, bridging the gap between multilingual AI and specialized medical knowledge.
Question: بیمار ۴۸ سالهای با درد قفسه سینه حاد و تغییرات ECG مشخص کننده STEMI مراجعه کرده است. مؤثرترین اقدام درمانی کدام است؟
Options:
Metadata: Specialty: Cardiology | Clinical: Yes | Patient: 48-year-old male
We evaluated various state-of-the-art language models on PersianMedQA, revealing interesting patterns in multilingual medical reasoning:
Model | Accuracy (Persian) | Accuracy (English) | Language Gap |
---|---|---|---|
GPT-4.1 | 83.1% | 83.3% | +0.2% |
Gemini 2.5 Flash | 82.4% | 83.7% | +1.3% |
Llama 3.1-405B-Instruct | 69.3% | 75.8% | +6.5% |
Meditron3-8B | 39.7% | 51.6% | +11.9% |
Dorna2-Llama3-8B | 36.0% | 53.1% | +17.1% |
See our paper for comprehensive analysis including chain-of-thought experiments, ensembling strategies, and selective-answering performance.
The dataset will be made available through multiple channels to ensure maximum accessibility for the research community:
The paper is under review. Meanwhile, the preprint is available on arXiv:
Mohammad Javad Ranjbar Kalahroodi, Amirhossein Sheikholselami, Sepehr Karimi, Sepideh Ranjbar Kalahroodi, Heshaam Faili, & Azadeh Shakery.
arXiv preprint arXiv:2506.00250, 2025.
@misc{ranjbar2025persianmedqa,
title={{PersianMedQA: Evaluating Large Language Models on a Persian-English Bilingual Medical Question Answering Benchmark}},
author={Mohammad Javad Ranjbar Kalahroodi and Amirhossein Sheikholselami and Sepehr Karimi and Sepideh Ranjbar Kalahroodi and Heshaam Faili and Azadeh Shakery},
year={2025},
eprint={2506.00250},
archivePrefix={arXiv},
primaryClass={cs.CL}
}