Language-Centric Evaluation of LLMs in the Persian Medical Domain
PersianMedQA is a large-scale, expert-validated multiple-choice question answering dataset covering 23 medical specialties, collected from 14 years of Iranian residency and pre-residency board examinations.
This dataset enables comprehensive evaluation of large language models in medical reasoning tasks, with questions available in both Persian (original) and high-quality English translations. Our work represents the first comprehensive Persian medical QA benchmark, bridging the gap between multilingual AI and specialized medical knowledge.
Question: بیمار ۴۸ سالهای با درد قفسه سینه حاد و تغییرات ECG مشخص کننده STEMI مراجعه کرده است. مؤثرترین اقدام درمانی کدام است؟
Options:
Metadata: Specialty: Cardiology | Clinical: Yes | Patient: 48-year-old male
We evaluated various state-of-the-art language models on PersianMedQA, revealing interesting patterns in multilingual medical reasoning:
Model | Accuracy (Persian) | Accuracy (English) | Language Gap |
---|---|---|---|
GPT-4.1 | 83.1% | 83.3% | +0.2% |
Gemini 2.5 Flash | 82.4% | 83.7% | +1.3% |
Llama 3.1-405B-Instruct | 69.3% | 75.8% | +6.5% |
Meditron3-8B | 39.7% | 51.6% | +11.9% |
Dorna2-Llama3-8B | 36.0% | 53.1% | +17.1% |
See our paper for comprehensive analysis including chain-of-thought experiments, ensembling strategies, and selective-answering performance.
The dataset will be made available through multiple channels to ensure maximum accessibility for the research community:
The paper is under review. Meanwhile, here is the arXiv version:
Ranjbar Kalahroodi, M. J., Sheikholselami, A., Karimi, S., Ranjbar Kalahroodi, S., Faili, H., & Shakery, A. (2025).
PersianMedQA: Language-Centric Evaluation of LLMs in the Persian Medical Domain.
arXiv preprint arXiv:2506.00250.
https://arxiv.org/abs/2506.00250