Evaluating Large Language Models on a Persian-English Bilingual Medical Question Answering Benchmark
Mohammad Javad Ranjbar Kalahroodi · Amirhossein Sheikholselami · Sepehr Karimi Arpanahi
Sepideh Ranjbar Kalahroodi · Heshaam Faili · Azadeh Shakery University of Tehran · Shahid Beheshti University of Medical Sciences · IPM
PersianMedQA is a large-scale, expert-validated multiple-choice medical question answering dataset covering 23 specialties, collected from 14 years of Iranian national residency and pre-residency board examinations.
We benchmark 41 state-of-the-art LLMs — including general-purpose, Persian, and medical models — in zero-shot and chain-of-thought (CoT) settings in both Persian and English. Our cross-linguistic analysis reveals that 3–10% of questions can only be answered correctly in Persian due to cultural and clinical context lost in translation, demonstrating that translate-test evaluation is inadequate for medical AI in non-English settings.
The dataset, along with a comprehensive bilingual Persian medical dictionary, is publicly available to support reproducible research and further development of multilingual clinical NLP.
20,785
Expert-Validated Questions
23
Medical Specialties
14
Years of Exam Data
41
Evaluated LLMs
2
Languages
PersianEnglish
83.09%
Best Accuracy (GPT-4.1, Persian)
Dataset Features
Comprehensive Coverage: 20,785 questions spanning 23 medical specialties — from cardiology and surgery to pharmacology and medical statistics — sourced from 14 years of Iranian national residency exams (2011–2024).
Expert Validation: All questions underwent a rigorous three-level verification process by the National Center for Medical Education Assessment (Sanjesh), with additional review by a board-certified internal medicine specialist.
Clinical Focus: ~70% clinical case scenarios and 30% non-clinical questions, reflecting real-world high-stakes evaluation standards for Iranian medical graduates.
Rich Metadata: Specialty labels, clinical/non-clinical flags, patient demographics (age, gender), and domain classification for granular bias and performance analysis.
Bilingual Excellence: High-quality English translations validated by a medical expert, plus a comprehensive bilingual Persian medical dictionary (64,000+ terms across 10 categories) released alongside the dataset.
Contamination Controls: Temporal consistency analysis (2011–2024) and exact-match search confirm minimal data leakage into LLM training corpora.
Dataset Splits
Split
Questions
Sampling
Training
14,549
Stratified by year & specialty
Validation
1,000
Stratified by year & specialty
Test
5,236
Stratified by year & specialty
Total
20,785
—
Example Questions
Clinical: A 48-year-old man presents with chest pain (4h), anterior ST-elevation, sweating, BP 90/60 mmHg, distended neck veins, and basal rales. Most effective treatment?
Fibrinolytic + emergency angioplasty if needed
Fibrinolytic only
Emergency angioplasty ← Correct Answer
Fibrinolytic + angioplasty after 48h
Specialty: Cardiology · Type: Clinical · Patient: 48-year-old male
Evaluation Results
Zero-shot evaluation of 41 models on the 5,236-question test set. All models evaluated with temperature 0 and English prompts.
Top Model Performance
LLM accuracy on PersianMedQA test set
Model
Type
Persian (%)
English (%)
Average (%)
GPT-4.1
Closed-weight
83.09
80.71
81.90
Gemini-2.5-Flash-Preview
Closed-weight
82.37
79.09
80.73
Claude-3.7-Sonnet
Closed-weight
75.19
77.37
76.28
GPT-4.1-Mini
Closed-weight
74.76
77.10
75.93
DeepSeek-Chat-V3
Open-weight
68.05
73.30
70.67
LLaMA-3.1-405B-Instruct
Open-weight
67.02
73.49
70.25
Meditron3-8B
Medical
38.67
50.00
44.34
Dorna2-LLaMA-3.1-8B
Persian LLM
34.87
51.24
43.06
PersianMind-1.0
Persian LLM
24.22
25.17
24.69
Full results for all 41 models available in the paper (Table 4). Persian LLMs shown in italics performed at or near random-guess level (25%). Human baseline: ~75% accuracy.
Key Findings
Closed-weight models dominate: GPT-4.1 (83.09% Persian) and Gemini-2.5-Flash (82.37% Persian) far outperform all open-weight alternatives, which peak at ~68% in Persian.
Persian-first advantage: Some top models (GPT-4.1, Gemini-2.5-Flash) actually score higher on Persian than English, consistent with semantic drift introduced by machine translation.
Translation is insufficient: 3–10% of questions across models can only be answered correctly in Persian, due to Iran-specific clinical protocols, vaccination schedules, and antibiotic guidelines lost in translation.
Size ≠ performance: Medical and Persian LLMs consistently underperform smaller general-purpose models, demonstrating that scale must be paired with high-quality domain-relevant training data.
Ensembling helps: Five diverse open-weight models achieve 73.7% accuracy via majority vote — +3.3% over the best single open-weight model.
Access the Dataset
All resources are publicly available to support reproducible research and further development of Persian medical NLP.
PersianMedQA Dataset: 20,785 questions with train/validation/test splits, specialty labels, clinical flags, and patient demographics in both Persian and English.
Bilingual Medical Dictionary: 64,000+ entries across 10 categories (diseases, medications, procedures, symptoms, lab tests, anatomical terms, and more).
⚠️ Important Disclaimer: This dataset is designed exclusively for research purposes. The models evaluated are not certified for clinical use and must not be deployed in real-world healthcare settings without strict validation, expert oversight, and regulatory approval.
How to Cite
Choose the citation format that matches your use case. Use the published proceedings entry when citing for academic papers; the arXiv entry when linking to the preprint.
BibTeX
@inproceedings{kalahroodi-etal-2026-persianmedqa,
title = {{P}ersian{M}ed{QA}: Evaluating Large Language Models on a {P}ersian-{E}nglish Bilingual Medical Question Answering Benchmark},
author = {Ranjbar Kalahroodi, Mohammad Javad and Sheikholselami, Amirhossein and Karimi Arpanahi, Sepehr and Ranjbar Kalahroodi, Sepideh and Faili, Heshaam and Shakery, Azadeh},
booktitle = {Proceedings of the Fifteenth Language Resources and Evaluation Conference},
month = may,
year = {2026},
publisher = {European Language Resources Association},
url = {https://doi.org/10.63317/3yixio7ngbkh},
doi = {10.63317/3yixio7ngbkh},
pages = {4371--4386},
}
@misc{kalahroodi2025persianmedqa,
title = {{PersianMedQA}: Evaluating Large Language Models on a {Persian}-{English} Bilingual Medical Question Answering Benchmark},
author = {Ranjbar Kalahroodi, Mohammad Javad and Sheikholselami, Amirhossein and Karimi Arpanahi, Sepehr and Ranjbar Kalahroodi, Sepideh and Faili, Heshaam and Shakery, Azadeh},
year = {2025},
eprint = {2506.00250},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2506.00250},
}