PersianMedQA — Persian-English Bilingual Medical Question Answering Benchmark

PersianMedQA is a large-scale, expert-validated multiple-choice medical question answering dataset covering 23 specialties, collected from 14 years of Iranian national residency and pre-residency board examinations.

We benchmark 41 state-of-the-art LLMs — including general-purpose, Persian, and medical models — in zero-shot and chain-of-thought (CoT) settings in both Persian and English. Our cross-linguistic analysis reveals that 3–10% of questions can only be answered correctly in Persian due to cultural and clinical context lost in translation, demonstrating that translate-test evaluation is inadequate for medical AI in non-English settings.

The dataset, along with a comprehensive bilingual Persian medical dictionary, is publicly available to support reproducible research and further development of multilingual clinical NLP.

20,785

Expert-Validated Questions

Medical Specialties

Years of Exam Data

Evaluated LLMs

Languages

Persian English

83.09%

Best Accuracy (GPT-4.1, Persian)

Dataset Features

Comprehensive Coverage: 20,785 questions spanning 23 medical specialties — from cardiology and surgery to pharmacology and medical statistics — sourced from 14 years of Iranian national residency exams (2011–2024).
Expert Validation: All questions underwent a rigorous three-level verification process by the National Center for Medical Education Assessment (Sanjesh), with additional review by a board-certified internal medicine specialist.
Clinical Focus: ~70% clinical case scenarios and 30% non-clinical questions, reflecting real-world high-stakes evaluation standards for Iranian medical graduates.
Rich Metadata: Specialty labels, clinical/non-clinical flags, patient demographics (age, gender), and domain classification for granular bias and performance analysis.
Bilingual Excellence: High-quality English translations validated by a medical expert, plus a comprehensive bilingual Persian medical dictionary (64,000+ terms across 10 categories) released alongside the dataset.
Contamination Controls: Temporal consistency analysis (2011–2024) and exact-match search confirm minimal data leakage into LLM training corpora.

Dataset Splits

Split	Questions	Sampling
Training	14,549	Stratified by year & specialty
Validation	1,000	Stratified by year & specialty
Test	5,236	Stratified by year & specialty
Total	20,785	—

Example Questions

Clinical: A 48-year-old man presents with chest pain (4h), anterior ST-elevation, sweating, BP 90/60 mmHg, distended neck veins, and basal rales. Most effective treatment?

Fibrinolytic + emergency angioplasty if needed
Fibrinolytic only
Emergency angioplasty ← Correct Answer
Fibrinolytic + angioplasty after 48h

Specialty: Cardiology · Type: Clinical · Patient: 48-year-old male

Evaluation Results

Zero-shot evaluation of 41 models on the 5,236-question test set. All models evaluated with temperature 0 and English prompts.

Top Model Performance

LLM accuracy on PersianMedQA test set
Model	Type	Persian (%)	English (%)	Average (%)
GPT-4.1	Closed-weight	83.09	80.71	81.90
Gemini-2.5-Flash-Preview	Closed-weight	82.37	79.09	80.73
Claude-3.7-Sonnet	Closed-weight	75.19	77.37	76.28
GPT-4.1-Mini	Closed-weight	74.76	77.10	75.93
DeepSeek-Chat-V3	Open-weight	68.05	73.30	70.67
LLaMA-3.1-405B-Instruct	Open-weight	67.02	73.49	70.25
Meditron3-8B	Medical	38.67	50.00	44.34
Dorna2-LLaMA-3.1-8B	Persian LLM	34.87	51.24	43.06
PersianMind-1.0	Persian LLM	24.22	25.17	24.69

Full results for all 41 models available in the paper (Table 4). Persian LLMs shown in italics performed at or near random-guess level (25%). Human baseline: ~75% accuracy.

Key Findings

Closed-weight models dominate: GPT-4.1 (83.09% Persian) and Gemini-2.5-Flash (82.37% Persian) far outperform all open-weight alternatives, which peak at ~68% in Persian.
Persian-first advantage: Some top models (GPT-4.1, Gemini-2.5-Flash) actually score higher on Persian than English, consistent with semantic drift introduced by machine translation.
Translation is insufficient: 3–10% of questions across models can only be answered correctly in Persian, due to Iran-specific clinical protocols, vaccination schedules, and antibiotic guidelines lost in translation.
Size ≠ performance: Medical and Persian LLMs consistently underperform smaller general-purpose models, demonstrating that scale must be paired with high-quality domain-relevant training data.
Ensembling helps: Five diverse open-weight models achieve 73.7% accuracy via majority vote — +3.3% over the best single open-weight model.

Access the Dataset

All resources are publicly available to support reproducible research and further development of Persian medical NLP.

🤗 HuggingFace Dataset 📄 Paper (LREC 2026) arXiv Preprint

Included Resources

PersianMedQA Dataset: 20,785 questions with train/validation/test splits, specialty labels, clinical flags, and patient demographics in both Persian and English.
Bilingual Medical Dictionary: 64,000+ entries across 10 categories (diseases, medications, procedures, symptoms, lab tests, anatomical terms, and more).

⚠️ Important Disclaimer: This dataset is designed exclusively for research purposes. The models evaluated are not certified for clinical use and must not be deployed in real-world healthcare settings without strict validation, expert oversight, and regulatory approval.

How to Cite

Choose the citation format that matches your use case. Use the published proceedings entry when citing for academic papers; the arXiv entry when linking to the preprint.

BibTeX

@inproceedings{kalahroodi-etal-2026-persianmedqa,
    title     = {{P}ersian{M}ed{QA}: Evaluating Large Language Models on a {P}ersian-{E}nglish Bilingual Medical Question Answering Benchmark},
    author    = {Ranjbar Kalahroodi, Mohammad Javad  and  Sheikholselami, Amirhossein  and  Karimi Arpanahi, Sepehr  and  Ranjbar Kalahroodi, Sepideh  and  Faili, Heshaam  and  Shakery, Azadeh},
    booktitle = {Proceedings of the Fifteenth Language Resources and Evaluation Conference},
    month     = may,
    year      = {2026},
    publisher = {European Language Resources Association},
    url       = {https://doi.org/10.63317/3yixio7ngbkh},
    doi       = {10.63317/3yixio7ngbkh},
    pages     = {4371--4386},
}

DOI: 10.63317/3yixio7ngbkh · ISBN: 979-8-89176-283-1