PersianMedQA is a large-scale, expert-validated multiple-choice medical question answering dataset covering 23 specialties, collected from 14 years of Iranian national residency and pre-residency board examinations.

We benchmark 41 state-of-the-art LLMs — including general-purpose, Persian, and medical models — in zero-shot and chain-of-thought (CoT) settings in both Persian and English. Our cross-linguistic analysis reveals that 3–10% of questions can only be answered correctly in Persian due to cultural and clinical context lost in translation, demonstrating that translate-test evaluation is inadequate for medical AI in non-English settings.

The dataset, along with a comprehensive bilingual Persian medical dictionary, is publicly available to support reproducible research and further development of multilingual clinical NLP.

20,785
Expert-Validated Questions
23
Medical Specialties
14
Years of Exam Data
41
Evaluated LLMs
2
Languages
Persian English
83.09%
Best Accuracy (GPT-4.1, Persian)

Dataset Features

Dataset Splits

Split Questions Sampling
Training14,549Stratified by year & specialty
Validation1,000Stratified by year & specialty
Test5,236Stratified by year & specialty
Total20,785

Example Questions

Clinical: A 48-year-old man presents with chest pain (4h), anterior ST-elevation, sweating, BP 90/60 mmHg, distended neck veins, and basal rales. Most effective treatment?

  1. Fibrinolytic + emergency angioplasty if needed
  2. Fibrinolytic only
  3. Emergency angioplasty ← Correct Answer
  4. Fibrinolytic + angioplasty after 48h
Specialty: Cardiology  ·  Type: Clinical  ·  Patient: 48-year-old male

Evaluation Results

Zero-shot evaluation of 41 models on the 5,236-question test set. All models evaluated with temperature 0 and English prompts.

Top Model Performance

LLM accuracy on PersianMedQA test set
Model Type Persian (%) English (%) Average (%)
GPT-4.1Closed-weight83.0980.7181.90
Gemini-2.5-Flash-PreviewClosed-weight82.3779.0980.73
Claude-3.7-SonnetClosed-weight75.1977.3776.28
GPT-4.1-MiniClosed-weight74.7677.1075.93
DeepSeek-Chat-V3Open-weight68.0573.3070.67
LLaMA-3.1-405B-InstructOpen-weight67.0273.4970.25
Meditron3-8BMedical38.6750.0044.34
Dorna2-LLaMA-3.1-8BPersian LLM34.8751.2443.06
PersianMind-1.0Persian LLM24.2225.1724.69

Full results for all 41 models available in the paper (Table 4). Persian LLMs shown in italics performed at or near random-guess level (25%). Human baseline: ~75% accuracy.

Key Findings

Access the Dataset

All resources are publicly available to support reproducible research and further development of Persian medical NLP.

🤗 HuggingFace Dataset 📄 Paper (LREC 2026) arXiv Preprint

Included Resources

  • PersianMedQA Dataset: 20,785 questions with train/validation/test splits, specialty labels, clinical flags, and patient demographics in both Persian and English.
  • Bilingual Medical Dictionary: 64,000+ entries across 10 categories (diseases, medications, procedures, symptoms, lab tests, anatomical terms, and more).

How to Cite

Choose the citation format that matches your use case. Use the published proceedings entry when citing for academic papers; the arXiv entry when linking to the preprint.

BibTeX
@inproceedings{kalahroodi-etal-2026-persianmedqa, title = {{P}ersian{M}ed{QA}: Evaluating Large Language Models on a {P}ersian-{E}nglish Bilingual Medical Question Answering Benchmark}, author = {Ranjbar Kalahroodi, Mohammad Javad and Sheikholselami, Amirhossein and Karimi Arpanahi, Sepehr and Ranjbar Kalahroodi, Sepideh and Faili, Heshaam and Shakery, Azadeh}, booktitle = {Proceedings of the Fifteenth Language Resources and Evaluation Conference}, month = may, year = {2026}, publisher = {European Language Resources Association}, url = {https://doi.org/10.63317/3yixio7ngbkh}, doi = {10.63317/3yixio7ngbkh}, pages = {4371--4386}, }

DOI: 10.63317/3yixio7ngbkh  ·  ISBN: 979-8-89176-283-1