PersianMedQA

Evaluating Large Language Models on a Persian-English Bilingual Medical Question Answering Benchmark

PersianMedQA is a large-scale, expert-validated multiple-choice question answering dataset covering 23 medical specialties, collected from 14 years of Iranian residency and pre-residency board examinations.

This dataset enables comprehensive evaluation of large language models in medical reasoning tasks, with questions available in both Persian (original) and high-quality English translations. Our work represents the first comprehensive Persian medical QA benchmark, bridging the gap between multilingual AI and specialized medical knowledge.

20,785

Total Questions

Medical Specialties

Years of Exams

Languages

Persian English

Dataset Features

Comprehensive Coverage: Questions span 23 medical specialties from cardiology to radiology, ensuring broad domain representation
Clinical Focus: ~70% clinical case scenarios, 30% basic science questions, reflecting real-world medical practice
Rich Metadata: Specialty labels, clinical flags, patient demographics when available for granular analysis
Bilingual Excellence: Original Persian with high-quality English translations using state-of-the-art models

Example Question

Persian Clinical Case:

Question: بیمار ۴۸ سالهای با درد قفسه سینه حاد و تغییرات ECG مشخص کننده STEMI مراجعه کرده است. مؤثرترین اقدام درمانی کدام است؟

Options:

تجویز فیبرینولیتیک و در صورت لزوم آنژیوپلاستی اورژانس
تجویز فیبرینولیتیک
آنژیوپلاستی اورژانس ← Correct Answer
تجویز فیبرینولیتیک و آنژیوپلاستی ۴۸ ساعت بعد

Metadata: Specialty: Cardiology | Clinical: Yes | Patient: 48-year-old male

Evaluation Results

We evaluated various state-of-the-art language models on PersianMedQA, revealing interesting patterns in multilingual medical reasoning:

Model	Accuracy (Persian)	Accuracy (English)	Language Gap
GPT-4.1	83.1%	83.3%	+0.2%
Gemini 2.5 Flash	82.4%	83.7%	+1.3%
Llama 3.1-405B-Instruct	69.3%	75.8%	+6.5%
Meditron3-8B	39.7%	51.6%	+11.9%
Dorna2-Llama3-8B	36.0%	53.1%	+17.1%

See our paper for comprehensive analysis including chain-of-thought experiments, ensembling strategies, and selective-answering performance.

Access the Dataset

The dataset will be made available through multiple channels to ensure maximum accessibility for the research community:

🤗 Hugging Face Dataset 📄 Paper (arXiv)

Intended Uses & Applications

Benchmarking: Evaluate multilingual and domain-specific language models on high-stakes medical reasoning tasks
Few-shot Learning: Test zero-shot and few-shot capabilities of Persian QA systems
Cross-lingual Research: Investigate translation effects and cross-lingual transfer in medical domains
Cultural Context: Analyze cultural and linguistic nuances in medical AI systems
Educational Tools: Develop AI-powered medical education platforms for Persian-speaking regions

⚠️ Important Disclaimer: This dataset is designed exclusively for research purposes. It must not be used to provide real-world medical advice, clinical decision-making, or deployment in healthcare systems without proper validation and regulatory approval.

Reference

The paper is under review. Meanwhile, the preprint is available on arXiv:

PersianMedQA: Evaluating Large Language Models on a Persian-English Bilingual Medical Question Answering Benchmark.

Mohammad Javad Ranjbar Kalahroodi, Amirhossein Sheikholselami, Sepehr Karimi, Sepideh Ranjbar Kalahroodi, Heshaam Faili, & Azadeh Shakery.

arXiv preprint arXiv:2506.00250, 2025.

BibTeX Citation


@misc{ranjbar2025persianmedqa,
      title={{PersianMedQA: Evaluating Large Language Models on a Persian-English Bilingual Medical Question Answering Benchmark}}, 
      author={Mohammad Javad Ranjbar Kalahroodi and Amirhossein Sheikholselami and Sepehr Karimi and Sepideh Ranjbar Kalahroodi and Heshaam Faili and Azadeh Shakery},
      year={2025},
      eprint={2506.00250},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}