PersianMedQA

Language-Centric Evaluation of LLMs in the Persian Medical Domain

PersianMedQA is a large-scale, expert-validated multiple-choice question answering dataset covering 23 medical specialties, collected from 14 years of Iranian residency and pre-residency board examinations.

This dataset enables comprehensive evaluation of large language models in medical reasoning tasks, with questions available in both Persian (original) and high-quality English translations. Our work represents the first comprehensive Persian medical QA benchmark, bridging the gap between multilingual AI and specialized medical knowledge.

20,785
Total Questions
23
Medical Specialties
14
Years of Exams
2
Languages
Persian English

Dataset Features

Example Question

Persian Clinical Case:

Question: بیمار ۴۸ سالهای با درد قفسه سینه حاد و تغییرات ECG مشخص کننده STEMI مراجعه کرده است. مؤثرترین اقدام درمانی کدام است؟

Options:

  1. تجویز فیبرینولیتیک و در صورت لزوم آنژیوپلاستی اورژانس
  2. تجویز فیبرینولیتیک
  3. آنژیوپلاستی اورژانس ← Correct Answer
  4. تجویز فیبرینولیتیک و آنژیوپلاستی ۴۸ ساعت بعد

Metadata: Specialty: Cardiology | Clinical: Yes | Patient: 48-year-old male

Evaluation Results

We evaluated various state-of-the-art language models on PersianMedQA, revealing interesting patterns in multilingual medical reasoning:

Model Accuracy (Persian) Accuracy (English) Language Gap
GPT-4.1 83.1% 83.3% +0.2%
Gemini 2.5 Flash 82.4% 83.7% +1.3%
Llama 3.1-405B-Instruct 69.3% 75.8% +6.5%
Meditron3-8B 39.7% 51.6% +11.9%
Dorna2-Llama3-8B 36.0% 53.1% +17.1%

See our paper for comprehensive analysis including chain-of-thought experiments, ensembling strategies, and selective-answering performance.

Access the Dataset

The dataset will be made available through multiple channels to ensure maximum accessibility for the research community:

🤗 Hugging Face Dataset 📄 Paper (arXiv)

Intended Uses & Applications

  • Benchmarking: Evaluate multilingual and domain-specific language models on high-stakes medical reasoning tasks
  • Few-shot Learning: Test zero-shot and few-shot capabilities of Persian QA systems
  • Cross-lingual Research: Investigate translation effects and cross-lingual transfer in medical domains
  • Cultural Context: Analyze cultural and linguistic nuances in medical AI systems
  • Educational Tools: Develop AI-powered medical education platforms for Persian-speaking regions
⚠️ Important Disclaimer: This dataset is designed exclusively for research purposes. It must not be used to provide real-world medical advice, clinical decision-making, or deployment in healthcare systems without proper validation and regulatory approval.

Reference

The paper is under review. Meanwhile, here is the arXiv version:

Ranjbar Kalahroodi, M. J., Sheikholselami, A., Karimi, S., Ranjbar Kalahroodi, S., Faili, H., & Shakery, A. (2025).

PersianMedQA: Language-Centric Evaluation of LLMs in the Persian Medical Domain.

arXiv preprint arXiv:2506.00250.

https://arxiv.org/abs/2506.00250

Citation Format

@article{ranjbar2025persianmedqa, title = {PersianMedQA: Language-Centric Evaluation of LLMs in the Persian Medical Domain}, author = {Mohammad Javad Ranjbar Kalahroodi and Amirhossein Sheikholselami and Sepehr Karimi and Sepideh Ranjbar Kalahroodi and Heshaam Faili and Azadeh Shakery}, journal = {arXiv preprint arXiv:2506.00250}, year = {2025}, url = {https://arxiv.org/abs/2506.00250}, archivePrefix = {arXiv}, eprint = {2506.00250}, primaryClass = {cs.CL} }