Poster presentation at SilkRoadNLP 2026, the First Workshop on NLP for Iranian Languages, EACL 2026
01 — Overview
Punctuation restoration is essential for improving the readability and downstream utility of automatic speech recognition (ASR) outputs, yet remains underexplored for Persian despite its importance. We introduce PersianPunc, a large-scale, high-quality dataset for Persian punctuation restoration, constructed through automated processing of existing textual resources. We formulate punctuation restoration as a token-level sequence labeling task and fine-tune BERT-based models to achieve state-of-the-art performance.
Through comparative evaluation, we demonstrate that while large language models can perform punctuation restoration, they suffer from critical limitations: over-correction tendencies and substantially higher computational requirements. Our lightweight BERT-based approach achieves an F1 score of 97.28% while maintaining efficiency suitable for real-time applications, significantly outperforming LLM baselines on both accuracy and computational metrics.
02 — Why It Matters
03 — Contributions
A curated corpus of 17 million samples spanning formal and informal Persian text across diverse domains and topics.
Fine-tuned ParsBERT achieving an F1 score of 97.28%, restoring 4 distinct Persian punctuation marks with real-time efficiency.
Rigorous evaluation showing our lightweight model outperforms GPT-4o and GPT-4o-mini on accuracy while using far less compute.
04 — Performance
| Model | F1 Score (%) | Full Sentence Match (%) |
|---|---|---|
| CRF (Hosseini et al.) | 69.00 | N/A |
| ViraPart | 92.13 | N/A |
| GPT-4o-mini | 89.08 | 38.01 |
| GPT-4o | 93.85 | 50.10 |
| PersianPunc-ParsBERT (Ours) | 97.28 | 61.80 |
05 — Breakdown
06 — Data
Bijankhan-Peykare Corpus, Persian Medical QA, and Persian Wikipedia — standardised punctuation in formal academic contexts.
Persian Telegram Channels, Farsi Stories, and Blog Dataset V2 — modern conversational patterns with varied punctuation styles.
Multi-stage filtering: structural, content, and linguistic checks. SHA-256 hash-based exact deduplication across all sources.
Persian comma (50.1%), Period (35.5%), Colon (10.0%), Exclamation (2.9%), Question (1.6%) — avg 2.51 marks per sentence.
07 — Resources
The curated evaluation & training subset used for model development and benchmarking.
HuggingFace DatasetsThe complete large-scale 17M-sample dataset covering formal and informal Persian text.
HuggingFace DatasetsFine-tuned ParsBERT model for Persian punctuation restoration — lightweight, fast, and production-ready.
HuggingFace ModelFull paper with methodology, detailed experiments, appendices, and evaluation prompts.
arXiv Preprint09 — Reference