Accepted · SilkRoadNLP 2026 — EACL 2026 Workshop

PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration

M. J. Ranjbar · A. Shakery · H. Faili

Poster presentation at SilkRoadNLP 2026, the First Workshop on NLP for Iranian Languages, EACL 2026

Abstract

Punctuation restoration is essential for improving the readability and downstream utility of automatic speech recognition (ASR) outputs, yet remains underexplored for Persian despite its importance. We introduce PersianPunc, a large-scale, high-quality dataset for Persian punctuation restoration, constructed through automated processing of existing textual resources. We formulate punctuation restoration as a token-level sequence labeling task and fine-tune BERT-based models to achieve state-of-the-art performance.

Through comparative evaluation, we demonstrate that while large language models can perform punctuation restoration, they suffer from critical limitations: over-correction tendencies and substantially higher computational requirements. Our lightweight BERT-based approach achieves an F1 score of 97.28% while maintaining efficiency suitable for real-time applications, significantly outperforming LLM baselines on both accuracy and computational metrics.

The Power of a Single Comma

Without punctuation
bakhshesh lazem nist e'damesh konid
"No mercy needed, execute him" — Negative
↓  Add comma
With punctuation
bakhshesh, lazem nist e'damesh konid
"Forgiveness, no need to execute him" — Positive

Without punctuation
nah baba rast migi
"No way, are you serious?" — Sarcastic
↓  Add comma & period
With punctuation
nah, baba rast migi.
"No, dad, you're right." — Affirmative

Key Contributions

01

Large-Scale Dataset

A curated corpus of 17 million samples spanning formal and informal Persian text across diverse domains and topics.

02

State-of-the-Art Model

Fine-tuned ParsBERT achieving an F1 score of 97.28%, restoring 4 distinct Persian punctuation marks with real-time efficiency.

03

LLM Comparison

Rigorous evaluation showing our lightweight model outperforms GPT-4o and GPT-4o-mini on accuracy while using far less compute.

Results & Benchmarks

97.28
F1 Score (%)
61.80
Full Match (%)
17M
Samples
4
Punctuation Marks
Model F1 Score (%) Full Sentence Match (%)
CRF (Hosseini et al.)69.00N/A
ViraPart92.13N/A
GPT-4o-mini89.0838.01
GPT-4o93.8550.10
PersianPunc-ParsBERT (Ours)97.2861.80

Per-Punctuation F1 Scores

Period (·)
98.71
Colon (:)
90.45
Question (?)
88.89
Persian Comma (،)
80.03
Macro Avg
91.33

Dataset Composition

📚 Formal Academic

Bijankhan-Peykare Corpus, Persian Medical QA, and Persian Wikipedia — standardised punctuation in formal academic contexts.

💬 Informal Contemporary

Persian Telegram Channels, Farsi Stories, and Blog Dataset V2 — modern conversational patterns with varied punctuation styles.

🔍 Quality Pipeline

Multi-stage filtering: structural, content, and linguistic checks. SHA-256 hash-based exact deduplication across all sources.

📊 Punctuation Stats

Persian comma (50.1%), Period (35.5%), Colon (10.0%), Exclamation (2.9%), Question (1.6%) — avg 2.51 marks per sentence.

Open Resources

How to Cite

BibTeX
@inproceedings{ranjbar2026persianpunc, author = {Ranjbar, Mohammad Javad and Shakery, Azadeh and Faili, Heshaam}, title = {PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration}, booktitle = {Proceedings of SilkRoadNLP 2026: The First Workshop on NLP for Iranian Languages}, year = {2026}, venue = {EACL 2026 Workshop}, note = {Poster presentation} }