Accepted · SilkRoadNLP 2026 — EACL 2026 Workshop

PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration

M. J. Ranjbar · A. Shakery · H. Faili

Poster presentation at SilkRoadNLP 2026, the First Workshop on NLP for Iranian Languages, EACL 2026

01 — Overview

Abstract

Punctuation restoration is essential for improving the readability and downstream utility of automatic speech recognition (ASR) outputs, yet remains underexplored for Persian despite its importance. We introduce PersianPunc, a large-scale, high-quality dataset for Persian punctuation restoration, constructed through automated processing of existing textual resources. We formulate punctuation restoration as a token-level sequence labeling task and fine-tune BERT-based models to achieve state-of-the-art performance.

Through comparative evaluation, we demonstrate that while large language models can perform punctuation restoration, they suffer from critical limitations: over-correction tendencies and substantially higher computational requirements. Our lightweight BERT-based approach achieves an F1 score of 97.28% while maintaining efficiency suitable for real-time applications, significantly outperforming LLM baselines on both accuracy and computational metrics.

02 — Why It Matters

The Power of a Single Comma

❌

Without punctuation

bakhshesh lazem nist e'damesh konid

"No mercy needed, execute him" — Negative

↓ Add comma

✓

With punctuation

bakhshesh, lazem nist e'damesh konid

"Forgiveness, no need to execute him" — Positive

❌

Without punctuation

nah baba rast migi

"No way, are you serious?" — Sarcastic

↓ Add comma & period

✓

With punctuation

nah, baba rast migi.

"No, dad, you're right." — Affirmative

03 — Contributions

Key Contributions

Large-Scale Dataset

A curated corpus of 17 million samples spanning formal and informal Persian text across diverse domains and topics.

State-of-the-Art Model

Fine-tuned ParsBERT achieving an F1 score of 97.28%, restoring 4 distinct Persian punctuation marks with real-time efficiency.

LLM Comparison

Rigorous evaluation showing our lightweight model outperforms GPT-4o and GPT-4o-mini on accuracy while using far less compute.

04 — Performance

Results & Benchmarks

97.28

F1 Score (%)

61.80

Full Match (%)

17M

Samples

Punctuation Marks

Model	F1 Score (%)	Full Sentence Match (%)
CRF (Hosseini et al.)	69.00	N/A
ViraPart	92.13	N/A
GPT-4o-mini	89.08	38.01
GPT-4o	93.85	50.10
PersianPunc-ParsBERT (Ours)	97.28	61.80

06 — Data

Dataset Composition

📚 Formal Academic

Bijankhan-Peykare Corpus, Persian Medical QA, and Persian Wikipedia — standardised punctuation in formal academic contexts.

💬 Informal Contemporary

Persian Telegram Channels, Farsi Stories, and Blog Dataset V2 — modern conversational patterns with varied punctuation styles.

🔍 Quality Pipeline

Multi-stage filtering: structural, content, and linguistic checks. SHA-256 hash-based exact deduplication across all sources.

📊 Punctuation Stats

Persian comma (50.1%), Period (35.5%), Colon (10.0%), Exclamation (2.9%), Question (1.6%) — avg 2.51 marks per sentence.

09 — Reference

How to Cite

BibTeX

@inproceedings{ranjbar2026persianpunc, author = {Ranjbar, Mohammad Javad and Shakery, Azadeh and Faili, Heshaam}, title = {PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration}, booktitle = {Proceedings of SilkRoadNLP 2026: The First Workshop on NLP for Iranian Languages}, year = {2026}, venue = {EACL 2026 Workshop}, note = {Poster presentation} }