PersianPunc — Persian Punctuation Restoration Dataset & Model

01 — Overview

Abstract

Proceedings of SilkRoadNLP 2026, EACL · Rabat, Morocco · March 29, 2026 · pp. 105–113

Punctuation restoration is essential for improving the readability and downstream utility of automatic speech recognition (ASR) outputs, yet remains underexplored for Persian despite its importance. We introduce PersianPunc, a large-scale, high-quality dataset of 17 million samples for Persian punctuation restoration, constructed through systematic aggregation and filtering of existing textual resources.

We formulate punctuation restoration as a token-level sequence labeling task and fine-tune ParsBERT to achieve strong performance. Through comparative evaluation, we demonstrate that while large language models can perform punctuation restoration, they suffer from over-correction tendencies and substantially higher computational requirements. Our lightweight BERT-based approach achieves a macro-averaged F1 score of 91.33% while maintaining efficiency suitable for real-time applications.

02 — Motivation

The Power of a Single Comma

Persian punctuation placement can completely reverse a sentence's meaning — a critical challenge for ASR pipelines.

Without punctuation

bakhshesh lazem nist e'damesh konid

"No mercy needed, execute him" — Negative

↓ Add comma

With punctuation

bakhshesh, lazem nist e'damesh konid

"Forgiveness, no need to execute him" — Positive

Without punctuation

nah baba rast migi

"No way, are you serious?" — Sarcastic

↓ Add comma & period

With punctuation

nah, baba rast migi.

"No, dad, you're right." — Affirmative

03 — Contributions

Key Contributions

Three primary contributions to Persian NLP and low-resource language processing.

PersianPunc Dataset

A curated corpus of 17 million filtered and deduplicated samples spanning formal and informal Persian text across six complementary corpora and diverse domains.

Fine-tuned ParsBERT Model

Token-level sequence labeling achieving 91.33% macro F1 and 97.28% micro F1 — restoring 4 distinct Persian punctuation marks with real-time efficiency.

Rigorous LLM Comparison

Systematic evaluation showing our lightweight model surpasses GPT-4o and GPT-4o-mini on accuracy, full sentence match, and computational efficiency.

04 — Performance

Results & Benchmarks

Evaluated on 1,000 held-out Persian sentences from PersianPunc.

91.33

Macro F1 (%)

97.28

Micro F1 (%)

61.80

Full Sent. Match (%)

17M

Training Samples

Persian punctuation restoration model comparison
Model	Test Set	Macro F1 (%)	Full Sent. Match (%)
CRF (Hosseini & Sameti, 2017)	Hosseini et al. corpus †	69.00	—
ViraPart (Farokhshad et al., 2021)	Bijankhan corpus †	92.13	—
GPT-4o-mini	PersianPunc test set	79.54	38.01
GPT-4o (OpenAI, 2023)	PersianPunc test set	85.96	50.10
PersianPunc-ParsBERT (Ours)	PersianPunc test set	91.33	61.80

† CRF and ViraPart results are on different test sets and cannot be directly compared to our results; shown for contextual reference only.

05 — Breakdown

Per-Punctuation F1 Scores

Performance across each of the four restored punctuation marks on the 1,000-sentence test set.

Period (.)

98.71

Colon (:)

90.45

Question (؟)

88.89

Persian Comma (،)

80.03

Macro Average

91.33

06 — Data

Dataset Composition

17,102,014 unique samples after SHA-256 hash-based exact deduplication across six source corpora.

Formal Academic Text

Bijankhan-Peykare Corpus, Persian Medical QA, and Persian Wikipedia — standardised punctuation in formal literary, medical, and encyclopedic contexts.

Contemporary Informal Text

Persian Telegram Channels, Farsi Stories, and Blog Dataset V2 — modern conversational patterns covering social media, narrative fiction, and personal blogging.

Quality Pipeline

Multi-stage filtering: structural, content, and linguistic checks. Exact deduplication via SHA-256 with whitespace normalisation across all sources.

Punctuation Distribution

Persian comma (50.1%), Period (35.5%), Colon (10.0%), Exclamation (2.9%), Question (1.6%). Average 2.51 punctuation marks per sentence.

07 — Open Resources

All Resources Are Public

Dataset, model, and paper freely available to support future research in Persian and low-resource NLP.

Paper — ACL Anthology

Full paper with methodology, experiments, appendices, and zero-shot LLM evaluation prompts. DOI: 10.18653/v1/2026.silkroadnlp-1.11

ACL Anthology

PersianPunc Eval Dataset

The curated 1M-sample subset used for model training and benchmarking, with train / validation / test splits.

HuggingFace Datasets

Full 17M Dataset

The complete large-scale corpus of 17,102,014 unique Persian sentences spanning formal and informal text across all six source domains.

HuggingFace Datasets

PersianPunc-ParsBERT

Fine-tuned ParsBERT for Persian punctuation restoration — lightweight, fast, and ready for real-time ASR pipelines.

Model upload coming soon

08 — Reference

How to Cite

Please use the official ACL Anthology citation.

BibTeX

@inproceedings{ranjbar-kalahroodi-etal-2026-persianpunc, title = {{P}ersian{P}unc: A Large-Scale Dataset and {BERT}-Based Approach for {P}ersian Punctuation Restoration}, author = {Ranjbar Kalahroodi, Mohammad Javad and Faili, Heshaam and Shakery, Azadeh}, editor = {Merchant, Rayyan and Megerdoomian, Karine}, booktitle = {The Proceedings of the First Workshop on {NLP} and {LLM}s for the {I}ranian Language Family}, month = mar, year = {2026}, address = {Rabat, Morocco}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2026.silkroadnlp-1.11/}, doi = {10.18653/v1/2026.silkroadnlp-1.11}, pages = {105--113}, ISBN = {979-8-89176-371-5}, }

DOI: 10.18653/v1/2026.silkroadnlp-1.11 · ISBN: 979-8-89176-371-5